WO2009103820A1 - Systems and methods for acquiring, collecting and processing data relating to remotely or locally accessed electronic documents or applications - Google Patents

Systems and methods for acquiring, collecting and processing data relating to remotely or locally accessed electronic documents or applications Download PDF

Info

Publication number
WO2009103820A1
WO2009103820A1 PCT/EP2009/052134 EP2009052134W WO2009103820A1 WO 2009103820 A1 WO2009103820 A1 WO 2009103820A1 EP 2009052134 W EP2009052134 W EP 2009052134W WO 2009103820 A1 WO2009103820 A1 WO 2009103820A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
user
source
server
data
Prior art date
Application number
PCT/EP2009/052134
Other languages
French (fr)
Inventor
Dominique Helene Beatrice Monet
Adrien Michel Arculeo
Original Assignee
Monet Dominique Helene Beatric
Adrien Michel Arculeo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Monet Dominique Helene Beatric, Adrien Michel Arculeo filed Critical Monet Dominique Helene Beatric
Publication of WO2009103820A1 publication Critical patent/WO2009103820A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3495Performance evaluation by tracing or monitoring for systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • G06F11/3423Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time

Definitions

  • the present invention concerns in general terms the acquisition, collection and processing of data relating to electronic documents or applications use, such as but not limited to web sites, electronic documents, web applications or software applications, either remotely accessed from a client station ("user station") or locally accessed, whichever programming language or communication protocol is used during the process.
  • a client station can be any system able to interpret an electronic document or execute software applications.
  • the document can be accessed or the application be executed either on the user station or on a remote system.
  • the first-generation technology (so-called "server side") is that of the log that appeared at the time of the first client-server machines.
  • the principle of the log is that of a file on which the server records its activity in serving web pages to a machine-client and notes the problems, according to the log level that it is wished to obtain.
  • the first log level is recording that someone has connected to the server and has requested pages. Thus the data stream available at the server is simply used, without creating new information strictly speaking.
  • the second-generation technology (so-called "client side”) has given "Web Analytics" as the software activity sector.
  • the company WebSideStory (now part of Omniture) has proposed a technology of JavaScript tagging of the pages on a website.
  • This system thus comprises a browser, a page server and multitude of small predetermined JavaScript tags that will send data to a server monitoring and analyzing the actions .
  • This technology is however very cumbersome because, at the page server, it is necessary first to design and organize the tag plan, then to tag all the links and all the pages. For example, if the page server is capable of offering 10,000 pages and there exist approximately 40 links per page (these figures being normal), it is necessary to place 400,000 tags on the respective 400,000 links .
  • CMS Content Management System
  • the cumbersome nature of such an approach is such that the most widespread tool at present for managing content (CMS, standing for "Content Management System”) has had to develop in order to directly and industrially integrate tagging in its interface, thus avoiding manual tagging as was the case previously.
  • CMS Content Management System
  • the tool also automatically places the name that the operator or its administrator has set that must appear in the JavaScript.
  • links are regularly broken or displaced.
  • hard- written links do not adapt to changes in an electronic document or application, and for instance for a web site or web application the quality of the link depends largely on the quality of the CMS solution generally put in place by the operator/user of the website, or the way in which the tags have been put in place by hand.
  • the present invention thus aims to propose an analysis system for electronic documents or applications, such as but not limited to web sites, electronic documents, web applications or software applications, either remotely accessed from a client station ("user station") or locally accessed, whichever programming language or communication protocol is used during the process, commercial or otherwise, aimed at revealing, in a much finer and more robust way, users behaviors and differences in behavior between users. For example, two persons passing through the same page on a web site and clicking on the same link do not necessarily have the same concerns: it is the whole of the travel that will make sense.
  • the invention aims at enabling this information to be put in correlation with other interactions between these users and the other touch points (points of contact) of the operator/user of the electronic documents or applications, in particular the call centre, the sales force, advertising, promotion with money-off coupons, help desks etc. More generally, the invention aims at providing a novel chain for acquiring, organizing and processing behavioral data relating to use of or interaction with electronic sources of information. According to the invention, the acquisition cooperates directly with the collection and processing of the data by providing the acquisition with data organized and time-stamped in real time or almost real time. Accordingly, the present invention provides a method for monitoring the behavior of a user accessing an electronic source of information such as a web site, electronic documents, web applications or software applications, accessed on a user station, comprising the following steps:
  • the step of providing the intelligent agent in said user station comprises embedding code intended to constitute said intelligent agent into a given zone of said source of information, and loading the source of information including said zone or said section into the user station.
  • said intelligent agent comprises at least two codes executable in possibly different executing environments, said codes being deleted at different times when an electronic document or an application is unloaded from the user station.
  • one code is executable in the environment of accessing said source of information, and another code is executable in a scripting environment.
  • said agent is made from a plurality of code sections embedded in different places in the source of information .
  • said data representative of user actions include inputs from human to machine interaction devices.
  • said data representative of semantic contents are collected each time the content is refreshed or at a fixed interval of time or according to any other trigger action .
  • said source of information includes contents markers (microformats)
  • the method further comprises the step of adapting the actions of said intelligent agent depending on these markers.
  • the method further comprises a step of identifying the user station.
  • the method further comprises the steps of storing a user identifier in the user station for each source of information or group of such sources, and transferring said identifier to said collection and processing server.
  • the method further comprising a step of linking different user identifiers in the user station.
  • the source of information is provided by at least one information server, and said collection and processing server is at least partially embodied in said information server.
  • the present invention provides a generator capable of dynamically embedding in sources of information at least one executable code section intended to form an intelligent agent for performing the method as defined above.
  • the present invention provides an intelligent agent stored in a user station for performing the method as defined above.
  • the present invention further provides a method for generating page contents markers in web pages on a web site or a web application for performing the method as defined above.
  • the present invention provides a combination of a server providing sources of information and a plurality of user stations adapted to perform the method as defined above.
  • figure 1 is a synthetic overall logic diagram of the actions and interactions of three entities participating in the present invention (web page server of a site operator or server of an application operator subscribing to the system, client station of a visitor/user and collection and processing server)
  • figure 2 is a more detailed overall logic diagram
  • figure 3 is an idle management diagram used in the present invention
  • figure 4 is a logic diagram illustrating the main steps of a semantic analysis process implemented in the collection and processing server
  • figure 5a is a logic diagram illustrating the main steps of a process of processing markers ("micro formats") included in a page supplied by the web page server
  • figure 5b contains a list of types of information that can be marked by such markers, given by way of example
  • figure 6 is a logic diagram illustrating the main steps of user identity reconciliation.
  • each page addressed by the server contains a code intended to be self-installed and executed, in the form of an intelligent agent, on the client station.
  • this intelligent agent consists of three to five files compiled respectively according to Adobe Flash technology (2 files . swf) , according to XML technology (file . xml ) , according to a proprietary encryption (file without extension) and according to JavaScript technology (file . js) .
  • One of the two .swf file and the .xml file are optional.
  • generation of these page or file tags takes place automatically through a few code lines calling the files at the server (files previously stored in a well identified directory) in order to automatically embed them in a given area of any page or file generated.
  • . swf files There are two different . swf files. They are the agent and the reader.
  • the reader is required only in case more than one domain or sub-domain is registered in the license file. As shown in figure 6, it is used to link information regarding a user unique identifier from different domains to be able to link the user behavior on a domain to the behavior of the same user on another domain.
  • the agent uses parameters described in the . xml configuration file.
  • this agent functions in particular to recognize the page or file in question, to recognize possibly from where the internet user comes, to recognize if and where he is moving to on the site, to identify and recognize the internet user (even if it is an anonymous visitor, it is possible to determine whether or not he has already come to the site, even if accessed from a different Internet Service Provider than previously, the number of his visits, etc) as shown in figure 2 (flows UIDl to UID6) .
  • This agent sends information about the content of the page to the input server as described in figure 2 specifically in flows Sl to S4 and continuing in figure 4.
  • This content can change during time using any kind of technology like the one known as AJAX (Asynchronous Javascript And XML) and though the agent can be parametered to send the page content at a fixed time interval if it has changed.
  • AJAX Asynchronous Javascript And XML
  • the visitor has supplied the site with at least one item of identification or authentication, it is possible to reconcile each visitor with all his previous paths, behaviors and centers of interest detected from the content of the page i.e. to attribute to him/her the paths, the behaviors and the centers of interest detected from the content of the page and then to transmit this information to the site operator for injection into his databases (subscribers to his electronic newsletter, updating of the client file, information for the call centre, etc) .
  • the invention also covers an approach where the aforementioned code section is added "manually" in a static fashion (although the pages produced from static HTML pages are now on the verge of disappearing from commercial, industrial or professional web sites or web applications) .
  • the present invention is not limited to any particular manner for the server to form and address the requested pages or files to the client station .
  • the generation of the tagged pages according to the invention is entirely transparent vis-a-vis the information system of the site operator .
  • the tagging program supplied to the site operator has a first functionality of recognition of the environment in which it is executed.
  • the implementation code lines are placed by the page server in a suitable manner according to the technology that this server uses in each of the pages served (for example PHP, ASP (.net) and Java) .
  • the program has in fact several code versions having the same functionalities and adapted to the currently most widespread technologies of dynamic web page generation. In the present case, three versions of the code are provided, which are adapted to the PHP, ASP (.net) and Java environments. Naturally, this configuration is given only by way of example.
  • This program fits within the process of dynamic web page generation in order to be inserted therein in each page, preferably in the section of a block present in all the pages of the website (typically the copyright block as seen above) , code sections that form the two components, Flash and Java, of the agent that will be executed when the page is received at the client station.
  • the above process is thus sufficient to dynamically tag the whole of the site. Further, the reliability of this tagging is totally independent from the links contained in the pages and redirecting the Internet user to other pages (these links can in particular be active or dead, etc) and are transparent to them.
  • the Flash component (here a . swf file) of the agent can take any form, visible or not on the display of the client station (a block of NxN pixels that is transparent and therefore invisible to the internet user, logo of the site operator, etc) .
  • the Flash component known as reader is executed inside the agent and so has the same appearance.
  • agent .swf file An important feature in the agent .swf file is that the code executed on the client station is suitable for being supplied to a server collecting and processing information in real time or almost real time representative of the fact that this page has been loaded by such or such Internet user and/or such or such client station .
  • the page tagging program In order to prevent client stations being able to consider the .swf and . js codes attached to each page as being spy programs (web bugs or the like) having the characteristic of coming from third-party servers, it is preferable for the page tagging program to be executed directly in the technical environment of the website. In a variant embodiment, it is however possible to incorporate the Flash and JavaScript files into the pages supplied by the page server at a third-party server, or at the collection and processing server as will be described below.
  • Flash code of the intelligent agent is to make its execution dependent on the identification of the associated domain (subscriber under contract to the service) in the current browsing or use of application, and on the determination of the validity of the subscription contract in terms of date.
  • the intelligent agent is able to send, to a collection server whose address it knows, a certain number of items of information and in particular the following:
  • the Flash code reads the system date of the client station.
  • the Flash code relies upon a distinct file containing the data relating to the contract and rights to execute.
  • Flash code for each site operator, this code is compiled in such manner that it contains the details of the data collection and processing service contract.
  • the system date of the client station is used to verify the existence of a valid service contract. It should be noted here that all the other dates used by the method and system of the invention are determined at the collection and processing server, serving as time reference.
  • cookies enable an e-Commerce site or a personalizable site or application to construct pages according to profile data stored in these cookies, as a general rule JavaScript cookies .
  • Flash technology or equivalent
  • JavaScript cookies which are passive
  • Flash technology An advantage of Flash technology or equivalent is that an object of the "Local Shared Object" type, which operates in the same logic as a JavaScript cookie, is placed in a specific directory common to all the browsers on the station.
  • Flash makes it possible to limit the size of a file to 0 kilobytes but by default it is 100 kilobytes; Therefore, if such a Local Shared Object is used to store a user identifier, the size of this object will be only a few bytes, well below the default limit and therefore acceptable .
  • the intelligent agent is capable of extracting from the page a semantic content for the collection and processing server (as will be described in detail below) at each refreshing rather than only when each page is loaded according to a refresh rate hard-written in the Flash agent code .
  • a unique identifier is generated at the collection server and transmitted to the client station, where the intelligent agent recovers it in order to use it mainly as an identifier of the person who is in the process of visiting the site, storing it in the aforementioned "Local Shared Object".
  • This identifier is preferably encrypted.
  • the encrypting mode is chosen to make the identifier illegible to third parties (a technique based on hashing keys or the like) .
  • This visitor identifier is the cornerstone used for the remainder of the activities performed by the system and the method of the invention.
  • the server of this site When a client station connects under the control of a visitor to a website belonging to the system and method of the invention (a subscriber) , the server of this site, when a page is first loaded, writes on the client station the files of the intelligent agent, the latter generating the identifier of the client station as described above, with consequently a specific identifier for each domain visited.
  • this make it advantageously possible to reaggregate, by virtue of the personal identifier, the data collected for one and the same user operating from several different client stations (user stations) .
  • client stations user stations
  • a user accesses a subscribed site during the day on his office computer and then in the evening on his personal computer, and then the next day on the computer of his secondary residence.
  • accesses from other machines such as PDAs, mobile telephones, computer gaming devices, on-board computers, embedded computers, etc, are advantageously also taken into account as user stations, the portable character of Flash and JavaScript technologies or other languages facilitating this.
  • any other information provided by the user to re-aggregate different identifiers can be used client information provided via a "microformat" or information provided in the URL. In both cases, information can be email address, name, client number or any other information allowing to surely identify the user. The generation of this unique identifier and data so called “microformats" will be detailed hereinafter.
  • the intelligent agent is capable of extracting text contents from the pages consulted on the site, which is carried out mainly by eliminating from the code of the page loaded everything that is images, animations, videos, JavaScript codes, etc .
  • Text content must be understood here in the broad sense, namely any character string present on the page, such as blocks of text but also legends to images or videos for example.
  • the essential role of the JavaScript component of the intelligent agent is to manage the time spent on the page and the activity time on the page, determined by monitoring the input devices (typically mouse and keyboard) .
  • the intelligent agent first of all extracts all the text data from the page. It then carries out a cleaning in order to extract therefrom only the semantic text part; it is this semantic part that will be one of the important parts of the data transferred to the collection server in order to be stored and processed therein .
  • the operator can affect the text data collecting by adding appropriate micro format as described later in this document.
  • the Flash code of the intelligent agent communicates with the collection and processing server.
  • the Flash code has the function of extracting the text content from the consulted pages. In order to avoid unnecessary work in duplication, this code is able to check whether, for a given page designated by its URL, the text content has already been acquired; this is preferably done by applying a hash function to the content already acquired and to the content of the page being consulted by means of a common key; if the hashing result is the same, the page is considered not to have changed and its extracted text content is not transmitted again to the server.
  • one of the functionalities of the invention is to reconstitute the whole of the browsing universe of the Internet user, thus constituting a "semantic horizon", composed of the pages requested at least once by a client station.
  • a semantic analysis process as will be described below, to differentiate between a page as it exists in the whole of the website contents of the subscribed operator and the interest of a visitor for this page.
  • this semantic analysis process is capable of analyzing the consistency between the content of a page and the contents of pages consulted before and after it, and to attribute to the meaning of this page a different weight according to the degree of consistency. Such a process makes it possible to ignore pages consulted accidentally
  • the intelligent agent is capable of pre ⁇ processing or pre-organizing the text data contained in the pages consulted, which are then transferred to the collection and processing server and stored in this server with a view to processing thereof, allowing later attribution to their visitor or users :
  • the server stores, in the form of a database, the text contents once "cleaned” by the agent, the visitor identifiers, links to be able to aggregate visitor identifiers and reconcile group of visitors like families or groups of relatives, the URLs of the pages, and finally the activity indicators and their time stamping. Other available information can of course be stored.
  • the Flash code initiates calls by means of the integrated JavaScript code with a view to collecting, at the browser of the client station, information issuing from the input devices (mouse, keyboard, etc) , except for the closure of the page, which is managed by a distinct JavaScript file.
  • the intelligent agent When the page is unloaded, the intelligent agent is adapted to send to the collection and processing server information representing the inactivity (or activity) time spent on the page.
  • the Flash code increments a count of +1 for example every elapsed second. On the other hand, it counts +0 (no incrementation) if during the running second an action has been performed on the input device (typically mouse movement, scrolling, etc) .
  • the system thus has available, apart from the time- stamped instants of start of loading of the pages and end of unloading of the page, information (message sent by the intelligent agent from the client station to the collection and processing server) representing the duration of inactivity on the page, determined by the result of the count during unloading.
  • the collection and processing server is capable of obtaining duration of activity on the page by simple difference calculation.
  • the time stampings can be performed at the client station by means of a system clock thereof, and the calculation of activity time performed by the intelligent agent.
  • the intelligent agent can also transmit to the collection and processing server information such as technical information on the machine (display size, colour depth, version of operating system, type and version of browser, previously visited page (information supplied as standard by the majority of browsers) .
  • the intelligent agent records not only the existence or not of activity on the input devices of the client station as described above, but also the nature of these activities (mouse clicks and their types (left button, right button, scroll wheel, etc.) and their consequences such as display of pop-up menus, text selection, execution of JavaScript codes and more generally everything relating to dynamic HTML (DHTML) .
  • a current trend with market players consists of going towards the "Rich Internet Application", using technologies such as Flash, Flex and Air (Adobe) , Silverlight (Microsoft) or Ajax, in order to provide applications having an interface and ergonomics similar to those of a heavy client installed on a station, such as an office application.
  • technologies such as Flash, Flex and Air (Adobe) , Silverlight (Microsoft) or Ajax, in order to provide applications having an interface and ergonomics similar to those of a heavy client installed on a station, such as an office application.
  • browsing on some Rich Internet Applications may take place in a manner disconnected from the network, and does not make it possible to associate precise time stamping with the browsing events: it is only when the user reconnects to the network, for example in order to place an order after having consulted a catalogue off ⁇ line (this being transparent to the internet user) , that the data of his browsing may then be recovered and transferred as a whole to the collection and processing server, with consequently a unique time stamping common to all the activity data during the off-line period. Recourse for example to a local relay clock is then provided.
  • the intelligent agent has a full duplex information channel open between the client station and the collection server. It can be used this channel to insert contextual information according to a processing inside the collection server that would be made available to the web page for example in the form of a JavaScript variable or a JavaScript or Flash function triggered.
  • the intelligent agent can also be implemented in other programming language than Flash and JavaScript to be able to use the intelligent agent in other environments than internet websites and web applications. Such other environments can be
  • This functionality at the collection and processing server is to process the text content recovered as described above, extracting the key concepts, in a totally automated fashion, and to collect together these concepts according to their meaning (with advantageously the possibility of "zooming" in depth on this meaning) in order to output them at the end of the service chain offered to the subscribing operator.
  • this semantic analysis is preferably based on the analysis of the organization of a text, on the greater importance of certain words compared with others, on the greater importance of certain concepts pre-defined (with ontologies or else) or not compared with others, on the use of particular technical terms rather than a generic term.
  • a first step relies upon to statistical processes of the stemming/lemmatization type, by correlation between the words and context analysis, ambiguities removal and special dynamic dictionaries.
  • the second step is based on the links of the text contents to a general purpose ontology, or a designated group-of-concepts ontology (for instance distinct for France/Belgium, distinct for business/leisure activities, etc.), or an ontology established in direct relation to the activity of the subscribed operation (for instance a "mouse" in such a context would be taken into account as an animal or as a computer peripheral, depending on the activity of the subscribed operation) .
  • a general purpose ontology or a designated group-of-concepts ontology (for instance distinct for France/Belgium, distinct for business/leisure activities, etc.), or an ontology established in direct relation to the activity of the subscribed operation (for instance a "mouse" in such a context would be taken into account as an animal or as a computer peripheral, depending on the activity of the subscribed operation) .
  • an ontology is a structured set of terms and concepts establishing the meaning of an information field (a) by the metadata of a namespace, or (b) by the items of a knowledge domain. It is a data model representing a set of concepts in a domain, as well as the relationships between these concepts .
  • the processing of the text content is then arranged to restore a semantic network organized as a set of concepts completely describing a domain, these concepts being linked to each other by relationships that are taxonomic (hierarchization of the concepts) on the one hand and semantic (relationships of meaning between the words) on the other hand.
  • Yet another approach is to have recourse to an ontology that is not trade-based but centered on the user of the processed data.
  • an ontology that is not trade-based but centered on the user of the processed data.
  • the semantic analysis could be based on the ontology of this trade using for example the Food and Non-food categories, and subdividing the categories more and more finely according to a tree structure.
  • Corresponding key words can be attached to this hierarchy of concepts.
  • each key word is attached to an element of the ontology of the subscribed operator so as to be able to use this in the same way as an ordinary hierarchical structure and to perform on it all kinds of data processing operations, in particular so-called “drill-down" processing operations (see in particular http : //en . wikipedia . org/wiki/Data_drilling) .
  • these processing operations are possible only in relation to hierarchy codifications, in particular those of analytical accounting tools, except if "hard” cumbersome parameterizations are carried out with the keywords of a specific ontology.
  • the combination of a monitoring of the behavioral type of visitors with a semantic analysis of the visited pages makes it possible for example to supply the subscribing operators with information on the bidirectional behavioral interaction of the visitor/product relationship, and for example:
  • a marking "microformat” is made of a naming, according to a determined keyword whose meaning is predetermined. For example, it may be a case of a generic product name from the catalogue of the website operator, and a marked microformat thus marks (delimits) each location on a page where such a product is found, by means of attributes
  • HTML tag or a tag added specifically to carry the microformat information, such as for example the div tag.
  • each cell in the table can be enhanced with microformat information in order to characterize the content of the cell. If a product name is alone in the middle of a text, it is characterized alone by means of a div tag carrying the microformat information.
  • the microformats are generated directly during the dynamic generation of a HTML page by means of a microformat tagging engine that tags contents according to predetermined tagging rules (unlike the second- generation behavioral analysis systems described in the introduction, the contents are tagged rather than the links) .
  • microformat is characterized by a format (for example product code, customer identifier, advertising campaign code, quantity, price, discount, carriage costs, etc) .
  • a format for example product code, customer identifier, advertising campaign code, quantity, price, discount, carriage costs, etc.
  • the information issuing from these microformats is transferred to the collection and processing server by the intelligent agent of the client station (by means of its Flash code) , and stored in the database of said server for example (typically for an on-line sales website) in a format (product A, page B, client C, quantity D, total amount E) .
  • microformats make it advantageously possible for example to integrate analyses on data collected for non-client visitors (for example a visitor loads his basket but never goes to the checkout) in the transactional system of the subscribing operator.
  • the system makes it possible to determine that a client visits the site N times before reaching a purchase decision, that another client visits only the luxury products pages, if a client or potential client is a discount hunter or a heavy spender or high margin items, etc.
  • microformats can have meaningful identifiers
  • client ID (client ID, product name or reference, etc.) by the following mechanism: when an Internet user X identifies himself on the website, he in general enters his ID, and then "Hello Mr X" appears at the top of the page.
  • a microformat can mark this character string, by placing a div type tag and containing the client number, named for example "XXX_client_number" .
  • This number is invisible to the Internet user on the web page that he is consulting, but on the other hand is read by the Flash code of the intelligent agent, which makes it possible to attach the current consultation with the identifier (identical by definition) situated in the client database.
  • the collection and processing server is capable of determining that, when an Internet user places products in a checkout basket, this occurs on a certain client station (for example the one situated at his place of work) , and that when the same internet user, returning in the evening to add a new product to the basket, validates his order, etc, this occurs on another client station (for example the one situated at his home) .
  • the system and method of the invention are thus capable of making a link and reconciling the technical elements in order to provide a seamless vision of the activity of an internet user, and to retrieve a customer- centric vision taking account of his movements, his habits, his/her sensitivity towards certain elements, his/her propensity to act, his/her way to use or interact with electronic documents or applications, etc.
  • the prior art systems are not capable of performing this dual identification, that is to say determining that the same user has connected to the site from two different machines .
  • Internet user has connected to the server from such time to such time in the day, and that he has reconnected in the evening, without being able to determine his location .
  • the semantically legible link established by the method and the system of the invention between the microformats and the reference frames of the subscribing operators is much more precise: if a given person has already been a client, then the transactional accounting system of the subscribing operator already has his invoicing and delivery postal address, which makes it possible to easily apply segmentation processing operations of social-professional group type based on this geolocation.
  • the system and method of the invention also make it possible for example:
  • microformats to be used to exclude certain text areas from the text extraction for the purpose of semantic analysis (for example exclusion of standard browsing menus, not adding any meaning) .
  • microformats it is possible to use at the page server a functionality of automated creation of microformats on the basis of a first parameter (marker type) and a second parameter (content nature) , in order thus to automatically generate the character string constituting the name of the microformat in a CMS.
  • a program for managing microformats having this functionality is used, in several versions adapted to the most widespread dynamic web page generation technologies.
  • three versions of the code adapted to the PHP, ASP (.net) and Java environments are provided.
  • the system and method of the invention can further comprise, particularly at the collection and processing server, functionalities for: determining that certain contents normally protected by passwords of a subscribed site have been accessed by unauthorized persons; - determining that certain contents normally protected by passwords of a subscribed site have been accessed by the same password simultaneously from two different client stations; determining that a subscribed site has been visited by robots rather than by human operators;
  • the information resulting from processing operations performed by the collection and processing server may have several types of uses, and in particular: - an operational "tactical" use, on a day-by-day basis (by the detection of short-term sales evolutions by product, etc.), mainly for performance measurement and correction purposes;
  • the present invention is not limited to the embodiments described above and illustrated in the drawings, but the skilled person can devise numerous other embodiments.
  • the present invention has been described in the specific framework of a server providing web pages, the present invention applies to a user station remotely or locally accessing any source of information including but not limited to static and dynamic web pages, files, streaming contents, web or other applications, etc.

Abstract

The present invention provides method for monitoring the behavior of a user accessing an electronic source of information such as a web site, electronic documents, web applications or software applications, accessed on a user station, comprising the following steps: - providing in the user station an intelligent agent at the same time as at least a part of the source of information is accessed, - executing said agent in said user station in order to collect data representative of user actions and semantic contents in relation with accessing the source of information, and to transfer said collected data to a collection and processing server, and collecting and processing said data in said collection and processing server to output user data regarding user behavior. The present invention further provides a source of information generator, an intelligent agent and a combination of a server providing sources of information and a plurality of user stations in relation with this method.

Description

SYSTEMS AND METHODS FOR ACQUIRING, COLLECTING AND
PROCESSING DATA RELATING TO REMOTELY OR LOCALLY ACCESSED
ELECTRONIC DOCUMENTS OR APPLICATIONS
Field of the invention
The present invention concerns in general terms the acquisition, collection and processing of data relating to electronic documents or applications use, such as but not limited to web sites, electronic documents, web applications or software applications, either remotely accessed from a client station ("user station") or locally accessed, whichever programming language or communication protocol is used during the process. A client station can be any system able to interpret an electronic document or execute software applications. The document can be accessed or the application be executed either on the user station or on a remote system.
Background to the invention Methods and systems for recognizing that an internet user visits a website or use/interact with a web application, returns to it, etc are already known.
The first-generation technology (so-called "server side") is that of the log that appeared at the time of the first client-server machines. The principle of the log is that of a file on which the server records its activity in serving web pages to a machine-client and notes the problems, according to the log level that it is wished to obtain. The first log level is recording that someone has connected to the server and has requested pages. Thus the data stream available at the server is simply used, without creating new information strictly speaking. The second-generation technology (so-called "client side") has given "Web Analytics" as the software activity sector. In particular, the company WebSideStory (now part of Omniture) has proposed a technology of JavaScript tagging of the pages on a website. It is in fact a tagging that acts at the client station: when the visitor to the internet site, on his client machine, requests the server to serve him pages, a code is installed on each link. The WebSideStory client attaches a specific JavaScript to each link and, each time that there is a mouse click, the click will have its normal action on the server and in addition the JavaScript tag code will register it and send information (in the form of a URL link) to the WebSideStory server in order to indicate to it that the link that was named in this predetermined manner has been activated.
This system thus comprises a browser, a page server and multitude of small predetermined JavaScript tags that will send data to a server monitoring and analyzing the actions .
This technology is however very cumbersome because, at the page server, it is necessary first to design and organize the tag plan, then to tag all the links and all the pages. For example, if the page server is capable of offering 10,000 pages and there exist approximately 40 links per page (these figures being normal), it is necessary to place 400,000 tags on the respective 400,000 links . The cumbersome nature of such an approach is such that the most widespread tool at present for managing content (CMS, standing for "Content Management System") has had to develop in order to directly and industrially integrate tagging in its interface, thus avoiding manual tagging as was the case previously. Thus, when an operator produces the site content (text and/or photograph of an article, a product description, etc) , the tool also automatically places the name that the operator or its administrator has set that must appear in the JavaScript.
Yet another drawback of this known technique relates to the fact that it is possible to place only one item of information per tag. This therefore makes it necessary to ensure the exhaustiveness and coherence of all the tagging, so that restoration is faithful to the intention of the site operator.
In this regard, knowing that in the life of a large site it very often happens that links are broken or displaced, the tagging system is not in a position to faithfully reflect the actions of the visitor to the site .
Summary of the invention By analogy with the shelves of a shop, which are regularly moved according to the observed changes in behavior of visitors (most desired items, the way of moving in a shop, the way of exploring what is on offer in the shop in order to form a decision about purchase, possible difficulties in finding an item or department, etc) , it is often desirable to reorganize for instance the arrangement of a website according to changes in behavior of virtual visitors.
The result of such reorganizations is that links are regularly broken or displaced. In other words, hard- written links do not adapt to changes in an electronic document or application, and for instance for a web site or web application the quality of the link depends largely on the quality of the CMS solution generally put in place by the operator/user of the website, or the way in which the tags have been put in place by hand.
The present invention thus aims to propose an analysis system for electronic documents or applications, such as but not limited to web sites, electronic documents, web applications or software applications, either remotely accessed from a client station ("user station") or locally accessed, whichever programming language or communication protocol is used during the process, commercial or otherwise, aimed at revealing, in a much finer and more robust way, users behaviors and differences in behavior between users. For example, two persons passing through the same page on a web site and clicking on the same link do not necessarily have the same concerns: it is the whole of the travel that will make sense. It is the holistic view of the whole visit, or even the aggregation of several single visits, that will make sense, topped by the fact that the present invention also prevents distortion in meaning that would otherwise be caused by the hard-coded information predefined by the operator and attached on the JavaScript .
It also aims at enabling this information to be put in correlation with other interactions between these users and the other touch points (points of contact) of the operator/user of the electronic documents or applications, in particular the call centre, the sales force, advertising, promotion with money-off coupons, help desks etc. More generally, the invention aims at providing a novel chain for acquiring, organizing and processing behavioral data relating to use of or interaction with electronic sources of information. According to the invention, the acquisition cooperates directly with the collection and processing of the data by providing the acquisition with data organized and time-stamped in real time or almost real time. Accordingly, the present invention provides a method for monitoring the behavior of a user accessing an electronic source of information such as a web site, electronic documents, web applications or software applications, accessed on a user station, comprising the following steps:
- providing in the user station an intelligent agent at the same time as at least a part of the source of information is accessed,
- executing said agent in said user station in order to collect data representative of user actions and semantic contents in relation with accessing the source of information, and to transfer said collected data to a collection and processing server, and collecting and processing said data in said collection and processing server to output user data regarding user behavior.
Preferred but non-limiting aspects of the above method are as follows:
* the step of providing the intelligent agent in said user station comprises embedding code intended to constitute said intelligent agent into a given zone of said source of information, and loading the source of information including said zone or said section into the user station. * said intelligent agent comprises at least two codes executable in possibly different executing environments, said codes being deleted at different times when an electronic document or an application is unloaded from the user station.
* one code is executable in the environment of accessing said source of information, and another code is executable in a scripting environment.
* said agent is made from a plurality of code sections embedded in different places in the source of information .
* said data representative of user actions include inputs from human to machine interaction devices.
* said data representative of user actions comprise measured idle periods.
* said data representative of semantic contents comprise text blocks from the source of information.
* the method further comprises the steps of:
- for each loaded source of information, forming a key representative of the source content, for each newly loaded source of information, comparing said key with keys of previously loaded sources of information, and collecting data representative of semantic contents of the newly loaded source of information only if the key thereof does not equal the key of a previously loaded source of information.
* for sources of information including dynamic web pages, said data representative of semantic contents are collected each time the content is refreshed or at a fixed interval of time or according to any other trigger action .
* said source of information includes contents markers (microformats) , and the method further comprises the step of adapting the actions of said intelligent agent depending on these markers.
* at least some of said markers are also transferred to the collection and processing server. * the method further comprises a step of identifying the user station.
* the method further comprises the steps of storing a user identifier in the user station for each source of information or group of such sources, and transferring said identifier to said collection and processing server.
* the method further comprising a step of linking different user identifiers in the user station.
* the source of information is provided by at least one information server, and said collection and processing server is at least partially embodied in said information server.
* said transfer of collected data is performed collected data after collected data.
* said transfer of collected data is performed after accumulating a group of collected data.
* said intelligent agent and said collection and processing server are capable of bidirectional communication together. According to another aspect, the present invention provides a generator capable of dynamically embedding in sources of information at least one executable code section intended to form an intelligent agent for performing the method as defined above.
According to a third aspect, the present invention provides an intelligent agent stored in a user station for performing the method as defined above.
The present invention further provides a method for generating page contents markers in web pages on a web site or a web application for performing the method as defined above.
Finally, the present invention provides a combination of a server providing sources of information and a plurality of user stations adapted to perform the method as defined above.
Brief description of the drawings
The invention will be better understood from a reading of the following detailed description of a preferred embodiment thereof, given by way of non- limitative example and made with reference to the accompanying drawings, in which: figure 1 is a synthetic overall logic diagram of the actions and interactions of three entities participating in the present invention (web page server of a site operator or server of an application operator subscribing to the system, client station of a visitor/user and collection and processing server) , figure 2 is a more detailed overall logic diagram, figure 3 is an idle management diagram used in the present invention, figure 4 is a logic diagram illustrating the main steps of a semantic analysis process implemented in the collection and processing server, figure 5a is a logic diagram illustrating the main steps of a process of processing markers ("micro formats") included in a page supplied by the web page server, figure 5b contains a list of types of information that can be marked by such markers, given by way of example, and figure 6 is a logic diagram illustrating the main steps of user identity reconciliation.
The contents of all figures are considered here as belonging to the description.
Detailed description of a preferred embodiment 1/ Technical environment and general presentation
The following description will be given with reference to figures 1 to 3, where the correspondence between blocks in these figures and the present description, not otherwise indicated, will clearly appear to the skilled reader.
The environment of the invention resides in the cooperation between a client station and a web page server or web application server responding to requests (URLs/URIs) sent by the client station. According to the invention, each page addressed by the server contains a code intended to be self-installed and executed, in the form of an intelligent agent, on the client station. In an embodiment implemented with state- of-the-art tools, this intelligent agent consists of three to five files compiled respectively according to Adobe Flash technology (2 files . swf) , according to XML technology (file . xml ) , according to a proprietary encryption (file without extension) and according to JavaScript technology (file . js) . One of the two .swf file and the .xml file are optional.
Thus, when the visitor sends requests and obtains pages or files, those pages or files received in response are tagged with these codes rather than the links contained in these pages or files as it was the case in the prior art .
Advantageously, generation of these page or file tags takes place automatically through a few code lines calling the files at the server (files previously stored in a well identified directory) in order to automatically embed them in a given area of any page or file generated.
These few lines of incorporation code are read systematically whenever a page or file is to be generated dynamically at the server and preferably placed in a block present in all the pages or files of the website
(typically the page copyright block), at the end thereof.
It is therefore understood that the tagging according to the system and the method according to the invention are effected in an extremely simple and repetitive manner, the same section of code being called for all the pages of the site (domain) , including in its possibly sub-domains and/or other domains.
There are two different . swf files. They are the agent and the reader. The reader is required only in case more than one domain or sub-domain is registered in the license file. As shown in figure 6, it is used to link information regarding a user unique identifier from different domains to be able to link the user behavior on a domain to the behavior of the same user on another domain. To find the reader for each domain or sub-domain, the agent uses parameters described in the . xml configuration file.
The role of the . js and the other .swf code files arriving with each page generated by the site on the client station is to be installed and executed, always in the same way as illustrated in figures 1 and 2.
The functions of this agent are in particular to recognize the page or file in question, to recognize possibly from where the internet user comes, to recognize if and where he is moving to on the site, to identify and recognize the internet user (even if it is an anonymous visitor, it is possible to determine whether or not he has already come to the site, even if accessed from a different Internet Service Provider than previously, the number of his visits, etc) as shown in figure 2 (flows UIDl to UID6) .
This agent sends information about the content of the page to the input server as described in figure 2 specifically in flows Sl to S4 and continuing in figure 4. This content can change during time using any kind of technology like the one known as AJAX (Asynchronous Javascript And XML) and though the agent can be parametered to send the page content at a fixed time interval if it has changed.
If the visitor has supplied the site with at least one item of identification or authentication, it is possible to reconcile each visitor with all his previous paths, behaviors and centers of interest detected from the content of the page i.e. to attribute to him/her the paths, the behaviors and the centers of interest detected from the content of the page and then to transmit this information to the site operator for injection into his databases (subscribers to his electronic newsletter, updating of the client file, information for the call centre, etc) .
It should be noted here that, although the pages of a site are at the present time usually generated dynamically, the invention also covers an approach where the aforementioned code section is added "manually" in a static fashion (although the pages produced from static HTML pages are now on the verge of disappearing from commercial, industrial or professional web sites or web applications) . Thus the present invention is not limited to any particular manner for the server to form and address the requested pages or files to the client station . It should also be noted that the generation of the tagged pages according to the invention is entirely transparent vis-a-vis the information system of the site operator .
2/ Implementation on the server side
The tagging program supplied to the site operator has a first functionality of recognition of the environment in which it is executed. The implementation code lines are placed by the page server in a suitable manner according to the technology that this server uses in each of the pages served (for example PHP, ASP (.net) and Java) . The program has in fact several code versions having the same functionalities and adapted to the currently most widespread technologies of dynamic web page generation. In the present case, three versions of the code are provided, which are adapted to the PHP, ASP (.net) and Java environments. Naturally, this configuration is given only by way of example.
This program fits within the process of dynamic web page generation in order to be inserted therein in each page, preferably in the section of a block present in all the pages of the website (typically the copyright block as seen above) , code sections that form the two components, Flash and Java, of the agent that will be executed when the page is received at the client station. The above process is thus sufficient to dynamically tag the whole of the site. Further, the reliability of this tagging is totally independent from the links contained in the pages and redirecting the Internet user to other pages (these links can in particular be active or dead, etc) and are transparent to them.
3/ Implementation on the client station side
The Flash component (here a . swf file) of the agent can take any form, visible or not on the display of the client station (a block of NxN pixels that is transparent and therefore invisible to the internet user, logo of the site operator, etc) . The Flash component known as reader is executed inside the agent and so has the same appearance.
An important feature in the agent .swf file is that the code executed on the client station is suitable for being supplied to a server collecting and processing information in real time or almost real time representative of the fact that this page has been loaded by such or such Internet user and/or such or such client station .
In order to prevent client stations being able to consider the .swf and . js codes attached to each page as being spy programs (web bugs or the like) having the characteristic of coming from third-party servers, it is preferable for the page tagging program to be executed directly in the technical environment of the website. In a variant embodiment, it is however possible to incorporate the Flash and JavaScript files into the pages supplied by the page server at a third-party server, or at the collection and processing server as will be described below.
An advantageous functionality of the Flash code of the intelligent agent is to make its execution dependent on the identification of the associated domain (subscriber under contract to the service) in the current browsing or use of application, and on the determination of the validity of the subscription contract in terms of date.
This code will thus be inoperative in any other context: other domain names, same domain name but contract expired, etc.
With reference to the drawings, the intelligent agent is able to send, to a collection server whose address it knows, a certain number of items of information and in particular the following:
Figure imgf000017_0001
Figure imgf000018_0001
4/ Details of each function a) Authorization of execution by the agent
At the start of its execution, the Flash code reads the system date of the client station. The Flash code relies upon a distinct file containing the data relating to the contract and rights to execute.
According to a variant, it is possible to provide a distinct version of the Flash code for each site operator, this code is compiled in such manner that it contains the details of the data collection and processing service contract.
The system date of the client station, reliable in the very great majority of cases given the current techniques of automatic date updating by the operating systems, is used to verify the existence of a valid service contract. It should be noted here that all the other dates used by the method and system of the invention are determined at the collection and processing server, serving as time reference.
b) Recognition of the client station
In a very conventional manner nowadays, cookies enable an e-Commerce site or a personalizable site or application to construct pages according to profile data stored in these cookies, as a general rule JavaScript cookies .
Advantageously, relying upon the Flash technology (or equivalent) rather than to JavaScript cookies, which are passive, makes it possible to prevent what is stored in the client station at the arrival of the pages from the subscriber website from being deleted by the browser
(for example when the internet user is closes his browser) and to allow these to be used by several browsers on the same station. This makes it possible to reconstruct internet user paths from one session to another and therefore to be able to collect data relating to all the sessions of an internet user in order to be able to analyze them in the period, which may comprise visits and revisits, and not simply one visit on one day.
An advantage of Flash technology or equivalent is that an object of the "Local Shared Object" type, which operates in the same logic as a JavaScript cookie, is placed in a specific directory common to all the browsers on the station.
Flash makes it possible to limit the size of a file to 0 kilobytes but by default it is 100 kilobytes; Therefore, if such a Local Shared Object is used to store a user identifier, the size of this object will be only a few bytes, well below the default limit and therefore acceptable .
In a variant and to avoid in the future the risk of automatic deletion or limitation of this type of file, it is possible to use a dynamic naming process for the Flash files .
c) Fineness of data acquisition For a website whose content within the same page
(rather than only the pages themselves) is produced dynamically by a technology such as Ajax (Asynchronous
JavaScript and XML) , where the pages that are refreshed but only parts of the pages are (for example stock exchange quotation ticker, news flashes, etc) , the intelligent agent is capable of extracting from the page a semantic content for the collection and processing server (as will be described in detail below) at each refreshing rather than only when each page is loaded according to a refresh rate hard-written in the Flash agent code .
This enables to acquire all the semantic data consulted by the visitors to these sites and all the actions by said visitors (which is not allowed by the techniques of the prior art) .
d) Recognition of the user Once the checking operations have been performed, a unique identifier is generated at the collection server and transmitted to the client station, where the intelligent agent recovers it in order to use it mainly as an identifier of the person who is in the process of visiting the site, storing it in the aforementioned "Local Shared Object". This identifier is preferably encrypted. The encrypting mode is chosen to make the identifier illegible to third parties (a technique based on hashing keys or the like) . This visitor identifier is the cornerstone used for the remainder of the activities performed by the system and the method of the invention.
When a client station connects under the control of a visitor to a website belonging to the system and method of the invention (a subscriber) , the server of this site, when a page is first loaded, writes on the client station the files of the intelligent agent, the latter generating the identifier of the client station as described above, with consequently a specific identifier for each domain visited.
For an Internet user visiting n. domains subscribing to the system, there are therefore n identifiers stored in the client station. It is noticeable that for one site using multiple domain, there will be several unique identifiers. The reconciliation of different identifiers is made using both the agent and the reader as described in figure 6. The aim of this process is to store links between identifiers from different domains on a single client station for a single person. Though it can be followed the travel of a person between different domains .
According to one aspect of the invention, this make it advantageously possible to reaggregate, by virtue of the personal identifier, the data collected for one and the same user operating from several different client stations (user stations) . For example, a user accesses a subscribed site during the day on his office computer and then in the evening on his personal computer, and then the next day on the computer of his secondary residence. Of course, accesses from other machines such as PDAs, mobile telephones, computer gaming devices, on-board computers, embedded computers, etc, are advantageously also taken into account as user stations, the portable character of Flash and JavaScript technologies or other languages facilitating this.
It can be used any other information provided by the user to re-aggregate different identifiers. As examples it can be used client information provided via a "microformat" or information provided in the URL. In both cases, information can be email address, name, client number or any other information allowing to surely identify the user. The generation of this unique identifier and data so called "microformats" will be detailed hereinafter.
5/ Collection of data a) Text data and temporal data
In order to determine what an Internet user is consulting on a monitored site, the intelligent agent is capable of extracting text contents from the pages consulted on the site, which is carried out mainly by eliminating from the code of the page loaded everything that is images, animations, videos, JavaScript codes, etc .
"Text content" must be understood here in the broad sense, namely any character string present on the page, such as blocks of text but also legends to images or videos for example.
The essential role of the JavaScript component of the intelligent agent is to manage the time spent on the page and the activity time on the page, determined by monitoring the input devices (typically mouse and keyboard) .
If no action by the user is detected, this is counted as inactivity time. All the other functions described here are preferably carried out by the Flash code, including the extraction of text contents described above.
b) Details of the extraction of text contents As indicated, the intelligent agent first of all extracts all the text data from the page. It then carries out a cleaning in order to extract therefrom only the semantic text part; it is this semantic part that will be one of the important parts of the data transferred to the collection server in order to be stored and processed therein .
The operator can affect the text data collecting by adding appropriate micro format as described later in this document.
The Flash code of the intelligent agent communicates with the collection and processing server. The Flash code has the function of extracting the text content from the consulted pages. In order to avoid unnecessary work in duplication, this code is able to check whether, for a given page designated by its URL, the text content has already been acquired; this is preferably done by applying a hash function to the content already acquired and to the content of the page being consulted by means of a common key; if the hashing result is the same, the page is considered not to have changed and its extracted text content is not transmitted again to the server.
In the end, one of the functionalities of the invention is to reconstitute the whole of the browsing universe of the Internet user, thus constituting a "semantic horizon", composed of the pages requested at least once by a client station. It should be observed here that such a "semantic horizon" makes it possible, in the context of a semantic analysis process as will be described below, to differentiate between a page as it exists in the whole of the website contents of the subscribed operator and the interest of a visitor for this page. Thus, for example, this semantic analysis process is capable of analyzing the consistency between the content of a page and the contents of pages consulted before and after it, and to attribute to the meaning of this page a different weight according to the degree of consistency. Such a process makes it possible to ignore pages consulted accidentally
(involuntary click) or by error (for example "clothing for children" in the context of part or all of the single-visits aggregation on a clearly defined theme such as "footwear for men") .
Thus the intelligent agent is capable of pre¬ processing or pre-organizing the text data contained in the pages consulted, which are then transferred to the collection and processing server and stored in this server with a view to processing thereof, allowing later attribution to their visitor or users :
In general terms, the server stores, in the form of a database, the text contents once "cleaned" by the agent, the visitor identifiers, links to be able to aggregate visitor identifiers and reconcile group of visitors like families or groups of relatives, the URLs of the pages, and finally the activity indicators and their time stamping. Other available information can of course be stored.
c) Details of the time stamping and activity measurement The Flash code initiates calls by means of the integrated JavaScript code with a view to collecting, at the browser of the client station, information issuing from the input devices (mouse, keyboard, etc) , except for the closure of the page, which is managed by a distinct JavaScript file.
Use of a JavaScript code distinct from the Flash code for performing these operations is advantageous in that, when a web page is unloaded (a browser event conventionally referred to as "OnUnload") , it is the Flash code, in general relatively heavy compared with the other elements of the page, which is first unloaded from memory. Consequently the use of a JavaScript code instead of a Flash code for these functions makes it possible to continue to monitor the actions of the user during this phase even though the Flash code has been unloaded and is therefore unusable.
When the page is unloaded, the intelligent agent is adapted to send to the collection and processing server information representing the inactivity (or activity) time spent on the page.
This data is obtained in the present example by a temporal counting process as illustrated in figure 3. The Flash code increments a count of +1 for example every elapsed second. On the other hand, it counts +0 (no incrementation) if during the running second an action has been performed on the input device (typically mouse movement, scrolling, etc) .
The system thus has available, apart from the time- stamped instants of start of loading of the pages and end of unloading of the page, information (message sent by the intelligent agent from the client station to the collection and processing server) representing the duration of inactivity on the page, determined by the result of the count during unloading.
In combination with the aforementioned time-stamping data, the collection and processing server is capable of obtaining duration of activity on the page by simple difference calculation. In a variant embodiment, the time stampings can be performed at the client station by means of a system clock thereof, and the calculation of activity time performed by the intelligent agent.
The intelligent agent can also transmit to the collection and processing server information such as technical information on the machine (display size, colour depth, version of operating system, type and version of browser, previously visited page (information supplied as standard by the majority of browsers) .
d) Monitoring of activity
The intelligent agent records not only the existence or not of activity on the input devices of the client station as described above, but also the nature of these activities (mouse clicks and their types (left button, right button, scroll wheel, etc.) and their consequences such as display of pop-up menus, text selection, execution of JavaScript codes and more generally everything relating to dynamic HTML (DHTML) .
Also recorded are entries in forms (subscription to newsletters, postal addresses, financial instruments, passwords, etc), this being done character by character.
e) Basic indicator profile information
Independently to the Flash agent execution, some minimal information is gathered. The information sent is
URL, Flash version and Internet browser information. This is also illustrated in figure 2.
f) Variant embodiments and evolutions of the acquisition process
It will be understood that the acquisition and time stamping processes as described above are intended to evolve with changes to technologies (in particular the new "Air" technology from Adobe) .
A current trend with market players consists of going towards the "Rich Internet Application", using technologies such as Flash, Flex and Air (Adobe) , Silverlight (Microsoft) or Ajax, in order to provide applications having an interface and ergonomics similar to those of a heavy client installed on a station, such as an office application. However, browsing on some Rich Internet Applications may take place in a manner disconnected from the network, and does not make it possible to associate precise time stamping with the browsing events: it is only when the user reconnects to the network, for example in order to place an order after having consulted a catalogue off¬ line (this being transparent to the internet user) , that the data of his browsing may then be recovered and transferred as a whole to the collection and processing server, with consequently a unique time stamping common to all the activity data during the off-line period. Recourse for example to a local relay clock is then provided.
The intelligent agent has a full duplex information channel open between the client station and the collection server. It can be used this channel to insert contextual information according to a processing inside the collection server that would be made available to the web page for example in the form of a JavaScript variable or a JavaScript or Flash function triggered.
The intelligent agent can also be implemented in other programming language than Flash and JavaScript to be able to use the intelligent agent in other environments than internet websites and web applications. Such other environments can be
- non HTML semantic execution contexts,
- non HTML pages,
- non Internet Protocol network protocols, - non web browser client station execution environment like video games or any software application,
- integration of the agent in the code of the web browser or as a plug-in of the web browser.
6/ Processing of the collected data a) Semantic analysis
The purpose of this functionality at the collection and processing server is to process the text content recovered as described above, extracting the key concepts, in a totally automated fashion, and to collect together these concepts according to their meaning (with advantageously the possibility of "zooming" in depth on this meaning) in order to output them at the end of the service chain offered to the subscribing operator.
With reference to figure 4, this semantic analysis is preferably based on the analysis of the organization of a text, on the greater importance of certain words compared with others, on the greater importance of certain concepts pre-defined (with ontologies or else) or not compared with others, on the use of particular technical terms rather than a generic term.
A first step relies upon to statistical processes of the stemming/lemmatization type, by correlation between the words and context analysis, ambiguities removal and special dynamic dictionaries.
The second step is based on the links of the text contents to a general purpose ontology, or a designated group-of-concepts ontology (for instance distinct for France/Belgium, distinct for business/leisure activities, etc.), or an ontology established in direct relation to the activity of the subscribed operation (for instance a "mouse" in such a context would be taken into account as an animal or as a computer peripheral, depending on the activity of the subscribed operation) .
It should be stated here that an ontology is a structured set of terms and concepts establishing the meaning of an information field (a) by the metadata of a namespace, or (b) by the items of a knowledge domain. It is a data model representing a set of concepts in a domain, as well as the relationships between these concepts . The processing of the text content is then arranged to restore a semantic network organized as a set of concepts completely describing a domain, these concepts being linked to each other by relationships that are taxonomic (hierarchization of the concepts) on the one hand and semantic (relationships of meaning between the words) on the other hand.
Yet another approach is to have recourse to an ontology that is not trade-based but centered on the user of the processed data. For example, for a site of the on-line supermarket type, the semantic analysis could be based on the ontology of this trade using for example the Food and Non-food categories, and subdividing the categories more and more finely according to a tree structure. Corresponding key words can be attached to this hierarchy of concepts.
Advantageously, each key word is attached to an element of the ontology of the subscribed operator so as to be able to use this in the same way as an ordinary hierarchical structure and to perform on it all kinds of data processing operations, in particular so-called "drill-down" processing operations (see in particular http : //en . wikipedia . org/wiki/Data_drilling) . It should be noted here that, with current data processing techniques, these processing operations are possible only in relation to hierarchy codifications, in particular those of analytical accounting tools, except if "hard" cumbersome parameterizations are carried out with the keywords of a specific ontology.
In summary, the combination of a monitoring of the behavioral type of visitors with a semantic analysis of the visited pages makes it possible for example to supply the subscribing operators with information on the bidirectional behavioral interaction of the visitor/product relationship, and for example:
- what are the products removed from the purchase basket at the last moment before going to the checkout; sensibility or not to price i.e. what is the relative price level of the removed items compared to the whole merchandising assortment prices or compared to the price level of the categories to which the removed products belong; propensity to buy (yes/no and level and proportions) or propensity to buy at a certain price, with or without discounts and other incentives;
- what links exist between the routes, the site browsing or application interaction, centers of interest, etc. and the final content of the basket or the transactions taking place;
- correlation between sensibility to advertising and semantic context; - generation of useful information to place cross- sell proposals or up-sell proposals (more extensive associated products, for example the sale of car insurance products to clients requesting household insurance) , etc; - comparative analysis of concepts that attract the visitor, those that induce sales/transactions and those that encourage loyalty or repeat interactions; matches between concepts and segments or microsegments of products, etc; - behavior of visitors over long periods, detection of visit cycles and consumption cycles (these not being necessarily the same); etc.
b) Microformats With reference to figures 5a and 5b, a marking "microformat" is made of a naming, according to a determined keyword whose meaning is predetermined. For example, it may be a case of a generic product name from the catalogue of the website operator, and a marked microformat thus marks (delimits) each location on a page where such a product is found, by means of attributes
(cited with reference to the current state of the art and liable to change to follow the evolution of technologies) : id and class associated with an existing
HTML tag or a tag added specifically to carry the microformat information, such as for example the div tag.
For example, if a table contained in a page represents an invoice, each cell in the table can be enhanced with microformat information in order to characterize the content of the cell. If a product name is alone in the middle of a text, it is characterized alone by means of a div tag carrying the microformat information.
Preferably, the microformats are generated directly during the dynamic generation of a HTML page by means of a microformat tagging engine that tags contents according to predetermined tagging rules (unlike the second- generation behavioral analysis systems described in the introduction, the contents are tagged rather than the links) .
Each microformat is characterized by a format (for example product code, customer identifier, advertising campaign code, quantity, price, discount, carriage costs, etc) . A certain number of examples of microformats are given in figure 5bis of the drawings.
When a page is received, the information issuing from these microformats is transferred to the collection and processing server by the intelligent agent of the client station (by means of its Flash code) , and stored in the database of said server for example (typically for an on-line sales website) in a format (product A, page B, client C, quantity D, total amount E) .
These multiplets can then be processed by any suitable data processing tool.
Such microformats make it advantageously possible for example to integrate analyses on data collected for non-client visitors (for example a visitor loads his basket but never goes to the checkout) in the transactional system of the subscribing operator.
They also make it possible for example to produce individualized histories of purchases made by the various clients (or non-finalized purchase intentions for non- clients) , histories that the data processing tools can then correlate with the analysis of the browsing routes, centers of interest, product segments and microsegments .
For example, the system makes it possible to determine that a client visits the site N times before reaching a purchase decision, that another client visits only the luxury products pages, if a client or potential client is a discount hunter or a heavy spender or high margin items, etc.
These microformats can have meaningful identifiers
(client ID, product name or reference, etc.) by the following mechanism: when an Internet user X identifies himself on the website, he in general enters his ID, and then "Hello Mr X" appears at the top of the page.
In this precise example, a microformat can mark this character string, by placing a div type tag and containing the client number, named for example "XXX_client_number" . This number is invisible to the Internet user on the web page that he is consulting, but on the other hand is read by the Flash code of the intelligent agent, which makes it possible to attach the current consultation with the identifier (identical by definition) situated in the client database.
And by virtue of these identifiers and the sequence of their processing, in combination with the identifiers stored in the "Local Shared Object" on a machine-by- machine basis, the collection and processing server is capable of determining that, when an Internet user places products in a checkout basket, this occurs on a certain client station (for example the one situated at his place of work) , and that when the same internet user, returning in the evening to add a new product to the basket, validates his order, etc, this occurs on another client station (for example the one situated at his home) .
The system and method of the invention are thus capable of making a link and reconciling the technical elements in order to provide a seamless vision of the activity of an internet user, and to retrieve a customer- centric vision taking account of his movements, his habits, his/her sensitivity towards certain elements, his/her propensity to act, his/her way to use or interact with electronic documents or applications, etc.
In this regard, as indicated above, the prior art systems are not capable of performing this dual identification, that is to say determining that the same user has connected to the site from two different machines .
They only make it possible to determine that an
Internet user has connected to the server from such time to such time in the day, and that he has reconnected in the evening, without being able to determine his location .
In addition, in relation to the transactional systems/the information system of the operator, it is possible according to the present invention to correlate these web data with the email or postal addresses, by nature very precise and very segmenting, of the identified visitors.
The only prior art systems that make it possible to determine from where a visitor is connecting are server- side systems that incorporate in their CMS system a tool for acquiring information relating to the IP addresses of the internet users .
But no prior art solution based on client-side intelligence makes it possible to do this.
It is also necessary to bear in mind that the use of information relating to the IP address for deducing a geographical address therefrom is very imprecise.
On the other hand, the semantically legible link established by the method and the system of the invention between the microformats and the reference frames of the subscribing operators is much more precise: if a given person has already been a client, then the transactional accounting system of the subscribing operator already has his invoicing and delivery postal address, which makes it possible to easily apply segmentation processing operations of social-professional group type based on this geolocation. The system and method of the invention also make it possible for example:
- to determine which visitor often connects from a work environment type, or on the other hand from a home environment type: - to determine that a visitor who has identified himself on a website was previously frequently connected to the site anonymously (with for example the habit over a certain time of visiting a site without identifying himself and without ordering anything) , in order thus to be able to allocate to the profile of this person all the data relating to the previously made anonymous visits;
- by operating on the meaning of the browsing routes taken by the Internet users, to retrieve data relating to several internet users in the same group (family with father, mother, children for example) whose routes are semantically very different and to which it is thus possible to apply targeted marketing.
It is also possible to provide for certain microformats to be used to exclude certain text areas from the text extraction for the purpose of semantic analysis (for example exclusion of standard browsing menus, not adding any meaning) .
According to another possibility in relation to microformats, it is possible to use at the page server a functionality of automated creation of microformats on the basis of a first parameter (marker type) and a second parameter (content nature) , in order thus to automatically generate the character string constituting the name of the microformat in a CMS.
Advantageously, a program for managing microformats having this functionality is used, in several versions adapted to the most widespread dynamic web page generation technologies. In the case in question, three versions of the code adapted to the PHP, ASP (.net) and Java environments are provided.
7/ Other variants
The system and method of the invention can further comprise, particularly at the collection and processing server, functionalities for: determining that certain contents normally protected by passwords of a subscribed site have been accessed by unauthorized persons; - determining that certain contents normally protected by passwords of a subscribed site have been accessed by the same password simultaneously from two different client stations; determining that a subscribed site has been visited by robots rather than by human operators;
- determining most suitable advertisement according to centers of interest;
- obtaining traffic data in relation for example to sponsored links; attribution of the customers by their profitability to affiliate of web sites or web applications. "Affiliate" must be taken here in a marketing meaning.
8/ Use of the processed data
The information resulting from processing operations performed by the collection and processing server may have several types of uses, and in particular: - an operational "tactical" use, on a day-by-day basis (by the detection of short-term sales evolutions by product, etc.), mainly for performance measurement and correction purposes;
- a more strategic use (quality and potential of the customer base, location of potential customers, how to segment the customer base, customer changes or product changes, etc;
- use at the level of the individualized management of the customer base (response to requests, merchant /customer interactions, stimulation channels, purchase cycles, marketing return on investment, global profitability, loyalty-creation actions, etc.);
- differentiation of the members of the same family, all connected under the same identifier, by semantic analysis.
Of course, the present invention is not limited to the embodiments described above and illustrated in the drawings, but the skilled person can devise numerous other embodiments. In particular, although the present invention has been described in the specific framework of a server providing web pages, the present invention applies to a user station remotely or locally accessing any source of information including but not limited to static and dynamic web pages, files, streaming contents, web or other applications, etc.

Claims

1. A method for monitoring the behavior of a user accessing an electronic source of information such as a web site, electronic documents, web applications or software applications, accessed on a user station, comprising the following steps:
- providing in the user station an intelligent agent at the same time as at least a part of the source of information is accessed,
- executing said agent in said user station in order to collect data representative of user actions and semantic contents in relation with accessing the source of information, and to transfer said collected data to a collection and processing server, and collecting and processing said data in said collection and processing server to output user data regarding user behavior.
2. A method according to claim 1, wherein the step of providing the intelligent agent in said user station comprises embedding code intended to constitute said intelligent agent into a given zone of said source of information, and loading the source of information including said zone or said section into the user station .
3. A method according to claim 1, wherein said intelligent agent comprises at least two codes executable in possibly different executing environments, said codes being deleted at different times when an electronic document or an application is unloaded from the user station .
4. A method according to claim 3, wherein one code is executable in the environment of accessing said source of information, and another code is executable in a scripting environment.
5. A method according to any one of claims 1 to 4, wherein said agent is made from a plurality of code sections embedded in different places in the source of information .
6. A method according to any one of claims 1 to 5, wherein said data representative of user actions include inputs from human to machine interaction devices.
7. A method according to claim 6, wherein said data representative of user actions comprise measured idle periods.
8. A method according to any one of claims 1 to 7, wherein said data representative of semantic contents comprise text blocks from the source of information.
9. A method according to any one of claims 1 to 8, comprising further steps of: - for each loaded source of information, forming a key representative of the source content, for each newly loaded source of information, comparing said key with keys of previously loaded sources of information, and collecting data representative of semantic contents of the newly loaded source of information only if the key thereof does not equal the key of a previously loaded source of information.
10. A method according to any one of claims 1 to 9, wherein, for sources of information including dynamic web pages, said data representative of semantic contents are collected each time the content is refreshed or at a fixed interval of time or according to any other trigger action .
11. A method according to any one of claims 1 to 10, wherein said source of information includes contents markers (microformats) , and further comprising the step of adapting the actions of said intelligent agent depending on these markers.
12. A method according to claim 11, wherein at least some of said markers are also transferred to the collection and processing server.
13. A method according to any one of claims 1 to 12, further comprising a step of identifying the user station .
14. A method according to any one of claims 1 to 13, further comprising the steps of storing a user identifier in the user station for each source of information or group of such sources, and transferring said identifier to said collection and processing server.
15. A method according to claim 14, further comprising a step of linking different user identifiers in the user station.
16. A method according to any one of claims 1 to
15, where the source of information is provided by at least one information server, and said collection and processing server is at least partially embodied in said information server.
17. A method according to any one of claims 1 to
16, wherein said transfer of collected data is performed collected data after collected data.
18. A method according to any one of claims 1 to 16, wherein said transfer of collected data is performed after accumulating a group of collected data.
19. A method according to any one of claims 1 to 18 wherein said intelligent agent and said collection and processing server are capable of bidirectional communication together.
20. A generator capable of dynamically embedding in sources of information at least one executable code section intended to form an intelligent agent for performing the method according to any one of claims 1- 19.
21. An intelligent agent stored in a user station for performing the method according to any one of claims 1-19.
22. A method for generating page contents markers in web pages on a web site or a web application for performing the method according to claim 11.
23. A combination of a server providing sources of information and a plurality of user stations adapted to perform the method of any one to claims 1-19.
PCT/EP2009/052134 2008-02-22 2009-02-23 Systems and methods for acquiring, collecting and processing data relating to remotely or locally accessed electronic documents or applications WO2009103820A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US3069008P 2008-02-22 2008-02-22
US61/030.690 2008-02-22

Publications (1)

Publication Number Publication Date
WO2009103820A1 true WO2009103820A1 (en) 2009-08-27

Family

ID=40707716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2009/052134 WO2009103820A1 (en) 2008-02-22 2009-02-23 Systems and methods for acquiring, collecting and processing data relating to remotely or locally accessed electronic documents or applications

Country Status (1)

Country Link
WO (1) WO2009103820A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110507997A (en) * 2019-08-12 2019-11-29 广州小丑鱼信息科技有限公司 A kind of user behavior analysis method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002048902A2 (en) * 2000-12-11 2002-06-20 Systar S.A. System and method for providing behavioral information of a user accessing on-line resources
US20030145071A1 (en) * 2002-01-31 2003-07-31 Christopher Straut Method, apparatus, and system for capturing data exchanged between server and a user
US20060155764A1 (en) * 2004-08-27 2006-07-13 Peng Tao Personal online information management system
US20070124202A1 (en) * 2005-11-30 2007-05-31 Chintano, Inc. Systems and methods for collecting data and measuring user behavior when viewing online content
US20070232290A1 (en) * 2006-04-03 2007-10-04 Tatman Lance A System and method for measuring user behavior and use of mobile equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002048902A2 (en) * 2000-12-11 2002-06-20 Systar S.A. System and method for providing behavioral information of a user accessing on-line resources
US20030145071A1 (en) * 2002-01-31 2003-07-31 Christopher Straut Method, apparatus, and system for capturing data exchanged between server and a user
US20060155764A1 (en) * 2004-08-27 2006-07-13 Peng Tao Personal online information management system
US20070124202A1 (en) * 2005-11-30 2007-05-31 Chintano, Inc. Systems and methods for collecting data and measuring user behavior when viewing online content
US20070232290A1 (en) * 2006-04-03 2007-10-04 Tatman Lance A System and method for measuring user behavior and use of mobile equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110507997A (en) * 2019-08-12 2019-11-29 广州小丑鱼信息科技有限公司 A kind of user behavior analysis method and system

Similar Documents

Publication Publication Date Title
Eirinaki et al. Web mining for web personalization
US8185608B1 (en) Continuous usability trial for a website
US7120590B1 (en) Electronically distributing promotional and advertising material based upon consumer internet usage
US10891657B1 (en) Directed content to anonymized users
US20080005313A1 (en) Using offline activity to enhance online searching
US20080004884A1 (en) Employment of offline behavior to display online content
US20120331102A1 (en) Targeted Content Delivery for Networks
JP2007510973A (en) Optimization of advertising activities on computer networks
US20090171754A1 (en) Widget-assisted detection and exposure of cross-site behavioral associations
US20140033007A1 (en) Modifying the presentation of a content item
KR20150130282A (en) Intelligent platform for real-time bidding
WO2001054034A9 (en) Electronic commerce services
WO2014141078A1 (en) A method of and system for providing a client device with particularized information without employing unique identifiers
CA2943338A1 (en) System and method for identifying user habits
US20110313833A1 (en) Reconstructing the online flow of recommendations
Wiedmann et al. Customer profiling in e-commerce: Methodological aspects and challenges
JP2009193465A (en) Information processor, information providing system, information processing method, and program
JP2009265833A (en) Advertisement system and advertisement method
JP2011508925A (en) Detect and publish behavior-related widget support
CN106228391A (en) The method and system of monitoring of the advertisement
JP2010113542A (en) Information provision system, information processing apparatus and program for the information processing apparatus
CN106228390A (en) The monitoring of the advertisement method and the corresponding reward voucher that utilize electronic coupons use terminal
CN102957722A (en) Network service Method and system for generating personalized recommendation
Patel et al. Process of web usage mining to find interesting patterns from web usage data
Kuo et al. Personalization technology application to Internet content provider

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09713622

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09713622

Country of ref document: EP

Kind code of ref document: A1