WO2015183098A1

WO2015183098A1 - Method and system for collecting, transforming, storing, and presentation of data from multiple data sources.

Info

Publication number: WO2015183098A1
Application number: PCT/NO2015/050090
Authority: WO
Inventors: Harald Jellum
Original assignee: Companybook As
Priority date: 2014-05-24
Filing date: 2015-05-26
Publication date: 2015-12-03
Also published as: EP3149690A1; EP3149690A4

Abstract

Method and system for collecting, transforming, storing and presentation of data from multiple data sources on a common platform, where data from multiple business sources transforms to same Method and system for collecting, transforming, storing, and presentation of data from multiple data sources. structure which makes it possible to summarize, compare, see differences, statistics, trends and other related relations between the sources.

Description

TITLE:

Method and system for collecting, transforming, storing, and presentation of data from multiple data sources. PRIOR ART

Business Information databases:

There exists several business information systems which contain information about company names, addresses, contact data, turnover, result and other typical structured data elements. Examples can be Yellow pages, Proff.no, D&B, Experian and others. Typical for these are that they are comprised in a database which typically is updated manually or partly automated through a content management system where input originates from persons calling the companies in the database to collect information. Update frequency is typically every 1-2 years. The problem with such products and solutions are that the content production is primarily based on input from one or , at best, a few sources. Therefore there may be information which are of importance, but never becomes part of the system because the information does not appear in the input bases used. The other problem is that these systems are very resource demanding to keep updated.

Media monitoring:

There exists various systems for media monitoring, which allow the user to manually search for a company name which then let the user see news article comprising the exactly defined company name. The challenge is that there are many examples of names, places, and things that have the same name as a Company. Examples are Apple the fruit or the company Apple, persons named Ericsson versus the company name Ericsson or the generic distance term "miles" as compared to the company Miles. In these cases a user performing a search based on such search terms will experience a lot of "noise", or unwanted hits. The similar will also apply when a search for products is performed. A problem with these systems is that they offer no possibilities to avoid the "noise".

Search engines:

There is a lot of business information on internet. This content is typically indexed by search engines so the user can search by free text to find the best matching Web sites. When looking for a company you typically get a list of URL's. These are often ranked by different parameters controlled by the search engine. In a typical search engine there is a problem that the search engine does not understand the content, and a further problem is that the pages resulting from a search have only one common feature, namely the search string, or parts of the search string. The problem is then that completely unrelated articles / pages may be shown as a result from the same search.

BRIEF DESCRIPTION OF THE INVENTION :

The invention provides a method and system for collecting, transforming, storing and presentation of data from multiple data sources on a common platform, where data from multiple business sources transforms to same structure which makes it possible to summarize, compare, see differences, statistics, trends and other related relations between the sources. FIGURES

Fig 1 is a block diagram overview of the system

Fig 2 is a block diagram overview of Filtering, Entity extraction and standardized ID's

Fig 3 is a flow chart of an embodiment of the method DETAILED DESCRIPTION OF THE INVENTION

The invention provides a method and system for providing a complete "world" overview of information related to business information systems, such as company web pages , company data bases, private and public registers , search engines and business applications, users, employees, owners, consultancies, business professionals or other relevant business relations, News, forum, Blogs, Social networks or similar.

The invention provides a method and system for viewing information from different sources having originally different format. The object of the invention is to provide an information system usable for employees, owners, consultants or business contacts of a company, and optionally any person in need for a complete overview of a topic or entity, irrespectively if the information originates from any type of source, being a company information system such as company database, financial registers, public or private company registers, company catalogues, other business intelligence systems, company applications, search engines, or other relevant business systems.

Further, the invention solves the above mentioned problems of prior art by structuring all the information presented in available data sources, and thereby providing a relation between information about for example, companies, products, services, trends, sentiments, connections between companies or other business related information, and thereby presenting an "world" overview were the relations between information and source becomes clearer. The invention will also present a combined information overview of an object collected from different sources. A typical use will be to use the invention to aid in the search for businesses, products, persons or business opportunities.

It is an object of the invention to provide a user with a method and system for improved ability to create new opportunities for businesses by services with much better combined business knowledge from various sources. The invention can be integrated with other company information solutions.

In other words, the method and system of the invention transforms data from multiple business sources into the same format which makes it possible to summarize, compare, see differences, statistics, trends and other relations between the information from the various sources and to enable combination of the total volume of information in a way that it represents the complete "world's" combined business information volume.

The method and system of the invention, with reference to fig. 1, continuously read 600s all business sources 700, 710, 720, 730, 740, 750 comprising structured or unstructured information. All information in the search results then is filtered 500 and results in relevant business news. All entities are extracted 400 and the entities are mapped to «standardized» ID's 300 and stored in a database 200 of the invention. It is thus possible to summarize 100, see differences 120, statistics 140, trends 160 or other relation between the sources.

The method and system of the invention automatically transforms unstructured and structured business data into the same structure which makes it possible to summarize, compare, see differences, statistics, trends and other relations between the different sources, in other words a method and system which makes it possible to combine all business information together.

The method and system of the invention uses advanced search technology to crawl 600 business sources 700, 710, 720, 730, 740, 750 which is then filtered 600 and automatically extracting entities 400 which is then transformed to standardized ID's 300 which converts the data into a common structure and stored in a search data base 200. The search database 200 offer a number of services where the information can be combined to provide for example, but not limited to: Weighted sum from all sources 100, Differences between sources 120, Statistics 140, Trends 160, Relations between sources, filtering and content 180, other combinations of information forms the bases for the information presentation form.

The invention may use machine learning, natural language processing, training sets, word vectors, stemming and other relevant techniques combined with synonyms, dictionaries, databases and language translators.

The method of the invention provides a method to transform data from multiple business and data sources to the same data structure to enable a user to summarize, compare, see differences, statistic, trends, and other relations between the sources by using modern search technology 600 together with filtering 500, entity extraction 400 and "mapping" 300 to a common structure 200 in a database which provides the possibilities to summarize, compare, see differences, statistic, trends and other relations between the sources 100-160.

One embodiment of the system of the invention is outlined in fig. 1, where the system comprises crawler (600) modules which automatically read information from different business data sources and the like. These sources can comprise both structured and unstructured information such as on or more of, but not limited to: company web sites 700, company databases, private or public registers 710, search engines, applications within business/company services 720, users, employees, owners, consultancies, business professionals or other relevant business relations 730, News 740 and Forum, Blogs, Social networks or similar 750. The information is passing through a business relevance filter (500) module filtering out information such as, but not limited to: Business related terms and expressions as: Company name, products and services, locations, Language, turnover, result or other financial data, customers, competitors or other business relations, Market, Industry, contact data, sentiments, rules or other Business related content. Next is an entity extraction (400) module for identification of entities such as, but not limited to company name, person name and title, industry, product, location, market, financial data or other business related entities. A mapping module 300 will then map entities to standardized ID's. The entities are mapped to a standardized unique ID 300 for each type of entity group in such a way that all information is stored on a common structure form in the database 200.From this common structure one or it is possible to derive relations such as, but not limited to: the weighted sum from all sources 100, the differences between the sources 120, statistics 140 and trends 160 or other relations or operations between the sources. The system is further discussed in figure 2 where it is shown that business data from structured and unstructured sources 700-750 will be checked if they are "business relevant". Examples of techniques used are business name dictionaries, addresses, contact data, persons, industry, catalogues, dictionaries, languages modules, location, rules and learnings from read content. These may be further optimized depending on what type of business data that shall be passed on to the transformation process in the entity extraction module 400. From the relevant business information it will be extracted entities from the text. This is done by for example machine learning, natural language processing (NLP), training set, word vectors, stemming and other relevant techniques combined with dictionaries, synonyms, databases, language translators or other text recognition technology 405. Examples of entities which can be extracted may be one or more of, but not limited to: company name, person name and title 410, industry 415, product 420, location 425, market 430, financial data 435 or other business related entities. The entities are then sent further to a mapping module 300 which standardizes the entities to unique ID's. An example of an entity may be «smart mobile phones» and «smart telephone)) which both means the same, and thus will be mapped and associated with the same ID. Other techniques may be used, such as industries SIC (Standard Industrial Classification) codes. Search technology stemming is another method which identifies the base form of the word, yet another method is soundex which identifies the sound picture of the word, synonyms, NLP, vector representation of expressions, machine learning and known trainsets (300) are other examples.

An embodiment of providing a weighted sum of all sources is presented in fig. 3, where the example make use of data from the database 200 comprising common structure for all sources. Business information from different sources comprising company name and associated products can be summarized. For each source a 3 dimensional representation 101 comprising company name ID as one axis and associated Product ID as the other axis is created, and the probability to belong to a marked is indicated by the height 102 of the displayed surface. It is also possible to combine this with the company industry code, location, financial figures or other business related parameters. This is done in the same way per different business source illustrated as different layers 103-105 in the figure. These can then be summarized 106 to show the sum of all probabilities within a given market. In addition one can weigh the different sources respective quality, size, feedback from users etc.

In one embodiment the height of the 3-D plane is the probability for a company or product to belong to a given market. It is possible to include other parameters which also have effect on the height, such as: Weight of source (trust score), and Size of source. This principle applies for all other properties and combinations of the information from the company or source.

In a first embodiment of the system of the invention the system may be comprising one or more crawler (600) modules, wherein the crawler modules are set up to search and fetch data from structured and unstructured business and data sources such as, but not limited by: company web pages (700), company data bases, private and public registers (710), search engines and business applications (720), users, employees, owners, consultancies, business professionals or other relevant business relations (730), News (740) and forum, Blogs, Social networks (750);

a filtering (500) module, where the filter module are set up to filter out from the searched and fetched data from the crawler (600) modules terms and expressions such as, but not limited to:

Company name, products and services, locations, Language, turnover, result or other financial data, customers, competitors or other business relations, Market, Industry, contact data, sentiments, rules or other Business related content,

an entity extraction (400) module, wherein the entity extraction (400) module identifies entities such as, but not limited to: company name, person name and title, industry, product, location, market, financial data or other business related entities,

a mapping (300) module for mapping entities to standard ID's,

a database (200) for storing the data that is searched and fetched by the crawler (600) modules, filtered in the filtering (500) module, extracted in the extracting (400) module and mapped in the mapping (300) module, in predefine data structures,

one or more output (100 - 160) modules for providing relational information between the data sources.

In a second embodiment of the system of the invention according to the first embodiment of the system of the invention, the system further comprise a network service and a communication module, for providing communication between the system and a user.

In a third embodiment of the system of the invention according to the first or second embodiments of the system of the invention, the entity extraction module comprise an entity recognizer (405) module for further optimization of the searched and fetched data and recognition of relevant business information by for example, but not limited to: machine learning, natural language processing (NLP), training set, word vectors, stemming and other relevant techniques combined with dictionaries, synonyms, databases, language translators or other text recognition technologies. In a fourth embodiment of the system of the invention according to any of the previous embodiments of the system of the invention, the relational information from the one or more output modules comprise one or more of, but not restricted to: a weighted sum from all sources(lOO), differences between sources (120), statistics(140) and trends (160) between the sources.

In a fifth embodiment of the system of the invention according to any of the previous embodiments of the system of the invention, the system is comprised in a cloud service. In a first embodiment of the method of the invention the method provides a common platform for representation of data from multiple data sources using the system defined in any of the previous claims, the method comprising performing the following steps:

searching and fetching data using crawlers (600) from structured and unstructured business and data sources such as, but not limited by: company web pages (700), company data bases, private and public registers (710), search engines and business applications (720), users, employees, owners, consultancies, business professionals or other relevant business relations (730), News (740) and forum, Blogs, Social networks (750);

filtering out from the searched and fetched data from the crawler (600) modules, in a filtering (500) module, terms and expressions such as, but not limited to: Company name, products and services, locations, Language, turnover, result or other financial data, customers, competitors or other business relations, Market, Industry, contact data, sentiments, rules or other Business related content,

identifying and extracting entities in an entity extraction (400) module, wherein the entities may comprise, but not limited to: company name, person name and title, industry, product, location, market, financial data and other business related entities,

mapping entities to standard ID's, in a mapping (300),

storing the data that is searched and fetched by the crawler (600) modules, filtered in the filtering (500) module, extracted in the extracting (400) module and mapped in the mapping (300) module, in predefine data structures in a database (200), and

output relational information between the data sources by one or more output (100 - 160) modules.

In a second embodiment of the method of the invention according to the first embodiment of the method of the invention, the filtering operation is further set up to learn from previously filtered content. In a third embodiment of the method of the invention according to the first or second embodiments of the method of the invention, the identifying and extracting entities operation is further set up to learn from previously identified and extracted content.

In a fourth embodiment of the method of the invention according to any of the previous

embodiments of the method of the invention the identifying and extraction of entities operation may use techniques as machine learning, «natural language processing)) (NLP), training sets, word vectors, stemming and other relevant techniques combined with dictionaries, synonyms, databases, language translators or other relevant text recognition technologies (405).

In a fifth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the mapping of entities operation to standardized ID's can use techniques as, but not limited to: others ID standards, own standards, search technology as stemming, «soundex», synonyms, NLP, vector representation of expressions, machine learning and know training sets (300).

In a sixth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the common structure (200) of the sources may be used as a weighted sum of the business information from all sources to give a summarized possibility to belong to a given market based on a set of products(lOO).

In a seventh embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the a company name ID and product ID are set together in a 3 - dimensional plane (101) and where the height is the probability to belong to a given market given a set of products (102) or other relevant combination of a company properties.

In an eight embodiment of the method of the invention according to any of the previous

embodiments of the method of the invention the different planes (101) represents corresponding different business sources (103-105) and that these may be summarized (106) into one weighted probability for all sources. In a ninth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the different sources may be weighted based on their quality, trust, reputation or other relevant parameters. In a tenth embodiment of the method of the invention according to any of the previous

embodiments of the method of the invention the probability to belong to a market given a set of products may depend on the company's official industry code, location, financial numbers and other business related parameters. In an eleventh embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the output of a common structure (200) may show trends over time to develop relationships between companies, products, locations, market, financial strength and other business relations. In a twelwth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the output of a common structure (200) may show statistics over most common trends, most popular products and services, most popular companies, industries, locations, megatrends, technology development or other relevant relations.

In a thirteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the output for a common structure (200) may show differences between sources as e.g. based on locations, deviation from the normal, normal distributions, standard deviation, derived over time or similar.

In a fourteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the solution may be integrated as a part of other systems as company databases, financial registers, public company registers, company catalogues, other for dictionaries for companies, business applications, search engines and other relevant business systems.

In a fifteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention at filtering (500), entity extractions (400) and mapping of entities to ID's may be enhanced by feedback from users. In a sixteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the method may be integrated with mobile applications, tablets, «phablets» or other communication devices which uses the devices information about information about time, location, user, language, profile etc.

In a seventeenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the total knowledge from the sources may be shown as different graphs.

In an eighteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the entities can be words, known sentences, relations between word or other text relations. In a nineteenth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the search from different sources may be combined.

In a twentieth embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the companies, persons and news from different sources may be combined.

In a twenty-first embodiment of the method of the invention according to any of the previous embodiments of the method of the invention the output from the output modules (100 - 160) is communicated to a user.

In a twenty-second embodiment of the method of the invention according to any of the previous first to twentieth embodiments of the method of the invention the method is implemented as a cloud service.

Claims

PATENT CLAIMS

1.

System for providing a common platform for representation of data from multiple data sources, the system being c h a r a c t e r i z e d b y comprising:

one or more crawler (600) modules, wherein the crawler modules are set up to search and fetch data from structured and unstructured business and data sources such as, but not limited by: company web pages (700), company data bases, private and public registers (710), search engines and business applications (720), users, employees, owners, consultancies, business professionals or other relevant business relations (730), News (740) and forum, Blogs, Social networks (750);

a filtering (500) module, where the filter module are set up to filter out from the searched and fetched data from the crawler (600) modules terms and expressions such as, but not limited to: Company name, products and services, locations, Language, turnover, result or other financial data, customers, competitors or other business relations, Market, Industry, contact data, sentiments, rules or other Business related content,

a mapping (300) module for mapping entities to standard ID's,

2.

System according to claim 1, wherein the system further comprise:

a network service and a communication module, for providing communication between the system and a user.

3.

System according to claim 1 or claim 2, wherein the entity extraction module comprise an entity recognizer (405) module for further optimization of the searched and fetched data and recognition of relevant business information by for example, but not limited to: machine learning, natural language processing (NLP), training set, word vectors, stemming and other relevant techniques combined with dictionaries, synonyms, databases, language translators or other text recognition technologies.

4.

System according to any of the previous claims, wherein the relational information from the one or more output modules comprise one or more of, but not restricted to: a weighted sum from all sources(lOO), differences between sources (120), statistics(140) and trends (160) between the sources.

5.

System according to any of the previous claims, wherein the system is comprised in a cloud service.

6.

Method for providing a common platform for representation of data from multiple data sources using the system defined in any of the previous claims, the method being

c h a r a c t e r i z e d b y comprising:

mapping entities to standard ID's, in a mapping (300), storing the data that is searched and fetched by the crawler (600) modules, filtered in the filtering (500) module, extracted in the extracting (400) module and mapped in the mapping (300) module, in predefine data structures in a database (200), and

7.

Method according to claim 6, wherein the filtering operation is further set up to learn from previously filtered content.

8.

Method according to any of claim 6 to 7, wherein the identifying and extracting entities operation is further set up to learn from previously identified and extracted content.

9.

Method according to any of claim 6 to 8, wherein the identifying and extraction of entities operation may use techniques as machine learning, «natural language processing)) (NLP), training sets, word vectors, stemming and other relevant techniques combined with dictionaries, synonyms, databases, language translators or other relevant text recognition technologies (405).

10.

Method according to any of claim 6 to 9, wherein the mapping of entities operation to standardized ID's can use techniques as, but not limited to: others ID standards, own standards, search technology as stemming, «soundex», synonyms, NLP, vector representation of expressions, machine learning and know training sets (300).

11.

Method according to any of claim 6 to 10, wherein thecommon structure (200) of the sources may be used as a weighted sum of the business information from all sources to give a summarized possibility to belong to a given market based on a set of products(lOO).

12.

Method according to any of claim 6 to 11, wherein the a company name ID and product ID are set together in a 3 - dimensional plane (101) and where the height is the probability to belong to a given market given a set of products (102) or other relevant combination of a company properties.

13.

Method according to any of claim 6 to 12, wherein different planes (101) represents corresponding different business sources (103-105) and that these may be summarized (106) into one weighted probability for all sources.

14.

Method according to any of claim 6 to 13, wherein the different sources may be weighted based on their quality, trust, reputation or other relevant parameters.

15.

Method according to any of claim 6 to 14, wherein the probability to belong to a market given a set of products may depend on the company's official industry code, location, financial numbers and other business related parameters.

16.

Method according to any of claim 6 to 16, wherein the output of a common structure (200) may show trends over time to develop relationships between companies, products, locations, market, financial strength and other business relations.

17.

Method according to any of claim 6 to 16, wherein the output of a common structure (200) may show statistics over most common trends, most popular products and services, most popular companies, industries, locations, megatrends, technology development or other relevant relations.

18.

Method according to any of claim 6 to 17, wherein the output for a common structure (200) may show differences between sources as e.g. based on locations, deviation from the normal, normal distributions, standard deviation, derived over time or similar.

19.

Method according to any of claim 6 to 18, wherein the solution may be integrated as a part of other systems as company databases, financial registers, public company registers, company catalogues, other for dictionaries for companies, business applications, search engines and other relevant business systems.

20.

Method according to any of claim 6 to 19, wherein at filtering (500), entity extractions (400) and mapping of entities to ID's may be enhanced by feedback from users.

21.

Method according to any of claim 6 to 20, wherein the method may be integrated with mobile applications, tablets, «phablets» or other communication devices which uses the devices information about information about time, location, user, language, profile etc.

22.

Method according to any of claim 6 to 21, wherein the total knowledge from the sources may be shown as different graphs.

23.

Method according to any of claim 6 to 22, wherein entities can be words, known sentences, relations between word or other text relations.

24.

Method according to any of claim 6 to 23, wherein search from different sources may be combined.

25.

Method according to any of claim 6 to 24, wherein companies, persons and news from different sources may be combined.

26.

Method according to any of claim 6 to 25, wherein the output from the output modules (100 - 160) is communicated to a user.

27.

Method according to any of claim 6 to 25, wherein the method is implemented as a cloud service.