WO2021087308A1 - Exploration efficace faisant appel à la planification de chemins, et applications correspondantes - Google Patents

Exploration efficace faisant appel à la planification de chemins, et applications correspondantes Download PDF

Info

Publication number
WO2021087308A1
WO2021087308A1 PCT/US2020/058286 US2020058286W WO2021087308A1 WO 2021087308 A1 WO2021087308 A1 WO 2021087308A1 US 2020058286 W US2020058286 W US 2020058286W WO 2021087308 A1 WO2021087308 A1 WO 2021087308A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data source
demographic information
fields
task
Prior art date
Application number
PCT/US2020/058286
Other languages
English (en)
Inventor
Carlos VERA-CIRO
Robert Raymond Lindner
Original Assignee
Veda Data Solutions, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/668,544 external-priority patent/US20210133275A1/en
Priority claimed from US16/668,524 external-priority patent/US20210134407A1/en
Application filed by Veda Data Solutions, Inc. filed Critical Veda Data Solutions, Inc.
Priority to EP20883009.1A priority Critical patent/EP4052145A4/fr
Priority to CN202080076024.9A priority patent/CN114761945A/zh
Publication of WO2021087308A1 publication Critical patent/WO2021087308A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work or social welfare, e.g. community support activities or counselling services

Definitions

  • This field is generally related to processing information.
  • demographic information may include, but is not limited, to their name, address, specialties, academic credentials, certifications, and the like.
  • This demographic information may be available from various public data sources, such as websites. These websites may retrieve the demographic information from underlying databases, such as state, county, city, or municipality databases, that store the data.
  • states may have licensing boards that maintain lists of all licensed healthcare providers, along with their associated demographic information.
  • health insurance companies may have public websites listing the healthcare providers, and associated demographic information, in their network.
  • healthcare providers may themselves set up public websites that list such demographic information about their practices.
  • Some of these websites may be organized by trees of information. For example, to retrieve demographic information about a particular healthcare provider, a user may first select the county from a drop-down list. Then another page appears asking the user to select a town in the selected county from a drop-down list. Then, a third page may appear asking the user to select a health care specialty. Only then are the healthcare providers meeting the selected criteria displayed, along with at least some of the relevant demographic information stored in the underlying database.
  • Entities may have a need to maintain demographic information.
  • health insurance companies may have a need to maintain demographic information about healthcare providers that need to be reimbursed for claimed services. Often times this infonnation may be inaccurate, or less accurate than information available from other public data sources.
  • the returning data may not be structured in a known format. It may be presented in a way that, once rendered, a human user would readily be able to identify the demographic information and how it corresponds to a particular healthcare provider.
  • an automated system may have difficulty parsing the data and associating the demographic information describing a single healthcare provider.
  • FIG. 1 illustrates a diagram of a network for communications between one or more data sources and a system, according to aspects of the present disclosure.
  • FIG. 2 illustrates a diagram of a system for accumulating data from the one or more data sources, according to aspects of the present disclosure.
  • FIG. 3 illustrates an example decision tree generated by the system for accumulating data from the one or more data sources, according to aspects of the present disclosure.
  • FIG. 4 illustrates example priority levels assigned to the one or more data sources, according to aspects of the present disclosure.
  • FIG. 5 illustrates an example report generated by the system for accumulating data from the one or more data sources, according to aspects of the present disclosure.
  • FIG. 6 illustrates a method of extracting unstructured data from a plurality of data sources, according to aspects of the present disclosure.
  • FIG. 7 illustrates a method of training a computing device to extract unstructured data from a plurality of data sources, according to aspects of the present disclosure.
  • FIG. 8 illustrates a method of using a machine learning model.
  • FIGS. 9A-B illustrates a diagram illustrating how to extract geometric distances between page elements.
  • FIGS. 10-12 illustrate diagrams illustrating how to extract distances between fields in a markup language.
  • FIG. 13 is an example computer system useful for implementing various embodiments.
  • Embodiments provide ways to retrieve unstructured data along from data sources not optimized for automated retrieval. For example, embodiments may generate a branched tree for each data source that maps out paths to individual sites of, for example, a healthcare provider listing the unstructured data. Using this branched tree, tasks can be generated to navigate along a path with the data source to each site and extract the unstructured data from the data source. In this way, embodiments provide the ability to navigate through a site from a base site to a site that has the relevant data.
  • the data requests are made in a prioritized, yet random nature.
  • the data sources may be categorized by priority (e.g., high priority, moderate priority, low priority, etc.), and the system may randomly select a data source within a given priority level and assign a task associated with the selected data source to one of the data extractors.
  • the system may monitor the number of data extractors currently navigating a given data source to avoid overloading the data source, which may cause the data source to crash.
  • the data extracted may be unstructured. In other words, it may be in a markup language designed to render to a human user.
  • the demographic information sought might not be tagged. That is, the markup language may not identify what data constitutes a name and associated telephone number or address.
  • the demographic information may be identified using, for example, a simple regular expression.
  • a distance between the respective fields is determined. The distance may be the geometric distance in the rendered page or distance between the two fields within the markup code.
  • a model may be trained based at least in part on this data to predict whether the various pieces of extracted demographic information relate to the same person. In this way, embodiments may use machine learning to interpret automatically documents that are not formatted specifically for a machine.
  • references to “one embodiment”, “an embodiment”, “an example embodiment”, etc. indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • FIG. 1 is a diagram illustrating a system 100 for communications over a network
  • the one or more data sources 105 may be a data source, such as a website, that includes demographic information of one or more individuals, such, as healthcare providers, including but not limited to, doctors, dentists, physician assistants, nurse practitioners, nurses, or the like. Although the present disclosure describes the individuals as being healthcare providers, it should be understood by those of ordinary skill in the arts that present disclosure may be implemented by accumulating data from any data source. [0026] In some instances, the one or more data sources 105 may include a Center for
  • CMS Medicaid and Medicare
  • DEA Drug Enforcement Agency
  • NPI National Provider Identifier
  • CMS data source may be a data service provided by a government agency.
  • the database may be distributed and different agencies organizations may be responsible for different data stored in CMS data source.
  • CMS data source may include data on healthcare providers, such as lawfully available demographic information and claims information.
  • CMS data source may also allow a provider to enroll and update its information in the Medicare Provider Enrollment System and to register and assist in the Medicare and Medicaid Electronic Health Records (EHR) Incentive Programs.
  • EHR Electronic Health Records Incentive Programs.
  • the directory data source may be a directory of healthcare providers.
  • the directory data source may be a proprietary directory that matches healthcare providers with demographic and behavioral attributes that a particular client believes to be true.
  • the directory data source may, for example, belong to an insurance company and can only be accessed and utilized securely with the company’s consent.
  • the DEA data source may be a registration database maintained by a government agency such as the DEA.
  • the DEA may maintain a database of healthcare providers, including physicians, optometrists, pharmacists, dentists, or veterinarians, who are allowed to prescribe or dispense medication.
  • the DEA data source may match a healthcare provider with a DEA number.
  • DEA data sources may include demographic information about healthcare providers.
  • the public data source may perhaps be a web-based data source such as an online review system. These data sources may include demographic information about healthcare providers, area of specialty, and behavioral information such as crowd sourced reviews.
  • the NPI data source may be a data source matching a healthcare provider to a
  • the NPI is a Health Insurance Portability and Accountability Act (HIPAA) Administrative Simplification Standard.
  • HIPAA Health Insurance Portability and Accountability Act
  • the NPI is a unique identification number for covered health care providers. Covered health care providers and all health plans and health care clearinghouses must use the NPIs in the administrative and financial transactions adopted under HIPAA.
  • the NPI is a 10-position, intelligence-free numeric identifier (10-digit number). This means that the numbers do not carry other information about healthcare providers, such as the state in which they live or their medical specialty.
  • NPI data source may also include demographic information about a healthcare provider.
  • the registration data source may include state licensing information.
  • a healthcare provider such as a physician
  • the state licensing board may provide the registration data source information about the healthcare provider, such as demographic information and areas of specialty, including board certifications.
  • the claims data source may be a data source with insurance claims information.
  • claims data source may be a proprietary database.
  • Insurance claims may specify information necessary for insurance reimbursement.
  • claims information may include information on the healthcare provider, the services performed, and perhaps the amount claimed.
  • the services performed may be described using a standardized code system, such as ICD-9.
  • the information on the healthcare provider could include demographic information.
  • the one or more data sources 105 may each have different formats for providing the demographic information of the healthcare providers and/or list different types of demographic information. As such, the demographic information of each healthcare provider may be inconsistent from one data source 105 to another.
  • the data sources 105 may be hosted on a server, such as a host server, a web server, an application server, etc., a data center device, or a similar device, capable of communicating via the network 110.
  • the network 110 may include one or more wired and/or wireless networks.
  • the network 110 may include a cellular network (e.g., a long-term evolution (LTE) network, a code division multiple access (CDMA) network, a 3G network, a 4G network, a 5G network, another type of next generation network, etc.), a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cloud computing network, and/or the like, and/or a combination of these or other types of networks.
  • LTE long-term evolution
  • CDMA code division multiple access
  • 3G Third Generation
  • 4G fourth generation
  • 5G 5G network
  • PLMN public land mobile network
  • PLMN public land mobile network
  • LAN
  • system 115 includes a server 200 having one or more scouters
  • System 115 also includes one or more data extractors 210.
  • one or more scouters 205 may be configured to explore all possible permutations of each data source 105 to arrive at a site of each individual listed on the data source 105.
  • model trainer 235 may be used to train the one or more scouters 205 using machine learning algorithms to iteratively navigate a respective data source 105 until reaching the site of each individual.
  • each scouter 205 may be trained to select a combination of one or more of a series of links, drop-down menus, radial buttons, etc., until a path to the site of each individual is determined.
  • the series of links, drop-down menus, etc. may include one or more parameters for searching for healthcare providers.
  • the parameters may include a county, zip code, city, specialty, languages spoken, insurances accepted, and the like. It should be understood by those of ordinary skill in the arts that these are merely example parameters and that any combination of parameters may be used in accordance with aspects of the present disclosure.
  • scouters 205 may be trained, for example, using supervised machine learning algorithms based on sample data sources to learn how to navigate the data sources to the sites of each individual. For example, using the sample data sources, the scouters 205 may be trained on how to select a combination of the one or more of a series of links, the drop-down menus, the radial buttons, etc. That is, the scouters 205 may be trained on set of training examples (e.g., sample data sources), such that the scouters 205 may navigate the data sources 105 without human intervention.
  • set of training examples e.g., sample data sources
  • one or more scouters 205 may generate a decision tree for a respective data source 105 that provides a route to the site of each individual. That is, the scouters 205 may generate a decision tree for each of a plurality of data sources with the decision tree comprising one or more paths to respective sites of the data source 205.
  • FIG. 3 illustrates a decision tree for state A that includes the parameters county, zip code, and specialty. It should be understood that the parameters shown in FIG. 3 are merely example parameters, and that any combination and/or order of parameters may be used to navigate to the site of each individual.
  • the decision tree may include multiple branches to the same site of an individual (i.e., fewer search parameters are required to reach the site of each individual), and in such instances, scouter 105 may retain the shortest path to the site of the individual while discarding all remaining paths to the site of the individual.
  • scouter 205 may routinely survey the respective data source 105 to determine if any updates and/or modifications have been made (e.g., whether any healthcare providers have been added to/removed from the data source, whether the previous paths are still valid, whether any shorter paths have been established, etc.). For example, scouter 105 may survey a data source 105 for updates and/or modifications weekly, monthly, quarterly, etc.
  • the controller 220 may maintain a schedule for surveying data sources 105 and instruct scouter(s) 205 to survey data source 105 accordingly.
  • controller 220 may generate and maintain a list of tasks for each of the plurality of data sources 205.
  • each task may correspond to a respective one of the one or more paths to navigate from a base web site to a destination, leaf web site that includes the desired demographic information.
  • Each task may also include instructions for extracting demographic information from the respective site. That is, controller 220 may split the decision tree into separate tasks having instructions for obtaining the demographic information from the site of each individual.
  • controller 220 may communicate these tasks to a corresponding data extractor 210, with the task providing the corresponding data extractor 210 with instructions on how to extract the demographic information from the respective site.
  • controller 220 may assign and transmit the task to the corresponding data extractor.
  • the controller 220 may store the tasks in a queue such that the data extractor 210 may select one of the tasks from the queue.
  • the task communicated to the data extractor 210 may cause the data extractor 210 to navigate the corresponding data source to the respective site and extract the demographic information from the respective site.
  • controller 220 may track which tasks have been communicated to data extractors 210 in order to ensure that data extractors 210 avoid performing duplicate tasks.
  • one or more data extractors 210 may be a computing device, such as a mobile phone (e.g., a smart phone, a radiotelephone, etc.), a laptop computer, a tablet computer, a handheld computer, or a similar type of device.
  • a mobile phone e.g., a smart phone, a radiotelephone, etc.
  • the instructions may include instructions for navigating through data source 105 to the respective site.
  • the instructions may indicate which link(s) to click, which drop-down option(s) to select, which radial button(s) to select, or the like, in order to navigate to the respective site.
  • the instructions may also include instructions for emulating movements of a user when navigating the data source 105. That is, the instructions may indicate where to move the mouse on a given site to make the aforementioned selections. Additionally, the instructions may include instructions to move the mouse after clicking the particular link, selecting an option of the drop-down list, selecting a radial button, or the like.
  • Further embodiments may include instructions for obviating a challenge-response test (e.g., a completely automated public Turing test to tell computers and humans apart “CAPTCHA”).
  • the instructions may direct the data extractor 210 to access a specific uniform resource locator (“URL”), rather than navigating through the data source 105.
  • the instructions for navigating through data source 105 may include instructions that cause the data extractor 205 to automatically navigate to a given page, e.g., a “Contact Us” page, of the data source 105 and extract the demographic information from the given site.
  • the controller 220 may communicate the tasks to the data extractors 210 based on a combination of a priority level of a data source 105 and a random selection.
  • the data sources 105 may be assigned a priority level. For example, as illustrated in FIG. 4, the data sources 105 may be assigned a high priority, a moderate priority, or a low priority.
  • the priority levels may be assigned to different states, different regions, different insurance providers, etc. It should be understood by those of ordinary skill in the arts that these are merely example priority levels, and that any number of priority levels are further contemplated in accordance with aspects of the present disclosure.
  • the controller 220 may communicate the tasks from a randomly selected data source 105 within a given priority level to corresponding data extractors 210.
  • the priority level for each data source 105 may be set by an administrator of the system 115 and may be adjusted any time.
  • the controller 220 may manage the number data extractors performing tasks for a corresponding data source 105.
  • managing the number data extractors may include managing a maximum number of data extractors 210 performing tasks on each of the plurality of data sources 105. That is, to avoid overloading the data source 105, the controller 200 may limit the number of data extractors 210 performing tasks on a given data source 105. When the maximum number of data extractors for a given data source 105 is reached, the controller 220 may communicate task(s) of another data source 105 having the same priority level to a corresponding data extractor(s) 210.
  • the controller 220 may communicate task(s) of another data source 105 having a different priority level to a corresponding data extractor(s) 210.
  • the other data source 105 of the same or different priority level may be randomly selected.
  • managing the number data extractors may include periodically adjusting the number of data extractors 210 performing tasks on a data source 105 to increase or decrease the workload on the data source 105.
  • the controller 220 may periodically adjust the number of data extractors 210 performing tasks on a data source 105 in order to avoid overloading the data source 105 or to maximize the load on data source 105 during off-peak usage hours (e.g., overnight).
  • controller 220 may reassign data extractors 210 to perform tasks on another data source 105 having the same priority level.
  • controller 220 may reassign the data extractors 210 to perform tasks on another data source 105 having a different priority level.
  • the other data source 105 of the same or different priority level may be randomly selected.
  • controller 220 may also generate a user interface presented on a display 230.
  • the user interface may indicate a color code indicator of the priority level of a data source 105, the number of tasks for each data source 105, an identification number of data source 105, the number of data extractors 210 performing tasks on each data source 105, a progress indicator of the tasks for each data source 105 (e.g., a percentage of jobs completed, whether data extractors 210 have started or completed the tasks, etc.), and an overall status of the tasks (e.g., “none,” “executing,” “initialized,” “completed,” etc.).
  • an administrator may pause one or more data extractors 210 performing tasks on data source 105 and/or change the priority level of a data source 105.
  • the user interface may be updated in predetermined intervals, e.g., every 15 minutes, every hour, etc.
  • controller 220 may also maintain a schedule for each data source 105 indicating when data source 105 should be crawled in order to obtain the demographic information. For example, each data source 105 may be crawled based on its own respective schedule (e.g., daily, weekly, bi-weekly monthly, bi-monthly, quarterly, etc.). Using these schedules, controller 220 may determine whether to obtain the demographic information from a specific site of a given data source 105. For example, when given data source 105 is scheduled for crawling, controller 220 may communicate a message to one of data extractors 210 with a script for exploring the data source 105.
  • schedule for each data source 105 indicating when data source 105 should be crawled in order to obtain the demographic information. For example, each data source 105 may be crawled based on its own respective schedule (e.g., daily, weekly, bi-weekly monthly, bi-monthly, quarterly, etc.). Using these schedules, controller 220 may determine whether to obtain the demographic information from a specific site of
  • controller 220 may receive a message from data extractor 210 indicating that the job is complete and also requesting a new job.
  • data extractor 210 performing a given task may encounter a failure at data source 105 (e.g., data source 105 itself or the site of each individual is inaccessible).
  • the script may include instructions for repeating the task when data extractor 210 encounters the failure.
  • the instructions may cause data extractor 210 to iteratively attempt to access the site of an individual at a set interval and for a set number of attempts (e.g., every twenty-four hours for three days).
  • data extractors 210 may be trained using machine learning algorithms to accumulate unstructured demographic data from data sources 105 in a structured manner.
  • trainer 235 may be used to train data extractors 210, for example, using supervised machine learning algorithms to learn, identify, and extract the unstructured data on any given site.
  • data extractors 210 may identify a distance between two or more parameters, e.g., a name and address of a healthcare provider on a rendered image of given site of the data source.
  • the distance between the two or more parameters may be a vertical distance (e.g., the parameters are vertically aligned) or a horizontal distance (e.g., the parameters are vertically aligned).
  • the distance between the two parameters may be the distance between x-y coordinates of each parameter in a rendered image of the site. In other words, in some embodiments, the distance between two parameters may be a spatial distance.
  • data extractors 210 may be trained to identify other types and combinations of demographic information.
  • data extractors 210 may be trained to identify a number of pairs of parameters on a given site of data source 105. That is, in some situations, multiple healthcare providers may be listed on the same site with common demographic information or unique demographic information associated with each healthcare.
  • data extractors 210 may be trained to identify a ratio between a number of healthcare and a number of pieces of demographic information.
  • data extractors 210 may be trained to identify the demographic information based on a code used to generate the site.
  • data extractors 210 may identify the distance between the demographic information in marked-up language (e.g., XML or Hypertext Markup Language (HTML) code) on any given site.
  • marked-up language e.g., XML or Hypertext Markup Language (HTML) code
  • the code for any each site may include nested node or trees, and the distance between the demographic information the node may be a number of steps between the nested code or tree of the different types of demographic information.
  • data extractors 210 may identify line number and character number of each of the parameters and determine a distance between them.
  • Data extractors 210 may be trained to identify whether the various pieces of demographic information are related to one another. For example, the distances, number of pairs of parameters, and/or ratio between a number of healthcare and a number of pieces of demographic information may be features inputted to generate a model.
  • Trainer 235 may use a sample set generated by humans identifying related demographic information on the same page or by analyzing a sample set of pages with known positions or labeling of related demographic information. The labeling may be, for example, within tags in the markup language.
  • data extractors 210 may identify any combination of demographic information on each respective site of a data source 105. That is, data extractors 210 may be trained on set of training examples (e.g., sample data sources), such that data extractors 210 may identify and extract the unstructured data on any given site without human intervention.
  • Example supervised machine learning algorithms that may be used to train scouters 205 include, but are not limited to, support vector machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, neural networks, and similarity learning. It should be understood by those of ordinary skill in the art that these are merely example supervised machine learning algorithms and that other supervised machine learning algorithms may be used in accordance with aspects of the present disclosure.
  • the data extractors 210 may reformat the demographic data in a structure manner. For example, as illustrated in FIG. 5, the data extractors 210 may generate a report having the data retrieved from the sites in a table format.
  • the structured format may include first name, last name, address, phone number, email address, specialty, license number, and expiration date.
  • the data extractors 210 may transmit the report to the server 200, which may then process the report.
  • the ingester 215 may be retrieve the demographic data from the report of the data extractors 210 and to separate the demographic information based on the category of data (e.g., name, address, phone, specialty, etc.) into separate databases within the repository 225.
  • the different categories of data may be separated into logical partitions within the repository 225.
  • the different categories of data may be separated into different memories within the repository 225.
  • the ingester 215 retrieves all of the demographic data accumulated for a given data source 105, identifies and categorizes the various pieces of information collected based on a category of data, and stores the categorized data within an assigned partition or database within the repository 225.
  • the ingester 215 may monitor each data source 105 to determine whether data relating to any individual has changed and requires updating.
  • FIG. 6 illustrates a method of extracting unstructured data from a plurality of data sources, according aspects of the present disclosure.
  • a method 600 may include generating a decision tree for each of a plurality of data sources 605.
  • the decision tree may comprise one or more paths to respective sites of the data source.
  • one or more scouters e.g., the scouters 205 of FIG. 2
  • the method 600 may also include generating a list of tasks for each of the plurality of data sources (e.g., data sources 105 of FIG. 1) based on the decision tree 610.
  • Each task may correspond to a respective one of the one or more paths and may comprise instructions for extracting demographic information from the respective site.
  • a controller e.g., the controller 220 of FIG. 2
  • the method 600 may also include communicating a task from the list of tasks to a corresponding data extractor based on a priority level of the corresponding data source 615.
  • the controller e.g., the controller 220 of FIG. 2
  • the controller may store the tasks in a queue such that the data extractor may select one of the tasks from the queue.
  • the task may provide the corresponding data extractor with instructions on how to extract the demographic information from the respective site.
  • the method 600 may also include causing the corresponding data extractor to navigate the corresponding data source to the respective site and extract the demographic information from the respective site based on the communicated task 620.
  • the communicated task may cause the corresponding data extractor (e.g., the data extractor 210 of FIG. 2) to navigate the corresponding data source to the respective site and extract the demographic information from the respective site based on the communicated task.
  • the method 600 may also include receiving the extracted demographic information 625 from the corresponding data extractor.
  • the corresponding data extractor e.g., the data extractor 210 of FIG. 2 may transmit the extracted data to a server (e.g., the server 200 of FIG. 2).
  • the method 600 may further include parsing the extracted demographic information into separate categories 630 and storing the parsed demographic information in separate databases based on the separate categories 635.
  • an ingester e.g., ingester 215 of FIG. 2
  • the different categories of data may be separated into logical partitions within the repository (e.g., the repository 225 of FIG. 2).
  • the different categories of data may be separated into different memories within the repository (e.g., the repository 225 of FIG. 2).
  • a computing device can include but is not limited to: a personal computer, a mobile device such as a mobile phone, workstation, embedded system, game console, television, set-top box, or any other computing device.
  • a computing device can include, but is not limited to, a device having a processor and memory, including a non-transitory memory, for executing and storing instructions.
  • the memory may tangibly embody the data and program instructions in a non-transitory manner.
  • Software may include one or more applications and an operating system.
  • Hardware can include, but is not limited to, a processor, a memory, and a graphical user interface display.
  • FIG. 7 illustrates a method 700 of training a computing device to extract unstructured data from a plurality of data sources, according aspects of the present disclosure.
  • Method 700 starts in step 702 by person demographic information.
  • the page may be represented in a markup language.
  • One such example markup language is illustrated in FIG. 9A
  • Figure 9A shows an example page 900 providing information about a healthcare provider as part of a medical referral service.
  • page 900 may be represented by markup language such as HTML.
  • HTML HyperText Markup Language
  • HTML snippet that may be used to represent the contents of page 900:
  • demographic information may be parsed from either the rendered page 900 or from the underlying markup language illustrated in the code snippet above.
  • regular expressions or another set of rules, may be used to identify the phone numbers or addresses.
  • machine learning classifiers may be used to identify these various fields in the markup language or in the rendered page (for example, using computer vision techniques).
  • the different regular expressions or classifiers may each be configured to identify a particular type of demographic information, for example, name, address, or phone number.
  • a set of features is extracted based on the page and the identified demographic information.
  • the set of features may include, for example, the number of data fields extracted, the number of different types of data fields extracted, and/or the ratio of names extracted to another type of information extractors, such as addresses.
  • the set of features may include a distance between the various pieces of demographic and other information. The distance may include a geometric distance and/or a distance within the markup code.
  • FIGs. 9A and 9B How a geometric distance may be determined is illustrated in FIGs. 9A and 9B.
  • FIG. 9A shows a page 900 illustrating a rendering of the marked up document.
  • a rendering can be generated for example using WebKit or other browser package.
  • FIG. 9B a location is determined where each of the plurality of fields is located.
  • the fields are located at position 952, 954, 956, and 958.
  • a name is detected at position 952, an address is detected at position 954, and phone numbers are detected at positions by 956 and 958.
  • detection may be done by retrieving information from the rendering engine.
  • the rendering engine may provide locations of the respective fields.
  • computer vision techniques may be used on the rendered page to determine the locations of the respective fields.
  • a geometric distance between the respective locations of the plurality of fields in the rendered marked-up document is calculated. In one embodiment, a distance is calculated for every pair of fields. In another embodiment, a distance is calculated between each name and each other type of demographic information.
  • a distance 964 is determined between fields 952 and 954.
  • a distance 966 is determined between field 952 and 956.
  • a distance 962 is determined between fields 952 and 958.
  • the geometric distance may be an advantageous feature to use in the model because page 900 may be designed to present the demographic information to a human user in a way that the human user recognizes that the various demographic information represents a single healthcare provider. As illustrated in the example in FIG. 9B, a distance 962 between fields 952 and 958 is larger than a distance 966 between fields 952 and 956, suggesting that fields 952 and 956 represent demographic information from the same individual, while fields 952 and 958 do not. [0081] In addition to or an alternative to the geometric distance, a distance within the markup code may be determined. In one embodiment, the distance may simply be the number of lines or characters of code between fields. In another embodiment, the distance may be a number of nodes separating the fields within a document object model, as illustrated, for example, in FIG. 10.
  • FIG. 10 illustrates a document object model 1000.
  • the document object model may include a plurality of interconnected nodes.
  • the plurality of interconnected nodes may, for example, be structured as a tree.
  • document object model 1000 has a root node 1016 and a number of leaf nodes 1002, 1004, 1006, 1008, 1010, 1012, and 1014 connected by intermediate nodes. Together, these nodes define the contents and format of the page.
  • the various fields of demographic information are embedded within contents of some, but not all, of the leaf nodes.
  • leaf nodes 1102, 1104, 1206, and 1208 have demographic information.
  • a distance between them in the document object model is determined.
  • the distance may be determined by calculating the number of hops between the respective locations of the plurality of fields in the rendered marked-up document.
  • the method includes the correct groupings of demographic information representing a single healthcare provider on a page are received at step 706.
  • the groupings may be identified by human user. Alternatively, the groupings may be generated given a known labeling of the demographic information on certain pages.
  • FIG. 8 illustrates a method 800 of using a machine learning model.
  • demographic information is parsed at step 802, as described above for step 702.
  • Features are extracted at step 804, again as described above for step 704.
  • those features are applied to the model, which is trained to determine whether any two or more fields of demographic information represent the same individual healthcare provider based on the features provided. In this way, embodiments can identify fields of demographic information on page information that represent the same individual or provider.
  • FIG. 13 Various embodiments may be implemented, for example, using one or more well- known computer systems, such as computer system 1300 shown in FIG. 13.
  • One or more computer systems 1300 may be used, for example, to implement any of the embodiments discussed herein, as well as combinations and sub-combinations thereof.
  • Computer system 1300 may include one or more processors (also called central processing units, or CPUs), such as a processor 1304.
  • processors also called central processing units, or CPUs
  • Processor 1304 may be connected to a communication infrastructure or bus 1306.
  • Computer system 1300 may also include user input/output device(s) 1303, such as monitors, keyboards, pointing devices, etc., which may communicate with communication infrastructure 1306 through user input/output interface(s) 1302.
  • user input/output device(s) 1303 such as monitors, keyboards, pointing devices, etc.
  • communication infrastructure 1306 may communicate with user input/output interface(s) 1302.
  • processors 1304 may be a graphics processing unit (GPU).
  • a GPU may be a processor that is a specialized electronic circuit designed to process mathematically intensive applications.
  • the GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.
  • Computer system 1300 may also include a main or primary memory 1308, such as random access memory (RAM).
  • Main memory 1308 may include one or more levels of cache.
  • Main memory 1308 may have stored therein control logic (i.e., computer software) and/or data.
  • Computer system 1300 may also include one or more secondary storage devices or memory 1310.
  • Secondary memory 1310 may include, for example, a hard disk drive 1312 and/or a removable storage device or drive 1314.
  • Removable storage drive 1314 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.
  • Removable storage drive 1314 may interact with a removable storage unit 1318.
  • Removable storage unit 1318 may include a computer usable or readable storage device having stored thereon computer software (control logic) and/or data.
  • Removable storage unit 1318 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and / any other computer data storage device.
  • Removable storage drive 1314 may read from and/or write to removable storage unit 1318.
  • Secondary memory 1310 may include other means, devices, components, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 1300.
  • Such means, devices, components, instrumentalities or other approaches may include, for example, a removable storage unit 1322 and an interface 1320.
  • Examples of the removable storage unit 1322 and the interface 1320 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.
  • Computer system 1300 may further include a communication or network interface
  • Communication interface 1324 may enable computer system 1300 to communicate and interact with any combination of external devices, external networks, external entities, etc. (individually and collectively referenced by reference number 1328).
  • communication interface 1324 may allow computer system 1300 to communicate with external or remote devices 1328 over communications path 1326, which may be wired and/or wireless (or a combination thereof), and which may include any combination of LANs, WANs, the Internet, etc.
  • Control logic and/or data may be transmitted to and from computer system 1300 via communication path 1326.
  • Computer system 1300 may also be any of a personal digital assistant (PDA), desktop workstation, laptop or notebook computer, netbook, tablet, smart phone, smart watch or other wearable, appliance, part of the Internet-of-Things, and/or embedded system, to name a few non-limiting examples, or any combination thereof.
  • PDA personal digital assistant
  • Computer system 1300 may be a client or server, accessing or hosting any applications and/or data through any delivery paradigm, including but not limited to remote or distributed cloud computing solutions; local or on-premises software (“on premise” cloud-based solutions); “as a service” models (e.g., content as a service (CaaS), digital content as a service (DCaaS), software as a service (SaaS), managed software as a service (MSaaS), platform as a service (PaaS), desktop as a service (DaaS), framework as a service (FaaS), backend as a service (BaaS), mobile backend as a service (MBaaS), infrastructure as a service (IaaS), etc.); and/or a hybrid model including any combination of the foregoing examples or other services or delivery paradigms.
  • “as a service” models e.g., content as a service (CaaS), digital content as a service (DCaaS), software as
  • JSON JavaScript Object Notation
  • XML Extensible Markup Language
  • YAML Yet Another Markup Language
  • XHTML Extensible Hypertext Markup Language
  • WML Wireless Markup Language
  • MessagePack XML User Interface Language
  • XUL XML User Interface Language
  • proprietary data structures, formats or schemas may be used, either exclusively or in combination with known or open standards.
  • a tangible, non-transitory apparatus or article of manufacture comprising a tangible, non-transitory computer useable or readable medium having control logic (software) stored thereon may also be referred to herein as a computer program product or program storage device.
  • control logic software stored thereon
  • control logic when executed by one or more data processing devices (such as computer system 1300), may cause such data processing devices to operate as described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Development Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente divulgation concerne des systèmes et procédés d'extraction de données non structurées à partir d'une source de données de manière structurée. Des modes de réalisation offrent des moyens de récupération de données non structurées à partir de sources de données non optimisées pour une récupération automatisée. Par exemple, des modes de réalisation peuvent générer un arbre ramifié pour chaque source de données qui trace des chemins vers des sites individuels, par exemple, d'un prestataire de soins de santé listant les données non structurées. À l'aide de cet arbre ramifié, des tâches peuvent être générées pour naviguer sur un chemin avec la source de données vers chaque site et extraire les données non structurées à partir de la source de données. De cette manière, des modes de réalisation offrent la possibilité de naviguer, par l'intermédiaire un site, d'un site de départ à un site qui présente les données pertinentes.
PCT/US2020/058286 2019-10-30 2020-10-30 Exploration efficace faisant appel à la planification de chemins, et applications correspondantes WO2021087308A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20883009.1A EP4052145A4 (fr) 2019-10-30 2020-10-30 Exploration efficace faisant appel à la planification de chemins, et applications correspondantes
CN202080076024.9A CN114761945A (zh) 2019-10-30 2020-10-30 使用路径调度的高效爬寻及其应用

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16/668,544 US20210133275A1 (en) 2019-10-30 2019-10-30 Extracting unstructured demographic information from a data source in a structured manner
US16/668,544 2019-10-30
US16/668,524 US20210134407A1 (en) 2019-10-30 2019-10-30 Efficient crawling using path scheduling, and applications thereof
US16/668,524 2019-10-30

Publications (1)

Publication Number Publication Date
WO2021087308A1 true WO2021087308A1 (fr) 2021-05-06

Family

ID=75716503

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/058286 WO2021087308A1 (fr) 2019-10-30 2020-10-30 Exploration efficace faisant appel à la planification de chemins, et applications correspondantes

Country Status (3)

Country Link
EP (1) EP4052145A4 (fr)
CN (1) CN114761945A (fr)
WO (1) WO2021087308A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030195889A1 (en) * 2002-04-04 2003-10-16 International Business Machines Corporation Unified relational database model for data mining
US6941318B1 (en) * 2002-05-10 2005-09-06 Oracle International Corporation Universal tree interpreter for data mining models
US20070094060A1 (en) * 2005-10-25 2007-04-26 Angoss Software Corporation Strategy trees for data mining
US20130166207A1 (en) 2011-12-21 2013-06-27 Telenav, Inc. Navigation system with point of interest harvesting mechanism and method of operation thereof
US20130236111A1 (en) * 2012-03-09 2013-09-12 Ancora Software, Inc. Method and System for Commercial Document Image Classification
US20140172754A1 (en) * 2012-12-14 2014-06-19 International Business Machines Corporation Semi-supervised data integration model for named entity classification
US20160092730A1 (en) * 2014-09-30 2016-03-31 Abbyy Development Llc Content-based document image classification

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242102B2 (en) * 2014-12-29 2019-03-26 Samsung Electronics Co., Ltd. Network crawling prioritization
US20180150562A1 (en) * 2016-11-25 2018-05-31 Cognizant Technology Solutions India Pvt. Ltd. System and Method for Automatically Extracting and Analyzing Data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030195889A1 (en) * 2002-04-04 2003-10-16 International Business Machines Corporation Unified relational database model for data mining
US6941318B1 (en) * 2002-05-10 2005-09-06 Oracle International Corporation Universal tree interpreter for data mining models
US20070094060A1 (en) * 2005-10-25 2007-04-26 Angoss Software Corporation Strategy trees for data mining
US20130166207A1 (en) 2011-12-21 2013-06-27 Telenav, Inc. Navigation system with point of interest harvesting mechanism and method of operation thereof
US20130236111A1 (en) * 2012-03-09 2013-09-12 Ancora Software, Inc. Method and System for Commercial Document Image Classification
US20140172754A1 (en) * 2012-12-14 2014-06-19 International Business Machines Corporation Semi-supervised data integration model for named entity classification
US20160092730A1 (en) * 2014-09-30 2016-03-31 Abbyy Development Llc Content-based document image classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4052145A4

Also Published As

Publication number Publication date
EP4052145A4 (fr) 2023-11-01
CN114761945A (zh) 2022-07-15
EP4052145A1 (fr) 2022-09-07

Similar Documents

Publication Publication Date Title
US10984913B2 (en) Blockchain system for natural language processing
Kim et al. Identifying and prioritizing critical factors for promoting the implementation and usage of big data in healthcare
US8121858B2 (en) Optimizing pharmaceutical treatment plans across multiple dimensions
US20200050949A1 (en) Digital assistant platform
US20200311610A1 (en) Rule-based feature engineering, model creation and hosting
US8050937B1 (en) Method and system for providing relevant content based on claim analysis
US20170052943A1 (en) Method, apparatus, and computer program product for generating a preview of an electronic document
US20210174380A1 (en) Efficient data processing to identify information and reformant data files, and applications thereof
US12093278B2 (en) Concept agnostic reconciliation and prioritization based on deterministic and conservative weight methods
US20210141834A1 (en) Dynamic database updates using probabilistic determinations
Donahue et al. Veterans health information exchange: successes and challenges of nationwide interoperability
US20220059228A1 (en) Systems and methods for healthcare insights with knowledge graphs
Mwanza et al. Impact of industry 4.0 on healthcare systems of low-and middle-income countries: a systematic review
US20210133275A1 (en) Extracting unstructured demographic information from a data source in a structured manner
US20150339602A1 (en) System and method for modeling health care costs
US20210134407A1 (en) Efficient crawling using path scheduling, and applications thereof
US12027269B2 (en) Intelligent system and methods for automatically recommending patient-customized instructions
EP4052145A1 (fr) Exploration efficace faisant appel à la planification de chemins, et applications correspondantes
US11893030B2 (en) System and method for improved state identification and prediction in computerized queries
US20140100872A1 (en) Method, apparatus, and computer program product for sharing patient charting templates
US20210319891A1 (en) Patient scheduling using predictive analytics
US20200226192A1 (en) Search engine for searching an instrument index
US20230273848A1 (en) Converting tabular demographic information into an export entity file
AU2018250433A1 (en) Disruption assessment tool
WO2023164599A1 (fr) Procédé insensible aux défaillances pour traiter des données avec intervention humaine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20883009

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020883009

Country of ref document: EP

Effective date: 20220530