US20210133769A1 - Efficient data processing to identify information and reformant data files, and applications thereof - Google Patents

Efficient data processing to identify information and reformant data files, and applications thereof Download PDF

Info

Publication number
US20210133769A1
US20210133769A1 US16/668,565 US201916668565A US2021133769A1 US 20210133769 A1 US20210133769 A1 US 20210133769A1 US 201916668565 A US201916668565 A US 201916668565A US 2021133769 A1 US2021133769 A1 US 2021133769A1
Authority
US
United States
Prior art keywords
demographic information
fields
data file
analyzing
identify
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/668,565
Inventor
Carlos Vera-Ciro
Robert Raymond LINDNER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Veda Data Solutions Inc
Original Assignee
Veda Data Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Veda Data Solutions Inc filed Critical Veda Data Solutions Inc
Priority to US16/668,565 priority Critical patent/US20210133769A1/en
Assigned to VEDA DATA SOLUTIONS, INC. reassignment VEDA DATA SOLUTIONS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LINDNER, ROBERT RAYMOND, VERA-CIRO, Carlos
Priority to PCT/US2020/058200 priority patent/WO2021087254A1/en
Priority to CN202080076168.4A priority patent/CN114830079A/en
Priority to EP20882233.8A priority patent/EP4052119A4/en
Priority to US17/181,519 priority patent/US20210174380A1/en
Publication of US20210133769A1 publication Critical patent/US20210133769A1/en
Assigned to COMERICA BANK reassignment COMERICA BANK SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: VEDA DATA SOLUTIONS, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • G06F17/278
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management

Definitions

  • This field is generally related to processing information.
  • demographic information may include, but is not limited, to their name, address, specialties, academic credentials, certifications, and the like.
  • This demographic information may be available from various public data sources, such as websites. These websites may retrieve the demographic information from underlying databases, such as state, county, city, or municipality databases, that store the data. For example, states may have licensing boards that maintain lists of all licensed healthcare providers, along with their associated demographic information.
  • health insurance companies may have public websites listing the healthcare providers, and associated demographic information, in their network.
  • healthcare providers may themselves set up public websites that list such demographic information about their practices.
  • Entities may have a need to maintain demographic information.
  • health insurance companies may have a need to maintain demographic information about healthcare providers that need to be reimbursed for claimed services.
  • these entities often attempt to collect and integrate the demographic information from providers, hospitals, group practices, or the like.
  • responses to requests for this information have poor response rates, are poorly formatted, and may include inaccurate information.
  • the responses may be structured in an unknown format, may include inconsistent or mislabeled headings, or may include spurious information.
  • the responses should be reviewed to verify the contents of the data provided and reformatted into a consistent structure.
  • the responses frequently include hundreds, if not thousands, of entries with any number of different types of demographic data. Consequently, manually reviewing and reformatting data from these responses may be difficult, time-consuming, and expensive, and often takes weeks per file to complete.
  • the present disclosure is directed to a method for identifying demographic information in a data file.
  • the method may include receiving the data file containing a plurality of fields of demographic information from a third-party.
  • the data file may include inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information.
  • the method may also include analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information.
  • the machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information.
  • the method may further include generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly.
  • the method may also include generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
  • FIG. 1 illustrates a diagram of a network for communications between one or more data sources and a system, according to aspects of the present disclosure.
  • FIG. 2 illustrates a diagram of a system for reviewing and reformatting data files from the one or more data sources, according to aspects of the present disclosure.
  • FIGS. 3-5B illustrate example data files received from the one or more data sources, according to aspects of the present disclosure.
  • FIG. 6 illustrates example revised data file, according to aspects of the present disclosure.
  • FIG. 7 illustrates a method of reformatting data from a data source, according aspects of the present disclosure.
  • FIG. 8 is an example computer system useful for implementing various embodiments.
  • Embodiments provide ways to review and reformat data files that include inconsistent or mislabeled nomenclatures for one or more fields of a plurality of fields of demographic information or spurious demographic information, which would require weeks per file to review and reformat manually.
  • embodiments may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information.
  • the machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information.
  • analyzing the data file may be based on a combination of one or more of semantic content of the demographic information, a shape of the demographic information, or metadata. In this way, embodiments provide the ability to identify different types of demographic data.
  • Embodiments may also generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly.
  • Embodiments may also generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
  • the revised data file may be formatted based on the requirements of the third-party that provided the original data file.
  • the revised data file may be fully customizable based on individual requests for the restructured data.
  • embodiments provide the ability to effectively and efficient generate data files in a format that is most useful to the third party.
  • the present disclosure may implement a combination of a plurality of machine learning algorithms and rules, which improves the functionality of the computing device. Namely, the combination of machine learning algorithms and rules avoids overtraining, and thus overcomplicating, the machine learning model, thereby reducing the amount of resources, e.g., processing consumption and memory resources, required to generate reformatted data files. Additionally, in some aspects, the present disclosure may intelligently identify different types of demographic information based on a sampled portion of the data file, rather than the entire data file, which may include hundreds, if not thousands of entries. By identifying the different types of demographic information based on a sampled portion, the present disclosure may further reduce the amount of resources required to generate reformatted data files.
  • the present disclosure may intelligently identify different types of demographic information based on a sampled portion of the data file, rather than the entire data file, which may include hundreds, if not thousands of entries.
  • references to “one embodiment”, “an embodiment”, “an example embodiment”, etc. indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • FIG. 1 is a diagram illustrating a network 100 for communications over a network 110 between one or more data sources 105 and a system 115 .
  • the one or more data sources 105 may be any data source that maintains databases of demographic information of one or more individuals, such, as healthcare providers, including but not limited to, doctors, dentists, physician assistants, nurse practitioners, nurses, or the like. Although the present disclosure describes the individuals as being healthcare providers, it should be understood by those of ordinary skill in the arts that present disclosure may be implemented accumulating data from any data source.
  • the data sources 105 may be hosted on a server, such as a host server, a web server, an application server, etc., a data center device, or a similar device, capable of communicating via the network 110 .
  • the one or more data sources 105 may include a Center for Medicaid and Medicare (CMS) services data source, a directory data source, a Drug Enforcement Agency (DEA) data source, a public data source, a National Provider Identifier (NPI) data source, a registration data source, and/or a claims data source.
  • CMS data source may be a data service provided by a government agency.
  • the database may be distributed and different agencies organizations may be responsible for different data stored in CMS data source.
  • the CMS data source may also include data on healthcare providers, such as lawfully available demographic information and claims information.
  • the CMS data source may also allow a provider to enroll and update its information in the Medicare Provider Enrollment System and to register and assist in the Medicare and Medicaid Electronic Health Records (EHR) Incentive Programs.
  • EHR Electronic Health Records Incentive Programs.
  • the directory data source may be a directory of healthcare providers.
  • the directory data source may be a proprietary directory that matches healthcare providers with demographic and behavioral attributes that a particular client believes to be true.
  • the directory data source may, for example, belong to an insurance company or a health system, and can only be accessed and utilized securely with the company's consent.
  • the DEA data source may be a registration database maintained by a government agency such as the DEA.
  • the DEA may maintain a database of healthcare providers, including physicians, optometrists, pharmacists, dentists, or veterinarians, who are allowed to prescribe or dispense medication.
  • the DEA data source may match a healthcare provider with a DEA number.
  • DEA data source to may include demographic information about healthcare providers.
  • the public data source may be a public data source, perhaps a web-based data source such as an online review system. These data sources may include demographic information about healthcare providers, area of specialty, and behavioral information such as crowd sourced reviews.
  • the NPI data source may be a data source matching a healthcare provider to a NPI.
  • the NPI is a Health Insurance Portability and Accountability Act (HIPAA) Administrative Simplification Standard.
  • HIPAA Health Insurance Portability and Accountability Act
  • the NPI is a unique identification number for covered health care providers. Covered health care providers and all health plans and health care clearinghouses must use the NPIs in the administrative and financial transactions adopted under HIPAA.
  • the NPI is a 10-position, intelligence-free numeric identifier (10-digit number). This means that the numbers do not carry other information about healthcare providers, such as the state in which they live or their medical specialty.
  • NPI data source may also include demographic information about a healthcare provider.
  • the registration data source may include state licensing information.
  • a healthcare provider such as a physician
  • the state licensing board may provide the registration data source information about the healthcare provider, such as demographic information and areas of specialty, including board certifications.
  • the claims data source may be a data source with insurance claims information.
  • claims data source may be a proprietary database.
  • Insurance claims may specify information necessary for insurance reimbursement.
  • claims information may include information on the healthcare provider, the services performed, and perhaps the amount claimed.
  • the services performed may be described using a standardized code system, such as ICD-9.
  • the information on the healthcare provider could include demographic information.
  • the one or more data sources 105 may receive data files from any number of origins, e.g., multiple practice groups, other ones of the plurality of data sources 105 , etc.
  • the one or more data sources 105 may receive responses to requests for demographic information from, for example, medical practice groups, hospitals, or the like. This information may be entered by an administrator, and as such, the data file may include inconsistent or mislabeled nomenclatures for one or more fields of a plurality of fields of demographic information or it may include spurious demographic information.
  • the one or more data sources 105 may acquire another entity that utilizes different nomenclatures for one or more fields of the plurality of fields.
  • one or more of the plurality of data sources 105 may transmit a data file containing the plurality of fields of demographic information to the server 115 .
  • the data file may include a table of information having any number of headings labeling a plurality of fields of demographic information.
  • the data file may include a table having the headings “Name,” “Addrs.,” “PH#,” “FX#,” “Specialty,” “License No.,” and “Expiration Date.”
  • the demographic information provided under the heading “FX#” are a number of email addresses.
  • one of the entries under the heading “Addrs.” includes a typographical error in the zip code.
  • the data file may include extraneous metadata and/or superfluous information. Namely, as shown in FIG. 3 , the data file may include, for example, “Author Name” and “Date Generated,” indicated who authored the data file and the date it was created.
  • the data file may include a table of information having a heading and subheadings.
  • the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Address # 1 ,” “Address # 2 ,” “Phone No.,” and “Fx #.”
  • the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Billing,” and “Service.”
  • FIG. 4A the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Address # 1 ,” “Address # 2 ,” “Phone No.,” and “Fx #.”
  • the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Billing,” and “Service.”
  • FIG. 4A the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Address # 1 ,” “Address # 2 ,” “Phone No
  • the data file may have a heading labeled “Group Name” with subheadings labeled “Name,” “Addr,” “Name,” and “Addr.”
  • the data file may have inconsistent or mislabeled nomenclatures or spurious demographic information.
  • the format of each data file having the demographic information may be inconsistent from source to another.
  • the network 110 may include one or more wired and/or wireless networks.
  • the network 110 may include a cellular network (e.g., a long-term evolution (LTE) network, a code division multiple access (CDMA) network, a 3G network, a 4G network, a 5G network, another type of next generation network, etc.), a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cloud computing network, and/or the like, and/or a combination of these or other types of networks.
  • LTE long-term evolution
  • CDMA code division multiple access
  • 3G Third Generation
  • 4G fourth generation
  • 5G 5G network
  • PLMN public land mobile network
  • PLMN public land mobile network
  • LAN
  • the server 115 may include an ingester 205 , a repository 210 , a display 215 , and a model trainer 220 , as illustrated in FIG. 2 .
  • the ingester 205 may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information.
  • the model trainer 220 may train the machine learning model using a number of Monte Carlo training sets having sample data files. That is, the model trainer 220 may use a sample set generated by humans identifying demographic information in a data file.
  • the machine learning model may be based on a plurality of machine learning algorithms to identify different types of demographic information.
  • the plurality of machine learning algorithms may be supervised machine learning algorithms including, but are not limited to, support vector machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, neural networks, and similarity learning. It should be understood by those of ordinary skill in the art that these are merely example supervised machine learning algorithms and that other supervised machine learning algorithms may be used in accordance with aspects of the present disclosure.
  • the ingester 205 may analyze the data file by analyzing semantic content of each of the plurality of fields of demographic information to identify the different types of demographic information. For example, the ingester 205 may identify semantic content, such as a state name or state abbreviation, which indicates that the demographic information is likely an address, rather than, for example, a phone number or facsimile number. Similarly, the ingester 205 may identify semantic content, such as street names (e.g., Avenue, Road, Street, Lane, etc.) and/or their associated abbreviations (e.g., Ave., Rd. St. Ln., etc.), which would likewise also indicate that the demographic information is an address.
  • semantic content such as a state name or state abbreviation, which indicates that the demographic information is likely an address, rather than, for example, a phone number or facsimile number.
  • semantic content such as street names (e.g., Avenue, Road, Street, Lane, etc.) and/or their associated abbreviations (e
  • the ingester 205 may identify semantic content, such as state names (or country names) and/or their associated abbreviations, which would likewise also indicate that the demographic information is an address.
  • the ingester 205 may also be able to identify a billing address based on the semantic content.
  • the semantic content may include, for example, a PO Box number, which would indicate that the content is a billing address, rather than a service address.
  • the ingester 205 may identify the semantic content, such as a hyperlink, which may indicate that the demographic information is an email address. It should be understood by those of ordinary skill in the arts that these are merely examples of semantic content that may be identified, and that other types of semantic content are contemplated in accordance with aspects of the present disclosure.
  • the ingester 205 may analyze the data file by analyzing a shape of each of the plurality of fields of demographic information to identify the different types of demographic information. For example, the ingester 205 may analyze the demographic information to identify the number of characters, the type of the characters (e.g., numeric versus letter characters), the number of non-alphanumeric characters (e.g., spaces, commas, periods, or the like), and an overall arrange of the alphanumeric characters and non-alphanumeric characters.
  • the type of the characters e.g., numeric versus letter characters
  • non-alphanumeric characters e.g., spaces, commas, periods, or the like
  • the shape of the demographic information may be “XXX[comma][space]XXX” or “XXX[comma][space]XXX [space]X[period]”, with each X representing a letter character, which are common formats identifying names.
  • the ingester 205 may identify the state within an address based on the semantic content, as discussed herein.
  • the ingester 205 may identify the shape of the demographic information, such as XXX@XXX[period]XXXX, which indicates that the demographic information is an email address. It should be understood by those of ordinary skill in the arts that these are merely examples of shapes of demographic content that may be identified, and that other types of shapes of demographic content are contemplated in accordance with aspects of the present disclosure.
  • the ingester 205 may analyze the data file by analyzing metadata of each of the plurality of fields of demographic information to identify the different types of demographic information.
  • the metadata may include each nomenclature of the headings.
  • the semantic content and shapes of the demographic information may be similar.
  • phone numbers and facsimile numbers may have similar semantic content and shapes.
  • service addresses and billing addresses may have similar semantic content and shapes.
  • the ingester 205 may analyze the metadata of the headings (or subheadings). For example, the ingester 205 may identify common nomenclatures used for the different types of demographic information.
  • common nomenclatures for phone numbers may include, but are not limited to, “Phone No.,” “Phone Number,” “P:,” “PH No.,” or the like
  • common nomenclatures for facsimile numbers may include, but are not limited to, “Fax No.,” “Fax Number,” “F:,” “FX No.,” or the like
  • common nomenclatures for service addresses may include the terms, for example, “Service,” “Serv.,” or the like, or the service address may be listed only as “Address” or some variation thereof, whereas the billing address may be specifically identified as such.
  • the ingester 205 may analyzed layered headings, as illustrated in the examples shown in FIGS. 3 and 4A -B.
  • the ingester 205 may analyze the headings “Author Name” and “Date Generated,” and determine that these fields are merely extraneous metadata and/or superfluous information that should be removed when reformatting the data file.
  • the ingester 205 may analyze the primary heading and subheadings, and determine that the demographic information provided below the primary heading is related to a practice group, i.e., a group name, group service address, group billing address, group phone number, and group facsimile number.
  • a practice group i.e., a group name, group service address, group billing address, group phone number, and group facsimile number.
  • the ingester 205 may analyze the primary heading and subheadings, and determine that the demographic information provided below the primary heading is related to a practice group, i.e., a group name, however the remaining subheadings are “Service” and “Billing,” and the ingester 205 may determine that the demographic information provided under these subheadings are a billing address, billing phone number, service address, and service phone, respectively.
  • the machine learning model may also be trained on respective rules for common types of demographic information.
  • the rules may include a rule that a five digit number or a five digit number followed by a hyphen and another four digit number is a zip code, as these are the only available formats for zip codes.
  • an NPI may be formatted as a ten digit number with the first digit being a “1,” and as such, the rules may include a rule indicating that any ten digit number commencing with a “1” is an NPI.
  • the rules may include a rule for determining responses to binary pieces of demographic information, e.g., whether a healthcare provider is accepting new patients—“Yes”/“Y” or “No”/“N.”
  • a rule for determining responses to binary pieces of demographic information e.g., whether a healthcare provider is accepting new patients—“Yes”/“Y” or “No”/“N.”
  • the ingester 205 may analyze the inter-columnar relationship between multiple columns. For example, as illustrated in FIG. 5A , the data file includes alternating headings of “Name” and “Addr.” After reviewing the semantic content, shape, and metadata of the rows under each column, the ingester 205 may determine that the respective types of demographic information are names and addresses. Furthermore, by analyzing the inter-columnar relationship between multiple columns, the ingester 205 may determine that the alternating headings should be grouped as pairs, e.g., a healthcare provider name and their associated address. As another example illustrated in FIG. 5B , the data file may include multiple addresses for a single healthcare provider, i.e., “Addrs.
  • the ingester 205 may determine that each address is associated with the same healthcare provider, and separate each address into separate entries, e.g., separate row of information, in a revised data file, while still associating the addresses with the same healthcare provider.
  • the ingester 205 may also generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly. For example, the ingester 205 may generate a baseline score for each of the plurality of fields of demographic information, which may then be adjusted. For example, the ingester 205 may increase the scores for demographic information having well-known semantic content and/or shapes, e.g., zip codes and NPIs. Additionally, the ingester 205 may increase or decrease the score based on whether the heading correctly identifies the associated demographic information, e.g., whether the heading correctly identifies “NPIs.” For example, the score may be decreased when the heading and the content do not match, whereas the score may be increased when the heading and content match.
  • a score indicating a probability that each of the plurality of fields of demographic information was identified correctly. For example, the ingester 205 may generate a baseline score for each of the plurality of fields of demographic information, which may then be adjusted. For example, the ingest
  • ingester 205 may increase the score based on whether demographic information having similar semantic content and/or shapes have been detected. For example, the ingester 205 increases the score for a telephone number or address if only a single piece of demographic information having the given semantic content and/or shape is identified. However, in the event two or more identified fields of demographic information having the same semantic content and/or shape are identified (e.g., a phone number and a facsimile number or a service address and a billing address), the ingester 205 may decrease the score for both of the two or more identified fields of demographic information, and these identified fields may have the same score. Furthermore, in some situations, the ingester 205 may generate an alert notifying an administrator of the two or more identified fields of demographic information having the same semantic content and/or shape, such that the administrator may provide input to resolve the conflict.
  • the ingester 205 may generate an alert notifying an administrator of the two or more identified fields of demographic information having the same semantic content and/or shape, such that the administrator may provide input
  • the ingester 205 may apply additional processing to distinguish between the two or more identified fields of demographic information. For example, in some embodiments, the ingester 205 may cross-check at least one of the plurality of fields of demographic information against known demographic information stored in, for example, the repository 210 . For example, the ingester 205 may cross-check an identified phone number and an identified facsimile number against known phone numbers and facsimile numbers to verify which is the phone number and which is the facsimile number. In some embodiments, the ingester 205 may sequentially check the digits of the phone and facsimile numbers until the ingester 205 determines that one of the two is a phone number.
  • the ingester 205 may identify one of the two or more identified fields of demographic information, accordingly, with the remaining field of demographic information being identified as the most reasonable alternative (e.g., the facsimile number).
  • the ingester 205 may cross-check other pieces of demographic information, such as the NPI, service addresses, and billing addresses. It should be understood by those of ordinary skill in the arts that these are merely examples of the types of demographic information that may be cross-checked, and that other types of demographic information may be cross-checked in accordance with aspects of the present disclosure.
  • the ingester 205 may identify incorrect information and, in some instances, update the incorrect information. For example, as illustrated in FIG. 3 , the zip code in the address associated with “Jane Doe” included a typographical error, and to fix this error, the ingester 205 may query the repository 210 to identify a correct zip. Additionally, or alternatively, the ingester 205 may compare the incorrect zip code to other zip codes of the data file, e.g., the zip code associated with “John Doe,” as illustrated in FIG. 3 .
  • the ingester 205 may determine the zip code associated with “John Doe” is the correct zip code and update the zip code for “Jane Doe” accordingly. Additionally, the ingester 205 may determine whether identified information is corrected by cross-checking, for example, identified phone numbers against known phone numbers. In some instances, the cross-checking may confirm that the identified numbers are indeed phone numbers. In other instances, the cross-checking may determine that the identified phone numbers were incorrectly labeled in the data file, and in fact, are facsimile numbers, rather than phone numbers.
  • the ingester 205 may analyze a limited number of rows of demographic information in the data file (i.e., less than the full number of rows in the data file) to improve the overall efficiency of the ingester 205 .
  • the ingester 205 may be able to identify the type of demographic information of each of the plurality of fields of demographic information, and assume that all remaining rows that have not been analyzed are the identified type of demographic information.
  • the ingester 205 may generate the revised data file in smaller segments of rows, rather than the entire data file, which may require substantial amounts of resources, e.g., processing consumption and memory resources.
  • the ingester 205 reduces the overall amount of resources used and improves the efficiency of the server 115 .
  • the ingester 205 may generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
  • the ingester 205 may generate a revised data file having a format that is customized according to a request from the data source 105 .
  • the requested format may be a format that is consistent with preexisting data files of the data source 105 .
  • the requested format may be an entirely new format. For example, as illustrated in FIG.
  • the data source 105 may request that the demographic information be separated into “F Name,” “L Name,” “Street Address,” “City,” “State,” and “Zip Code.”
  • the ingester 205 may identify fields for the requested format and parse through the identified types of demographic information to determine which demographic information belongs in which field of the requested format.
  • the ingester 205 may parse the demographic information and separate them into different fields in the revised data file, i.e., “First Name” and “Last name.” That is, the ingester may generate new columns by separating a column of a single type of demographic information (e.g., “Full Name”) into different separate columns parsing the single type of demographic information into separate subcomponents (e.g., “First Name” and “Last Name” as separate columns).
  • a single type of demographic information e.g., “Full Name”
  • separate subcomponents e.g., “First Name” and “Last Name” as separate columns.
  • the ingester 205 may generate a new columns by combining separate columns of information (e.g., “First Name” and “Last Name”) into a single column (e.g., “Full Name”). It should be understood by those of ordinary skill in the arts that this is merely an example, and that the ingester 205 may parse other types of demographic information in accordance with aspects of the present disclosure. In further embodiments, the ingester 205 may separate a single incoming data file into any number of revised data files.
  • a given piece of demographic information may not match what the ingester 205 identified as the type of demographic information.
  • the ingester 205 may identify one of the plurality of fields of demographic information as being NPIs, but one entry may not match the known format for an NPI. In such circumstances, the ingester 205 may pass through the mismatching demographic information untouched, render the value null, or insert special characters flagging the particular entry. Alternatively, the ingester 205 may generate an alert notifying an administrator of the mismatching demographic information, such that the administrator may provide input to resolve the discrepancy.
  • the ingester 205 may determine additional information based on the identified demographic information. For example, using the address of the identified address, the ingester 205 may determine the geolocation or coordinates of the healthcare provider. As another example, the ingester 205 may supplement a missing zip code based on a known street address, city, and state. The ingester 205 may include such additional information in the revised data file upon request. The ingester 205 may store the revised data file in the repository 210 , and the server 115 may transmit the revised data file to the data source 105 over the network 110 .
  • FIG. 7 illustrates a method for identifying demographic information in a data file.
  • a computing device may receive the data file containing a plurality of fields of demographic information from a third-party.
  • the data file may have inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information.
  • the computing device may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information.
  • the machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information.
  • the computing device e.g., server 115
  • the computing device e.g., server 115
  • a computing device can include but are not limited to, a personal computer, a mobile device such as a mobile phone, workstation, embedded system, game console, television, set-top box, or any other computing device. Further, a computing device can include, but is not limited to, a device having a processor and memory, including a non-transitory memory, for executing and storing instructions. The memory may tangibly embody the data and program instructions in a non-transitory manner. Software may include one or more applications and an operating system. Hardware can include, but is not limited to, a processor, a memory, and a graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be a part of or the entirety of a clustered or distributed computing environment or server farm.
  • FIG. 8 Various embodiments may be implemented, for example, using one or more well-known computer systems, such as computer system 800 shown in FIG. 8 .
  • One or more computer systems 800 may be used, for example, to implement any of the embodiments discussed herein, as well as combinations and sub-combinations thereof.
  • Computer system 800 may include one or more processors (also called central processing units, or CPUs), such as a processor 804 .
  • processors also called central processing units, or CPUs
  • Processor 804 may be connected to a communication infrastructure or bus 806 .
  • Computer system 800 may also include user input/output device(s) 803 , such as monitors, keyboards, pointing devices, etc., which may communicate with communication infrastructure 806 through user input/output interface(s) 802 .
  • user input/output device(s) 803 such as monitors, keyboards, pointing devices, etc.
  • communication infrastructure 806 may communicate with user input/output interface(s) 802 .
  • processors 804 may be a graphics processing unit (GPU).
  • a GPU may be a processor that is a specialized electronic circuit designed to process mathematically intensive applications.
  • the GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.
  • Computer system 800 may also include a main or primary memory 808 , such as random access memory (RAM).
  • Main memory 808 may include one or more levels of cache.
  • Main memory 808 may have stored therein control logic (i.e., computer software) and/or data.
  • Computer system 800 may also include one or more secondary storage devices or memory 810 .
  • Secondary memory 810 may include, for example, a hard disk drive 812 and/or a removable storage device or drive 814 .
  • Removable storage drive 814 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.
  • Removable storage drive 814 may interact with a removable storage unit 818 .
  • Removable storage unit 818 may include a computer usable or readable storage device having stored thereon computer software (control logic) and/or data.
  • Removable storage unit 818 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/ any other computer data storage device.
  • Removable storage drive 814 may read from and/or write to removable storage unit 818 .
  • Secondary memory 810 may include other means, devices, components, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 800 .
  • Such means, devices, components, instrumentalities or other approaches may include, for example, a removable storage unit 822 and an interface 820 .
  • Examples of the removable storage unit 822 and the interface 820 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.
  • Computer system 800 may further include a communication or network interface 824 .
  • Communication interface 824 may enable computer system 800 to communicate and interact with any combination of external devices, external networks, external entities, etc. (individually and collectively referenced by reference number 828 ).
  • communication interface 824 may allow computer system 800 to communicate with external or remote devices 828 over communications path 826 , which may be wired and/or wireless (or a combination thereof), and which may include any combination of LANs, WANs, the Internet, etc.
  • Control logic and/or data may be transmitted to and from computer system 800 via communication path 826 .
  • Computer system 800 may also be any of a personal digital assistant (PDA), desktop workstation, laptop or notebook computer, netbook, tablet, smart phone, smart watch or other wearable, appliance, part of the Internet-of-Things, and/or embedded system, to name a few non-limiting examples, or any combination thereof.
  • PDA personal digital assistant
  • Computer system 800 may be a client or server, accessing or hosting any applications and/or data through any delivery paradigm, including but not limited to remote or distributed cloud computing solutions; local or on-premises software (“on-premise” cloud-based solutions); “as a service” models (e.g., content as a service (CaaS), digital content as a service (DCaaS), software as a service (SaaS), managed software as a service (MSaaS), platform as a service (PaaS), desktop as a service (DaaS), framework as a service (FaaS), backend as a service (BaaS), mobile backend as a service (MBaaS), infrastructure as a service (IaaS), etc.); and/or a hybrid model including any combination of the foregoing examples or other services or delivery paradigms.
  • “as a service” models e.g., content as a service (CaaS), digital content as a service (DCaaS), software as a
  • Any applicable data structures, file formats, and schemas in computer system 800 may be derived from standards including but not limited to JavaScript Object Notation (JSON), Extensible Markup Language (XML), Yet Another Markup Language (YAML), Extensible Hypertext Markup Language (XHTML), Wireless Markup Language (WML), MessagePack, XML User Interface Language (XUL), a comma-separated values (CSV), or any other functionally similar representations alone or in combination.
  • JSON JavaScript Object Notation
  • XML Extensible Markup Language
  • YAML Yet Another Markup Language
  • XHTML Extensible Hypertext Markup Language
  • WML Wireless Markup Language
  • MessagePack XML User Interface Language
  • CSV comma-separated values
  • proprietary data structures, formats or schemas may be used, either exclusively or in combination with known or open standards.
  • a tangible, non-transitory apparatus or article of manufacture comprising a tangible, non-transitory computer useable or readable medium having control logic (software) stored thereon may also be referred to herein as a computer program product or program storage device.
  • control logic software stored thereon
  • control logic when executed by one or more data processing devices (such as computer system 800 ), may cause such data processing devices to operate as described herein.

Abstract

The present disclosure is directed to systems and methods for identifying demographic information in a data file. The method may include: receiving the data file containing a plurality of fields of demographic information from a third-party, the data file having inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information; analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information, the machine learning model being based on a plurality of machine learning algorithms to identify different types demographic information; generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly; and generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.

Description

    BACKGROUND Field
  • This field is generally related to processing information.
  • Background
  • As technology advances, an ever increasing amount of demographic information is becoming digitized. For example, for healthcare providers, demographic information may include, but is not limited, to their name, address, specialties, academic credentials, certifications, and the like. This demographic information may be available from various public data sources, such as websites. These websites may retrieve the demographic information from underlying databases, such as state, county, city, or municipality databases, that store the data. For example, states may have licensing boards that maintain lists of all licensed healthcare providers, along with their associated demographic information. In another example, health insurance companies may have public websites listing the healthcare providers, and associated demographic information, in their network. In another example, healthcare providers may themselves set up public websites that list such demographic information about their practices.
  • Entities may have a need to maintain demographic information. For example, health insurance companies may have a need to maintain demographic information about healthcare providers that need to be reimbursed for claimed services. To maintain the demographic information, these entities often attempt to collect and integrate the demographic information from providers, hospitals, group practices, or the like. Often times responses to requests for this information have poor response rates, are poorly formatted, and may include inaccurate information. For example, the responses may be structured in an unknown format, may include inconsistent or mislabeled headings, or may include spurious information. As such, the responses should be reviewed to verify the contents of the data provided and reformatted into a consistent structure. However, the responses frequently include hundreds, if not thousands, of entries with any number of different types of demographic data. Consequently, manually reviewing and reformatting data from these responses may be difficult, time-consuming, and expensive, and often takes weeks per file to complete. These costs and time delays significantly contribute to the administrative overhead costs that account for about one third of healthcare premiums in the United States.
  • Thus, systems and methods are needed to improve reviewing and reformatting these responses into a validated format by automating expensive administrative tasks, thereby eliminating manual data formatting and reducing wasteful spending.
  • BRIEF SUMMARY
  • In an embodiment, the present disclosure is directed to a method for identifying demographic information in a data file. The method may include receiving the data file containing a plurality of fields of demographic information from a third-party. The data file may include inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information. The method may also include analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information. The machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information. The method may further include generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly. The method may also include generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
  • System and computer program product embodiments are also disclosed.
  • Further embodiments, features, and advantages of the invention, as well as the structure and operation of the various embodiments, are described in detail below with reference to accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present disclosure and, together with the description, further serve to explain the principles of the disclosure and to enable a person skilled in the relevant art to make and use the disclosure.
  • FIG. 1 illustrates a diagram of a network for communications between one or more data sources and a system, according to aspects of the present disclosure.
  • FIG. 2 illustrates a diagram of a system for reviewing and reformatting data files from the one or more data sources, according to aspects of the present disclosure.
  • FIGS. 3-5B illustrate example data files received from the one or more data sources, according to aspects of the present disclosure.
  • FIG. 6 illustrates example revised data file, according to aspects of the present disclosure.
  • FIG. 7 illustrates a method of reformatting data from a data source, according aspects of the present disclosure.
  • FIG. 8 is an example computer system useful for implementing various embodiments.
  • The drawing in which an element first appears is typically indicated by the leftmost digit or digits in the corresponding reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.
  • DETAILED DESCRIPTION
  • Embodiments provide ways to review and reformat data files that include inconsistent or mislabeled nomenclatures for one or more fields of a plurality of fields of demographic information or spurious demographic information, which would require weeks per file to review and reformat manually. For example, embodiments may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information. The machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information. For example, analyzing the data file may be based on a combination of one or more of semantic content of the demographic information, a shape of the demographic information, or metadata. In this way, embodiments provide the ability to identify different types of demographic data. Embodiments may also generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly. Embodiments may also generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type. For example, the revised data file may be formatted based on the requirements of the third-party that provided the original data file. In other words, the revised data file may be fully customizable based on individual requests for the restructured data. Thus, embodiments provide the ability to effectively and efficient generate data files in a format that is most useful to the third party.
  • Furthermore, the present disclosure may implement a combination of a plurality of machine learning algorithms and rules, which improves the functionality of the computing device. Namely, the combination of machine learning algorithms and rules avoids overtraining, and thus overcomplicating, the machine learning model, thereby reducing the amount of resources, e.g., processing consumption and memory resources, required to generate reformatted data files. Additionally, in some aspects, the present disclosure may intelligently identify different types of demographic information based on a sampled portion of the data file, rather than the entire data file, which may include hundreds, if not thousands of entries. By identifying the different types of demographic information based on a sampled portion, the present disclosure may further reduce the amount of resources required to generate reformatted data files.
  • In the detailed description that follows, references to “one embodiment”, “an embodiment”, “an example embodiment”, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • FIG. 1 is a diagram illustrating a network 100 for communications over a network 110 between one or more data sources 105 and a system 115. In some embodiments, the one or more data sources 105 may be any data source that maintains databases of demographic information of one or more individuals, such, as healthcare providers, including but not limited to, doctors, dentists, physician assistants, nurse practitioners, nurses, or the like. Although the present disclosure describes the individuals as being healthcare providers, it should be understood by those of ordinary skill in the arts that present disclosure may be implemented accumulating data from any data source. In some embodiments, the data sources 105 may be hosted on a server, such as a host server, a web server, an application server, etc., a data center device, or a similar device, capable of communicating via the network 110.
  • In some instances, the one or more data sources 105 may include a Center for Medicaid and Medicare (CMS) services data source, a directory data source, a Drug Enforcement Agency (DEA) data source, a public data source, a National Provider Identifier (NPI) data source, a registration data source, and/or a claims data source. The CMS data source may be a data service provided by a government agency. The database may be distributed and different agencies organizations may be responsible for different data stored in CMS data source. The CMS data source may also include data on healthcare providers, such as lawfully available demographic information and claims information. The CMS data source may also allow a provider to enroll and update its information in the Medicare Provider Enrollment System and to register and assist in the Medicare and Medicaid Electronic Health Records (EHR) Incentive Programs.
  • The directory data source may be a directory of healthcare providers. In one example, the directory data source may be a proprietary directory that matches healthcare providers with demographic and behavioral attributes that a particular client believes to be true. The directory data source may, for example, belong to an insurance company or a health system, and can only be accessed and utilized securely with the company's consent.
  • The DEA data source may be a registration database maintained by a government agency such as the DEA. The DEA may maintain a database of healthcare providers, including physicians, optometrists, pharmacists, dentists, or veterinarians, who are allowed to prescribe or dispense medication. The DEA data source may match a healthcare provider with a DEA number. In addition, DEA data source to may include demographic information about healthcare providers.
  • The public data source may be a public data source, perhaps a web-based data source such as an online review system. These data sources may include demographic information about healthcare providers, area of specialty, and behavioral information such as crowd sourced reviews.
  • The NPI data source may be a data source matching a healthcare provider to a NPI. The NPI is a Health Insurance Portability and Accountability Act (HIPAA) Administrative Simplification Standard. The NPI is a unique identification number for covered health care providers. Covered health care providers and all health plans and health care clearinghouses must use the NPIs in the administrative and financial transactions adopted under HIPAA. The NPI is a 10-position, intelligence-free numeric identifier (10-digit number). This means that the numbers do not carry other information about healthcare providers, such as the state in which they live or their medical specialty. NPI data source may also include demographic information about a healthcare provider.
  • The registration data source may include state licensing information. For example, a healthcare provider, such as a physician, may need to register with a state licensing board. The state licensing board may provide the registration data source information about the healthcare provider, such as demographic information and areas of specialty, including board certifications.
  • The claims data source may be a data source with insurance claims information.
  • Like the directory data source, the claims data source may be a proprietary database. Insurance claims may specify information necessary for insurance reimbursement. For example, claims information may include information on the healthcare provider, the services performed, and perhaps the amount claimed. The services performed may be described using a standardized code system, such as ICD-9. The information on the healthcare provider could include demographic information.
  • The one or more data sources 105 may receive data files from any number of origins, e.g., multiple practice groups, other ones of the plurality of data sources 105, etc. For example, the one or more data sources 105 may receive responses to requests for demographic information from, for example, medical practice groups, hospitals, or the like. This information may be entered by an administrator, and as such, the data file may include inconsistent or mislabeled nomenclatures for one or more fields of a plurality of fields of demographic information or it may include spurious demographic information. As another example, the one or more data sources 105 may acquire another entity that utilizes different nomenclatures for one or more fields of the plurality of fields. In some implementations, one or more of the plurality of data sources 105 may transmit a data file containing the plurality of fields of demographic information to the server 115.
  • In some embodiments, the data file may include a table of information having any number of headings labeling a plurality of fields of demographic information. For example, as illustrated in FIG. 3, the data file may include a table having the headings “Name,” “Addrs.,” “PH#,” “FX#,” “Specialty,” “License No.,” and “Expiration Date.” However, as illustrated in FIG. 3, the demographic information provided under the heading “FX#” are a number of email addresses. Furthermore, as illustrated in FIG. 3, one of the entries under the heading “Addrs.” includes a typographical error in the zip code. As further shown in FIG. 3, the data file may include extraneous metadata and/or superfluous information. Namely, as shown in FIG. 3, the data file may include, for example, “Author Name” and “Date Generated,” indicated who authored the data file and the date it was created.
  • In further embodiments, the data file may include a table of information having a heading and subheadings. For example, as illustrated in FIG. 4A, the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Address # 1,” “Address # 2,” “Phone No.,” and “Fx #.” In another example, as illustrated in FIG. 4B, the data file may have a heading labeled “Group” with subheadings labeled “Name,” “Billing,” and “Service.” In yet another example, as illustrated in FIG. 5A, the data file may have a heading labeled “Group Name” with subheadings labeled “Name,” “Addr,” “Name,” and “Addr.” Thus, as illustrated in the examples shown in FIGS. 3-5B, the data file may have inconsistent or mislabeled nomenclatures or spurious demographic information. In some instances, the format of each data file having the demographic information may be inconsistent from source to another.
  • The network 110 may include one or more wired and/or wireless networks. For example, the network 110 may include a cellular network (e.g., a long-term evolution (LTE) network, a code division multiple access (CDMA) network, a 3G network, a 4G network, a 5G network, another type of next generation network, etc.), a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cloud computing network, and/or the like, and/or a combination of these or other types of networks.
  • To review and reformat the data files from the data sources 105, the server 115 may include an ingester 205, a repository 210, a display 215, and a model trainer 220, as illustrated in FIG. 2. In some embodiments, the ingester 205 may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information. For example, in some embodiments, the model trainer 220 may train the machine learning model using a number of Monte Carlo training sets having sample data files. That is, the model trainer 220 may use a sample set generated by humans identifying demographic information in a data file. In some embodiments, the machine learning model may be based on a plurality of machine learning algorithms to identify different types of demographic information. In some embodiments, the plurality of machine learning algorithms may be supervised machine learning algorithms including, but are not limited to, support vector machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, neural networks, and similarity learning. It should be understood by those of ordinary skill in the art that these are merely example supervised machine learning algorithms and that other supervised machine learning algorithms may be used in accordance with aspects of the present disclosure.
  • As one example, the ingester 205 may analyze the data file by analyzing semantic content of each of the plurality of fields of demographic information to identify the different types of demographic information. For example, the ingester 205 may identify semantic content, such as a state name or state abbreviation, which indicates that the demographic information is likely an address, rather than, for example, a phone number or facsimile number. Similarly, the ingester 205 may identify semantic content, such as street names (e.g., Avenue, Road, Street, Lane, etc.) and/or their associated abbreviations (e.g., Ave., Rd. St. Ln., etc.), which would likewise also indicate that the demographic information is an address. Even further, the ingester 205 may identify semantic content, such as state names (or country names) and/or their associated abbreviations, which would likewise also indicate that the demographic information is an address. In some embodiments, the ingester 205 may also be able to identify a billing address based on the semantic content. For example, the semantic content may include, for example, a PO Box number, which would indicate that the content is a billing address, rather than a service address. In yet another example, the ingester 205 may identify the semantic content, such as a hyperlink, which may indicate that the demographic information is an email address. It should be understood by those of ordinary skill in the arts that these are merely examples of semantic content that may be identified, and that other types of semantic content are contemplated in accordance with aspects of the present disclosure.
  • As another example, the ingester 205 may analyze the data file by analyzing a shape of each of the plurality of fields of demographic information to identify the different types of demographic information. For example, the ingester 205 may analyze the demographic information to identify the number of characters, the type of the characters (e.g., numeric versus letter characters), the number of non-alphanumeric characters (e.g., spaces, commas, periods, or the like), and an overall arrange of the alphanumeric characters and non-alphanumeric characters. For example, the shape of the demographic information may be “XXX[comma][space]XXX” or “XXX[comma][space]XXX [space]X[period]”, with each X representing a letter character, which are common formats identifying names. In another example, the shape of the demographic information may be ### XXX[space]XXX [space]XXX[comma]XX[space]##### (or #####=####), with each # representing a numeric character and each X representing a letter character, which is a common format of an address. However, some data files may use a full state name, rather than the two letter abbreviation for the state, and as such, the ingester 205 may identify the state within an address based on the semantic content, as discussed herein. In yet another example, the ingester 205 may identify the shape of the demographic information, such as XXX@XXX[period]XXXX, which indicates that the demographic information is an email address. It should be understood by those of ordinary skill in the arts that these are merely examples of shapes of demographic content that may be identified, and that other types of shapes of demographic content are contemplated in accordance with aspects of the present disclosure.
  • As yet another example, the ingester 205 may analyze the data file by analyzing metadata of each of the plurality of fields of demographic information to identify the different types of demographic information. For example, the metadata may include each nomenclature of the headings. In some instances, the semantic content and shapes of the demographic information may be similar. For example, phone numbers and facsimile numbers may have similar semantic content and shapes. In another example, service addresses and billing addresses may have similar semantic content and shapes. To differentiate between demographic information having similar semantic content and shapes, the ingester 205 may analyze the metadata of the headings (or subheadings). For example, the ingester 205 may identify common nomenclatures used for the different types of demographic information. For example, common nomenclatures for phone numbers may include, but are not limited to, “Phone No.,” “Phone Number,” “P:,” “PH No.,” or the like, whereas common nomenclatures for facsimile numbers may include, but are not limited to, “Fax No.,” “Fax Number,” “F:,” “FX No.,” or the like. Likewise, common nomenclatures for service addresses may include the terms, for example, “Service,” “Serv.,” or the like, or the service address may be listed only as “Address” or some variation thereof, whereas the billing address may be specifically identified as such. Furthermore, the ingester 205 may analyzed layered headings, as illustrated in the examples shown in FIGS. 3 and 4A-B. Using the data file shown in FIG. 3, the ingester 205 may analyze the headings “Author Name” and “Date Generated,” and determine that these fields are merely extraneous metadata and/or superfluous information that should be removed when reformatting the data file. As another example, using the data file shown in FIG. 4A, the ingester 205 may analyze the primary heading and subheadings, and determine that the demographic information provided below the primary heading is related to a practice group, i.e., a group name, group service address, group billing address, group phone number, and group facsimile number. In yet another example, using the data file shown in FIG. 4B, the ingester 205 may analyze the primary heading and subheadings, and determine that the demographic information provided below the primary heading is related to a practice group, i.e., a group name, however the remaining subheadings are “Service” and “Billing,” and the ingester 205 may determine that the demographic information provided under these subheadings are a billing address, billing phone number, service address, and service phone, respectively.
  • In some embodiments, the machine learning model may also be trained on respective rules for common types of demographic information. For example, the rules may include a rule that a five digit number or a five digit number followed by a hyphen and another four digit number is a zip code, as these are the only available formats for zip codes. As another example, an NPI may be formatted as a ten digit number with the first digit being a “1,” and as such, the rules may include a rule indicating that any ten digit number commencing with a “1” is an NPI. In a further example, the rules may include a rule for determining responses to binary pieces of demographic information, e.g., whether a healthcare provider is accepting new patients—“Yes”/“Y” or “No”/“N.” By using rules for common types of demographic information, the present disclosure avoids overtraining, and thus overcomplicating, the machine learning model and also improves efficiency of the machine learning model. In some embodiments, these rules may be defined as regular expressions, however it should be understood by those ordinary skill in the arts that other types of rules may be used.
  • In some embodiments, the ingester 205 may analyze the inter-columnar relationship between multiple columns. For example, as illustrated in FIG. 5A, the data file includes alternating headings of “Name” and “Addr.” After reviewing the semantic content, shape, and metadata of the rows under each column, the ingester 205 may determine that the respective types of demographic information are names and addresses. Furthermore, by analyzing the inter-columnar relationship between multiple columns, the ingester 205 may determine that the alternating headings should be grouped as pairs, e.g., a healthcare provider name and their associated address. As another example illustrated in FIG. 5B, the data file may include multiple addresses for a single healthcare provider, i.e., “Addrs. 1,” “City 1,” “State 1,” as well as “Addrs. 2,” “City 2,” “State 2.” In this instance, the ingester 205 may determine that each address is associated with the same healthcare provider, and separate each address into separate entries, e.g., separate row of information, in a revised data file, while still associating the addresses with the same healthcare provider.
  • The ingester 205 may also generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly. For example, the ingester 205 may generate a baseline score for each of the plurality of fields of demographic information, which may then be adjusted. For example, the ingester 205 may increase the scores for demographic information having well-known semantic content and/or shapes, e.g., zip codes and NPIs. Additionally, the ingester 205 may increase or decrease the score based on whether the heading correctly identifies the associated demographic information, e.g., whether the heading correctly identifies “NPIs.” For example, the score may be decreased when the heading and the content do not match, whereas the score may be increased when the heading and content match. In some embodiments, ingester 205 may increase the score based on whether demographic information having similar semantic content and/or shapes have been detected. For example, the ingester 205 increases the score for a telephone number or address if only a single piece of demographic information having the given semantic content and/or shape is identified. However, in the event two or more identified fields of demographic information having the same semantic content and/or shape are identified (e.g., a phone number and a facsimile number or a service address and a billing address), the ingester 205 may decrease the score for both of the two or more identified fields of demographic information, and these identified fields may have the same score. Furthermore, in some situations, the ingester 205 may generate an alert notifying an administrator of the two or more identified fields of demographic information having the same semantic content and/or shape, such that the administrator may provide input to resolve the conflict.
  • To resolve this, the ingester 205 may apply additional processing to distinguish between the two or more identified fields of demographic information. For example, in some embodiments, the ingester 205 may cross-check at least one of the plurality of fields of demographic information against known demographic information stored in, for example, the repository 210. For example, the ingester 205 may cross-check an identified phone number and an identified facsimile number against known phone numbers and facsimile numbers to verify which is the phone number and which is the facsimile number. In some embodiments, the ingester 205 may sequentially check the digits of the phone and facsimile numbers until the ingester 205 determines that one of the two is a phone number. In some instances, only one of the two identified fields of demographic information may be known, e.g., the phone number, and the ingester 205 may identify one of the two or more identified fields of demographic information, accordingly, with the remaining field of demographic information being identified as the most reasonable alternative (e.g., the facsimile number). Similarly, the ingester 205 may cross-check other pieces of demographic information, such as the NPI, service addresses, and billing addresses. It should be understood by those of ordinary skill in the arts that these are merely examples of the types of demographic information that may be cross-checked, and that other types of demographic information may be cross-checked in accordance with aspects of the present disclosure.
  • Additionally, the ingester 205 may identify incorrect information and, in some instances, update the incorrect information. For example, as illustrated in FIG. 3, the zip code in the address associated with “Jane Doe” included a typographical error, and to fix this error, the ingester 205 may query the repository 210 to identify a correct zip. Additionally, or alternatively, the ingester 205 may compare the incorrect zip code to other zip codes of the data file, e.g., the zip code associated with “John Doe,” as illustrated in FIG. 3. As the addresses of “Jane Doe” and “John Doe” have the same street address, city, and state, the ingester 205 may determine the zip code associated with “John Doe” is the correct zip code and update the zip code for “Jane Doe” accordingly. Additionally, the ingester 205 may determine whether identified information is corrected by cross-checking, for example, identified phone numbers against known phone numbers. In some instances, the cross-checking may confirm that the identified numbers are indeed phone numbers. In other instances, the cross-checking may determine that the identified phone numbers were incorrectly labeled in the data file, and in fact, are facsimile numbers, rather than phone numbers.
  • In some embodiments, the ingester 205 may analyze a limited number of rows of demographic information in the data file (i.e., less than the full number of rows in the data file) to improve the overall efficiency of the ingester 205. For example, after analyzing the semantic content, shape, and metadata of a number of rows, the ingester 205 may be able to identify the type of demographic information of each of the plurality of fields of demographic information, and assume that all remaining rows that have not been analyzed are the identified type of demographic information. Furthermore, the ingester 205 may generate the revised data file in smaller segments of rows, rather than the entire data file, which may require substantial amounts of resources, e.g., processing consumption and memory resources. By assuming the type of demographic information of the remaining rows, the ingester 205 reduces the overall amount of resources used and improves the efficiency of the server 115.
  • Once the plurality of fields of demographic information have been identified and corrected as needed, the ingester 205 may generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type. In some embodiments, the ingester 205 may generate a revised data file having a format that is customized according to a request from the data source 105. For example, the requested format may be a format that is consistent with preexisting data files of the data source 105. As another example, the requested format may be an entirely new format. For example, as illustrated in FIG. 6, the data source 105 may request that the demographic information be separated into “F Name,” “L Name,” “Street Address,” “City,” “State,” and “Zip Code.” To achieve this, the ingester 205 may identify fields for the requested format and parse through the identified types of demographic information to determine which demographic information belongs in which field of the requested format. That is, for example, when the ingester 205 identified the demographic information as being “Last Name, First Name” or “Full Name,” the ingester 205 may parse the demographic information and separate them into different fields in the revised data file, i.e., “First Name” and “Last name.” That is, the ingester may generate new columns by separating a column of a single type of demographic information (e.g., “Full Name”) into different separate columns parsing the single type of demographic information into separate subcomponents (e.g., “First Name” and “Last Name” as separate columns). Likewise, the ingester 205 may generate a new columns by combining separate columns of information (e.g., “First Name” and “Last Name”) into a single column (e.g., “Full Name”). It should be understood by those of ordinary skill in the arts that this is merely an example, and that the ingester 205 may parse other types of demographic information in accordance with aspects of the present disclosure. In further embodiments, the ingester 205 may separate a single incoming data file into any number of revised data files.
  • In some instances, a given piece of demographic information may not match what the ingester 205 identified as the type of demographic information. For example, the ingester 205 may identify one of the plurality of fields of demographic information as being NPIs, but one entry may not match the known format for an NPI. In such circumstances, the ingester 205 may pass through the mismatching demographic information untouched, render the value null, or insert special characters flagging the particular entry. Alternatively, the ingester 205 may generate an alert notifying an administrator of the mismatching demographic information, such that the administrator may provide input to resolve the discrepancy.
  • In some embodiments, the ingester 205 may determine additional information based on the identified demographic information. For example, using the address of the identified address, the ingester 205 may determine the geolocation or coordinates of the healthcare provider. As another example, the ingester 205 may supplement a missing zip code based on a known street address, city, and state. The ingester 205 may include such additional information in the revised data file upon request. The ingester 205 may store the revised data file in the repository 210, and the server 115 may transmit the revised data file to the data source 105 over the network 110.
  • FIG. 7 illustrates a method for identifying demographic information in a data file.
  • At 705, a computing device, e.g., server 115, may receive the data file containing a plurality of fields of demographic information from a third-party. The data file may have inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information.
  • At 710, the computing device, e.g., server 115, may analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information. The machine learning model may be based on a plurality of machine learning algorithms to identify different types demographic information.
  • At 715, the computing device, e.g., server 115, may generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly.
  • At 720, the computing device, e.g., server 115, may generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
  • Each of the servers and modules described above can be implemented in software, firmware, or hardware on a computing device. A computing device can include but are not limited to, a personal computer, a mobile device such as a mobile phone, workstation, embedded system, game console, television, set-top box, or any other computing device. Further, a computing device can include, but is not limited to, a device having a processor and memory, including a non-transitory memory, for executing and storing instructions. The memory may tangibly embody the data and program instructions in a non-transitory manner. Software may include one or more applications and an operating system. Hardware can include, but is not limited to, a processor, a memory, and a graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be a part of or the entirety of a clustered or distributed computing environment or server farm.
  • Various embodiments may be implemented, for example, using one or more well-known computer systems, such as computer system 800 shown in FIG. 8. One or more computer systems 800 may be used, for example, to implement any of the embodiments discussed herein, as well as combinations and sub-combinations thereof.
  • Computer system 800 may include one or more processors (also called central processing units, or CPUs), such as a processor 804. Processor 804 may be connected to a communication infrastructure or bus 806.
  • Computer system 800 may also include user input/output device(s) 803, such as monitors, keyboards, pointing devices, etc., which may communicate with communication infrastructure 806 through user input/output interface(s) 802.
  • One or more of processors 804 may be a graphics processing unit (GPU). In an embodiment, a GPU may be a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.
  • Computer system 800 may also include a main or primary memory 808, such as random access memory (RAM). Main memory 808 may include one or more levels of cache. Main memory 808 may have stored therein control logic (i.e., computer software) and/or data.
  • Computer system 800 may also include one or more secondary storage devices or memory 810. Secondary memory 810 may include, for example, a hard disk drive 812 and/or a removable storage device or drive 814. Removable storage drive 814 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.
  • Removable storage drive 814 may interact with a removable storage unit 818. Removable storage unit 818 may include a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 818 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/ any other computer data storage device. Removable storage drive 814 may read from and/or write to removable storage unit 818.
  • Secondary memory 810 may include other means, devices, components, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 800. Such means, devices, components, instrumentalities or other approaches may include, for example, a removable storage unit 822 and an interface 820. Examples of the removable storage unit 822 and the interface 820 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.
  • Computer system 800 may further include a communication or network interface 824. Communication interface 824 may enable computer system 800 to communicate and interact with any combination of external devices, external networks, external entities, etc. (individually and collectively referenced by reference number 828). For example, communication interface 824 may allow computer system 800 to communicate with external or remote devices 828 over communications path 826, which may be wired and/or wireless (or a combination thereof), and which may include any combination of LANs, WANs, the Internet, etc. Control logic and/or data may be transmitted to and from computer system 800 via communication path 826.
  • Computer system 800 may also be any of a personal digital assistant (PDA), desktop workstation, laptop or notebook computer, netbook, tablet, smart phone, smart watch or other wearable, appliance, part of the Internet-of-Things, and/or embedded system, to name a few non-limiting examples, or any combination thereof.
  • Computer system 800 may be a client or server, accessing or hosting any applications and/or data through any delivery paradigm, including but not limited to remote or distributed cloud computing solutions; local or on-premises software (“on-premise” cloud-based solutions); “as a service” models (e.g., content as a service (CaaS), digital content as a service (DCaaS), software as a service (SaaS), managed software as a service (MSaaS), platform as a service (PaaS), desktop as a service (DaaS), framework as a service (FaaS), backend as a service (BaaS), mobile backend as a service (MBaaS), infrastructure as a service (IaaS), etc.); and/or a hybrid model including any combination of the foregoing examples or other services or delivery paradigms.
  • Any applicable data structures, file formats, and schemas in computer system 800 may be derived from standards including but not limited to JavaScript Object Notation (JSON), Extensible Markup Language (XML), Yet Another Markup Language (YAML), Extensible Hypertext Markup Language (XHTML), Wireless Markup Language (WML), MessagePack, XML User Interface Language (XUL), a comma-separated values (CSV), or any other functionally similar representations alone or in combination. Alternatively, proprietary data structures, formats or schemas may be used, either exclusively or in combination with known or open standards.
  • In some embodiments, a tangible, non-transitory apparatus or article of manufacture comprising a tangible, non-transitory computer useable or readable medium having control logic (software) stored thereon may also be referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 800, main memory 808, secondary memory 810, and removable storage units 818 and 822, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 800), may cause such data processing devices to operate as described herein.
  • Based on the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art(s) how to make and use embodiments of this disclosure using data processing devices, computer systems and/or computer architectures other than that shown in FIG. 8. In particular, embodiments can operate with software, hardware, and/or operating system embodiments other than those described herein.
  • Conclusion
  • The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
  • The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
  • The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (20)

What is claimed is:
1. A computer-implemented method of identifying demographic information in a data file, comprising:
receiving the data file containing a plurality of fields of demographic information from a third-party, the data file having inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information;
analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information, the machine learning model being based on a plurality of machine learning algorithms to identify different types demographic information;
generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly; and
generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
2. The method of claim 1, wherein analyzing the data file comprises analyzing semantic content of each of the plurality of fields of demographic information to identify the different types of demographic information.
3. The method of claim 1, wherein analyzing the data file comprises analyzing a shape of each of the plurality of fields of demographic information to identify the different types of demographic information.
4. The method of claim 1, wherein analyzing the data file comprises analyzing metadata of each of the plurality of fields of demographic information to identify the different types of demographic information.
5. The method of claim 4, wherein the metadata includes each nomenclature of each of the plurality of fields of demographic information.
6. The method of claim 1, wherein, in response to identifying different ones of the plurality of fields of demographic information, the method further comprises cross-checking at least one of the plurality of fields of demographic information against known demographic information.
7. The method of claim 1, further comprising transmitting the revised data file to the third-party.
8. A system for identifying demographic information in a data file, comprising:
a memory that stores instructions for identifying the demographic information in the data file; and
a processor configured to execute the instructions that cause the processor to:
receive the data file containing a plurality of fields of demographic information from a third-party, the data file having inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information;
analyze the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information, the machine learning model being based on a plurality of machine learning algorithms to identify different types demographic information;
generate a score indicating a probability that each of the plurality of fields of demographic information was identified correctly; and
generate a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
9. The system of claim 8, wherein analyzing the data file comprises analyzing semantic content of each of the plurality of fields of demographic information to identify the different types of demographic information.
10. The system of claim 8, wherein analyzing the data file comprises analyzing a shape of each of the plurality of fields of demographic information to identify the different types of demographic information.
11. The system of claim 10, wherein the metadata includes each nomenclature of each of the plurality of fields of demographic information.
12. The system of claim 8, wherein analyzing the data file comprises analyzing each nomenclature to identify the different types of demographic information.
13. The system of claim 8, wherein, in response to identifying different ones of the plurality of fields of demographic information, the instructions further cause the processor to cross-check at least one of the plurality of fields of demographic information against known demographic information.
14. The system of claim 8, wherein the instructions further cause the processor to transmit the revised data file to the third-party.
15. non-transitory program storage device having instructions stored thereon that, when executed by at least one computing device, causes the at least one computing device to perform a method, the method comprising:
receiving the data file containing a plurality of fields of demographic information from a third-party, the data file having inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information;
analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information, the machine learning model being based on a plurality of machine learning algorithms to identify different types demographic information;
generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly; and
generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.
16. The method of claim 15, wherein analyzing the data file comprises analyzing semantic content of each of the plurality of fields of demographic information to identify the different types of demographic information.
17. The method of claim 15, wherein analyzing the data file comprises analyzing a shape of each of the plurality of fields of demographic information to identify the different types of demographic information.
18. The method of claim 15, wherein analyzing the data file comprises analyzing metadata of each of the plurality of fields of demographic information to identify the different types of demographic information.
19. The method of claim 18, wherein the metadata includes each nomenclature of each of the plurality of fields of demographic information.
20. The method of claim 15, wherein, in response to identifying different ones of the plurality of fields of demographic information, the method further comprises cross-checking at least one of the plurality of fields of demographic information against known demographic information.
US16/668,565 2019-10-30 2019-10-30 Efficient data processing to identify information and reformant data files, and applications thereof Abandoned US20210133769A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US16/668,565 US20210133769A1 (en) 2019-10-30 2019-10-30 Efficient data processing to identify information and reformant data files, and applications thereof
PCT/US2020/058200 WO2021087254A1 (en) 2019-10-30 2020-10-30 Efficient data processing to identify information and reformat data files, and applications thereof
CN202080076168.4A CN114830079A (en) 2019-10-30 2020-10-30 Efficient data processing for identifying information and reformatting data files and applications thereof
EP20882233.8A EP4052119A4 (en) 2019-10-30 2020-10-30 Efficient data processing to identify information and reformat data files, and applications thereof
US17/181,519 US20210174380A1 (en) 2019-10-30 2021-02-22 Efficient data processing to identify information and reformant data files, and applications thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/668,565 US20210133769A1 (en) 2019-10-30 2019-10-30 Efficient data processing to identify information and reformant data files, and applications thereof

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/181,519 Continuation US20210174380A1 (en) 2019-10-30 2021-02-22 Efficient data processing to identify information and reformant data files, and applications thereof

Publications (1)

Publication Number Publication Date
US20210133769A1 true US20210133769A1 (en) 2021-05-06

Family

ID=75689018

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/668,565 Abandoned US20210133769A1 (en) 2019-10-30 2019-10-30 Efficient data processing to identify information and reformant data files, and applications thereof
US17/181,519 Pending US20210174380A1 (en) 2019-10-30 2021-02-22 Efficient data processing to identify information and reformant data files, and applications thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/181,519 Pending US20210174380A1 (en) 2019-10-30 2021-02-22 Efficient data processing to identify information and reformant data files, and applications thereof

Country Status (4)

Country Link
US (2) US20210133769A1 (en)
EP (1) EP4052119A4 (en)
CN (1) CN114830079A (en)
WO (1) WO2021087254A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230273934A1 (en) * 2022-02-25 2023-08-31 Veda Data Solutions, Inc. Efficient column detection using sequencing, and applications thereof
US20230273848A1 (en) * 2022-02-25 2023-08-31 Veda Data Solutions, Inc. Converting tabular demographic information into an export entity file
US20230273900A1 (en) * 2022-02-25 2023-08-31 Veda Data Solutions, Inc. Fault tolerant method for processing data with human intervention

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2003207856A1 (en) * 2002-02-04 2003-09-02 Cataphora, Inc A method and apparatus to visually present discussions for data mining purposes
US8782087B2 (en) * 2005-03-18 2014-07-15 Beyondcore, Inc. Analyzing large data sets to find deviation patterns
US20070282681A1 (en) * 2006-05-31 2007-12-06 Eric Shubert Method of obtaining and using anonymous consumer purchase and demographic data
WO2007149216A2 (en) * 2006-06-21 2007-12-27 Information Extraction Systems An apparatus, system and method for developing tools to process natural language text
US20100191689A1 (en) * 2009-01-27 2010-07-29 Google Inc. Video content analysis for automatic demographics recognition of users and videos
US8583609B2 (en) * 2011-02-08 2013-11-12 Barry Sewall Method and system for creating an industry-specific computer dictionary and metadata apparatus for computer management applications using a multi-level database of terms and definitions
US9406072B2 (en) * 2012-03-29 2016-08-02 Spotify Ab Demographic and media preference prediction using media content data analysis
CN113571187A (en) * 2014-11-14 2021-10-29 Zoll医疗公司 Medical premonitory event estimation system and externally worn defibrillator
US9952962B2 (en) * 2015-03-26 2018-04-24 International Business Machines Corporation Increasing accuracy of traceability links and structured data
US20170286622A1 (en) * 2016-03-29 2017-10-05 International Business Machines Corporation Patient Risk Assessment Based on Machine Learning of Health Risks of Patient Population
US10599129B2 (en) * 2017-08-04 2020-03-24 Duro Labs, Inc. Method for data normalization
US10784000B2 (en) * 2018-03-16 2020-09-22 Vvc Holding Corporation Medical system interface apparatus and methods to classify and provide medical data using artificial intelligence
US11568302B2 (en) * 2018-04-09 2023-01-31 Veda Data Solutions, Llc Training machine learning algorithms with temporally variant personal data, and applications thereof
US11055327B2 (en) * 2018-07-01 2021-07-06 Quadient Technologies France Unstructured data parsing for structured information
US10963378B2 (en) * 2019-03-19 2021-03-30 International Business Machines Corporation Dynamic capacity allocation of stripes in cluster based storage systems

Also Published As

Publication number Publication date
CN114830079A (en) 2022-07-29
EP4052119A4 (en) 2023-10-18
EP4052119A1 (en) 2022-09-07
WO2021087254A1 (en) 2021-05-06
US20210174380A1 (en) 2021-06-10

Similar Documents

Publication Publication Date Title
Kho et al. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago
Roski et al. Creating value in health care through big data: opportunities and policy implications
US11232365B2 (en) Digital assistant platform
US20210174380A1 (en) Efficient data processing to identify information and reformant data files, and applications thereof
Ross et al. The HMO research network virtual data warehouse: a public data model to support collaboration
US20170091391A1 (en) Patient Protected Information De-Identification System and Method
US20200265931A1 (en) Systems and methods for coding health records using weighted belief networks
US20140025393A1 (en) System and method for providing clinical decision support
US20140316797A1 (en) Methods and system for evaluating medication regimen using risk assessment and reconciliation
US8050937B1 (en) Method and system for providing relevant content based on claim analysis
US20170330102A1 (en) Rule-based feature engineering, model creation and hosting
Newgard et al. Building a longitudinal cohort from 9‐1‐1 to 1‐year using existing data sources, probabilistic linkage, and multiple imputation: a validation study
CN111145847A (en) Clinical test data entry method and device, medium and electronic equipment
US20160267115A1 (en) Methods and Systems for Common Key Services
Moe et al. Identifying subgroups and risk among frequent emergency department users in British Columbia
US11205504B2 (en) System and method for computerized synthesis of simulated health data
Dzando et al. Healthcare in Ghana amidst the coronavirus pandemic: A narrative literature review
US20210043319A1 (en) Healthcare data cloud system, server and method
Marsolo et al. Assessing the impact of privacy-preserving record linkage on record overlap and patient demographic and clinical characteristics in PCORnet®, the National Patient-Centered Clinical Research Network
Srinivasa et al. Analytics on medical records collected from a distributed system deployed in the Indian rural demographic
Hanson et al. Generating real-world data from health records: design of a patient-centric study in multiple sclerosis using a commercial health records platform
Neira et al. Extraction of data from a hospital information system to perform process mining
US20220189641A1 (en) Opioid Use Disorder Predictor
de Silva Relational databases and biomedical big data
US20230273848A1 (en) Converting tabular demographic information into an export entity file

Legal Events

Date Code Title Description
AS Assignment

Owner name: VEDA DATA SOLUTIONS, INC., DISTRICT OF COLUMBIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:VERA-CIRO, CARLOS;LINDNER, ROBERT RAYMOND;SIGNING DATES FROM 20191024 TO 20191029;REEL/FRAME:050883/0151

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: COMERICA BANK, MICHIGAN

Free format text: SECURITY INTEREST;ASSIGNOR:VEDA DATA SOLUTIONS, INC.;REEL/FRAME:065668/0675

Effective date: 20231120