CN111161824A

CN111161824A - Automatic report interpretation method and system

Info

Publication number: CN111161824A
Application number: CN201911328539.9A
Authority: CN
Inventors: 梁萌萌; 余伟师; 谢欣
Original assignee: Suzhou Semek Gene Technology Co ltd
Current assignee: Suzhou Semek Gene Technology Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-15
Also published as: WO2021120528A1

Abstract

The invention belongs to the field of letter generation detection, and designs an automatic report interpretation method and a system, wherein the method comprises the following steps: acquiring various evidence data sources of letter generation analysis; calculating the scores of all evidence data sources, defining the value representing pathogenicity in the calculation result as a value A, and defining the value representing benign result in the calculation result as a value B; sequencing all the sites of the raw letter analysis in sequence according to the scores of the calculation results; screening pathogenic loci according to an industry gold standard; corresponding the pathogenic locus data, the phenotype data of the patient and the related data in the local relational database and then importing the data into a template; and adding conclusion description into the template to obtain a complete report. The core data obtained by searching and the peripheral information of the core data are displayed in a multi-dimensional mode, the associated data are integrated to the maximum extent, and the raw message analysis report is simple and easy to read.

Description

Automatic report interpretation method and system

Technical Field

The invention belongs to the field of letter generation detection, and designs an automatic report reading method and an automatic report reading system.

Background

With the rapid development of sequencing technology and the continuous reduction of cost, more and more patients will adopt the doctor's advice to receive the detection of molecular diagnostic technology, and the most popular contemporary gene is sequenced. However, as is known, neither the original result file obtained by sequencing nor the output file obtained by analyzing, filtering and annotating the original result by using various algorithms by a letter engineer can provide the most direct reference for doctors; it requires further processing of the data by a professional medical interpreter to form a clear and easily readable final report to aid clinical decision-making. In the process of writing the report, the interpreters need to query various public databases to re-screen the thousands of sites of the message output, and rank the variation of the selected sites according to the gold standard in the industry, so as to classify the sites as pathogenicity, suspected pathogenicity or unknown clinical significance. Finally, the interpreter must complete the report in the document format prescribed by the doctor.

At present, although all the main public databases provide web pages for information retrieval, the relevance among all the databases is poor, and a relatively obvious information island is formed, so that unscrambling personnel need to continuously switch on all query pages instead of obtaining multidimensional display of complete data through one-time query. Meanwhile, when the unscrambler manually screens the thousands of sites at the present stage, an automatic sequencing mechanism aiming at the pathogenic sites of a specific disease is lacked, so that more time is consumed in the step. In addition, when making a report, the standardization of the report and the aesthetic degree of the layout are also important factors influencing the overall interpretation rate.

Disclosure of Invention

The application provides an automatic report interpretation method and system, which can be used for carrying out multi-dimensional display on core data obtained by searching and peripheral information of the core data together, integrating related data to the maximum extent and enabling a raw letter analysis report to be simple and easy to read.

In order to achieve the technical purpose, the technical scheme adopted by the application is as follows: an automated report interpretation method, comprising:

acquiring various evidence data sources of letter generation analysis;

calculating the scores of all evidence data sources, defining the value representing pathogenicity in the calculation result as a value A, and defining the value representing benign result in the calculation result as a value B;

sequencing all the sites of the raw letter analysis in sequence according to the scores of the calculation results;

screening pathogenic loci according to an industry gold standard;

corresponding the pathogenic locus data, the phenotype data of the patient and the related data in the local relational database and then importing the data into a template; wherein the relevant data comprises gene function data, phenotypic description data, and graded evidence;

and adding conclusion description into the template to obtain a complete report.

As an improved technical scheme of the application, A in the value A is a number; b in the value B is a number; the sites of the student's letter analysis are sorted by number according to the score of the calculated results.

The improved technical scheme further comprises the steps of generating report data in a JSON format from the complete report, and storing the report in the JSON format in a historical report database.

As an improved technical scheme of the application, the local relational database comprises an OMIM database, a CHPO database, an HGMD database and a historical report database; and an OMIM database, a CHPO database, an HGMD database and a history report database in the local relational database are associated according to a gene-phenotype relation by adopting an ER relational graph mode to form a multi-dimensional data system.

As an improved technical scheme of the application, the method also comprises the step of synthesizing the report data of the complete report generation JSON format with the HTML text to form a PDF report.

As an improved technical scheme of the application, the weighted average calculation is adopted for calculating the scores of all evidence data sources.

As an improved technical scheme of the application, a logistic regression algorithm is adopted for calculating the scores of all evidence data sources.

It is another object of the present application to provide an automated report interpretation system, comprising

The intelligent analysis module is used for acquiring various evidence data source files of letter generation analysis, performing weighted average calculation on various data in the result file, and sequencing all the points in the calculation result according to pathogenicity;

the report writing module is used for acquiring the calculation result of the intelligent analysis module, the patient phenotype data and the data in the local relational database and carrying out conclusive descriptive text;

and the generating module is used for receiving the data reported by the report writing module, combining the HTML text in the template editing module, and generating a PDF report.

According to another embodiment of the application, a storage medium is characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

According to another embodiment of the present application, an electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the steps of any of the above method embodiments.

Advantageous effects

The method has the advantages that the weighted average calculation of multiple pathogenicity evidence data sources is performed through a Logistic Regression (Logistic Regression) algorithm, the screening speed of pathogenicity sites can be increased, the semi-automation of the screening of the pathogenicity sites is realized, meanwhile, the accuracy of sequencing results can be continuously improved by combining continuously accumulated historical data, so that the confidence that unscrambling personnel judge that the detection results are positive is increased, and meanwhile, the efficiency is improved;

by means of converting HTML to PDF, typesetting and beautifying of centralized management interpretation reports are achieved by using HTML style editing, time for editing the reports is shortened, uniformity of report pages is improved, interpretation personnel only need to relate to the contents of the reports instead of styles when making the reports, and about 30% of time can be saved;

the interpretation data written in the report can be effectively stored in the database, and is convenient for searching and consulting in a structured mode.

Through the association and integration of the database and the gene, phenotype and disease association structure system established during data integration, the problem of information isolated islands among various data sources can be effectively solved, unnecessary repeated query steps for acquiring relevant information of core query results by reading personnel during query are reduced, and the time of the reading personnel is saved;

it should be understood that all combinations of the foregoing concepts and additional concepts described in greater detail below can be considered as part of the subject matter of the present disclosure unless such concepts are mutually inconsistent.

The foregoing and other aspects, embodiments and features of the present teachings can be more fully understood from the following description taken in conjunction with the accompanying drawings. Additional aspects of the present disclosure, such as features and/or advantages of exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of the specific embodiments in accordance with the teachings of the present disclosure.

Drawings

Fig. 1 is a schematic diagram of an overall structure of an automated report interpretation method according to the present application.

Fig. 2 is a graph of ER relationships employed by the local relational database.

Detailed Description

For a better understanding of the technical content of the present application, specific embodiments are described below in conjunction with the appended drawings.

Embodiments of the present disclosure are not necessarily intended to include all aspects of the present application. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the concepts and embodiments disclosed herein are not limited to any implementation. In addition, some aspects of the present disclosure may be used alone or in any suitable combination with other aspects of the present disclosure.

According to the method and the device, when a technical scheme is designed, the defects caused by information isolated islands need to be effectively reduced in the process of making the reading report, the core data obtained by searching and the peripheral information of the core data are displayed in a multi-dimensional mode, and relevant data are integrated to the maximum extent. Meanwhile, an easy pathogenic site automatic sequencing model which accords with a company reading frame needs to be established, and the screening of the sites is accelerated. When a final PDF report is made, the problem of report style uniformity needs to be solved, central management needs to be strengthened, and interpreters can pay more attention to report contents rather than typesetting styles during the writing process.

Example 1

The method provided by the embodiment of the application can be executed in a cloud or a local server cluster. The local server cluster may include one or more processors (which may include, but are not limited to, x86 or ARM architecture processing devices) and memory for storing data, and optionally may also include transmission equipment for communication functions and input-output equipment.

The memory may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the automatic report interpretation method in the embodiment of the present application, and the processor executes various functional applications and data processing by running the computer programs stored in the memory, that is, implementing the method described above.

The storage can comprise high-speed random access memory, and data redundancy is realized through a RAID1 or RAID5 disk array, so that the safety of data is ensured.

The transmission device is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the local server cluster. In one example, the transmission device includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet.

In this embodiment, an automated report interpretation method operating in the local server cluster or the network architecture is provided, and with reference to fig. 1, the automated report interpretation method includes the following steps:

acquiring various evidence data sources of the biographical analysis, wherein the various evidence data sources are shown in the following table 1;

table 1 illustrates various sources of evidence data in a raw trust analysis results file

Calculating the scores of all evidence data sources, defining the value representing pathogenicity in the calculation result as a value A, and defining the value representing benign result in the calculation result as a value B; the value A wherein A is a number, such as 1.0; value B wherein B is a number, such as 0.0;

screening pathogenic loci according to an industry gold standard; the industry Standard may use the ACMG genetic variation Classification standards and guidelines.

Corresponding the pathogenic locus data, the phenotype data of the patient and the related data in the local relational database and then importing the data into a template; the related data comprise gene function data, phenotype description data, rating evidence and the like, and the information island problem among various data sources is effectively eliminated.

The local relational database comprises an OMIM database, a CHPO database, an HGMD database and a historical report database; and an OMIM database, a CHPO database, an HGMD database and a history report database in the local relational database are associated according to a gene-phenotype relation by adopting an ER relational graph mode to form a multi-dimensional data system. At the beginning, the local relational database is generated in advance and continuously updated, so the local relational model in this embodiment is a continuously updated model.

And generating JSON-format report data from the complete report, storing the JSON-format report in a historical report database, and synthesizing the JSON-format report data generated from the complete report and an HTML text to form a PDF report.

Through the steps, the gene, phenotype and disease associated structural system created during data integration can effectively eliminate the information island problem among various data sources; the weighted average calculation of multiple pathogenicity evidence data sources is carried out through a Logistic Regression (Logistic Regression) algorithm, so that the screening speed of pathogenicity sites can be improved; and editing by using an HTML style to realize typesetting and beautifying of the interpretation report and compress the time for editing the report. The problem that the relevance among databases is poor, a relatively obvious information island is formed, and therefore unscrambling personnel need to switch on each query page continuously instead of obtaining the multidimensional display of complete data through one-time query is solved effectively; when the unscrambler manually screens the thousands of sites at the present stage, an automatic sequencing mechanism aiming at the pathogenic sites of a specific disease is lacked, so that more time is consumed in the step; when the report is made, the problems of standardization, typesetting aesthetic degree and the like of the report are reported.

Preferably, the scores of the evidence data sources are calculated by adopting a logistic regression algorithm to perform weighted average calculation so as to improve the screening speed of the pathogenic loci. The sites of the student's letter analysis are sorted by number according to the score of the calculated results.

Example 2

In this embodiment, an automatic report interpretation system is further provided, and the system is used to implement the foregoing embodiments and preferred embodiments, and the description of which is already given is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

An automated report interpretation system comprising

the report writing module is used for acquiring the calculation result of the intelligent analysis module, the patient phenotype data and the data in the local relational database and carrying out conclusive descriptive text; in actual operation, the calculation results of the intelligent analysis module, the patient phenotype data and the data in the local relational database are acquired in a one-to-one correspondence manner;

Optionally, the intelligent analysis module comprises:

the receiving unit is used for acquiring various evidence data source files of the credit generation analysis;

the calculating unit is used for carrying out weighted average calculation on each item of data in the result file;

and the sorting unit is used for sorting all the points in the weighted average calculation result according to the pathogenicity.

Optionally, the report composition module comprises:

the receiving unit is used for acquiring the calculation result of the intelligent analysis module, the patient phenotype data and the data in the local relational database and corresponding the calculation result, the patient phenotype data and the data in the local relational database one by one;

and the description unit is used for conclusively describing the characters of the data.

Optionally, the generating module includes:

the receiving unit is used for receiving the data reported by the report writing module;

the integration unit is used for combining the data obtained by the receiving unit with HTML text in the template editing module for synthesis;

and the report generating unit is used for generating a PDF report from the synthesized text.

Optionally, a wkhtmltopdf tool is provided in the report generation unit.

Example 3

An automated report interpretation method comprising the steps of:

after the letter analysis result file is obtained, the letter analysis result file is firstly imported into an intelligent analysis module, the module carries out weighted average calculation on scores of all evidence data sources in the file, and then sites are ranked according to the pathogenicity according to the calculation result, wherein the score is 1.0 and represents pathogenicity, and the score is 0.0 and represents benign. The weighted average calculation is performed on the evidence data source in the above table 1 by using a logistic regression (logistic regression) algorithm, and then pathogenicity is ranked from top to bottom according to the calculation result.

Kind of evidence	Data source
		Function prediction	Polyphen2-HVAR
Conservation of evolution	LRT
		Function prediction	SIFT
Conservation of evolution	phastCons100way
		Conservation of evolution	GERP++
Structural domains	Gene
		Crowd frequency	gnomAD
Structural domains	dbNSFP Interpro
		Function prediction	MutationTaster2
History rating	Company history data

Based on the result, the unscrambler can finally screen the pathogenic site according to the industry gold standard; meanwhile, the screening results of each imported file and the interpretation personnel are also included in the model continuous learning of the module, so that the accuracy of the subsequent calculation sequencing is continuously improved.

After the screening result is determined, the data is imported into a report writing module, and also the phenotype data of the patient, and relevant data and historical data which are integrated in a local relational database and are captured from various public databases, wherein the relevant data and the historical data comprise but are not limited to gene functions, phenotype description, rating evidence and the like; the local relational database is generated in advance and continuously updated. The method loads data in a public database through a REST API interface and a file in a Tab Separated (TSV)/Comma Separated (CSV) format, and associates the data according to a gene-phenotype relationship to form a multi-dimensional data system. The creation core of this database is shown in fig. 2 based on the following ER relationship diagram, where 1: m represents a one-to-many relationship, and m:1 represents a many-to-one relationship.

The interpretation personnel combines the automatically acquired data, fills conclusive descriptive words into the report writing module, generates report data (without styles) in JSON format for final synthesis of reports, and saves the report data in a historical report database. The JSON-format report data is easy to expand, and reports of various templates can be compatible under the condition that report contents are continuously optimized. After the JSON format report is stored in the PostgreSQL relational database, the JSON format data are conveniently searched and reviewed in the later period by means of the processing capacity of the JSON format report on the JSON format data. The JSON-format report is stored without any style, so that the decoupling of page content and typesetting is realized to the maximum extent, and the report template is convenient to be reintroduced when updated.

And synthesizing the report data in the JSON format and HTML text with styles which is designed in a template editing module in advance. The HTML template realizes central control and is generated in advance. The style of the template is processed by CSS, and the modified template can be applied to a plurality of reports edited by a plurality of people after being issued once.

Generating a final version of PDF report by using an open-source wkhtmltopdf tool; and combining the JSON content and the HTML template to present a final PDF report. In this process, the user who legitimates the report does not need to pay attention to the page type of the report.

Example 4

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

acquiring various evidence data sources of letter generation analysis;

screening pathogenic loci according to an industry gold standard;

Example 5

acquiring various evidence data sources of letter generation analysis;

screening pathogenic loci according to an industry gold standard;

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automated report interpretation method, comprising:

acquiring various evidence data sources of letter generation analysis;

screening pathogenic loci according to an industry gold standard;

2. The automated report interpretation method of claim 1, wherein a in the value a is a number; b in the value B is a number; the sites of the student's letter analysis are sorted by number according to the score of the calculated results.

3. The automated report interpretation method of claim 1, further comprising generating the complete report into JSON formatted report data and storing the JSON formatted report in a historical report database.

4. The automated report interpretation method of claim 1, wherein the local relational database comprises an OMIM database, a CHPO database, an HGMD database, and a historical report database; and an OMIM database, a CHPO database, an HGMD database and a history report database in the local relational database are associated according to a gene-phenotype relation by adopting an ER relational graph mode to form a multi-dimensional data system.

5. The automated report interpretation method of claim 1, further comprising synthesizing complete report generation JSON formatted report data with HTML text and into a PDF report.

6. The automated report interpretation method of claim 1, wherein the score of each evidence data source is calculated using a weighted average calculation.

7. The automated report reading method according to claim 1, wherein a logistic regression algorithm is used to calculate the scores of each evidence data source.

8. An automated report interpretation system, comprising

9. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.