CN113407522A

CN113407522A - Data processing method and device, computer equipment and computer readable storage medium

Info

Publication number: CN113407522A
Application number: CN202110677339.5A
Authority: CN
Inventors: 吴丹枫
Original assignee: Shanghai Tenth Peoples Hospital
Current assignee: Shanghai Tenth Peoples Hospital
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-17
Anticipated expiration: 2041-06-18
Also published as: CN113407522B

Abstract

The invention relates to a data processing method, a data processing device, computer equipment and a computer readable storage medium, wherein the method comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode; acquiring original data of each application subsystem of each hospital information system from a data center by using a data acquisition management system according to the acquisition mode, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: the method has the functions of data acquisition, acquisition script configuration, acquisition scheduling and monitoring, message persistence, active and passive clustering modes, data verification service, a data error processing mechanism, a message tracking mechanism, a connection reconstruction mechanism, a message retransmission mechanism and the like, and solves the problem that abnormal data cannot be processed in the related technology.

Description

Data processing method and device, computer equipment and computer readable storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a computer device, and a computer-readable storage medium.

Background

In the whole process of data application, the data needs to be cleaned for multiple rounds according to different levels from original data to a data center, from the data center to a data warehouse and a data mart. Data cleaning is mainly based on some conclusions obtained after exploratory analysis, namely from the application perspective, and then four types of abnormal data are mainly processed, namely: missing values (missing values), outliers (outliers), deduplication processing (Duplicate Data), and processing of noise Data.

At present, no effective solution is provided for the problem that abnormal data cannot be processed in the related art.

Disclosure of Invention

The present application aims to overcome the defects in the prior art, and provides a data processing method, an apparatus, a computer device, and a computer-readable storage medium, so as to solve at least the problem in the related art that abnormal data cannot be processed.

In order to achieve the purpose, the technical scheme adopted by the application is as follows:

in a first aspect, an embodiment of the present application provides a data processing method, including: determining a data acquisition mode, wherein the data acquisition mode comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode; acquiring original data of each application subsystem of each hospital information system from a data center by using a data acquisition management system according to the acquisition mode, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

In some embodiments, the data interface is provided by a data interface management module of the data acquisition management system, and in the process of acquiring raw data of each application subsystem of each hospital information system from a data center according to the acquisition mode, performing data cleaning and data conversion on the acquired raw data, and performing classified storage on the obtained target data, the method further includes executing the following procedures by the data interface management module:

executing an interface editing process: editing data interface information according to the acquired domain knowledge represented by the template;

and (3) executing a data interface management flow: inquiring basic information of a data interface, managing the life cycle state of the data interface and managing versions, wherein the life cycle state of the data interface comprises creation, release, locking and abandonment, the initial state of the basic data interface after the template is deployed is creation, judging a newly deployed template, if the basic data interface is not deployed, updating the state of the created basic data interface into release, if the template version is deployed after being updated, comparing the template with the previous version, if the template version is compatible, updating the previous version into abandonment by the basic data interface, if the template version is incompatible, updating the previous version into locking, upgrading the version of the newly created basic data interface, changing the state into release, customizing the initial state of the data interface by a user into creation, auditing by an administrator after the user submits the initial state, updating the state into release after the auditing, and returning to the creation state after the approval, if the user updates the version on the released data interface, the current version is updated to be abandoned, the state of the new version is updated to be created, and the data interface in the created state can be modified by the user;

and (3) executing a data interface display flow: providing an online description document of a data interface, analyzing data interface information, displaying a data interface resource name based on a template according to the template classification, and displaying the description document of each request mode according to the resource name and version information, wherein the description document comprises a URL (uniform resource locator) of the request mode, an input parameter, an output parameter and related description information;

and (3) executing a data interface test flow: inputting parameter values required by parameters aiming at a deployed data interface, sending an HTTP request to a universal generator service of the data interface, and verifying the function of the data interface;

executing a data interface deployment flow: deploying the data interface information in the release state, and enabling a data interface general generator to access the data interface information so as to generate a data interface;

and executing a data interface maintenance flow: providing a maintenance function of the data interfaces, and adding corresponding Chinese description information for each data interface;

executing a data interface generation service flow: exposing the model management in the form of an interface, calling the interface after the template is successfully deployed, generating corresponding data interface information according to a basic data interface generation method, performing version management according to version management constraints, and updating the life cycle state to release;

executing a data interface general generation service flow: the data interface information is executed as a request by the server to process each deployed data interface.

In some embodiments, the data collection management system performs a data collection procedure, including: extracting, converting, loading and integrating data, configuring an ETL script for data conversion transmission according to business requirements, testing whether the script meets requirements or not, uploading the script to an ETL management platform, and controlling the script to start and close through the ETL management platform so as to realize the start and close of data acquisition.

In some embodiments, the acquisition script configuration executed by the data acquisition management system comprises: setting database links of all service systems or all hospital data centers of a hospital, aiming at the acquisition of data sources, the system has two modes of calendar acquisition and SQL acquisition, writing SQL acquisition statements and storing configuration by the system according to the standard content of a data set in the SQL acquisition, realizing the matching and the corresponding of the structure of each hospital service database or all hospital data centers with the standard of a regional medical data center data set, and setting different acquisition time and acquisition periods aiming at different acquisition models so as to stagger the data acquisition time points of all service models.

In some embodiments, the data collection management system performs collection scheduling and monitoring, including: carrying out centralized management according to the acquisition period of each model, and uniformly managing the acquisition time of each model in each hospital so as to start and stop each acquisition model; the data acquisition process is monitored, the detail condition of success or failure of data acquisition is monitored, a data acquisition log is formed, failed data is acquired again to control the overall data acquisition quality, and a data acquisition configuration scheme is perfected according to the data acquisition monitoring result.

In some of these embodiments, the message persistence performed by the data collection management system includes: and cleaning the data to filter the data which do not meet the requirements, delivering the filtering result to a business administration department, and determining whether the data are filtered or corrected by a business unit and then extracting the data, wherein the data which do not meet the requirements comprise three types of incomplete data, wrong data and repeated data.

In some embodiments, the data verification service executed by the data acquisition management system includes: when the integrated platform server is powered off abnormally or has power failure, the operating system fails to write or the data is damaged, the integrated engine provides data verification service to recover and repair the damaged data, and when the integrated engine detects abnormal shutdown, the integrated platform server performs data verification on message storage to ensure the integrity of the data.

In a second aspect, an embodiment of the present application provides a data processing apparatus, including: the determining unit is used for determining a data acquisition mode, and the data acquisition mode comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode; the processing unit is used for acquiring the original data of each application subsystem of each hospital information system from the data center according to the acquisition mode by using the data acquisition management system, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

In a third aspect, the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the method described above.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method as described above.

By adopting the technical scheme, compared with the prior art, the method provided by the embodiment of the application can collect the original data of each application subsystem of each hospital information system from the data center according to the collection mode, perform data cleaning and data conversion on the collected original data, perform classified storage on the obtained target data, and solve the problem that abnormal data cannot be processed in the related technology by processing the data.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a mobile terminal according to an embodiment of the present application;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a data processing scheme according to an embodiment of the present application;

FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment provides a mobile terminal. Fig. 1 is a block diagram of a mobile terminal according to an embodiment of the present application. As shown in fig. 1, the mobile terminal includes: a Radio Frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (WiFi) module 170, a processor 180, and a power supply 190. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 1 is not intended to be limiting of mobile terminals and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each constituent element of the mobile terminal in detail with reference to fig. 1:

the RF circuit 110 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 180; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuits include, but are not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 120 may be used to store software programs and modules, and the processor 180 executes various functional applications and data processing of the mobile terminal by operating the software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the input unit 130 may include a touch panel 131 and other input devices 132. The touch panel 131, also referred to as a touch screen, may collect touch operations of a user on or near the touch panel 131 (e.g., operations of the user on or near the touch panel 131 using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 131 may include two parts, i.e., a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 180, and can receive and execute commands sent by the processor 180. In addition, the touch panel 131 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 130 may include other input devices 132 in addition to the touch panel 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 140 may be used to display information input by a user or information provided to the user and various menus of the mobile terminal. The Display unit 140 may include a Display panel 141, and optionally, the Display panel 141 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 131 can cover the display panel 141, and when the touch panel 131 detects a touch operation on or near the touch panel 131, the touch operation is transmitted to the processor 180 to determine the type of the touch event, and then the processor 180 provides a corresponding visual output on the display panel 141 according to the type of the touch event. Although the touch panel 131 and the display panel 141 are shown in fig. 1 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 131 and the display panel 141 may be integrated to implement the input and output functions of the mobile terminal.

The mobile terminal may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 141 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 141 and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the mobile terminal, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile terminal, further description is omitted here.

A speaker 161 and a microphone 162 in the audio circuit 160 may provide an audio interface between the user and the mobile terminal. The audio circuit 160 may transmit the electrical signal converted from the received audio data to the speaker 161, and convert the electrical signal into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 160, and then outputs the audio data to the processor 180 for processing, and then transmits the audio data to, for example, another mobile terminal via the RF circuit 110, or outputs the audio data to the memory 120 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and the mobile terminal can help a user to send and receive e-mails, browse webpages, access streaming media and the like through the WiFi module 170, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 170, it is understood that it does not belong to the essential components of the mobile terminal, and it can be omitted or replaced with other short-range wireless transmission modules, such as Zigbee module or WAPI module, etc., as required within the scope not changing the essence of the invention.

The processor 180 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120, thereby performing overall monitoring of the mobile terminal. Alternatively, processor 180 may include one or more processing units; preferably, the processor 180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 180.

The mobile terminal also includes a power supply 190 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 180 via a power management system that may be configured to manage charging, discharging, and power consumption.

Although not shown, the mobile terminal may further include a camera, a bluetooth module, and the like, which will not be described herein.

In this embodiment, the processor 180 is configured to: determining a data acquisition mode, wherein the data acquisition mode comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode; acquiring original data of each application subsystem of each hospital information system from a data center by using a data acquisition management system according to the acquisition mode, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

The embodiment provides a data processing method. Fig. 2 is a flowchart of a data processing method according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, determining a data acquisition mode, wherein the data acquisition mode comprises the following steps: the method comprises the steps of collecting by using a data interface, collecting by using a copy subscription mode and collecting by using an SSIS mode.

Step S202, acquiring original data of each application subsystem of each hospital information system from a data center according to the acquisition mode by using a data acquisition management system, performing data cleaning and data conversion on the acquired original data, and performing classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

Through the steps, the original data of each application subsystem of each hospital information system can be collected from the data center according to the collection mode, the collected original data is subjected to data cleaning and data conversion, the obtained target data is classified and stored, and the problem that abnormal data cannot be processed in the related technology is solved by processing the data.

Fig. 3 is a schematic diagram of a data processing scheme according to an embodiment of the present application, and the embodiment of the present application is described and illustrated below by a preferred embodiment in conjunction with fig. 3.

And 1, developing a data interface.

The data interface comes from CDR (clinical data center) and HRP platform system, and the purpose is that the change on the time axis by the input organism of "people, property and thing" forms a group of outputs. The investment of people, property and thing is integrated through the analysis of the output on the economic benefit and the social benefit, and an operation system is further improved.

One of the particularity of the construction of the management system used in the scheme is that the construction standard and requirements of the management system are not only determined by the business needs and strategic development targets of the management system, but also a large amount of industrial requirements such as information construction requirements of the ministry of health, medical administration management of the ministry of health, clinical quality control specifications and the like must be adopted, and professional technical requirements of regional information construction must be met; meanwhile, the characteristics of evidence-based medicine in clinical medicine and medical technology are constantly in a state of fusion and development, and the informatization improvement and the continuous development of medical informatization of hospital processes are another special part of the informatization construction of hospitals. Therefore, the objective of the informatization construction of the hospital should be combined and considered with the long-term and short-term objectives, and combined with the information construction requirements of each layer of the hospital, the summary is as follows:

1) establishing a Special operating Data warehouse (SODR) taking analysis as a target by utilizing a Business Intelligence (BI) technology and based on a Data center platform and cost accounting system and financial system Data based on a Data set established by original IT investment; data source data is pushed to the SODR according to needs, and the data source data is connected with the SODR in a unified authorization mode, so that data security is guaranteed.

2) According to scientific, standardized and refined management concepts of modern hospitals and according to a management system, a management system is constructed, data utilization of a management layer, functional departments and clinical departments is realized, and information communication and feedback from bottom to top in the hospital are enhanced.

3) The user is helped to efficiently process post responsibility range data analysis, perform operation analysis, budget management, resource allocation, performance management and other item dynamic monitoring, so that abnormal conditions can be found in time, and operation management can be continuously improved.

4) Clinical department personnel consult the core KPI index data of the department, give the warning light suggestion according to the reference standard value, help the construction management optimization of clinical department.

The data center platform is used as a core of hospital informatization, reorganizes data of the existing system into a reusable data structure after discretization, integrates the data of the existing system and forms a data platform of an administrative center by integrating data of a cost accounting system and financial system data, and realizes integrated application based on the data platform of the administrative center. Various reports are generated, and then statistical analysis is carried out according to the index content of the special management.

The data interface management module mainly comprises functions of data interface editing, data interface management, data interface display, data interface test, data interface maintenance and the like, and provides data interface generation service, data interface general generator service and data interface deployment service at the same time:

1) interface editing: and acquiring the domain knowledge represented by the template, and editing the data interface information according to the domain knowledge.

2) Data interface management: the method mainly comprises the steps of inquiring basic information of the data interface, managing the life cycle state of the data interface and managing the version of the data interface, and establishing, releasing, locking, abandoning and the like the life cycle state of the data interface. The initial state of the basic data interface after the deployment of the template is creation, the newly deployed template is judged, if the template is not deployed, the created basic data interface is updated to be released, if the template version is deployed after the updating, the template is compared with the previous version, if the template version is compatible with the previous version, the previous version is updated to be abandoned, if the template version is incompatible with the previous version, the newly created basic data interface version is updated to be locked, and the state is changed to be released. The initial state of the user-defined data interface is creation, after the user submits, the administrator checks, the state is changed into release, the creation state is returned after rejection, the user updates the version on the released data interface, the current version is changed into disuse, the new version state is creation, and the user of the data interface in the creation state can modify the data interface by himself. Version upgrade refers to adding 1 to the current version base, such as upgrading from v1 to v 2.

3) And (3) data interface display: providing an online description document of a data interface, analyzing data interface information, displaying a data interface resource name based on a template according to the template classification, and displaying the description document of each request mode according to the resource name and version information, wherein the description document mainly comprises a URL (uniform resource locator) of the request mode, an input parameter, an output parameter and related description information.

4) And (3) testing a data interface: and inputting parameter values required by the parameters aiming at the deployed data interface, sending an HTTP request to a universal generator service of the data interface, and verifying the function of the data interface.

5) Deployment of a data interface: and deploying the data interface information in the release state, and enabling the data interface general generator to access the data interface information so as to generate the data interface.

6) And (3) data interface maintenance: the maintenance function of the data interface is provided, and corresponding Chinese description information is mainly added to each data interface, for example, function description information is added to each request mode.

7) Data interface generation service: and exposing the template to model management in a RESTful web API form, calling the interface to generate corresponding data interface information according to a generation method of a stamped basic data interface after the template is successfully deployed, performing version management according to version management constraints, and updating the life cycle state into release.

8) Data interface generic generation service: and processing the request of each deployed data interface as a service end, and executing data interface information, thereby dynamically realizing the RESTful web API.

2, data acquisition management (ETL).

The original data of the data center is collected from each application subsystem of each hospital information system, and various clinical diagnosis and treatment and management data collected from each application subsystem must be processed and arranged into standard data in a relevant way and then stored in different categories to form each resource database of the data center. The extraction, conversion and loading of the data acquisition of the management center information platform are realized by using an ETL tool. The ETL is responsible for processing such as data extraction (Extract), Cleaning (Cleaning), transformation (Transform), and loading (Load), and is an important ring for constructing a data center. The ETL extracts data in distributed and heterogeneous data sources, such as relational data, flat data files, and the like, to a temporary intermediate layer, then performs cleaning, conversion, integration, and finally loads the data to a data warehouse or a data mart, which becomes the basis of online analysis processing data mining.

2.1, a data acquisition system.

The data acquisition tool ETL quickly realizes the extraction, conversion, loading and integration of data. And (4) according to the service requirement, configuring an ETL script for data conversion transmission, and testing whether the script meets the requirement. And uploading the script to the ETL management platform. And controlling the script to start, close and the like through the ETL management platform.

For CDR data, the data source is relatively easy to design. Generally, the dbms (sqlserver) provides a database link function, that is, data synchronization is performed by publishing and subscribing data and by a log synchronization mode, data extraction is performed by using an sql script for the synchronized data, and a direct link relationship is established between the DW database server and the original business system, so that a Select statement can be written for direct access. There are some other data files of Excel type, and typically a database link is established by means of ODBC-like SQL Server data import. If a database link cannot be established, this can be done in two ways, one is by mapping the source data into database tables via a tool (SSIS) and then importing these source system files into the ODS. Another approach is through a program interface.

And 2.2, collecting script configuration.

The configuration of the acquisition script first requires setting a database link of each business system of the hospital or each hospital data center. Aiming at the acquisition of a data source, the system has two modes of calendar acquisition and SQL acquisition; in SQL acquisition, the system compiles SQL acquisition statements according to standard contents of data sets and stores the configuration, so that the matching and the correspondence between the structures of various hospital business databases or hospital data centers and the data sets of regional medical data centers are realized. The system also sets different acquisition time and acquisition period aiming at different acquisition models, thereby staggering the data acquisition time points of each service model, avoiding the peak time of the operation of the system and avoiding influencing the operation of the hospital service system.

And 2.3, collecting, scheduling and monitoring.

The acquisition scheduling is to perform centralized management according to the acquisition period of each model, uniformly manage the acquisition time of each model in each hospital, and start and stop each acquisition model. The acquisition monitoring is to monitor the data acquisition process, monitor the successful data acquisition and failure detail conditions to form a data acquisition log, acquire the failed data again, control the overall data acquisition quality, and perfect a data acquisition configuration scheme according to the data acquisition monitoring result.

The job scheduling and monitoring in the ETL is realized by mainly using sql server proxy service to schedule and maintain services as follows: and (3) maintaining parameters of the scheduling system, wherein the parameters of the scheduling system are as follows: setting and modifying task type, execution frequency, data date, current date and current date; defining and maintaining the operation step, defining the actual ETL processing process corresponding to the operation, generating an operation number, defining the operation type, the driving relation of the operation and the conditions required by the operation of the operation, or directly writing an sql script as the operation content; and scheduling exception handling, namely handling the exception condition in the scheduling process and providing functions of error searching and error rerun.

While configuring Job types that can be executed: ActiveX, operating system (running of CMD), PowerShell, various replication tasks, SQL Server Analysis Service (SSAS) commands (e.g., XML/A), SQL Server Analysis Service (SSAS) queries (MDX), SQL Server Integration Service (SSIS) packages (SQL Server 2000 de DTS packages), T-SQL scripts.

And meanwhile, the job execution log can be checked: scheduling logs, namely managing and recording main processes and abnormal information in scheduling, such as logs of scheduling start, scheduling completion, database operation abnormity and read-write file abnormity; job executes the journal, manages the journal recording Job execution information, provides the query, delete and execution state reset function of the journal; job detailed event Log manages and records Job detailed events (the number of flush records, database specific operation conditions) during execution, and provides query and delete operations for the logs.

The function type setting and data processing of the operation Step (ETL _ Step) are carried out simultaneously:

1) file registration: after the source data file of the FTP is decompressed, the ETL system can process the corresponding process flow by using related components in the SSIS (or independently developing an FTP file parsing service).

2) Data cleaning: in the source data file from the FTP and the CDR synchronous data, there may be illegal data or redundant data or non-uniform data rule standards, and the file format is also not standard, so that the data file cannot be used immediately by the ETL process, and therefore, data cleaning (deleting illegal data, redundant data, unifying data rule standards, converting into a file format that can be "loaded" by the ETL process) must be performed on the data file.

3) Loading data: the cleaned data (file format) is loaded into SQL SERVER database corresponding database table through Job.

4) And ODS data merging: and merging the source service system data of the same type of each branch into the same data table in SQL SERVER database.

5) PI processing: and processing a PI required by the service from the ODR data table according to the service requirement, the service rule and the analysis model.

6) Report processing: and processing a report required by the service from the ODR data table and the PI table according to the service requirement, the service rule and the analysis model.

7) ETL scheduler: and scheduling the operation of each process of the ETL processing.

8) And (3) monitoring program: and monitoring the running state (processing progress, processing efficiency, success, warning, error and the like) information of the ETL process, and timely reporting the running state of the system to the operation and maintenance personnel of the system.

The flow and the dependency relationship of the operation steps are as follows:

1) the operation of the cleaning type Job depends on the state of a corresponding data source, and the source data can be dispatched by cleaning the Job after being notified to the ETL system through a trigger, log distribution and the like.

2) The running of ODS layer load type Job depends on whether the corresponding clean file is generated by the cleansing program, i.e. whether the corresponding cleansing Job is running properly.

3) The data transfer from ODS to ODR depends on whether the associated data of the ODS layer is ready, i.e., whether the corresponding load Job is running properly.

4) The PI processing is carried out depending on ODR layer data, namely whether the corresponding conversion Job is correctly operated or not.

5) And according to the data dependency relationship, performing job scheduling in different regions, wherein ETL processing among the regions can be processed in parallel.

The job scheduling mode comprises the following steps:

1) and driving by leading Job, wherein each operation in the ETL process needs to be performed according to a certain sequence, the leading Job represents the Job to be processed firstly in the ETL process, and the number of the leading Job of the Job can be multiple.

2) Time-driven, when a certain point is reached, Job starts running.

3) The above conditions are driven synthetically, and Job can only operate if at least two of the above three conditions are satisfied.

4) In the concurrent design of Job, each Job starts to run in a background mode as long as the Job satisfies the driving relation. This achieves maximum parallelism for Job in different areas and the same area. The maximum parallel number may be set in advance in consideration of the system resource.

5) And in the concurrent conflict design, when the Job running in parallel needs to use the same resource together, conflict of resource occupation and common conflict in the ETL process are generated, the conflict is avoided in a token mode, the Job can run only by obtaining the token, and otherwise, the release of the token is waited.

Finally, defining inspection points and verification points in the data conversion process:

1) and comparing and checking the file and the source system data, and checking the accuracy of the downloaded data.

2) And cleaning the file, and comparing and checking the cleaning file and the downloaded file so as to judge the correctness of the cleaning treatment process.

3) And the ODS base table compares the data in the ODS base table with the data in the download file for checking, so that the correctness of the loading processing process can be judged.

4) And the ODR base table compares the data in the LDM base table with the data in the ODS base table for checking, so that the correctness of the conversion processing process can be judged.

5) And the PI value is compared and checked with a base table related to the LDM layer, so that the correctness of the PI calculation processing process can be judged.

Designing log information:

1) and scheduling the process log. The method exists in a file mode and is used for recording main processes and abnormal information in Job scheduling, such as scheduling start, scheduling completion, database operation abnormity and read-write file abnormity.

2) Job executes the log. The database table mode exists, necessary information is provided for Job scheduling, and the database table mode is one of the basis for Job scheduling policy calculation and the interface between the scheduling module and Job.

3) Job details the event log. The database table mode exists, and records the detailed information in the ETL processing process, such as the number of successful cleaning records, the number of failed cleaning records or the database operation condition (INSERT \ UPDATE \ DELETE).

And (3) giving an exception handling design: all rejected rows, acceptable number of errors, and a reasonable exit pattern.

Notification design, notification of important information (success/failure):

and (4) successfully exiting:

1) and (4) a sectional submission mode, namely exiting the ETL scheduling when the current tasks submitted in sections are all completed correctly, namely all the Job states registered in the Job running state temporary table are completed.

2) And in the automatic submission mode, when all tasks are completed correctly in due date, namely all the Job states registered in the Job running state table are completed, the ETL scheduling is quitted.

2. And (4) failing and exiting:

1) and if the key operation is abnormal and the operation of the key operation is abnormal, the rest operation is influenced and cannot be operated, exiting the ETL scheduling.

2) And when the ETL time limit is exceeded, the ETL scheduling is exited.

3) And (4) the database is abnormal, and when the database cannot be normally operated, the ETL scheduling is quitted.

4) And when the operating system is abnormal, the program cannot run normally, and if the file system is abnormal, the file reading and writing errors occur, the ETL scheduling needs to be exited.

5) And manually exiting, wherein the ETL scheduling can be exited in a manual operation mode when the ETL scheduling needs to be manually intervened.

2.4 message persistence.

After the service system is integrated by the integration platform, different kinds of abnormal conditions may occur in the information exchange process of the service system or the message processing process of the integration platform. According to years of experience, problems which may occur in the integration operation process are comprehensively analyzed, a multi-level, multi-angle and multi-path exception handling mechanism is designed from the aspects of integration platform products and integration schemes, and the reliability and stability of the operation of the whole integration scheme are guaranteed.

1) Data cleaning: the task of data cleaning is to filter the data which do not meet the requirements, and the filtered result is sent to a business administration department to confirm whether the data are filtered or corrected by a business unit and then extracted. The data which is not qualified is mainly three categories of incomplete data, error data and repeated data.

(1) Incomplete data: the data is mainly information missing which should be existed, such as the name of a supplier, the name of a department, the information missing of a job number, the unmatched main list and detail list in a business system, and the like. The integrity of the data is judged by two modes, firstly, the data is screened through a script, an incomplete data query process is added to a necessary field, and the other method is that in the design process of a data model, constraints are added, screening is carried out through exception handling after error reporting of a system, the data is filtered out, different log files are respectively written into according to missing contents and submitted to a client, and completion is required within a specified time. And writing the data into a data warehouse after completion.

(2) Erroneous data: the reasons for such errors are that the service system is not sound enough, or dirty data generated during the system upgrading process after service adjustment is received and input is not judged and directly written into a background database, for example, numerical data is input into full-angle digital characters, a carriage return operation is performed after character string data, a date format is incorrect, a date is out of range, and the like. The data is also classified, and for the problems of full-angle characters and invisible characters before and after the data, the data can be found only by writing SQL sentences, and then the data is required to be extracted by customers after the business system is corrected. Errors such as incorrect date format or date out-of-bounds errors can cause ETL operation failure, and the errors need to be picked out by a business system database in an SQL mode, are submitted to a business administration department to require correction in a limited period, and are extracted after correction.

(3) Data for repetition: for this type of data, particularly those that occur in dimension tables, all fields of the duplicate data records are exported for validation and collation by the client.

Data cleaning is a repeated process which cannot be completed within a few days, and only problems are continuously found and solved. Whether filtering is carried out or not, whether correction is carried out or not generally requires customer confirmation, filtered data is written into an Excel file or a data table, and mails for filtering data can be sent to business units every day in the initial stage of ETL development, so that the business units can be prompted to correct errors as soon as possible, and meanwhile, the business units can also serve as bases for verifying data in the future. Data cleansing requires care not to filter out useful data, to verify carefully for each filtering rule, and to confirm by the user.

2) And (6) data conversion.

The task of data transformation is mainly to perform inconsistent data transformation, transformation of data granularity, and computation of some business rules.

(1) Inconsistent data transformation: the process is an integrated process, and unifies the same type of data of different business systems, for example, the code of the same supplier in the settlement system is XX0001, and the code in CRM is YY0001, so that the data are unified and converted into one code after being extracted.

(2) Conversion of data granularity: business systems typically store very detailed data, and data in data warehouses is used for analysis and does not require very detailed data. Typically, business system data is aggregated at a data warehouse granularity.

(3) And (3) calculating a business rule: different enterprises have different business rules and different data indexes, and the indexes can be completed without simply adding, subtracting or adding, and storing the data indexes in a data warehouse after the data indexes are calculated in the ETL for analysis and use.

2.5 support active, passive clustering.

Because the integrated platform is located in the central position of various systems in a hospital, the high availability of the HA of the integrated platform is very important, and according to the number of hospital access systems, real-time requirements and message throughput, various high availability schemes are provided for deployment so as to ensure that when the integrated platform system HAs problems, the influence on services is reduced as much as possible.

The integrated platform meets the requirement of high availability through a clustering technology, and supports an active-passive clustering mode, a separation mode and a load balancing mode.

In active-passive cluster mode, two integrated platform servers are connected to a shared data store, typically placed in a SAN.

And the integrated platform service on the active node runs, and the integrated platform service on the passive node stops.

When the service on the active node fails, the cluster software will activate the passive node and start the integrated platform service. The active nodes use the same data store so that the configuration of the integrated platform and the messages currently being processed continue to run on the new service as if the integrated platform had performed only one normal reboot operation. The connection with the integration platform will be directed to the active node via virtual IP.

2.6 support data validation services.

When the integrated platform server is in power failure or abnormal shutdown and other operations, data damage may be caused due to write failure of an operating system; the integration engine provides data verification services that enable the recovery and repair of corrupted data. When the engine detects an abnormal shutdown, the integrated platform performs data verification on the message storage to ensure the integrity of the data.

The data verification process is divided into two phases:

stage 1: verifying a message currently being processed in an integration engine;

and (2) stage: the processed history message is verified.

2.7 support error handling mechanisms.

The message is processed in the integrated platform, if an error occurs, the integrated platform designs a multi-level error processing mechanism and provides various components, so that the error processing can be conveniently and flexibly performed.

2.8 support message tracing mechanisms.

The integration engine uses heterogeneous messaging systems to send messages to a wide variety of different systems, some of which messaging standards, particularly HL7, explicitly define reply messages to indicate successful receipt and proper processing of such information by remote systems. These responses may cause the integration engine to resend the message, or notify the system administrator that some systems did not handle correctly or did not respond at all.

The message tracking mechanism in the integrated platform realizes matching of the sent message and the response message received from the remote system, and judges and executes corresponding subsequent operations of retransmission, error reporting, notification and the like according to the response.

2.9 support the connection re-establishment mechanism.

After the application system is connected to the integrated platform, message sending and receiving services may be temporarily disconnected due to reasons such as message sending and receiving system faults, program infection viruses, operating system downtime, server power failure, network communication interruption and the like.

It is required that when a connection failure occurs, the message sending system first actively discover the connection failure and automatically reestablish the connection at regular time according to a set time interval until the reconnection is successful.

2.10 support message retransmission mechanisms.

Through the message sending and response mode of HL7, the message can be guaranteed to be successfully received by a receiving system after being sent. If the sending system sends the message to the integrated platform, the response message of the receiving system is not received due to network blockage or packet loss, error in receiving service and other problems.

After the message is sent, a timing mechanism is needed, when the response message of the receiving system is still not received after the preset maximum response time is exceeded, the sending system needs to automatically repeat the message, the number of times of retransmission is recorded and added by 1, if the response is not received after the retransmission, the retransmission is continued until the preset maximum number of times of retransmission is reached, and the message is identified to be retransmitted unsuccessfully; if the response is received after retransmission, retransmission is finished, and the identification message is successfully transmitted.

2.11 support a duplicate message detection mechanism.

Due to network delay, system failure, and no response from the receiving system, the sending system may repeatedly send the same message multiple times, and if the receiving system processes the same message multiple times, the receiving system may perform service operations multiple times under some circumstances, causing data errors.

Therefore, the receiving system needs to judge the repeated message, and if the received message is the same as the message processed before, the message is not processed, and a reject response is returned directly.

2.12 support an error message authentication mechanism.

Receiving systems often encounter two types of error information, one type is that the information cannot be processed, for example, a message event type which does not need to be processed, an incompatible message version and an unsuitable processing identifier are received; one is processing errors, such as missing some essential items during parsing of a received message, or data being unintelligible, or incorrect format, incorrect type, and extra-long length.

For each received message, there must be an acknowledgement, regardless of correctness or error. If a message which cannot be processed is received, a rejection response MSA-1 (AR or CR) needs to be returned; if an error is found during parsing and verifying the message, an error response MSA-1 AE or CE needs to be returned. If the message has no problems at all, a successful answer MSA-1 AA or CA is returned.

If an error occurs during message processing parsing and validation, then the error information is stored in the ERR segment of the reply message.

2.13 support error answer handling mechanisms.

After the message sending system finishes sending the message, response information returned from the integrated platform or other systems is received, the corresponding sending information needs to be found according to the unique identification ID in the response information, and the state of the sending information is modified to be successful, rejected or wrong.

It may then be decided how to perform the subsequent processing, such as displaying an error prompt to the operator, or logging an error log, etc. System management and maintenance personnel can decide how to proceed with subsequent processing based on the status of the message.

2.14 support data center based data access services.

Comprises the butt joint of regional platforms, the butt joint of medical conjuncted mechanisms and the report of external data.

The sharing and the business cooperation of medical resources and data resources inside and outside the hospital are realized in an auxiliary mode, the medical service efficiency and quality are improved, and an epitaxial hospital informatization system is built.

The hospital medical cooperation sharing system takes a hospital as a linkage core and a business guidance core, and connects medical and health service cooperation services in series by accessing a hospital data center platform and a cooperation platform to form a medical service cooperation network. The medical and health service cooperation sharing system business scope comprises: medical data sharing, appointment registration, appointment inspection, entrusted reading, regional inspection and inspection center, regional pathology center, special population health care and chronic disease prevention and treatment whole-course management. The data range mainly comprises medical service data such as electronic medical records, health physical examination, medical images and examination and medical cooperation data such as bidirectional referral, remote consultation and remote education.

2.15 support operation and maintenance monitoring.

The integrated platform carries out real-time data exchange with various application systems, bears the core business process of the hospital, and if problems occur in the information exchange process, the problems must be timely found and prompted to relevant personnel. The integrated platform provides a multi-angle and multi-level monitoring mechanism combining active and passive, and the risk of information exchange is reduced as much as possible.

The integrated monitoring platform can display the running state information of the system and the message processing condition in the engine. The system connection fault, the interface throughput capacity abnormity and other problems can be monitored and prompted in real time, meanwhile, the detailed processing log of each problem can be checked in detail, and corresponding reprocessing, resending and other subsequent processing can be carried out according to the requirements.

The monitoring platform can access the snapshot record of all the messages processed by the integration engine, the messages are stored in the archive records of the integration platform, the storage period of the archive messages can be set, and the archive messages are edited, retransmitted and reprocessed as required.

The monitoring platform can display the processing flow of the message and the complete path of the message in a visual mode.

The monitoring platform is a Web browser based application that is accessible through Internet Explorer, Microsoft Edge, Firefox, Chrome, or Safari. So that any operating system can be connected to the monitoring platform for monitoring.

1) Error and hold queue: for displaying the error queue and maintaining the number and content of the current messages in the queue.

2) Message view: messages in the monitoring platform can be presented to users in a more readable format, and HL7 messages can be presented through a tree structure to field names and numerical values. It can also be viewed in text format with keywords highlighted. The method and the device can help the user read the message more conveniently and search the key information in the message more quickly.

3) Engine statistics:

and (3) system statistics: message throughput and memory usage of the integration engine are monitored.

Delay statistics: the time it takes for the integration engine to process the message, and for the external system to respond to the message, is monitored.

And (4) performance statistics: the time it takes for each route and filter to process a message per unit time is monitored.

And (3) message statistics: after the content in the message is analyzed, the type, source and the like of the message are counted.

4) The server state: the integrated platform running log and the system audit log can be checked.

The system run threads can be viewed to help identify problems with the integration engine.

The system can collect various diagnostic information such as logs, configuration, system information and the like, and can be packaged into an independent filing file to provide technical support personnel for further problem analysis.

5) Engine uptime: the monitoring platform has an engine normal operation time reporting function, and a user can check the total amount of messages recorded and processed by the engine operation within a time range.

2.16 notification mechanism.

The monitoring platform provides a multi-tiered notification mechanism, and an administrator can set thresholds for triggering alerts or alarms at each individual component or global to the system. If an alert is not processed within a specified time, the issue is escalated and notified to a specified person or group.

For example, when the LIS system message receiving service fails, the integration platform cannot send out messages, and a large number of messages are accumulated in the message queue and reach a set threshold, an integration platform administrator and a LIS system engineer can be notified timely, so that the problems can be solved immediately.

2.17 monitor the list.

The monitoring list is used for grouping the components according to the logic field, and can be independently monitored or transferred according to the list. The notification can be sent according to the appointed date and time period through the list. The notification can be made according to the communication mode (e-mail, short message, paging) selected by the user.

The monitoring list is a user-defined route, communication point and web service group: creating a list, components needing to be monitored, related subscribers and a user for setting the monitoring list to notify a calendar.

And 3, data access processing.

After data access, exploratory analysis needs to be performed on original data, for the whole data, a preliminary knowledge of the data and an exploratory analysis process of prior knowledge, such as data types, missing values, data set scales, data distribution conditions under various characteristics and the like, are obtained, a third-party drawing library is used for performing visual observation to obtain basic attributes and distribution conditions of the data, and in addition, through univariate analysis and multivariate analysis, the relation among various characteristics in the data set can be preliminarily explored to verify the hypothesis provided in the business analysis stage.

3.1 missing value handling.

The method for acquiring the missing values in the data set can be directly acquired by various methods carried by pandas in python, and the missing values in most data sets generally exist, so that the final result of the model is directly influenced by the processing quality of the missing values. How to process the missing value depends mainly on the importance of the attribute where the missing value is located and the distribution of the missing value.

When the loss rate is low and the importance of the attribute is low, if the attribute is numerical data, the data distribution may be simply filled, for example: if the data are uniformly distributed, filling the data by using the average value; if the data distribution is inclined, filling by using the intermediate digit is enough. If the attribute is a category attribute, a global constant "unknown" or "NULL" may be used, but this is often less effective because the algorithm may identify it as a completely new category and is therefore rarely used.

When the deletion rate is high (> 95%) and the importance of an attribute is low, the attribute may be deleted directly. However, when the missing value is high and the degree of the attribute is high, directly deleting the attribute will have a bad influence on the result of the algorithm.

The missing value is high, the attribute importance degree is high, and the main used methods are interpolation method and modeling method.

The interpolation method mainly includes a random interpolation method, a multiple interpolation method, a hot stage interpolation method, a Lagrange interpolation method and a Newton interpolation method. The random interpolation method is to randomly extract a plurality of samples from the population to replace missing samples; the multiple interpolation method is to predict the missing data through the relation between variables, generate a plurality of complete data sets by using a Monte Carlo method, analyze the data sets, and finally summarize the analysis results; the hot platform interpolation is to find a sample (matching sample) similar to the sample where the missing value is located in the non-missing data set, and interpolate the missing value by using an observed value in the sample. Lagrange difference methods and newton interpolation methods.

The modeling method can predict the missing data by using models such as regression, Bayes, random forests, decision trees and the like. For example: a decision tree can be constructed to predict the value of missing values using attributes of other data in the data set.

In general, there is no uniform flow for processing missing data values, and the method must be selected according to the distribution of actual data, the degree of skew, the proportion of missing values, and the like. In the data preprocessing process, a modeling method is adopted for filling under more conditions except for using a simple filling method and deleting, and the method is mainly characterized in that the modeling method predicts unknown values according to existing values and has high accuracy. However, modeling may also cause the correlation between attributes to become large, which may affect the training of the final model.

3.2 outliers (outliers).

3.2.1 discovery of outliers.

1) Simple statistical analysis.

2)

Principle, outlier detection based on normal distribution, if the data obeys normal distribution

In principle, an outlier is a value that deviates more than 3 standard deviations from the mean in a set of measurements. If the data obeys normal distribution, the distance average

The probability of occurrence of a value other than that is

Belonging to very individual small probability events. If the data does not follow a normal distribution, it can also be described in terms of how many times the standard deviation is away from the mean.

3) Based on model detection, firstly establishing a data model, wherein the abnormity is objects which cannot be perfectly fitted with the same model; if the model is a collection of clusters, then an anomaly is an object that does not significantly belong to any cluster; when using regression models, anomalies are objects that are relatively far from predicted values.

4) Based on the distance, the anomalous objects are those objects that are far from other objects by defining a proximity metric between the objects.

5) Based on density, a point is classified as an outlier when its local density is significantly lower than most of its neighbors. Fitting for non-uniformly distributed data.

6) Clustering-based, clustering-based outliers: an object is cluster-based outliers if the object does not strongly belong to any cluster. Influence of outliers on initial clustering: if outliers are detected by clustering, there is a problem because outliers affect clustering: whether the structure is valid. To deal with this problem, the following method may be used: clustering the objects, deleting outliers, and clustering the objects again.

3.2.2 outlier treatment.

And deleting abnormal values, obviously showing that the abnormal values are abnormal and the abnormal values can be directly deleted in a small number.

Not processed, and may not be processed if the algorithm is not sensitive to outliers, but is preferably not used if the algorithm is sensitive to outliers, such as some algorithms based on distance calculations, including kmeans, knn, and the like.

The average value is replaced, the loss information is small, and the method is simple and efficient.

The missing value is treated as a method of treating the missing value.

3.3 deduplication processing.

For the judgment of the repeated items, the basic idea is 'sorting and merging', firstly, the records in the data set are sorted according to a certain rule, and then whether the records are repeated is detected by comparing whether the adjacent records are similar. The method comprises two operations, namely sequencing and calculating the similarity. At present, in the competition process, a duplicate method is mainly used for judgment, and then repeated samples are simply deleted.

3.4 noise treatment.

Noise is a random error or variance of the measurand, which is mainly distinguished from outliers. By the formula: observed quantity (Measurement) is True Data (True Data) + Noise (Noise). Outliers belong to observations, which may be either true data-producing or noise-contributing, but in general are significantly different observations from most observations. Noise includes erroneous values or deviations from desired outlier values, but it cannot be said that noise points contain outliers, although most data mining methods discard outliers as noise or anomalies. However, in some applications (e.g., fraud detection), outlier analysis or anomaly mining may be performed for outliers. And some points belong to outliers locally, but are normal from a global perspective.

The noise is mainly processed by a box separation method and a regression method:

a box separation method: the binning method smoothes out ordered data values by looking at "neighbors" of the data. These ordered values are distributed into a number of "buckets" or bins. Since the binning method looks at the values of the neighbors, it performs local smoothing.

Smoothing with a box mean value: each value in the bin is replaced by the average value in the bin.

Smoothing with bin number: each value in a bin is replaced by a median in the bin.

Smoothing with bin boundaries: the maximum and minimum values in the bin are also considered as boundaries. Each value in the bin is replaced by the nearest boundary value.

Generally, the greater the width, the more pronounced the smoothing effect. The bins may also be of equal width, with the range of intervals for each bin value being a constant. Binning may also be used as a discretization technique.

A regression method: a function may be used to fit the data to smooth the data. Linear regression involves finding the "best" straight line that fits two attributes (or variables) so that one attribute can predict the other. Multiple linear regression is an extension of linear regression that involves more than two attributes and the data is fit to a multidimensional surface. Using regression, a mathematical equation is found that fits the data, which can help eliminate noise.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

This embodiment provides a data processing apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus has been already made and is not repeated. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus includes:

a determining unit 41, configured to determine a data acquisition manner, where the data acquisition manner includes: collecting by using a data interface, a copying subscription mode and an SSIS mode;

a processing unit 43, configured to collect, by using a data collection management system, raw data of each application subsystem of each hospital information system from a data center according to the collection manner, perform data cleaning and data conversion on the collected raw data, and perform classified storage on the obtained target data, where the data collection management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

An embodiment provides a computer device. The data processing method combined with the embodiment of the application can be realized by computer equipment. Fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.

The computer device may comprise a processor 51 and a memory 52 in which computer program instructions are stored.

Specifically, the processor 51 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 52 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 52 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, magnetic tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 52 may include removable or non-removable (or fixed) media, where appropriate. The memory 52 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 52 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 52 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 51.

The processor 51 realizes any of the data processing methods in the above embodiments by reading and executing computer program instructions stored in the memory 52.

In some of these embodiments, the computer device may also include a communication interface 53 and a bus 50. As shown in fig. 5, the processor 51, the memory 52, and the communication interface 53 are connected via the bus 50 to complete mutual communication.

The communication interface 53 is used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application. The communication interface 53 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 50 comprises hardware, software, or both coupling the components of the computer device to each other. Bus 50 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 50 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Association) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 50 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the methods in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the data processing methods in the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A data processing method, comprising:

determining a data acquisition mode, wherein the data acquisition mode comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode;

acquiring original data of each application subsystem of each hospital information system from a data center by using a data acquisition management system according to the acquisition mode, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

2. The data processing method according to claim 1, wherein the data interface is provided by a data interface management module of the data acquisition management system, and in the process of acquiring raw data of each application subsystem of each hospital information system from a data center according to the acquisition mode, performing data cleaning and data conversion on the acquired raw data, and performing classified storage on the obtained target data, the method further comprises the following steps performed by the data interface management module:

3. The data processing method of claim 1, wherein the data collection management system executes a data collection procedure comprising:

extracting, converting, loading and integrating data, configuring an ETL script for data conversion transmission according to business requirements, testing whether the script meets requirements or not, uploading the script to an ETL management platform, and controlling the script to start and close through the ETL management platform so as to realize the start and close of data acquisition.

4. The data processing method of claim 1, wherein the acquisition script configuration executed by the data acquisition management system comprises:

setting database links of all service systems or all hospital data centers of a hospital, aiming at the acquisition of data sources, the system has two modes of calendar acquisition and SQL acquisition, writing SQL acquisition statements and storing configuration by the system according to the standard content of a data set in the SQL acquisition, realizing the matching and the corresponding of the structure of each hospital service database or all hospital data centers with the standard of a regional medical data center data set, and setting different acquisition time and acquisition periods aiming at different acquisition models so as to stagger the data acquisition time points of all service models.

5. The data processing method of claim 1, wherein the data acquisition scheduling and monitoring performed by the data acquisition management system comprises:

carrying out centralized management according to the acquisition period of each model, and uniformly managing the acquisition time of each model in each hospital so as to start and stop each acquisition model;

the data acquisition process is monitored, the detail condition of success or failure of data acquisition is monitored, a data acquisition log is formed, failed data is acquired again to control the overall data acquisition quality, and a data acquisition configuration scheme is perfected according to the data acquisition monitoring result.

6. The data processing method of claim 1, wherein the message persistence performed by the data collection management system comprises:

and cleaning the data to filter the data which do not meet the requirements, delivering the filtering result to a business administration department, and determining whether the data are filtered or corrected by a business unit and then extracting the data, wherein the data which do not meet the requirements comprise three types of incomplete data, wrong data and repeated data.

7. The data processing method of claim 1, wherein the data verification service executed by the data acquisition management system comprises:

when the integrated platform server is powered off abnormally or has power failure, the operating system fails to write or the data is damaged, the integrated engine provides data verification service to recover and repair the damaged data, and when the integrated engine detects abnormal shutdown, the integrated platform server performs data verification on message storage to ensure the integrity of the data.

8. A data processing apparatus, comprising:

the determining unit is used for determining a data acquisition mode, and the data acquisition mode comprises the following steps: collecting by using a data interface, a copying subscription mode and an SSIS mode;

the processing unit is used for acquiring the original data of each application subsystem of each hospital information system from the data center according to the acquisition mode by using the data acquisition management system, carrying out data cleaning and data conversion on the acquired original data, and carrying out classified storage on the obtained target data, wherein the data acquisition management system has the following functions: data collection, collection script configuration, collection scheduling and monitoring, message persistence, active and passive clustering mode, data verification service, data error handling mechanism, message tracking mechanism, connection reestablishment mechanism, message retransmission mechanism, repeated message detection mechanism, error message verification mechanism, error response handling mechanism, data center-based data access service, operation and maintenance monitoring, notification mechanism and monitoring list, wherein data cleaning represents filtering of data which is not qualified, and data conversion comprises conversion of inconsistent data, conversion of data granularity and conversion of business rules.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 7.