CN113138980A

CN113138980A - Data processing method, device, terminal and storage medium

Info

Publication number: CN113138980A
Application number: CN202110522186.7A
Authority: CN
Inventors: 王成; 王雅洁; 赵培祯
Original assignee: Dermatology Hospital Of Southern Medical University
Current assignee: Dermatology Hospital Of Southern Medical University
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-07-20

Abstract

The embodiment of the invention discloses a data processing method, a device, a terminal and a storage medium, which are applied to report data of diseases, and the method comprises the following steps: acquiring a plurality of original report data of the same disease; converting each identity characteristic of the patient in each original report data into a preset character string by a random algorithm to obtain data to be processed; processing all data to be processed based on a plurality of preset rules to determine repeated data; removing the data with the earliest disease reporting time in the repeated data, and setting other removed data as final repeated data; and associating the final repeated data with the corresponding relation and then respectively storing. According to the scheme, the conversion is carried out after the original report data is acquired, so that the information of the patient cannot be directly seen in the subsequent data duplicate checking process, the privacy of the patient is protected, the duplicate checking mode rule is beneficial to automatic processing, the data processing efficiency is improved, and the large-scale data processing requirement can be met.

Description

Data processing method, device, terminal and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for data processing.

Background

At present, the conditions of a plurality of infectious diseases need to be reported and registered; however, many infectious diseases, such as syphilis cases, are specific, and particularly, even after regular treatment of syphilis, both syphilis-specific antibodies and non-specific antibodies of syphilis can be displayed and detected to be positive for life, so that multiple rechecks are required, and therefore, patients can have multiple times of treatment, secondary treatment, multiple times of treatment, annual treatment and the like.

The problem of repeated reports can occur when patients see a plurality of times, at the same time, the most cases of syphilis are reported at present in a comprehensive hospital, the cases of the comprehensive hospital mainly come from preoperative screening, and a plurality of doctors in the preoperative screening process are professional doctors in non-skin disease departments, so that the problem that the standard of syphilis report is not clear and the syphilis is easily reported again is caused.

Specifically, the re-reporting of syphilis, i.e., the repeated reporting of syphilis, refers to the phenomenon that the same case is reported 2 times or more after re-infection is excluded. All cases reported in the same diagnosis after the 1 st report of the same case are re-reported after re-infection is eliminated. The re-reporting phenomenon of syphilis is not beneficial to accurately monitoring the infectious disease and exactly mastering the actual number of the infected persons. Repeated reports of syphilis will directly affect the judgment of the true level of the epidemic and further affect the government's decisions.

For such a situation, some methods for checking duplicate currently exist, such as manual duplicate checking and manual screening by using Excel (a software in the software suite of microsoft corporation for processing data) table, but this method has a large workload, is time-consuming and labor-consuming, is prone to error, cannot meet the needs of large-scale data processing, and in addition, in the process of data processing, the information of the patient is directly displayed, which is not beneficial to the privacy protection of the patient.

For this reason, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a terminal and a storage medium for data processing, and the method, the apparatus, the terminal and the storage medium convert after acquiring original report data, so that in a subsequent data duplicate checking process, information of a patient cannot be directly seen, which is beneficial to protecting privacy of the patient, and the duplicate checking method is regulated, which is beneficial to automation, improves data processing efficiency, and can meet large-scale data processing requirements.

Specifically, the present invention proposes the following specific examples:

the embodiment of the invention provides a data processing method, which is applied to report data of diseases and comprises the following steps:

acquiring a plurality of original report data of the same disease;

converting each identity characteristic of the patient in each original report data into a preset character string by a random algorithm to obtain data to be processed, and establishing a corresponding relation between the identity characteristic and the preset character string; the preset character strings are different after different identity characteristics are converted;

processing all the data to be processed based on a plurality of preset rules to determine repeated data; each preset rule is generated based on the identity characteristics of the patient and the unique characteristics of the disease, and the identity characteristics comprise the unique identity characteristics of the patient and/or other identity characteristics except the unique identity characteristics;

removing the data with the earliest disease reporting time in the repeated data, and setting other removed data as final repeated data;

and associating the final repeated data with the corresponding relation and then respectively storing.

In a specific embodiment, obtaining multiple raw report data for the same disease includes:

acquiring a plurality of original data of the same disease;

performing data cleaning on all the original data to remove the original data which does not comprise the identity features or the unique features of the diseases;

setting the remaining original data as original report data.

In a specific embodiment, the type of identity feature comprises any combination of one or more of the following: name, certificate number, gender, age, address, contact, birth time;

when the disease is syphilis, the disease is characterized by stages of syphilis.

In a specific embodiment, when the identity is a name, the conversion is: randomly converting all Chinese characters in the name into preset character strings; converting each Chinese character in the name into pinyin, and randomly converting each letter in the pinyin into a preset character; converting different Chinese characters into different preset character strings, and converting different letters into different preset characters;

when the identity feature is other than name, the conversion is: and converting all the other characteristics into preset character strings at random.

In a specific embodiment, when the preset character strings after the whole conversion of the Chinese characters in the name are the same, or each preset character obtained after the pinyin conversion of the Chinese characters in the name is the same, the identity characteristic of the name is determined to be the same.

In a specific embodiment, the "processing the data to be processed based on a plurality of preset rules to determine duplicate data" includes:

if the unique characteristics of the diseases in the two or more data to be processed are the same and the unique characteristics of the identities of the patients in the two or more data to be processed are the same, determining the two or more data to be processed as initial repeated data;

if the unique characteristics of diseases in two or more data to be processed are the same, and the other identity characteristics of which the quantity exceeds a preset value in the two or more data to be processed are the same, determining that the two or more data to be processed are initial repeated data;

and summarizing all the initial repeated data and performing deduplication to obtain repeated data.

The embodiment of the invention also provides a data processing device, which is applied to report data of diseases and comprises the following components:

the acquisition module is used for acquiring a plurality of original report data of the same disease;

the conversion module is used for converting each identity characteristic of the patient in each original report data into a preset character string by a random algorithm to obtain data to be processed and establishing a corresponding relation between the identity characteristic and the preset character string; the preset character strings are different after different identity characteristics are converted;

the processing module is used for processing all the data to be processed based on a plurality of preset rules so as to determine repeated data; each preset rule is generated based on the identity characteristics of the patient and the unique characteristics of the disease, and the identity characteristics comprise the unique identity characteristics of the patient and/or other identity characteristics except the unique identity characteristics;

the removing module is used for removing the data with the earliest disease reporting time in the repeated data and setting other removed data as final repeated data;

and the storage module is used for associating the final repeated data with the corresponding relation and then respectively storing the associated repeated data.

In a specific embodiment, the obtaining module is configured to: acquiring a plurality of original data of the same disease; performing data cleaning on all the original data to remove the original data which does not comprise the identity features or the unique features of the diseases; setting the remaining original data as original report data.

The embodiment of the invention also provides a terminal, which comprises a processor and a memory, wherein an application program is stored in the memory, and the application program executes the data processing method when running on the processor.

The embodiment of the present invention further provides a storage medium, where an application program is stored in the storage medium, and the application program executes the data processing method when running on a processor.

Therefore, the embodiment of the invention discloses a data processing method, a device, a terminal and a storage medium, which are applied to report data of diseases, and the method comprises the following steps: acquiring a plurality of original report data of the same disease; converting each identity characteristic of the patient in each original report data into a preset character string by a random algorithm to obtain data to be processed, and establishing a corresponding relation between the identity characteristic and the preset character string; the preset character strings are different after different identity characteristics are converted; processing all the data to be processed based on a plurality of preset rules to determine repeated data; each preset rule is generated based on the identity characteristics of the patient and the unique characteristics of the disease, and the identity characteristics comprise the unique identity characteristics of the patient and/or other identity characteristics except the unique identity characteristics; removing the data with the earliest disease reporting time in the repeated data, and setting other removed data as final repeated data;

and associating the final repeated data with the corresponding relation and then respectively storing. According to the scheme, the conversion is carried out after the original report data is acquired, so that the information of the patient cannot be directly seen in the subsequent data duplicate checking process, the privacy of the patient is protected, the duplicate checking mode rule is beneficial to automatic processing, the data processing efficiency is improved, and the large-scale data processing requirement can be met.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 shows a flow diagram of a method of data processing;

FIG. 2 is a schematic diagram of a data processing apparatus;

fig. 3 shows a schematic structural diagram of a terminal;

fig. 4 shows a schematic structural diagram of a storage medium.

Illustration of the drawings:

201-an acquisition module; 202-a conversion module; 203-a processing module; 204-a culling module;

205-storage module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

Example 1

Embodiment 1 of the present invention discloses a data processing method, which is applied to report data of diseases, and as shown in fig. 1, the method includes the following steps:

step S101, acquiring a plurality of original report data of the same disease;

specifically, the step S101 of obtaining the original report data of the same disease includes:

acquiring a plurality of original data of the same disease; performing data cleansing on all of the raw data to remove the raw data that does not include patient identity or unique characteristics of the disease; setting the remaining original data as original report data.

Specifically, for example, in the case where the disease is syphilis, the raw data may be downloaded from, for example, a national infectious disease report information management system, or downloaded from another data source.

The syphilis needs to include unique characteristics of syphilis diseases, such as stage characteristics of syphilis, and particularly, considering that syphilis cases have particularity, syphilis can be divided into first-stage syphilis (corresponding to ulcer or chancre at an infected part), second-stage syphilis (corresponding to rash, skin mucosa lesion, lymph node lesion and the like), and third-stage syphilis (corresponding to heart lesion or gummy swelling), so that the stage characteristics can be used as the unique characteristics of the syphilis diseases.

In addition, the report data also needs to include identity information, so that the original data is cleaned, the original data which does not include the identity characteristics or unique characteristics of the disease of the patient is removed, and the original data which remains after the removal operation is used as the original report data.

Furthermore, the type of identity feature comprises any combination of one or more of: name, certificate number, gender, age, address, contact, birth time; when the disease is syphilis, the disease is characterized by stages of syphilis.

Step S102, converting each identity characteristic of the patient in each original report data into a preset character string by a random algorithm to obtain data to be processed, and establishing a corresponding relation between the identity characteristic and the preset character string; the preset character strings are different after different identity characteristics are converted;

specifically, when the identity feature is a name, the conversion is: randomly converting all Chinese characters in the name into preset character strings; converting each Chinese character in the name into pinyin, and randomly converting each letter in the pinyin into a preset character; converting different Chinese characters into different preset character strings, and converting different letters into different preset characters;

for example, when the name of the patient is "Zhang III", the "Zhang III" is generally converted into a preset character string, for example, into Asc; in addition, the pinyin of "zhang san" is "zhang san", and is converted into a preset character for each letter, for example, "z" is converted into "a", "h" is converted into "S", and the like.

For another example, when the last name of the patient is "Liquan", the "Liquan" population is converted into a preset character string, for example, into sdc; in addition, the pinyin of "lie four" is "li si", and for each letter, the pinyin is converted into a preset character, for example, "l" is converted into "g", "i" is converted into "o", and the like.

And performing integral conversion on the specific identity characteristics of other types, such as the identification number, age and the like, and using the converted preset character string for uniquely identifying the identity characteristics before conversion.

When the identity feature is other than name, the conversion is: and converting all the other characteristics into preset character strings at random. Specific other features then perform an overall conversion, such as performing an overall conversion of gender "male" to "P"; converting the gender of the female into a U; for example, the birth time is converted into a predetermined string.

After the conversion, when the preset character strings after the whole conversion of the Chinese characters in the name are the same or each preset character obtained after the pinyin conversion of the Chinese characters in the name is the same, the identity characteristic of the name is determined to be the same.

Specifically, the name feature is still described as "zhang san", and if the names are all "zhang san" or the pinyin is "zhang san", it means that the feature of the name is the same as that of the body.

This is because it is considered that misleading is likely to occur due to the speech rate, accent, and the like when voice communication is performed in an actual process.

Step S103, processing all the data to be processed based on a plurality of preset rules to determine repeated data; each preset rule is generated based on the identity characteristics of the patient and the unique characteristics of the disease, and the identity characteristics comprise the unique identity characteristics of the patient and/or other identity characteristics except the unique identity characteristics;

specifically, the step S103 of processing all the to-be-processed data based on a plurality of preset rules to determine the repeated data includes: if the unique characteristics of the diseases in the two or more data to be processed are the same and the unique characteristics of the identities of the patients in the two or more data to be processed are the same, determining the two or more data to be processed as initial repeated data; if the unique characteristics of diseases in two or more data to be processed are the same, and the other identity characteristics of which the quantity exceeds a preset value in the two or more data to be processed are the same, determining that the two or more data to be processed are initial repeated data; and summarizing all the initial repeated data and performing deduplication to obtain repeated data.

In one particular embodiment, reference is made to table 1 below:

TABLE 1

Note: check represents agreement, and O represents homophonic and different characters. The reported cases may differ in age by ± 1 year within the same year.

Based on the characteristics of each type in table 1, each condition corresponds to a preset rule, and any one of the following conditions is satisfied to be regarded as repeated data in a specific processing process.

1. Selecting case data with the identity card number consistent with the syphilis staging in a database, wherein the selected case data is repeated data, and adding a label on the database for identification;

2. selecting the case data with completely consistent or basically consistent names (homonymous characters and different characters), gender and age (the reported cases can differ by +/-1 year in the same year), and consistent telephone numbers and syphilis stages from the database, wherein the selected case data is the repeated data, and adding a label on the database for identification;

3. selecting the case data with completely consistent or basically consistent names (same tone and different characters), sex, age (the reported cases can differ by +/-1 year in the same year), current addresses (particularly to the villages and towns and street levels) and consistent syphilis stages in a database, wherein the selected case data is the repeated data, and adding a label on the database for marking;

4. selecting the case data with completely consistent or basically consistent names (homophonic different characters) and consistent gender, birth date and syphilis staging in a database, wherein the selected case data is repeated data, and adding a label on the database for identification;

5. selecting sex, telephone, age (the reported cases in the same year can differ by +/-1 year) and syphilis staged case data from a database, wherein the selected case data is repeated data, and adding a label on the database for identification;

and integrating the tags obtained by the 5 standards, and deleting repeated data.

Specifically, the repetition determination may be performed based on other features and rules than the above features and rules.

S104, eliminating the data with the earliest disease reporting time in the repeated data, and setting other eliminated data as final repeated data;

specifically, the repeated data includes the time for reporting a disease, that is, the time for reporting a disease, and after the duplicate checking of the scheme, the first case of the disease reporting data also appears in the repeated data, so that the time for reporting a disease needs to be excluded from the first disease reporting data in the repeated cases, and the remaining repeated data is the final repeated data.

And S105, associating the final repeated data with the corresponding relation and then respectively storing the data.

Specifically, after the final repeated data is obtained, in a technical scene needing to be applied, the final repeated data needs to be restored to original data, and the restoration can be performed based on the corresponding relationship.

In addition, after the final repeated data is obtained, other analysis processing can be performed, for example, statistical analysis can be performed on the final repeated data according to factors such as year, area and the like. In addition, the final repeated data can be handed over to various cities for rechecking the data.

Example 2

Embodiment 1 of the present invention discloses a data processing apparatus, which is applied to report data of diseases, and as shown in fig. 2, the apparatus includes:

an obtaining module 201, configured to obtain multiple original report data of the same disease;

a conversion module 202, configured to convert each identity feature of the patient in each original report data into a preset character string by using a random algorithm, obtain data to be processed, and establish a corresponding relationship between the identity feature and the preset character string; the preset character strings are different after different identity characteristics are converted;

the processing module 203 is configured to process all the to-be-processed data based on a plurality of preset rules to determine duplicate data; each preset rule is generated based on the identity characteristics of the patient and the unique characteristics of the disease, and the identity characteristics comprise the unique identity characteristics of the patient and/or other identity characteristics except the unique identity characteristics;

the removing module 204 is configured to remove the data with the earliest reported time in the repeated data, and set the other removed data as final repeated data;

and the storage module 205 is configured to associate the final duplicate data with the corresponding relationship and then store the final duplicate data respectively.

In a specific embodiment, the obtaining module 201 is configured to: acquiring a plurality of original data of the same disease; performing data cleaning on all the original data to remove the original data which does not comprise the identity features or the unique features of the diseases; setting the remaining original data as original report data.

In a specific embodiment, the type of identity feature comprises any combination of one or more of the following: name, certificate number, gender, age, address, contact, birth time; when the disease is syphilis, the disease is characterized by stages of syphilis.

In a specific embodiment, when the identity is a name, the conversion is: randomly converting all Chinese characters in the name into preset character strings; converting each Chinese character in the name into pinyin, and randomly converting each letter in the pinyin into a preset character; converting different Chinese characters into different preset character strings, and converting different letters into different preset characters; when the identity feature is other than name, the conversion is: and converting all the other characteristics into preset character strings at random.

In a specific embodiment, the processing module 203 is configured to: if the unique characteristics of the diseases in the two or more data to be processed are the same and the unique characteristics of the identities of the patients in the two or more data to be processed are the same, determining the two or more data to be processed as initial repeated data; if the unique characteristics of diseases in two or more data to be processed are the same, and the other identity characteristics of which the quantity exceeds a preset value in the two or more data to be processed are the same, determining that the two or more data to be processed are initial repeated data; and summarizing all the initial repeated data and performing deduplication to obtain repeated data.

Example 3

Embodiment 3 of the present invention further discloses a terminal, as shown in fig. 3, including a processor and a memory, where the memory stores an application program, and the application program executes the data processing method described in embodiment 1 when running on the processor.

Example 4

Embodiment 4 of the present invention further discloses a storage medium, as shown in fig. 4, where an application program is stored in the storage medium, and the application program executes the data processing method described in embodiment 1 when running on a processor.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of data processing, applied to report data on a disease, the method comprising:

acquiring a plurality of original report data of the same disease;

2. The method of claim 1, wherein obtaining multiple raw report data for the same disease comprises:

acquiring a plurality of original data of the same disease;

setting the remaining original data as original report data.

3. The method of claim 1, wherein the type of identity feature comprises any combination of one or more of: name, certificate number, gender, age, address, contact, birth time;

4. The method of claim 1, wherein when the identity feature is a name, the converting is to: randomly converting all Chinese characters in the name into preset character strings; converting each Chinese character in the name into pinyin, and randomly converting each letter in the pinyin into a preset character; converting different Chinese characters into different preset character strings, and converting different letters into different preset characters;

5. The method as claimed in claim 4, wherein the identity of the name is determined to be the same when the predetermined character string after the entire conversion of the chinese characters in the name is the same or each predetermined character obtained after the pinyin conversion of the chinese characters in the name is the same.

6. The method according to claim 1, wherein the "processing all the data to be processed based on a plurality of preset rules to determine duplicate data" comprises:

7. An apparatus for data processing, applied to report data of a disease, the apparatus comprising:

8. The apparatus of claim 7, wherein the acquisition module is to:

acquiring a plurality of original data of the same disease;

setting the remaining original data as original report data.

9. A terminal, characterized in that it comprises a processor and a memory, in which an application program is stored, which, when run on the processor, performs the method of data processing according to any one of claims 1 to 6.

10. A storage medium, in which an application program is stored, which, when run on a processor, performs the method of data processing according to any one of claims 1-6.