CN112035451A

CN112035451A - Data verification optimization processing method and device, electronic equipment and storage medium

Info

Publication number: CN112035451A
Application number: CN202010869181.7A
Authority: CN
Inventors: 宋京青; 陈淼波; 刘颖权; 杨苑
Original assignee: Shanghai Linglong Software Technology Co ltd
Current assignee: Shanghai Linglong Software Technology Co ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-12-04

Abstract

The invention provides a processing method and a processing device for data verification optimization, electronic equipment and a storage medium, wherein the processing method for data verification optimization comprises the following steps: reading and checking data to be detected from a database to determine target data which do not meet predefined data specifications; determining a target specification term matched with the target data in an industry term knowledge base, wherein a plurality of specification terms are recorded in the industry term knowledge base, and all the specification terms conform to the data specification; writing the target specification term to the database, and iterating the target data. The invention provides a full-automatic and man-machine interaction optimization processing method.

Description

Data verification optimization processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a processing method and apparatus for data verification optimization, an electronic device, and a storage medium.

Background

The current era is a data explosion age, and a large amount of data can bring basis for scientific decision. However, problems with data quality are widespread. Erroneous, low-quality data presents a great difficulty in reasonably utilizing big data, and therefore, the acquired data needs to be cleaned and optimized.

In the prior art, the data cleaning optimization work is mainly realized manually in a data label mode, time and labor are consumed, efficiency is low, and further, the wide application of big data can be seriously influenced. In addition, most of the existing data cleaning modes are provided with a middle database, and the data cleaning is carried out in a static environment, so that the requirement that a large amount of data needs to be dynamically optimized in the transmission process cannot be met.

Disclosure of Invention

The invention provides a processing method and device for data verification optimization, electronic equipment and a storage medium, and aims to solve the problems of time consumption, labor consumption and low efficiency of data cleaning optimization.

According to a first aspect of the present invention, there is provided a processing method for data verification optimization, including:

reading and checking data to be detected from a database to determine target data which do not meet predefined data specifications;

determining a target specification term matched with the target data in an industry term knowledge base, wherein a plurality of specification terms are recorded in the industry term knowledge base, and all the specification terms conform to the data specification;

writing the target specification term to the database, and iterating the target data.

Optionally, the data to be detected is dynamically verified in real time in the transmission process read from the database.

Optionally, in the industry term knowledge base, determining a target specification term matching the target data includes:

determining N similar canonical terms that are semantically similar to the target data, among the plurality of canonical terms; wherein N is more than or equal to 1;

determining the target canonical term among the N similar canonical terms.

Optionally, the target specification term is determined among the N similar specification terms, including any one of:

feeding back the N similar canonical terms to relevant personnel, and determining the target canonical term in response to the approval operation of the relevant personnel;

when the number of the similar canonical terms is one, automatically determining a unique similar canonical term as the target canonical term;

and performing semantic analysis on the similar terms and/or the combination of the similar terms and the original text to which the target data belongs, and selecting one similar standard term which is most similar as the target standard term.

Automatically selecting a most similar one of the N similar canonical terms as the target canonical term.

Optionally, determining N similar canonical terms that are semantically similar to the target data, among the plurality of canonical terms, includes:

identifying the target data using the trained recognition model, determining the N similar canonical terms; the recognition model is configured to be able to recognize, for the input data, one or more canonical terms that are semantically similar to the input data.

Optionally, after determining the target canonical term from the N similar canonical terms, the method further includes:

inputting the target specification terms and the target data into the recognition model to train the recognition model with the target specification terms and the target data.

Optionally, after reading and verifying the data to be detected from the database, the method further includes:

evaluating the data quality of the data to be tested according to the verification result of the data to be tested to obtain a data quality evaluation result;

and determining a corresponding data quality optimization suggestion according to the verification result of the to-be-detected data and/or the data quality evaluation result, wherein the data quality optimization suggestion is used for guiding a maintenance main body of the to-be-detected data to improve the data quality.

Optionally, the data to be tested is verified by using a universal template corresponding to the data specification; the generic template specifies data for each of a plurality of data nodes.

Optionally, the data specification is an xml or json-based data specification, and the corresponding generic template is a generic template based on an xml or json scheme format.

Optionally, the universal template is designed based on JSON Schema format and CDIF.

According to a second aspect of the present invention, there is provided a data verification optimized processing apparatus, including:

the reading and checking module is used for reading and checking the data to be detected from the database and determining the target data which do not accord with the predefined data specification;

the determining module is used for determining a target specification term matched with the target data in an industry term knowledge base, wherein a plurality of specification terms are recorded in the industry term knowledge base, and all the specification terms conform to the data specification;

and the iteration module is used for writing the standard terms matched with the target data into the database and iterating the target data.

According to a third aspect of the invention, there is provided an electronic device comprising a processor and a memory,

the memory is used for storing codes;

the processor is configured to execute the code in the memory to implement the method according to the first aspect and its alternatives.

According to a fourth aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of the first aspect and its alternatives.

According to the processing method, the processing device, the electronic equipment and the storage medium for data verification optimization, target data which do not accord with the predefined data specification can be found through data verification, and the existence of the target data can bring trouble to further data application based on a database.

Furthermore, because the invention can dynamically check the transmitted data in real time in the process of reading the data, the invention can be conveniently implemented on data middleware (buses, gateways and the like), and realizes the real-time data quality check and optimization in the data transmission process.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a first flowchart illustrating a processing method for data verification optimization according to an embodiment of the present invention;

FIG. 2 is a first flowchart illustrating the step S12 according to an embodiment of the present invention;

FIG. 3 is a second flowchart illustrating the step S12 according to an embodiment of the present invention;

FIG. 4 is a second flowchart illustrating a processing method for data verification optimization according to an embodiment of the present invention;

FIG. 5 is a block diagram of a processing apparatus for data verification optimization according to an embodiment of the present invention;

FIG. 6 is a block diagram of a processing apparatus for data verification optimization according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

The invention provides a processing method and device for data verification optimization, electronic equipment and a storage medium, which can be applied to any service equipment, such as a server, a terminal and a computer, and meanwhile, the service equipment can also interact with a maintenance end (such as a database server) of a database.

The processing method for data verification optimization (and the device applied thereto) can be used for processing databases maintained by different maintenance subjects, and the database and the data in the database can be changed arbitrarily according to different applied scenarios without being limited to any one of the exemplary application scenarios.

When applied to a medical scenario, the data in the database may be case data.

When applied to a government approval scenario, the data in the database may be various table data in the government approval process. The data in the table are ensured to meet the standard requirements, the examination and approval delay caused by the fact that the data do not meet the requirements in the examination and approval process is avoided, the range of some data is regulated in the quality control process, and automatic alarm exceeding the range is formed.

In addition to the above examples, the embodiment of the present invention may also perform data verification and optimization with respect to an automation specification of software programming, so as to improve program quality, and may also perform data verification and optimization with respect to indexes and specifications of academic articles, so as to improve retrieval efficiency of the articles.

Meanwhile, in the following examples, case data is mainly used as an example for explanation, but it should be understood that the solution according to the embodiment of the present invention includes, but is not limited to, any example of an application scenario.

Referring to fig. 1, an embodiment of the present invention provides a processing method for data verification optimization, including:

s11: reading and checking data to be detected from a database to determine target data which do not meet predefined data specifications;

s12: determining a target specification term matching the target data in an industry term knowledge base;

a plurality of standard terms are recorded in the industry terminology database, and all the standard terms conform to the data specification;

s13: writing the canonical terms matched with the target data into the database, and iterating the target data.

The data to be tested is dynamically checked in real time in the transmission process read from the database. Therefore, the embodiment of the present invention can implement the real-time data quality verification and optimization during the data transmission process, and further, the embodiment of the present invention can be conveniently implemented on data middleware (bus, gateway, etc.) (but is not limited thereto).

The data specification may be a data specification which is self-defined by a client according to application requirements.

Thus, in one embodiment, prior to step S11, the client may define formatted and unformatted data specifications under the management portal provided by the service device, depending on the application scenario, including but not limited to the following: data type, number of parameters, field length, field type, data enumeration value, etc.

In step 11, according to the data specification, the service device and the database carrying the data to be tested are docked through a standard interface (API), so as to automatically read various types of data in the database, and check the data in real time according to the data specification in the reading process. Further, for the data that does not meet the specification (i.e., the target data), it may be written into the cache library of the service device, which may be understood as achieving the determination of the target data, and then the next optimization service operation may be triggered (i.e., achieving the optimization of the data through steps S12 and S13).

In one embodiment, the data to be tested may be verified by using a universal template corresponding to the data specification. The content of the universal template can be designed based on common knowledge in the industry, and particularly, the universal template specifies the data of each node in a plurality of data nodes.

Further, the data specification is xml or json-based data specification, and the corresponding universal template is xml or json scheme format-based universal template.

In a further specific example, the generic template is designed based on JSON Schema format and CDIF.

In a specific example, the data nodes of the case general template corresponding to the case data may be, for example:

the data nodes can be expanded, for example, the operation mode can be a thoracic operation mode, and the secondary expansion can be lung nodule positioning technology, a television thoracoscope operation, an open thoracic operation, a robot thoracic operation (reservation) and a transit open thoracic operation.

Wherein the TV thoracoscopic surgery expands again to: single-hole, two-hole, three-hole, four-hole thoracoscopic surgery; according to a special disease template, important information seen in the operation is selected, for example, structural nodes in the radical lung cancer operation are as follows: adhesion in the thoracic cavity, pleural effusion, other occupation in the thoracic cavity, tumor localization (internal nodes: left upper lobe, left lower lobe, right upper lobe, right middle lobe, right lower lobe), tumor maximum diameter, pleural depression on the surface of the tumor, tumor invasion (internal node involvement range: adjacent lung lobe, lobe bronchus, main bronchus, chest wall, etc.), operative resection range, lymph node clearing condition (left side: 5, 6, 7, 9, 10L, 11; right side: 2R, 4R, 7, 9, 10R, 11), predicted metastatic lymph nodes, operative complications (for example: respiratory system complications-persistent air leakage, pneumothorax, post-operative pleural effusion, etc.), total operative duration, intraoperative blood loss, post-operative diagnosis, post-operative treatment measures (automatic acquisition), etc.

The data nodes are not limited to the above examples, and further examples can be found in "the practice of thoracic surgery structured electronic medical record based on standardized terms" in the journal of Chinese Lung cancer ".

In a specific example, for the above case general template, a json schema format and an innovative CDIF specification can be creatively utilized to design a general case history template specification, data in the specification is constrained through schema verification, and data quality is improved through a feedback closed-loop mechanism.

In one embodiment, referring to fig. 2, step S12 may include:

s121: determining N similar canonical terms that are semantically similar to the target data, among the plurality of canonical terms; wherein N is more than or equal to 1;

s122: determining the target canonical term among the N similar canonical terms.

The content of the standard terms can be changed at will based on the standards of various industries, the standard terms in the corresponding industry term knowledge base can be maintained in advance or updated in real time, the maintenance of the knowledge base and the verification optimization of data can be realized in parallel, and the maintenance of the knowledge base and the verification optimization of the data can not interfere with each other.

In one example, the canonical terms may be standardized diagnosis and treatment terms, and specifically may be uniformly named and strictly defined for a basic language used in a clinical diagnosis and treatment process, so that the most appropriate terms selected or established may be used to solve the irregular and non-uniform phenomena of multiple terms, multiple words, synonyms, or synonyms.

In a specific implementation process, referring to fig. 3, step S121 may include:

s1211: identifying the target data using the trained recognition model, determining the N similar canonical terms.

The recognition model can be configured to recognize one or more standard terms with similar semanteme with the input data aiming at the input data, and the recognition model can be an artificial intelligent model, so that the optimized repair process is further realized by performing semantic analysis on characters and industry data and performing specification according to the industry terms.

Meanwhile, the recognition capability of the recognition model may also be updated in a real-time optimization manner, and further, after step S122, the method may further include:

s123: inputting the target specification terms and the target data into the recognition model to train the recognition model with the target specification terms and the target data.

In the above solution, the target specification term finally determined in step S122 may be the optimal (also understood as the most similar, or the most suitable for iteration) specification term determined automatically or manually, and by feeding the optimal specification term back to the recognition model, the recognition accuracy of the recognition model may be effectively improved, and the recognition capability of the model may be optimized.

In one embodiment, step S122 may include any one of the following:

feeding back the N similar canonical terms to relevant personnel, and determining the target canonical term in response to the approval operation of the relevant personnel; it can be seen that manual intervention is involved;

when the number of the similar canonical terms is one, automatically determining a unique similar canonical term as the target canonical term; it can be seen that an automatic term determination is involved;

automatically selecting a most similar one of the N similar canonical terms as the target canonical term; it can be seen that another automatic term determination is involved. Specific examples thereof include: and performing semantic analysis on the similar terms and/or the combination of the similar terms and the original text to which the target data belongs, and selecting one similar standard term which is most similar as the target standard term. In this case, N similar specification calls can be written into the temporary databases, respectively, and the above processing is performed based on the temporary databases.

In a specific example, after the machine (for example, the recognition model, but not limited to the recognition model) determines the terms (i.e., N similar canonical terms) similar to the meaning of the target data, the recommended N similar canonical terms may be written into a temporary modification document, and the target canonical terms in the document may be officially confirmed after the approval confirmation of the doctor, so as to rewrite the target canonical terms into the database.

Through the automatic identification, the automatic repair can be embodied, specifically, the repair optimization can be automatically realized without manual participation, and after the identification model related to the automatic identification is combined, experience can be continuously accumulated in the machine learning process, so that the identification capability is optimized.

Therefore, the recognition model can embody the optimization of the repair process, specifically, in the repair process, the machine is continuously learning, for example, learning which standard terms should be corresponded to new words and sentences, the learning result can continuously enrich the data matched with the standardized terms, and the recognition accuracy is improved, so that the recognition model has better repair capability and is suitable for wider data.

Through the combined semantic analysis, similar standard terms can be found accurately, and the searching accuracy can be improved in continuous optimization learning.

Meanwhile, in part of schemes, if the target data meets the data specification and has certain professionalism or can be used as a standardized term, the target data can also be written into the knowledge base in a manual or automatic mode, so that the knowledge base of the term is enriched. For example: if certain words and phrases occur frequently and the corresponding terms in the knowledge base can be uniquely determined (manually or automatically), they can also be written into the knowledge base.

The steps S11 to S13 may be repeated to continuously promote the adaptive machine learning process, optimize the correction efficiency, and improve the data quality.

In one embodiment, in addition to the above checking and iterative optimization of the data, the data quality can be evaluated, and in combination with the later evaluation and recommendation processes and the previous checking and iterative optimization processes, the data quality checking and semantic analysis can be performed again, iterative loop is repeated, and the data in the database is optimized step by step, so that the data in the database continuously approaches the specification requirement.

Therefore, referring to fig. 4, after step S11, the method may further include:

s14: evaluating the data quality of the data to be tested according to the verification result of the data to be tested to obtain a data quality evaluation result;

s15: and determining a corresponding data quality optimization suggestion according to the verification result of the to-be-detected data and/or the data quality evaluation result, wherein the data quality optimization suggestion is used for guiding a maintenance main body of the to-be-detected data to improve the data quality.

Further, in some embodiments, only step S14 may be implemented, and step S15 may not be implemented; and then the data quality evaluation result is fed back to a corresponding data maintenance subject (such as a hospital) without giving a specific optimization suggestion.

In a specific implementation process, in step S14, a summary analysis may be performed according to the verification result, a score may be given to the data source (i.e., the corresponding data or database to be tested) in combination with the industry quality standard, a grade may be performed according to the score result, and a report may be issued.

The verification result may include, for example: the data is the target data (i.e. which data is not in accordance with the specification), further the data may also include the node position, quantity, hazard degree, etc. of the data with quality problems (e.g. not in accordance with the specification), an objective comprehensive score may be formed by performing quantitative processing on the data of the verification results and weighting statistics, the data quality may be further graded based on the comprehensive score, further, maintenance subjects of different databases may be compared (e.g. data quality of different hospitals are compared), and finally, the analyzed information may be integrated into a report, and the report may be transmitted to the maintenance subjects of the databases (e.g. hospitals).

In a specific implementation procedure, in step S15, the processing result of the optimized service operation (for example, the processing result of step S12, which may specifically include the determined target data and the target specification term) may be merged to perform a service quality optimization improvement suggestion, for example, a suggestion that certain data is recorded as a corresponding specification term when being maintained later may be suggested.

Further, in step S15, simulation prediction may be performed, which may be provided as a value-added service, for example, the possible changes of the database after the recommendation is adopted by the maintenance subject of the database may be simulated.

In addition, the improvement of the data quality is a cyclic and repeated process, the discovery of the quality problem (i.e. the implementation of the processing of the step S11), the simulation scoring (i.e. the processing of the step S14), the feedback of the statistical score to the leadership of the maintenance subject (e.g. a hospital) and the like can be cyclically implemented, the improvement measures (whether manual intervention or automatic quality optimization measures) are implemented, the detection, scoring and feedback can be cyclically implemented after the optimization, the optimization is further performed on new problems, the cyclic and repeated process is performed until the comprehensive score reaches the quality standard (i.e. the data quality evaluation result reaches the preset standard reaching requirement), and the standard reaching of the data maintenance subject (e.g. the hospital) can be effectively promoted through the.

In addition, in the above scheme, according to the data test result (i.e. the verification result), different definition standards can be referred to, and the data quality level can be reasonably defined for different data ownership groups (i.e. data maintenance subjects). And different data owners are promoted to further improve the trial refining quality and reach higher-level standards.

Referring to fig. 5, an embodiment of the present invention further provides a processing apparatus 200 for data verification optimization, including:

the reading and checking module 201 is used for reading and checking the data to be detected from the database and determining target data which do not accord with predefined data specifications;

a determining module 202, configured to determine, in an industry term knowledge base, a target specification term that matches the target data, where the industry term knowledge base records a plurality of specification terms, and the plurality of specification terms all conform to the data specification;

and the iteration module 203 is used for writing the standard terms matched with the target data into the database and iterating the target data.

Optionally, the determining module 202 is specifically configured to:

determining the target canonical term among the N similar canonical terms.

Optionally, the determining module 202 is specifically configured to implement any one of the following:

Optionally, the determining module 202 is specifically configured to:

Optionally, the determining module 202 is further configured to:

Optionally, referring to fig. 6, the apparatus further includes:

the evaluation module 204 is configured to evaluate the data quality of the data to be tested according to the verification result of the data to be tested, so as to obtain a data quality evaluation result;

and the suggestion module 205 is configured to determine a corresponding data quality optimization suggestion according to the verification result of the to-be-detected data and/or the data quality evaluation result if the data quality evaluation result does not meet a preset standard requirement, where the data quality optimization suggestion is used to guide a maintenance subject of the to-be-detected data to improve data quality.

Optionally, the data to be tested is verified by using a universal template corresponding to the data specification;

the generic template specifies data for each of a plurality of data nodes.

The universal template is designed based on JSON Schema format and CDIF.

Referring to fig. 7, an embodiment of the present invention further provides an electronic device 30, including:

a processor 31; and the number of the first and second groups,

a memory 32 for storing executable instructions of the processor;

wherein the processor 31 is configured to perform the above-mentioned method via execution of the executable instructions.

The processor 31 is capable of communicating with the memory 32 via a bus 33.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned method.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A processing method for data verification optimization is characterized by comprising the following steps:

2. The processing method for data verification optimization according to claim 1, wherein the data to be tested is dynamically verified in real time during transmission read from the database.

3. The processing method of data verification optimization according to claim 1, wherein determining the target specification term matching the target data in the field term knowledge base comprises:

determining the target canonical term among the N similar canonical terms.

4. The data verification optimized processing method of claim 3, wherein determining the target canonical term among the N similar canonical terms comprises any one of:

5. The processing method of data verification optimization according to claim 3, wherein determining N similar canonical terms among the plurality of canonical terms that are semantically similar to the target data comprises:

6. The processing method of data verification optimization according to claim 5, further comprising, after determining the target canonical term among the N similar canonical terms:

7. The processing method for data verification optimization according to any one of claims 1 to 6, after reading and verifying the data to be tested from the database, further comprising:

8. The processing method for data verification optimization according to any one of claims 1 to 6, wherein the data to be tested is verified by using a universal template corresponding to the data specification; the generic template specifies data for each of a plurality of data nodes.

9. The method of claim 8, wherein the data specification is xml or json-based data specification, and the corresponding generic template is xml or json scheme format-based generic template.

10. The processing method of data verification optimization according to claim 9, wherein the generic template is designed based on JSON Schema format and CDIF.

11. A data verification optimized processing apparatus, comprising:

12. An electronic device, comprising a processor and a memory,

the memory is used for storing codes;

the processor configured to execute the code in the memory to implement the method of any one of claims 1 to 10.

13. A storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 10.