CN111078639A

CN111078639A - Data standardization method and device and electronic equipment

Info

Publication number: CN111078639A
Application number: CN201911219128.6A
Authority: CN
Inventors: 张云
Original assignee: Wanghai Kangxin Beijing Technology Co Ltd
Current assignee: Wanghai Kangxin Beijing Technology Co Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-28
Anticipated expiration: 2039-12-03
Also published as: CN111078639B

Abstract

The application provides a data standardization method and device and electronic equipment, and relates to the technical field of computers. The method comprises the following steps: acquiring data to be standardized and each standardized dictionary in a standardized dictionary library; determining preset data matching models corresponding to the standardized dictionaries; respectively matching the data to be standardized with each standardized dictionary based on a preset data matching model to obtain matching results, and determining a target standardized dictionary based on the matching results; based on the target normalization dictionary, a normalization result of the data to be normalized is determined. The data standardization method and the data standardization device realize standardization of data to be standardized and improve the accuracy of data standardization.

Description

Data standardization method and device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data normalization method and apparatus, and an electronic device.

Background

The data dictionary is used for defining and describing data items, data structures, data streams, data stores, processing logics, external entities and the like of data, and aims to describe each element in the data flow chart in detail. A data dictionary is a directory of records databases and application metadata that a user may access.

In the current data dictionary standardization scheme, the client uploads the unknown standardized dictionary to the cloud computer terminal, so that the cloud computer terminal matches the unknown standardized dictionary with the standard dictionary and automatically establishes the corresponding relation between the two dictionaries.

Disclosure of Invention

In order to solve at least one of the problems in the prior art, embodiments of the present application provide a data normalization method, an apparatus, and an electronic device, and the technical solution provided by the embodiments of the present application is as follows:

a first aspect of the present application provides a data normalization method, including:

acquiring data to be standardized and each standardized dictionary in a standardized dictionary library;

determining preset data matching models corresponding to the standardized dictionaries;

respectively matching the data to be standardized with each standardized dictionary based on a preset data matching model to obtain matching results, and determining a target standardized dictionary based on the matching results;

based on the target normalization dictionary, a normalization result of the data to be normalized is determined.

A second aspect of the present application provides a data normalization apparatus, including:

the acquisition module is used for acquiring data to be standardized and each standardized dictionary in the standardized dictionary library;

the first determining module is used for determining preset data matching models corresponding to the standardized dictionaries;

the matching template is used for matching the data to be standardized with each standardized dictionary respectively based on a preset data matching model to obtain a matching result, and determining a target standardized dictionary based on the matching result;

and the second determination module is used for determining a normalization result of the data to be normalized based on the target normalization dictionary.

Optionally, the matching result includes a similarity matching result;

the matching module is specifically configured to, when matching the data to be standardized with each standardized dictionary respectively based on the preset data matching model to obtain matching results:

and for each standardized dictionary, inputting the data to be standardized and the standardized dictionary into a preset data matching model corresponding to the standardized dictionary to obtain a similarity matching result of the data to be standardized and the standardized dictionary.

Optionally, when determining the target standardized dictionary based on the matching result, the matching module is specifically configured to:

and determining the standardized dictionary corresponding to the maximum similarity matching result as a target standardized dictionary.

Optionally, when determining that the normalized dictionary corresponding to the maximum similarity matching result is the target normalized dictionary, the matching module is specifically configured to:

if at least one similarity matching result in the similarity matching results exceeds a matching threshold, determining the standardized dictionary corresponding to the maximum similarity matching result as a target standardized dictionary;

the device further comprises a first transceiver module, wherein the first transceiver module is used for sending the standardized dictionaries corresponding to the first N similarity matching results with high similarity to the terminal equipment corresponding to the manager if the similarity matching results do not exceed the matching threshold, receiving the determination result sent by the terminal equipment, and determining the target standardized dictionary based on the indication information containing the target standardized dictionary in the determination result, wherein N is an integer greater than 0.

Optionally, for any standardized dictionary, the data matching algorithm of the preset data matching model is determined based on the matching result of each field in the data to be standardized and the corresponding field in the standardized dictionary and the weight corresponding to each field, and the expression of the data matching algorithm is as follows:

wherein M is_KMatching results of the kth field in the data to be standardized and the corresponding field in the standardized dictionary;

M_Kas a weight of the k-th fieldAnd (4) heavy.

Optionally, for any normalized dictionary, the weight corresponding to each field is determined by:

acquiring training sample data, wherein the training sample data comprises unnormalized data of each sample and a sample matching result of the unnormalized data of each sample and a standardized dictionary;

and optimizing and adjusting each weight corresponding to the data matching algorithm of the standardized dictionary based on the training sample data until the matching result of each sample non-standardized data and the standardized dictionary determined based on the data matching algorithm and the corresponding sample matching result meet the preset condition.

Optionally, the matching module is configured to, when the matching module matches the data to be normalized with each normalized dictionary based on the preset data matching model to obtain a matching result, and determines the target normalized dictionary based on the matching result, specifically:

if the standardized result of the data to be standardized does not exist in the standardized result library, respectively matching the data to be standardized with each standardized dictionary based on a preset data matching model to obtain a matching result, determining a target standardized dictionary based on the matching result, and executing a step of determining the standardized result of the data to be standardized based on the target standardized dictionary;

the device also comprises a third determining module which is used for determining the normalization result of the data to be normalized from the normalization result library if the normalization result of the data to be normalized exists in the normalization result library.

Optionally, the apparatus further comprises a second transceiver module and an adjustment module,

after the second determination module determines the normalization result of the data to be normalized, the second transceiver module is used for receiving the manual inspection result aiming at the normalization result; and the adjusting module is used for adjusting the weight of each field in the preset data matching model corresponding to the target standardized dictionary based on the manual checking result so as to optimize the preset data matching model corresponding to the target standardized dictionary.

Optionally, the device further comprises a calibration module and a cleaning module;

after the acquisition module acquires the data to be standardized, the matching module matches the data to be standardized with each standardized dictionary respectively to obtain a matching result based on a preset data matching model, and the checking module is used for checking the data of the data to be standardized based on a preset data checking rule; and the cleaning module is used for cleaning the data to be standardized after the calibration and the inspection based on the preset data cleaning rule.

Optionally, the data to be normalized includes at least one of:

the dictionary to be standardized, the business stock data and the business increment data.

Optionally, the at least one standardized dictionary comprises at least one of:

the dictionary comprises a national standardized dictionary, an international standardized dictionary, a standardized dictionary of expert specifications in the corresponding field of the dictionary, a standardized dictionary of an original version and a standardized dictionary of an updated version.

In a third aspect of the present application, an electronic device is provided, including:

the electronic device comprises a memory and a processor;

the memory has a computer program stored therein;

a processor for performing the method of any of the first aspects when running the computer program.

In a fourth aspect of the present application, a computer-readable medium is provided, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

the method comprises the steps of obtaining data to be standardized and standardized dictionaries, determining preset data matching models corresponding to the standardized dictionaries, matching the standardized dictionaries with the data to be standardized respectively, determining a target standardized dictionary based on matching results, determining the standardized results of the data to be standardized based on the target standardized dictionary, and compared with the scheme of matching between the two dictionaries, establishing the preset data matching models of the standardized dictionaries in advance, matching the data to be standardized with the standardized dictionaries respectively based on the corresponding preset data matching models to obtain more accurate matching results, further obtaining more accurate standardized results of the data to be standardized, limiting the method to standardization of the dictionaries, and standardizing various data to be standardized by using the scheme of the method, the range of data processing is increased.

Drawings

In order to clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of an embodiment of a data normalization method of the present application;

FIG. 2 is a schematic diagram of another embodiment of a data normalization method of the present application;

FIG. 3 is a schematic diagram of a data normalization apparatus according to the present application;

fig. 4 is a schematic structural diagram of an electronic device according to the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be noted that the same thing (e.g., drug, consumable, and sex definition) has different description modes between different objects or different systems, and in order to integrate scattered business data together when collecting, counting, and analyzing business data of each system, the standardization of the data dictionary is required.

The existing data standardization scheme has the following defects:

1. the scheme only provides standardization for user dictionary data, the applicable range is limited, and statistics, analysis and utilization of large-range data cannot be realized.

2. The unknown standardized dictionaries uploaded by the users have great difference, so that the matched standardized dictionaries are probably not found, and the accuracy of dictionary standardization is low even if the matched standardized dictionaries are found.

3. The unknown standardized dictionary is directly standardized without being preprocessed, so that the quality of dictionary data obtained after standardization is not high, and the effect of practical application is not good.

The application provides a data standardization method, a data standardization device and electronic equipment, and aims to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 1, the present application provides a data normalization method, which may be specifically executed by a server or a cloud computer, and includes:

step S101: acquiring data to be standardized and each standardized dictionary in a standardized dictionary library;

step S102: determining preset data matching models corresponding to the standardized dictionaries;

after the server acquires the data to be standardized, which is sent by the client, the server acquires each standardized dictionary in a preset standardized dictionary library, each standardized dictionary has a preset data matching model corresponding to the standardized dictionary one by one, and the preset data matching model can be used for determining the matching result between the corresponding standardized dictionary and the data to be standardized.

It is understood that the preset data matching model is a pre-trained model, and the preset data matching model has a corresponding data matching algorithm.

Step S103: respectively matching the data to be standardized with each standardized dictionary based on a preset data matching model to obtain matching results, and determining a target standardized dictionary based on the matching results;

step S104: based on the target normalization dictionary, a normalization result of the data to be normalized is determined.

The number of the standardized dictionaries in the standardized dictionary library is generally two or more, for one standardized dictionary, the server can input the standardized dictionary and the data to be standardized into the corresponding preset data matching model to obtain the matching result of the standardized dictionary and the data to be standardized, so that the matching result of each standardized dictionary and the data to be standardized is obtained, and the target standardized dictionary can be determined from the standardized dictionaries based on the matching result.

Based on the target standardized dictionary, the server finds fields corresponding to all the fields in the data to be standardized in the target standardized dictionary, so that a corresponding relation file of all the fields in the data to be standardized and corresponding fields in the target standardized dictionary is formed, the corresponding relation file is a standardized result, the server sends the corresponding relation file to the client, and the client can standardize the data to be standardized through the corresponding relation file.

In the embodiment, after the data to be standardized and each standardized dictionary are obtained, each standardized dictionary can be respectively matched with the data to be standardized based on the preset data matching model corresponding to each standardized dictionary, and the target standardized dictionary is determined based on the matching result, so that the standardized result of the data to be standardized can be determined based on the target standardized dictionary, compared with the existing scheme for matching between two dictionaries, the preset data matching model of each standardized dictionary is established in advance, the matching result obtained by respectively matching the data to be standardized with each standardized dictionary based on the corresponding preset data matching model is more accurate, the further obtained standardized result of the data to be standardized is more accurate, meanwhile, the application is not limited to the standardization of the dictionaries, and various data needing standardized processing can be standardized by using the scheme of the application, the range of data processing is increased.

Compared with the prior art, the method has the advantages that each standard dictionary in the maintained standard dictionary library is more perfect, each version of standard dictionary is included, meanwhile, the method can not only realize the standardization of the dictionary, but also realize the standardization of non-dictionary type data, and is specifically as follows:

optionally, the data to be normalized includes at least one of:

Optionally, the at least one standardized dictionary comprises at least one of:

According to the technical scheme, the standardization of the dictionary and the standardization of the data can be achieved, specifically, the data to be standardized can be the dictionary to be standardized, and at least one of service stock data and service increment data generated by a server in the dictionary standardization process, the service stock data refers to cache data existing before service execution of a certain application program, and the service increment data refers to a part, after the certain application program executes a certain service, of which the service data is increased.

The standardized dictionary of the existing scheme is generally the latest version dictionary, the standardized dictionary library of the application comprises a first version (V1.0) dictionary of each standard dictionary, if a certain dictionary is updated, the newly-added version of the dictionary and the first version dictionary can coexist in the standardized dictionary library, so that the requirements of different users can be met, meanwhile, the perfection of the dictionary in the standardized dictionary library also provides more standardized dictionaries for data to be standardized, and the accuracy of data standardization is improved.

Optionally, the matching result includes a similarity matching result;

based on a preset data matching model, respectively matching the data to be standardized with each standardized dictionary to obtain matching results, wherein the matching results comprise:

Optionally, determining the target normalized dictionary based on the matching result includes:

Optionally, determining the normalized dictionary corresponding to the maximum similarity matching result as a target normalized dictionary includes:

the method further comprises the following steps:

and if the similarity matching results do not exceed the matching threshold, sending the standardized dictionaries corresponding to the first N similarity matching results with higher similarity to the terminal equipment corresponding to the manager, receiving the determination result sent by the terminal equipment, and determining the target standardized dictionary based on the indication information containing the target standardized dictionary in the determination result, wherein N is an integer greater than 0.

In this embodiment, the matching result of the data to be normalized and the normalized dictionary may be a similarity matching result, the similarity matching result may be obtained by calculating cosine similarity between the data to be normalized and the normalized dictionary, and a specific calculation process will be described in the following embodiments.

After obtaining the similarity matching results of the data to be normalized and each normalized dictionary, based on the similarity matching results of each normalized dictionary, the normalized dictionary corresponding to the maximum similarity matching result can be determined as the target normalized dictionary, specifically, one possible case is: when at least one similarity matching result in the similarity matching results exceeds a matching threshold, determining the standardized dictionary corresponding to the maximum similarity matching result as a target standardized dictionary; another possible case is that the similarity matching results of the standardized dictionaries are sorted from large to small according to similarity, when none of the similarity matching results exceeds the matching threshold, the top N standardized dictionaries with the highest similarity sorting order can be sent to the terminal device corresponding to the manager, so that the manager can perform manual matching, if the manager finds that the standardized dictionaries matched with the data to be standardized exist in the top N standardized dictionaries (specifically, the manual matching finds that each field in the data to be standardized is the same as or has a high similarity with the corresponding field in the standardized dictionaries), the standardized dictionaries are determined as the target standardized dictionaries, the terminal device corresponding to the manager sends the determination result to the server, the determination result contains the indication information of the target standardized dictionaries, and the server can determine the target standardized dictionaries based on the indication information, however, the probability of this situation is relatively low, generally, if the similarity matching result between the standardized dictionary and the data to be standardized is smaller than the matching threshold, it is difficult to extract the standardized dictionary matched with the data to be standardized by manual matching, and in most cases, the manager finds that the standardized dictionary matched with the data to be standardized does not exist (or is uncertain whether the standardized dictionary exists) in the first N standardized dictionaries ranked ahead, and then adds the first N standardized dictionaries ranked ahead to the ambiguity library.

In the implementation, the target standardized dictionary can be determined by determining the similarity matching result between the data to be standardized and the standardized dictionary based on the preset data matching model, and meanwhile, for the standardized dictionary with the similarity matching result smaller than the matching threshold, the target standardized dictionary is determined by adopting a manual processing mode, so that the accuracy of standardizing the data to be standardized based on the target standardized dictionary is higher.

Further, the preset data matching model is obtained through model training, and after the preset data matching model is obtained, the preset data matching model can be applied to a standardization process of data to be standardized, and the training process and the using process of the preset data matching model are respectively described below.

M_Kis the weight of the kth field.

Optionally, after the data to be normalized is obtained, before the data to be normalized is respectively matched with each normalized dictionary based on the preset data matching model to obtain the matching result, the method further includes:

performing data checking on data to be standardized based on a preset data checking rule;

and based on a preset data cleaning rule, cleaning the data to be standardized after the calibration and the inspection.

Firstly, introducing a use process of a preset data matching model corresponding to a standardized dictionary:

each standardized dictionary is provided with a data structure, the data structure comprises a data logic structure, a data storage structure and a data operation structure, when the data to be standardized is matched with one standardized dictionary in a similarity mode, a template is required to be determined based on the data result of the standardized dictionary, after the server obtains the data to be standardized, the data to be standardized and the template can be integrated to obtain the template containing the data to be standardized, so that the corresponding relation between each field in the data to be standardized and each field in the standardized dictionary is formed, and each field in the data to be standardized is provided with a unique corresponding field in the standardized dictionary.

Meanwhile, after the server obtains the template containing the data to be standardized, the data to be standardized is checked based on a preset data checking rule, and then the data to be standardized after checking is cleaned based on a preset data cleaning rule.

Inputting the data to be standardized and the standardized dictionary subjected to data cleaning and data checking into a preset data matching model, wherein one possible situation of a data matching algorithm of the preset data matching model is shown in the expression, and then the similarity matching result of the data to be standardized and the standardized dictionary is the similarity matching result

Wherein M is_KFor the matching result of the kth field in the data to be normalized with the corresponding field in the normalized dictionary, M_KIs the weight of the kth field.

The standardized dictionary comprises a plurality of fields (also called as feature vectors), each field is configured with a weight value, the data to be standardized comprises n fields, the fields corresponding to the data to be standardized and the n fields in a one-to-one correspondence manner are respectively extracted from the standardized dictionary, the two fields in the one-to-one correspondence manner are subjected to similarity matching, the matching result of each field in the data to be standardized and the corresponding field in the standardized dictionary is multiplied by the weight of the field to obtain a result value, and the result values of the n fields are accumulated to obtain the final similarity matching result of the data to be standardized.

In this embodiment, the performing similarity matching on the two fields specifically may be calculating cosine similarity of the two fields, where the algorithm expression is as follows:

sim ═ cos (vector expression for one field, vector expression for another field); (formula 2)

The matching result of the two fields can be determined based on the cosine similarity of the two fields, and one possible case is: if the cosine similarity of the two fields is greater than a preset threshold, the two fields are considered to be successfully matched, the matching result of the two fields can be 1, if the matching result of the two fields is not successful, the matching result of the two fields can be 0, for example, the data to be normalized comprises A, B, C, D four fields, the normalized dictionary comprises 1, B1, C1 and D1 fields respectively corresponding to A, B, C, D, the weights of a1, B1, C1 and D1 are x1, x2, x3 and x4, of course, other fields can be included in the normalized dictionary, the similarity between the fields a and a1 is calculated according to a formula (2), if the similarity is greater than the threshold, the matching result between a and a1 is 1, if the similarity is not 0, the matching result of the other fields B, C, D is determined to be matched with the field a, and the similarity matching result of the normalized dictionary is determined (the matching result between the field a and a1 is x1+ B5, the matching result of the field 4624 is combined with C573C 24 + C24 Result x3+ matching result of D field D with D1 x 4).

It should be noted that, the matching result of the field may be 1 when the matching of the two fields is successful, and the matching result of the field may be 0 when the matching of the two fields is failed, which is only an implementable scheme, and actually 0 and 1 may be replaced with other parameters based on the normalization requirement of the user.

Another possible case is that the cosine similarity value of each field in the data to be standardized and the corresponding field of the standardized dictionary is directly determined as the matching result of the two fields, for example, the data to be standardized includes A, B, C, D four fields, the standardized dictionary includes A1, B1, C1, D1 four fields corresponding to A, B, C, D respectively, the weights of A1, B1, C1, D1 take x1, x2, x3 and x4, of course, other fields may be included in the standardized dictionary, and the similarity between the fields a and a1 is calculated according to formula (2), the remaining fields B, C, D refer to field a in a manner that determines similarity, and the result of similarity matching between the data to be normalized and the normalized dictionary (cosine similarity x1 of fields a and a 1+ cosine similarity x2 of fields B and B1 + cosine similarity x3 of fields C and C1 + cosine similarity x4 of fields D and D1).

The above is the process of calculating the similarity matching result between the data to be normalized and the normalized dictionary, and one preferable scheme is that if the similarity matching result between the data to be normalized and the normalized dictionary exceeds 0.8, the matching is considered to be successful, and if the similarity matching result between the data to be normalized and the normalized dictionary exceeds 0.8, the matching is not successful.

In the implementation, the dictionaries to be standardized uploaded by the users have great difference, and the weighted values of the fields in the data matching algorithm corresponding to the preset data matching model can be configured according to the characteristics of the dictionaries to be standardized, so that the matching of the dictionaries to be standardized uploaded by the users can be adapted, and the accuracy of data standardization is high.

Meanwhile, data cleaning and data verification are carried out on data to be standardized, and the quality of dictionary data obtained after standardization is improved.

Secondly, a training process of the preset data matching model is introduced.

And acquiring training sample data, wherein the training sample data comprises the unnormalized result data of each sample and the sample matching result of the unnormalized data of each sample and the standardized dictionary, and the sample matching result is obtained by manual matching, namely the known correct result in the model training.

The preset data matching models can be obtained through machine learning algorithm training, the obtained preset data matching models are different according to the algorithm characteristics and the dictionary characteristics of each dictionary, the different data matching models specifically refer to different weights of fields in the data matching algorithm of the data matching models, for the training of one preset data matching model, the initial data matching algorithm of the preset data matching model is known, the data matching algorithm is as the formula (1), the matching result of each field in each sample non-standardized data and the corresponding field in the standardized dictionary data and the weight of the field are brought into the formula (1) to be calculated to obtain the matching result of each sample non-standardized data and the standardized dictionary, meanwhile, the matching result of each sample non-standardized data and the sample of the standardized dictionary is known, and the weight value of each field in the data matching algorithm is reversely adjusted based on the sample matching result, and training to obtain a preset data matching model until the matching result of each sample unnormalized data and standardized dictionary determined based on the data matching algorithm and the corresponding sample matching result meet a preset condition, wherein the preset condition can be that the error between the matching result of the sample unnormalized data and standardized dictionary obtained based on the data matching algorithm and the known sample matching result is in a range.

Similarly, the training data may also be subjected to data verification and data cleansing.

Thirdly, the standardization result in the embodiment can be used for standardization of the data to be standardized only through manual checking, the manual checking result can also reversely update and perfect the weight of the field in the data matching algorithm of the preset data matching model, and finally, the process of reversely updating and perfecting the data matching algorithm of the preset data matching model is described.

Optionally, after determining a normalization result of the data to be normalized, the method further includes:

receiving a manual inspection result for the standardized result;

and based on the manual checking result, adjusting the weight of each field in the preset data matching model corresponding to the target standardized dictionary to optimize the preset data matching model corresponding to the target standardized dictionary.

The server sends the generated standardized result to the client corresponding to the user, if the user finds that the standardized result has problems, manual checking can be carried out, and the wrong standardized result can be modified, so that the client can realize standardization of the data to be standardized based on the standardized result after error correction, and under the condition, the user can add the standardized dictionary which is sent by the server and has errors into the ambiguity library.

In this embodiment, it may be that manual verification is performed only on the normalized result generated by automatic matching of the server (that is, in the case that the similarity matching result for the normalized dictionary exceeds the matching threshold), and manual verification is not required on the normalized result generated by manual matching of the manager (that is, in the case that the similarity matching result for the normalized dictionary does not exceed the matching threshold, and the manager needs to determine the target normalized dictionary from the top N similarity matching results with higher similarities).

In this embodiment, the normalized dictionary added to the ambiguity library, whether the normalized dictionary added by the administrator is the first N normalized dictionaries in the top sequence or the normalized dictionary with errors added by the user to the ambiguity library in this embodiment, may be used to reversely adjust the weights of the fields in the data matching algorithm corresponding to the normalized dictionary, for example, for the target normalized dictionary in the ambiguity library, the above normalized results for the target normalized dictionary and the data to be normalized are manually checked, the client may send the manual checking result to the server, and the server may adjust the weights of the fields in the data matching algorithm of the target normalized dictionary based on the manual checking result, so that the data matching algorithm of the preset data matching model corresponding to the target normalized dictionary may be optimized. Similarly, the data matching algorithm for the preset data matching model corresponding to the top N normalized dictionaries in the top order may also be adjusted in this way.

In the present application, it is possible that the normalized result of the data to be normalized exists in the normalized result library, and then the normalized result of the data to be normalized can be directly obtained from the normalized result library.

Optionally, based on the preset data matching model, matching the data to be normalized with each normalized dictionary respectively to obtain matching results, and determining the target normalized dictionary based on the matching results, including:

the method further comprises the following steps:

if the standardized result of the data to be standardized exists in the standardized result library, the standardized result of the data to be standardized is determined from the standardized result library.

A standardization result base is maintained in the server, the standardization result base stores standardization results of each data or dictionary, after the client sends the data to be standardized to the server, if the server finds that the data to be standardized is standardized, namely the standardization result of the data to be standardized exists in the standardization result base, the server directly sends the standardization result of the data to be standardized to the client to complete the standardization of the data to be standardized without matching the data to be standardized with each standardization dictionary, if the standardization result of the data to be standardized does not exist in the standardization result base, the standardization of the data to be standardized is required according to the steps of the embodiment shown in the figure 1, further, the standardization of the data to be standardized can be added into the standardization result base after the standardization of the data to be standardized is carried out according to the steps of the embodiment shown in the figure 1, thereby perfecting the standardized result library.

In the embodiment, the standardized result in the standardized result library is directly obtained for the data submitted with the standardized processing for multiple times, so that the data volume of the matching algorithm operated by the system is reduced, and the system performance is improved.

In summary, referring to fig. 2, the present disclosure specifically includes the following steps:

step S201, a server acquires data to be standardized, which is sent by a client;

step S202, the server judges whether a standardized result of the data to be standardized exists in a standardized result library;

step S203, if yes, the server sends a standardization result of the data to be standardized, which is determined from the standardization result library, to the client;

in the present embodiment, if the normalization result of the data to be normalized exists in the normalization result library, the client normalizes the data to be normalized directly with the normalization result without performing the subsequent steps S204 to S211.

Step S204, if not, the server acquires each standardized dictionary in the standardized dictionary library;

step S205, the server matches the data to be standardized with each standardized dictionary respectively based on the preset data matching model corresponding to each standardized dictionary to obtain similarity matching results;

step S206, the server judges whether at least one similarity matching result in the similarity matching results exceeds a matching threshold value;

step S207, if yes, the server determines that the standardized dictionary corresponding to the maximum similarity matching result is a target standardized dictionary, and determines a standardized result of the data to be standardized based on the target standardized dictionary;

s208, the server performs manual checking on the standardized result;

when at least one similarity matching result exceeds the matching threshold in the similarity matching result, the server automatically extracts a target standardized dictionary to generate a standardized result, and for a result to be standardized generated by the automatic matching of the server, manual checking is needed to avoid server matching errors, and when at least one similarity matching result does not exceed the matching threshold in the similarity matching result, manual matching by a manager is needed, specifically:

step S209, if not, the server sends the standardized dictionaries corresponding to the first N similarity matching results with higher similarity to the terminal equipment corresponding to the manager;

step S210, the server receives a determination result sent by the terminal equipment;

step S211, the server determines a target standardized dictionary based on the confirmation result, so as to determine a standardized result;

further, the server sends the final standardized result obtained by manually checking the standardized result determined by automatic matching to the client, or the server sends the standardized result determined by manually matching to the client by a manager, and meanwhile, the server also needs to maintain the final standardized result in a standardized result library, or the server maintains the standardized result determined by manually matching by the manager in the standardized result library.

It should be noted that the method shown in this embodiment is substantially the same as the method shown in fig. 1 in the foregoing, and therefore, the solution shown in this embodiment may specifically refer to the description of the method shown in fig. 1 and the solution in the optional embodiment thereof in the foregoing, and is not described again here.

This application has following advantage specifically:

1. the unified standardized dictionary database is provided, scattered data of each system can be integrated together, and meanwhile, the unified standardized dictionary database is suitable for standardization of dictionaries without standardization and standardization of non-dictionary type data without standardization, has a wide application range, and realizes statistics, analysis and utilization of data;

2. according to the method and the device, the matching accuracy is improved by applying the standardized result library, the standardized result in the standardized result library is directly obtained for the data submitted to the standardized processing for many times, the data volume of the matching algorithm in system operation is reduced, and the system performance is improved;

3. the multi-version standardized dictionary is maintained in the standardized dictionary library, so that the applicable user surface is wider, and the matching accuracy is higher;

4. flexible matching rules are provided by setting the weight of each field in a data matching algorithm corresponding to the preset data matching model, so that matching of various types of data to be standardized can be met, and personalized requirements of users are met;

5. providing an ambiguity library to update and perfect and adjust a preset data matching model of the standardized dictionary, and gradually improving the matching accuracy;

6. and meanwhile, data cleaning and data verification are carried out on the data to be standardized, so that the quality of the dictionary data obtained after standardization is improved.

Fig. 1 to fig. 2 describe a data normalization method provided in the present application, and the present application further provides a data normalization apparatus, please refer to fig. 3, which includes:

an obtaining module 301, configured to obtain data to be standardized and each standardized dictionary in a standardized dictionary library;

a first determining module 302, configured to determine preset data matching models corresponding to the standardized dictionaries;

the matching template 303 is used for matching the data to be standardized with each standardized dictionary respectively to obtain matching results based on a preset data matching model, and determining a target standardized dictionary based on the matching results;

a second determining module 304, configured to determine a normalization result of the data to be normalized based on the target normalization dictionary.

Optionally, the matching result includes a similarity matching result;

the matching module 303 is specifically configured to, when matching the data to be normalized with each normalized dictionary based on the preset data matching model to obtain a matching result:

Optionally, when determining the target normalized dictionary based on the matching result, the matching module 303 is specifically configured to:

M_Kis the weight of the kth field.

Optionally, the matching module 303 is specifically configured to, when matching the data to be normalized with each normalized dictionary based on the preset data matching model to obtain a matching result, and determining the target normalized dictionary based on the matching result:

the second determination module 304 is configured to, after determining the normalization result of the data to be normalized, receive a manual inspection result for the normalization result; and the adjusting module is used for adjusting the weight of each field in the preset data matching model corresponding to the target standardized dictionary based on the manual checking result so as to optimize the preset data matching model corresponding to the target standardized dictionary.

after the acquisition module 301 acquires the data to be standardized, the matching module 303 is used for performing data checking on the data to be standardized based on a preset data checking rule before the data to be standardized is respectively matched with each standardized dictionary based on a preset data matching model to obtain a matching result; and the cleaning module is used for cleaning the data to be standardized after the calibration and the inspection based on the preset data cleaning rule.

Optionally, the data to be normalized includes at least one of:

Optionally, the at least one standardized dictionary comprises at least one of:

Since the apparatus provided in the embodiments of the present application is an apparatus capable of executing the corresponding method in the embodiments of the present application, a specific implementation manner of the apparatus provided in the embodiments of the present application and various modifications thereof can be known to those skilled in the art based on the method provided in the embodiments of the present application, and therefore, a detailed description of how to implement the method in the embodiments of the present application by the apparatus is not provided herein. The apparatus used by those skilled in the art to implement the method in the embodiments of the present application is within the scope of the present application.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides an electronic device, which includes a memory and a processor; the memory has a computer program stored therein; the processor is adapted to perform the method provided in any of the alternative embodiments of the present application when executing the computer program.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program can implement the method provided in any optional embodiment of the present application.

As an example, fig. 4 shows a schematic structural diagram of an electronic device to which the present application may be applied, where the electronic device may specifically be a server or a cloud computer, and may also be other devices, the electronic device 8000 includes a memory 8003 and a processor 8001, the memory 8003 stores a computer program, and the processor 8001 is configured to execute any one of the methods when the computer program is executed. In particular, fig. 4 shows only an alternative schematic structure of the electronic device. The structure of the electronic device 8000 shown in fig. 4 is not limited to the embodiment of the present application.

Processor 8001 is coupled to memory 8003, such as via bus 8002. Optionally, the electronic device 8000 may also include a transceiver 8004. It should be noted that the transceiver 8004 is not limited to one in practical applications, and the transceiver 8004 may be specifically used for communication between the electronic device 8000 and other devices.

Processor 8001 may be a CPU (Central Processing Unit), general purpose Processor, DSP (Digital Signal Processor), ASIC (Application specific integrated Circuit), FPGA (Field Programmable Gate Array), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure of the present application. Processor 8001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, DSP and microprocessor combinations, and so forth.

Bus 8002 may include a path to transfer information between the aforementioned components. The bus 8002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (extended industry Standard Architecture) bus, or the like. The bus 8002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The Memory 8003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically erasable programmable Read Only Memory), a CD-ROM (Compact Read Only Memory) or other optical disk storage, optical disk storage (including Compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to this.

The memory 8003 is used for storing application program codes for executing the scheme of the present application, and the execution is controlled by the processor 8001. Processor 8001 is configured to execute application program code stored in memory 8003 to implement what is shown in any of the method embodiments above.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method of data normalization, comprising:

on the basis of the preset data matching model, respectively matching the data to be standardized with each standardized dictionary to obtain matching results, and determining a target standardized dictionary on the basis of the matching results;

based on the target standardized dictionary, a standardized result of the data to be standardized is determined.

2. The method of claim 1, wherein the matching result comprises a similarity matching result;

the matching of the data to be standardized with each standardized dictionary based on the preset data matching model to obtain matching results comprises the following steps:

3. The method of claim 2, wherein determining a target normalized dictionary based on the matching results comprises:

and determining the standardized dictionary corresponding to the maximum similarity matching result as the target standardized dictionary.

4. The method of claim 3, wherein determining the normalized dictionary corresponding to the maximum similarity matching result as the target normalized dictionary comprises:

if at least one similarity matching result in the similarity matching results exceeds a matching threshold, determining the standardized dictionary corresponding to the maximum similarity matching result as the target standardized dictionary;

the method further comprises the following steps:

if the similarity matching results do not exceed the matching threshold, sending the standardized dictionaries corresponding to the first N similarity matching results with high similarity to the terminal equipment corresponding to the manager, receiving a determination result sent by the terminal equipment, and determining a target standardized dictionary based on indication information including the target standardized dictionary in the determination result, wherein N is an integer greater than 0.

5. The method according to any one of claims 1 to 4, wherein for any standardized dictionary, the data matching algorithm of the preset data matching model is determined based on the matching result of each field in the data to be standardized and the corresponding field in the standardized dictionary and the weight corresponding to each field, and the expression of the data matching algorithm is as follows:

wherein, M is_KMatching the kth field in the data to be standardized with the corresponding field in the standardized dictionary;

the M is_KIs the weight of the kth field.

6. The method of claim 5, wherein for any normalized dictionary, the weight for each field is determined by:

and optimizing and adjusting each weight corresponding to the data matching algorithm of the standardized dictionary based on the training sample data until the matching result of the unnormalized data of each sample and the standardized dictionary determined based on the data matching algorithm and the corresponding sample matching result meet preset conditions.

7. The method according to any one of claims 1 to 4, wherein the matching the data to be standardized with each standardized dictionary respectively based on the preset data matching model to obtain matching results, and determining a target standardized dictionary based on the matching results comprises:

if the standardized result of the data to be standardized does not exist in the standardized result library, respectively matching the data to be standardized with each standardized dictionary based on the preset data matching model to obtain a matching result, determining a target standardized dictionary based on the matching result, and determining the standardized result of the data to be standardized based on the target standardized dictionary;

the method further comprises the following steps:

and if the standardized result of the data to be standardized exists in the standardized result library, determining the standardized result of the data to be standardized from the standardized result library.

8. The method according to any one of claims 1 to 4, wherein after determining the normalization result of the data to be normalized, the method further comprises:

receiving a manual inspection result for the standardized result;

and adjusting the weight of each field in the preset data matching model corresponding to the target standardized dictionary based on the manual checking result so as to optimize the preset data matching model corresponding to the target standardized dictionary.

9. The method according to any one of claims 1 to 4, wherein after the obtaining of the data to be normalized, before the matching of the data to be normalized with each normalized dictionary based on the preset data matching model to obtain the matching result, the method further comprises:

performing data checking on the data to be standardized based on a preset data checking rule;

and based on a preset data cleaning rule, cleaning the data to be standardized after the checking.

10. The method according to any one of claims 1 to 4, wherein the data to be normalized comprises at least one of:

11. The method according to any of claims 1 to 4, wherein said at least one standardized dictionary comprises at least one of:

12. A data normalization apparatus, comprising:

the matching template is used for matching the data to be standardized with each standardized dictionary respectively to obtain matching results based on the preset data matching model, and determining a target standardized dictionary based on the matching results;

13. An electronic device, comprising:

the electronic device comprises a memory and a processor;

the memory has stored therein a computer program;

the processor, when running the computer program, is configured to perform the method of any of claims 1-11.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1-11.