CN115794899A - Data standard matching method, device, equipment, medium and product - Google Patents

Data standard matching method, device, equipment, medium and product Download PDF

Info

Publication number
CN115794899A
CN115794899A CN202211402748.5A CN202211402748A CN115794899A CN 115794899 A CN115794899 A CN 115794899A CN 202211402748 A CN202211402748 A CN 202211402748A CN 115794899 A CN115794899 A CN 115794899A
Authority
CN
China
Prior art keywords
data
standard
feature
name
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211402748.5A
Other languages
Chinese (zh)
Inventor
吴宏杰
周檬
檀康
李静
陈汉
华锦芝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202211402748.5A priority Critical patent/CN115794899A/en
Publication of CN115794899A publication Critical patent/CN115794899A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data standard matching method, device, equipment, medium and product. The data standard matching method comprises the steps of obtaining data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, wherein the data item characteristics and the standard characteristics respectively comprise at least two characteristic items in names, data formats and remark information; determining the feature similarity corresponding to each feature item between the target data item and the data standard according to the data item features and the standard features corresponding to the data standard; determining the similarity between the target data item and the data standard according to the feature similarity corresponding to the at least two feature items respectively; a target data criterion with the highest similarity is determined from the plurality of data criteria, and the target data criterion is accurately determined as a data criterion matching the target data item. According to the embodiment of the application, the matching efficiency can be improved, and meanwhile, the accuracy of the matching result is improved.

Description

Data standard matching method, device, equipment, medium and product
Technical Field
The present application relates to data processing technologies, and in particular, to a data standard matching method, apparatus, device, medium, and product.
Background
In data governance work, managers may formulate multiple data standards in order to unify data specifications. For these data standards, how to find a matching data standard for an existing data item, so as to update the data recording mode of the data item according to the data standard, is one of the important issues of whether the data standard can be successfully landed.
The existing data standard matching mode is to select the matched data standard for the target data item manually according to the work experience, or to adapt the matched data standard only according to the Chinese name of the data item.
Therefore, the manual selection mode not only has low matching efficiency, but also leads to inaccurate matching results due to excessive dependence on work experience. And the way of determining the matched data standard according to the Chinese name of the data item, the matching effect depends on whether the Chinese name of the data item can express all meanings of the data item, and when the Chinese name is similar to the Chinese name in the data standard, but the business definition is different from the business definition in the data standard, the matching result is wrong.
Disclosure of Invention
The embodiment of the application provides a data standard matching method, device, equipment, medium and product, which can improve the matching efficiency and the accuracy of a matching result.
In a first aspect, an embodiment of the present application provides a data standard matching method, where the method includes:
acquiring data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, wherein the data item characteristics and the standard characteristics respectively comprise at least two characteristic items in name, data format and remark information;
for each data standard of the plurality of data standards, determining a feature similarity corresponding to each feature item between the target data item and the data standard according to the data item feature and a standard feature corresponding to the data standard;
determining the similarity between the target data item and the data standard according to the feature similarity corresponding to the at least two feature items respectively;
and determining a target data standard with the highest similarity from the plurality of data standards based on the similarity between the target data item and each of the plurality of data standards, and accurately determining the target data standard as the data standard matched with the target data item.
In a second aspect, an embodiment of the present application provides a data standard matching apparatus, including:
the system comprises a characteristic acquisition module, a data analysis module and a data analysis module, wherein the characteristic acquisition module is used for acquiring data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, and the data item characteristics and the standard characteristics respectively comprise at least two characteristic items in name, data format and remark information;
a first determining module, configured to determine, for each of the plurality of data standards, a feature similarity corresponding to each feature item between the target data item and the data standard according to the data item feature and a standard feature corresponding to the data standard;
a second determining module, configured to determine, according to feature similarities corresponding to the at least two feature items, a similarity between the target data item and the data standard;
and the standard determining module is used for determining a target data standard with the highest similarity from the plurality of data standards based on the similarity between the target data item and each data standard in the plurality of data standards, and accurately determining the target data as the data standard matched with the target data item.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory storing computer program instructions;
the computer program instructions, when executed by a processor, implement the steps of the data criterion matching method as described in any embodiment of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of the data criterion matching method as described in any one of the embodiments of the first aspect.
In a fifth aspect, the present application provides a computer program product, and instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform the steps of the data standard matching method as described in any embodiment of the first aspect.
According to the data standard matching method, device, equipment, medium and product in the embodiment of the application, the feature similarity corresponding to each feature item between the target data item and the data standard is calculated from the dimensions corresponding to at least two feature items in the name, data format and remark information, the similarity between the target data item and the data standard is comprehensively determined according to the feature similarities under at least two dimensions, and the target data standard with the highest similarity is determined from the multiple data standards and is the data standard matched with the target data item. Therefore, the embodiment of the application can realize automatic matching between the data item and the data standard, and compared with a manual matching mode, the matching mode of the embodiment of the application has higher efficiency. In addition, according to the embodiment of the application, the similarity between the target data item and the data standard is comprehensively calculated according to the dimensionalities corresponding to at least two feature items in the name, the data format and the remark information, so that compared with a mode of matching the data standard only according to the Chinese name, the embodiment of the application can enable the matching result to be more accurate. In summary, the embodiment of the application can improve the matching efficiency and improve the accuracy of the matching result.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram of a data criterion matching method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram of a data criterion matching method according to another embodiment of the present application;
FIG. 3 is a flow chart diagram of a synonym dictionary establishing process provided by the present application;
FIG. 4 is a schematic flow chart diagram of a data criterion matching method according to another embodiment of the present application;
FIG. 5 is a schematic flow chart diagram illustrating a data criterion matching method according to yet another embodiment of the present application;
FIG. 6 is a schematic flow chart of a data criterion matching method provided herein;
FIG. 7 is a schematic structural diagram of a data standard matching device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present application will be described in detail below, and in order to make objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are intended to be illustrative only and are not intended to be limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by illustrating examples thereof.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprises 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
In the enterprise-level data management work, managers can generally designate some service data standards, and the service data standards mainly comprise attributes such as standard Chinese and English names, service definitions, service rules, data types, data formats, filling value specifications and value descriptions. In order to better land on the data standards, a working mechanism of embedding the data standards into a system construction key link can be established, wherein the working mechanism mainly comprises two stages: the first stage is that business personnel match the business elements in the requirement with the data standard when writing the requirement; the second phase is that the developer matches the fields required for development with the data standards at the time of database design. In the two stages, a standard data format, a filling value specification and the like can be established for the service element or the field according to the matching result corresponding to the service element or the field.
The data standard matching of the data items is beneficial to enhancing the consistency of data definition and use of business personnel and developers, and promoting the sharing of information resources. However, the matching work of the data standards has two challenges, on one hand, the matching process depends on manual work, and when the data standards are matched for the fields in the stock system, if the stock system is large and the number of the data fields is large, the rapid matching of the data standards is difficult to complete in a manual mode, so that the matching efficiency is low; on the other hand, the accuracy comparison of matching depends on the working experience, because the content of the data standard covers various business fields of the enterprise, and the person who carries out the data standard matching may not fully understand the meaning of the data standard or the data item, thereby possibly causing the matching result to be wrong or inaccurate.
In addition, in the related art, in order to solve the problem that the manual matching efficiency is low, a way of matching the data standard only according to the Chinese name of the data item is provided, and the matching effect of the way greatly depends on whether the Chinese name of the data item can express all meanings of the data item. The Chinese names for data items are similar to the data standards, but the matching results may be incorrect if the business definition and data format are different from the data standards.
In order to solve the problems of the prior art, embodiments of the present application provide a data standard matching method, apparatus, device, medium, and product. The data standard matching method can be applied to a scene of performing data standard matching on data items, and the data standard matching method provided by the embodiment of the application is introduced below first.
Fig. 1 is a schematic flow chart of a data standard matching method according to an embodiment of the present application. As shown in fig. 1, the data standard matching method may specifically include the following steps:
s110, acquiring data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, wherein the data item characteristics and the standard characteristics respectively comprise at least two characteristic items of names, data formats and remark information;
s120, aiming at each data standard in the multiple data standards, determining the feature similarity corresponding to each feature item between the target data item and the data standard according to the data item features and the standard features corresponding to the data standards;
s130, determining the similarity between the target data item and the data standard according to the feature similarity corresponding to the at least two feature items respectively;
s140, determining a target data standard with the highest similarity from the plurality of data standards based on the similarity between the target data item and each data standard in the plurality of data standards, and accurately calibrating the target data as the data standard matched with the target data item.
Therefore, the feature similarity corresponding to each feature item between the target data item and the data standard is calculated from the dimensions corresponding to at least two feature items in the name, the data format and the remark information, the similarity between the target data item and the data standard is comprehensively determined from the feature similarities under at least two dimensions, and the target data standard with the highest similarity is determined from the multiple data standards and is the data standard matched with the target data item. Therefore, the embodiment of the application can realize automatic matching between the data item and the data standard, and compared with a manual matching mode, the matching mode of the embodiment of the application has higher efficiency. In addition, according to the embodiment of the application, the similarity between the target data item and the data standard is comprehensively calculated according to the dimensionalities corresponding to at least two feature items in the name, the data format and the remark information, so that compared with a mode of matching the data standard only according to the Chinese name, the embodiment of the application can enable the matching result to be more accurate. In summary, the embodiment of the application can improve the matching efficiency and improve the accuracy of the matching result.
Specific implementations of the above steps are described below.
In some embodiments, in S110, the target data item may be a business element in a requirement book when a business person writes the requirement book, may also be an existing field in an inventory database when a developer designs a database, and may also be a data item corresponding to metadata of a certain database, which is not limited herein. At least two characteristic items of name, data format and remark information can be included in the data item characteristics corresponding to the target data item.
The data standard may be a standard set in advance by managers to specify a data recording method for the data item. At least two characteristic items of name, data format and remark information can be included in the standard characteristic corresponding to the data standard corresponding to the characteristic item included in the data item characteristic. Wherein, the data item characteristics are the same as the types of the characteristic items contained in the standard characteristics.
For example, the name may include a chinese name, or an english name, or a chinese name and an english name corresponding to the chinese name. The remark information may be a fill specification or value specification describing the data item.
In some embodiments, in S120, for each data standard, a similarity between each feature item in the data item features and the same type of feature item in the data standard, that is, a feature similarity may be calculated. For the target data item and a data standard, a feature similarity can be calculated for a type of feature item.
For example, if the data item feature corresponding to the target data item and the standard feature corresponding to the data standard both include three feature items, namely, a name, a data format, and remark information, then the feature similarity between the name of the target data item and the standard name of the data standard, the feature similarity between the data format of the target data item and the data format of the data standard, and the feature similarity between the data format of the target data item and the remark information of the data standard may be calculated, that is, the three feature similarities may be calculated from the three feature dimensions.
Here, the manner of calculating the feature similarity may be different for different kinds of feature items, and is not limited herein.
In some embodiments, in S130, after the feature similarity corresponding to each feature item is calculated, the feature similarities may be fused by using a preset fusion algorithm, so as to calculate a final similarity between the target data item and the data standard. Wherein, the preset fusion algorithm may be a post-fusion model.
Illustratively, an average fusion algorithm is adopted to fuse the feature similarities, that is, an average value corresponding to the feature similarities is used as a final similarity between the target data item and the data standard. Of course, the weight corresponding to each feature item may also be learned by a ranking learning method, and then the multiple feature similarities are subjected to weighted summation to obtain the final similarity between the target data item and the data standard.
In order to fully consider the influence of the priority relationship between the feature items on the weight, in some embodiments, the step S130 may specifically include:
determining the weight corresponding to each of the at least two characteristic items according to the priority corresponding to each of the at least two characteristic items;
and carrying out weighted summation on the feature similarity based on the weights respectively corresponding to the at least two feature items to obtain the similarity between the target data item and the data standard.
Here, the weight corresponding to each feature item may be assigned according to the priority, and the weight corresponding to the feature item with higher priority is also larger.
For example, the similarity between the target data item and each data criterion may be calculated according to the following formula (1):
Figure BDA0003935641770000071
wherein L is the number of characteristic items, pi l The weight corresponding to the ith feature item,
Figure BDA0003935641770000072
for the feature similarity corresponding to the ith feature item between the target data item and the data standard,
Figure BDA0003935641770000073
is the similarity between the target data item and the data criterion.
In some embodiments, the weight corresponding to a first target feature item in the at least two feature items is greater than the sum of the weights corresponding to a second target feature item, where the first target feature item is any one of the at least two feature items except the feature item with the lowest priority, and the second target feature item is a feature item with a priority lower than that of the first target feature item in the at least two feature items.
For example, the feature items may be prioritized according to the priority corresponding to each feature item, and based on the ranked ranking number corresponding to each feature item, the weights may be iteratively assigned according to the following formula (2):
π l =p(1-p) l-1 ,p∈[0.5,1) (2)
wherein, pi l And the weight corresponding to the ith feature item.
Based on this, the weight corresponding to the feature term with the ranking number t satisfies the following inequality formula (3):
Figure BDA0003935641770000081
it follows that the weight of the ith feature term is always higher than the total weight of the remaining less prioritized feature terms, i.e. the improved post-fusion model allows more efficient features to have a greater impact in fusion.
In some embodiments, in the case that the data item feature and the standard feature each include a name, a data format, and remark information, and the name includes a chinese name and an english name corresponding to the chinese name, the priority of each feature item may be, in order from large to small: the priority corresponding to the Chinese name, the priority corresponding to the English name, the priority corresponding to the remark information and the priority corresponding to the data format.
That is, the priority ranking among the feature items may be: the priority corresponding to the Chinese name > the priority corresponding to the English name > the priority corresponding to the remark information > the priority corresponding to the data format.
In some embodiments, in S140, after calculating the similarity between the target data item and each data criterion, the target data criterion with the highest similarity may be determined as the data criterion matching the target data item. Therefore, the data standard corresponding to the target data item can be finally matched, so that the working personnel can conveniently make data recording rules for the target data item according to the data standard, the data recording rules of the same business data item in an enterprise can be unified, and the enterprise can conveniently manage the data in a unified manner.
The feature similarity calculation method in each feature item dimension is described in detail below.
As shown in fig. 2, in some embodiments, in the case that the data item feature and the standard feature both include names, the step S120 may specifically include the following steps:
s210, obtaining at least one synonym corresponding to the data item name of the target data item to obtain a target synonym set;
s220, aiming at each data standard in the multiple data standards, calculating the similarity between the data item name and the corresponding synonym thereof and the standard name of the data standard respectively based on the target synonym set to obtain multiple name similarities corresponding to the target data item;
and S230, determining feature similarity corresponding to the names between the target data item and the data standard according to the plurality of name similarities.
Here, the data item feature corresponding to the target data item may include a data item name, and correspondingly, the data standard may also include a standard name. The data item names and the standard names can both comprise Chinese names or English names or Chinese names and corresponding English names.
Illustratively, for each data item name, at least one synonym corresponding to the data item name may be obtained and composed into a target synonym set. For example, if the name of the data item is a chinese name, the synonym corresponding to the chinese name of the target data item may be obtained first, and then the similarity between the chinese name and the synonym corresponding thereto and the chinese name of the data standard may be calculated, so that a plurality of name similarities may be obtained. Based on the plurality of name similarities, a feature similarity corresponding to the name between the target data item and the data standard may be determined synthetically. The comprehensive determination method includes, but is not limited to, taking the name similarity with the largest similarity value as the feature similarity corresponding to the name, or taking the average value of the multiple name similarities as the feature similarity corresponding to the name.
In addition, the calculating the similarity between the data item name and the corresponding synonym thereof and the standard name of the data standard to obtain the similarity of a plurality of names corresponding to the target data item may specifically include:
and calculating cosine similarity between the data item name and the corresponding synonym thereof and the standard name of the data standard respectively to obtain a plurality of name similarities corresponding to the target data item.
Therefore, by acquiring at least one synonym corresponding to the data item name and combining the data item name and the corresponding synonym, the feature similarity of the target data item and the data standard in the name dimension is calculated, so that the name coverage is more comprehensive, and the feature similarity calculation result in the name dimension is more accurate.
Based on this, in order to more accurately represent the similarity between different names of the same service function in the service and to shorten the distance between different names of the same service function, in some embodiments, the S230 may specifically include:
and determining the highest name similarity in the plurality of name similarities as the feature similarity corresponding to the name between the target data item and the data standard.
In this way, even if the similarity between the data item name and the standard name is not high, but there is a word with high similarity to the standard name in the synonym corresponding to the data item name, the higher similarity value can be determined as the final similarity value between the data item name and the standard name, that is, the feature similarity corresponding to the name between the target data item and the data standard.
In addition, in order to conveniently and accurately obtain the synonym corresponding to the data item name of the target data item, in some embodiments, before the step S210, the data standard matching method provided in the embodiment of the present application may further include:
acquiring field names of a plurality of fields contained in an existing data table;
aiming at the field name of each field in a plurality of fields, acquiring a plurality of synonyms corresponding to the field name;
and establishing a synonym dictionary according to the field names of the fields and the synonyms corresponding to the field names.
Accordingly, the S210 may specifically include:
at least one synonym corresponding to the data item name of the target data item is queried based on the synonym dictionary.
Here, in order to facilitate the later synonym query, an enterprise-level synonym dictionary may be first established based on the existing data tables in the databases of various types of the enterprise.
Illustratively, the establishment process of the synonym dictionary may be: acquiring field names in the existing data table structure from databases of various types, wherein the field names can comprise Chinese names or English names or Chinese names and English names corresponding to the Chinese names; respectively acquiring synonyms corresponding to the field names according to the field names in a preset mode; and merging, removing the duplicate and other processing of the field names and the corresponding synonyms to obtain the enterprise-level synonym dictionary.
In addition, in a case where the name includes a chinese name and an english name corresponding to the chinese name, in order to make each name have its own synonym dictionary, in some embodiments, the step of acquiring a plurality of synonyms corresponding to the field names may specifically include:
acquiring a plurality of Chinese synonyms corresponding to the Chinese names;
acquiring a plurality of English synonyms corresponding to the English names;
establishing a synonym dictionary according to the field names of the fields and the corresponding synonyms, wherein the synonym dictionary comprises the following steps:
establishing a Chinese synonym dictionary according to the Chinese names of the fields and the corresponding Chinese synonyms; and the number of the first and second groups,
establishing an English synonym dictionary according to the English names of the fields and the corresponding English synonyms;
the synonym dictionary comprises a Chinese synonym dictionary and an English synonym dictionary.
Illustratively, after the chinese name and the english name of a field are obtained from the existing table structure, synonyms corresponding to each name can be obtained for the chinese name and the english name, respectively. Therefore, a Chinese synonym dictionary can be constructed based on the Chinese names of the fields and the corresponding Chinese synonyms of the fields after combination and duplication checking, and the Chinese synonym dictionary can be used for inquiring the synonyms corresponding to the Chinese names of the target data items; similarly, an English synonym dictionary can be constructed based on the English names of the fields and the corresponding English synonyms of the fields after merging and duplicate checking, and the English synonym dictionary can be used for inquiring the synonyms corresponding to the English names of the target data items.
Therefore, the Chinese synonym dictionary and the English synonym dictionary are respectively constructed, so that the language types of the names can be richer, and the synonyms corresponding to the names of different language types can be conveniently inquired in the later period.
Based on this, in order to make the content of the enterprise-level chinese synonym dictionary richer and more comprehensive, the step of acquiring a plurality of chinese synonyms corresponding to the chinese names may specifically include:
performing word segmentation processing on the Chinese name of the field to obtain a Chinese word segmentation corresponding to the field, and searching a Chinese synonym corresponding to the Chinese word segmentation by using a preset search engine to obtain a first Chinese synonym set;
under the condition that the occurrence frequency of the fields in the existing data table is larger than or equal to the preset frequency, acquiring target fields in the existing data table, which are the same as English names corresponding to the fields, determining target Chinese names corresponding to the target fields as Chinese synonyms corresponding to the Chinese names, and acquiring a second Chinese synonym set;
and determining a plurality of Chinese synonyms corresponding to the Chinese names according to the first Chinese synonym set and the second Chinese synonym set.
Here, the number of preset search engines may be plural, and different search engines may be used to search synonyms of different fields. The preset number of times may be 3 times, for example.
For example, after the chinese names of the fields are obtained, the chinese names may be first subjected to word segmentation processing to obtain one or more chinese participles, and then synonym searches are respectively performed on the chinese participles through search engines in different fields, so that a first chinese synonym set consisting of a plurality of chinese synonyms can be obtained. In addition, when the frequency of the field occurrence is high, for example, the frequency of the field occurrence in the existing data table is greater than or equal to 3 times, a target field in the existing data table, which is the same as the english name corresponding to the field, is obtained, and then the target chinese name corresponding to the target field is determined as the chinese synonym corresponding to the chinese name of the field, so that a second chinese synonym set consisting of a plurality of chinese synonyms can be obtained. For example, if the English name corresponding to the Chinese name A of the field is B and the English name corresponding to the Chinese name C of the target field is also B, the Chinese name C can be used as the Chinese synonym corresponding to the Chinese name A.
After the first Chinese synonym set and the second Chinese synonym set are obtained, the first Chinese synonym set and the second Chinese synonym set can be merged and deduplicated, and then words contained in a processing result are used as the Chinese synonyms corresponding to the Chinese names of the fields.
It should be noted that, the manner of determining the target field may be a semi-automatic manner, that is, it may be determined whether two fields with the same english name are actually replaceable fields with the same function in the business field by means of manual comparison, and in case that the two fields are determined to be replaceable fields, the replaceable fields are determined to be the target fields.
Similarly, in order to make the content of the enterprise-level english synonym dictionary richer and more comprehensive, the step of obtaining a plurality of english synonyms corresponding to the english names may specifically include:
performing word segmentation processing on English names of the fields to obtain English words corresponding to the fields, and searching English synonyms corresponding to the English words by using a preset search engine to obtain a first English synonym set;
under the condition that the occurrence frequency of the field in the existing data table is greater than or equal to the preset frequency, acquiring a target field which is the same as the Chinese name corresponding to the field in the existing data table, determining a target English name corresponding to the target field as an English synonym corresponding to the English name, and acquiring a second English synonym set;
and determining a plurality of English synonyms corresponding to the English names according to the first English synonym set and the second English synonym set.
Here, the principle of obtaining the english synonym is the same as that of obtaining the chinese synonym, and is not described herein again.
In some specific examples, the specific process of creating the synonym dictionary may be as shown in FIG. 3. After obtaining the Chinese name and English name pairs of each field in the existing table structure, synonym sets can be respectively obtained in two ways, one way is that the Chinese name and the English name are respectively subjected to word segmentation processing to obtain a Chinese word segmentation set A1 and an English word segmentation set B1, the word segmentation sets A1 and B1 are respectively subjected to synonym search through a plurality of preset search engines to obtain synonyms of the words, namely a Chinese synonym set A2 and an English synonym set B2; another approach is to screen out fields with frequency more than or equal to 3 in the existing table structure, manually compare the Chinese names corresponding to the fields with the same English name, if the business meanings expressed between the two Chinese names are the same, the two Chinese names can be determined as synonyms, and similarly, manually compare the English names corresponding to the fields with the same Chinese name, if the business meanings expressed between the two English names are the same, the two English names can be determined as synonyms, so that a synonym set A3 of the Chinese name and a synonym set B3 of the English name can be obtained. Finally, a collection of A2 and A3 is taken and is subjected to duplication removal, and an enterprise-level Chinese synonym dictionary can be obtained; and (4) taking the collection of the B2 and the B3 and removing the duplication to obtain an enterprise-level English synonym dictionary.
Therefore, by the mode of respectively acquiring synonyms from two ways aiming at the Chinese name and the English name, namely searching the universal synonyms respectively corresponding to various names from different fields and searching the synonyms of the special vocabulary respectively corresponding to various names from the industry professional vocabulary and the special vocabulary recorded in the enterprise data table, the synonyms corresponding to the field names of each field can be acquired more comprehensively, and the contents of the enterprise-level Chinese synonym dictionary and the English synonym dictionary are richer.
In addition, as shown in fig. 4, in some embodiments, when the data format is included in both the data item feature and the standard feature, the step S120 may specifically include the following steps:
s410, aiming at each data standard in the multiple data standards, expressing a standard data format corresponding to the data standard into attribute values respectively corresponding to multiple preset attributes to obtain a standard attribute value corresponding to the data standard;
s420, representing a target data format corresponding to the target data item as attribute values corresponding to a plurality of preset attributes respectively to obtain a target attribute value corresponding to the target data item;
s430, comparing the target attribute value with the standard attribute value aiming at each attribute in the plurality of preset attributes to obtain an attribute value comparison result;
s440, determining the characteristic similarity corresponding to the data format between the target data item and the data standard based on the attribute value comparison result corresponding to each attribute in the plurality of preset attributes.
Here, because the expressions of different documents and different databases for the data formats are different, the present application proposes a method capable of performing unified expression on the data formats in different documents and different databases, and provides a similarity calculation method after the unified expression.
In some embodiments, the plurality of preset attributes may include at least two of whether a numeric character is included, whether an alphabetical character is included, whether a special character is included, a maximum length, a minimum length, a length after decimal point, whether a date is included, and whether a time is included.
Illustratively, by investigating the data formats of various databases and the data formats of documents related to financial field data standards, these different data formats can be uniformly expressed as eight types of attributes such as "whether or not to include numeric characters", "whether or not to include alphabetic characters", "whether or not to include special characters", "maximum length", "minimum length", "length after decimal point", "date", "time", and the like.
For example, for the standard data format "n..17,2" corresponding to the data standard, eight types of attribute values, such as "including numeric characters, not including alphabetic characters, not including special characters, having a maximum length of 17, having no minimum length, having a length after decimal point of 2, which is not date, not time", can be represented uniformly.
For another example, the target data format "double (5, 3)" corresponding to the target data item in the database can be represented as "including numeric characters, not including alphabetic characters, not including special characters, having a maximum length of 5, having no minimum length, having a length after decimal point of 3, not being date, not time" and other eight types of attribute values in a unified manner.
Based on the data format similarity, the attribute values corresponding to the uniformly expressed data formats can be compared to calculate the feature similarity corresponding to the data formats between the target data items and the data standards.
For example, if the attribute values of the target data item and the data standard corresponding to the attribute of "date or time" are both "yes", that is, the data formats of the target data item and the data standard are both date or time, it may be determined that the feature similarity corresponding to the data format between the target data item and the data standard is 1; otherwise, if the attribute values of the target data item and the data standard corresponding to other attributes are the same or equal, the weight values can be accumulated according to the weight values corresponding to the attributes with the same or equal attribute values, so that the finally obtained accumulation result can be used as the feature similarity corresponding to the data format between the target data item and the data standard.
In some specific examples, if the weight values of "date" and "time" are 1, the weight value of "digital character included" is 0.25, the weight value of "alphabetic character included" is 0.25, the weight value of "special character included" is 0.2, the weight value of "maximum length" is 0.15, the weight value of "minimum length" is 0.1, and the weight value of "length after decimal point" is 0.05, the target data format "double (5, 3)" and the standard data format "n..17,2" are represented by unification, and the feature similarity corresponding to the data format between the target data item and the data standard can be calculated to be 0.25.
Therefore, through the process, the characteristic similarity between the target data item and the data standard can be calculated from the dimensionality of the data format, so that the similarity between the target data item and the data standard can be calculated more comprehensively, and the comprehensiveness and the accuracy of similarity calculation are further improved.
In addition, as shown in fig. 5, in some embodiments, when the remark information is included in both the data item feature and the standard feature, the step S120 may specifically include the following steps:
s510, aiming at each data standard in the multiple data standards, according to a preset editing distance algorithm, an editing distance between target remark information corresponding to a target data item and standard remark information corresponding to the data standards is calculated;
and S520, determining the feature similarity corresponding to the remark information between the target data item and the data standard according to the editing distance.
Here, the preset edit distance algorithm may be, for example, a Jaro-Winkler distance algorithm. The target remark information and the standard remark information can record information such as filling value specification and/or value description.
For example, the target remark information corresponding to the target data item and the standard remark information corresponding to the data standard may be calculated according to a Jaro-Winkler distance algorithm, and the Jaro-Winkler distance between the target data item and the standard remark information may be calculated, and the distance or a processing result obtained after the distance is subjected to preset transformation processing may be determined as the feature similarity corresponding to the remark information between the target data item and the data standard.
Therefore, through the process, the characteristic similarity between the target data item and the data standard can be calculated from the dimensionality of the remark information, so that the similarity between the target data item and the data standard can be calculated more comprehensively, and the comprehensiveness and the accuracy of similarity calculation are further improved.
Based on this, in order to better describe the process of data standard matching, some specific examples are given based on the above embodiments.
For example, a data criteria matching method flow diagram as shown in fig. 6. The calculation process of the similarity may include four parts, and specifically may include the following.
(1) Similarity calculation of Chinese names
And performing synonym replacement on the Chinese name of the target data item according to the established enterprise-level Chinese synonym dictionary to obtain a set C of the new name after replacement and the original name, and calculating the cosine similarity between each Chinese name in the set C and each Chinese name of the data standard.
(2) Similarity calculation of English names
And carrying out synonym replacement on the English name of the target data item according to the established enterprise-level English synonym dictionary to obtain a set D of the new name and the original name after replacement, and calculating the cosine similarity between each English name in the set D and each English name of the data standard.
(3) Similarity calculation for data formats
And expressing the data format of the target data item and the data format of the data standard respectively according to a unified mode, and calculating the similarity between the data format of the target data item and the data format of the data standard based on the unified expression.
(4) Similarity calculation of remark information
And calculating the Jaro-Winkler distance between the remark information of the target data item and the remark information of the data standard to obtain the similarity between the remark information of the target data item and the remark information of the data standard.
Based on the data standard, after the similarity of the four dimensions is obtained, the similarity of the four dimensions is input into the improved post-fusion model, the similarity of the target data item corresponding to each data standard is output, and one or more data standards with the highest similarity are selected as the data standards matched with the target data item.
To sum up, in the embodiment of the present application, an enterprise-level synonym dictionary is first established. And then, combining the established enterprise-level synonym dictionary, and respectively calculating the similarity between the target data item and each data standard in four dimensions of Chinese name, english name, data format, remark information and the like. For the calculation of the similarity of the data formats, the embodiment of the application provides a method capable of performing unified expression on the data formats in different documents and different databases, and provides a similarity calculation method after the unified expression. And then, the similarity obtained by calculation under the four dimensions is input into a post-fusion model, and because the internal priority relation exists among different feature items, the post-fusion model is improved, so that the features with high priority have larger influence degree in the model. And determining to obtain the data standard with the highest matching similarity with the target data item through the improved post-fusion model.
It should be noted that the application scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions in the embodiments of the present application, and do not constitute limitations on the technical solutions provided in the embodiments of the present application, and as a person having ordinary skill in the art can appreciate, with the occurrence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
Based on the same inventive concept, the application also provides a data standard matching device. The details are described with reference to fig. 7.
Fig. 7 is a schematic structural diagram of a data standard matching apparatus according to an embodiment of the present application.
As shown in fig. 7, the data criterion matching apparatus 700 may include:
a feature obtaining module 701, configured to obtain a data item feature corresponding to a target data item and standard features corresponding to multiple data standards, where the data item feature and the standard features each include at least two feature items in a name, a data format, and remark information;
a first determining module 702, configured to determine, for each data standard of the plurality of data standards, a feature similarity corresponding to each feature item between the target data item and the data standard according to the data item feature and a standard feature corresponding to the data standard;
a second determining module 703, configured to determine similarity between the target data item and the data standard according to feature similarities corresponding to the at least two feature items, respectively;
a criterion determining module 704, configured to determine, based on a similarity between the target data item and each of the multiple data criteria, a target data criterion with a highest similarity from the multiple data criteria, and accurately determine the target data as a data criterion that matches the target data item.
The data standard matching apparatus 700 is described in detail below, and specifically as follows:
in some embodiments, the first determining module 702 comprises:
the first obtaining sub-module is used for obtaining at least one synonym corresponding to the data item name of the target data item under the condition that the data item feature and the standard feature both comprise the name, and obtaining a target synonym set;
the first calculation submodule is used for calculating the similarity between the data item name and the corresponding synonym thereof and the standard name of the data standard respectively based on the target synonym set aiming at each data standard in the plurality of data standards to obtain a plurality of name similarities corresponding to the target data item;
and the first determining submodule is used for determining the feature similarity corresponding to the name between the target data item and the data standard according to the plurality of name similarities.
In some embodiments, the first determining submodule is specifically configured to:
and determining the highest name similarity in the plurality of name similarities as the feature similarity corresponding to the name between the target data item and the data standard.
In some embodiments, the first determining module 702 further comprises:
the second acquisition submodule is used for acquiring field names of a plurality of fields contained in the existing data table before acquiring at least one synonym corresponding to the data item name of the target data item;
a third obtaining sub-module, configured to obtain, for a field name of each of the multiple fields, multiple synonyms corresponding to the field name;
the dictionary establishing sub-module is used for establishing a synonym dictionary according to the field names of the fields and the synonyms corresponding to the field names;
the first obtaining submodule is specifically configured to:
querying, based on the synonym dictionary, at least one synonym corresponding to a data item name of the target data item.
In some of these embodiments, the names include Chinese names and their corresponding English names.
In some of these embodiments, the third acquisition submodule includes:
a first acquisition unit configured to acquire a plurality of chinese synonyms corresponding to the chinese names;
a second obtaining unit configured to obtain a plurality of english synonyms corresponding to the english name;
the dictionary establishing sub-module includes:
the first establishing unit is used for establishing a Chinese synonym dictionary according to the Chinese names of the fields and the corresponding Chinese synonyms; and the number of the first and second groups,
the second establishing unit is used for establishing an English synonym dictionary according to the English names of the fields and the corresponding English synonyms;
wherein the synonym dictionary comprises the Chinese synonym dictionary and the English synonym dictionary.
In some embodiments, the first obtaining unit comprises:
the first processing subunit is used for performing word segmentation processing on the Chinese name of the field to obtain a Chinese word segmentation corresponding to the field, and searching a Chinese synonym corresponding to the Chinese word segmentation by using a preset search engine to obtain a first Chinese synonym set;
the second processing subunit is used for acquiring a target field which is in the existing data table and has the same English name as the field in the existing data table under the condition that the occurrence frequency of the field in the existing data table is greater than or equal to the preset frequency, determining a target Chinese name corresponding to the target field as a Chinese synonym corresponding to the Chinese name, and acquiring a second Chinese synonym set;
and the first determining subunit is used for determining a plurality of Chinese synonyms corresponding to the Chinese names according to the first Chinese synonym set and the second Chinese synonym set.
In some embodiments, the second obtaining unit comprises:
the third processing subunit is configured to perform word segmentation processing on the english name of the field to obtain an english word segmentation corresponding to the field, and search for an english synonym corresponding to the english word segmentation by using a preset search engine to obtain a first english synonym set;
the fourth processing subunit is configured to, when the occurrence frequency of the field in the existing data table is greater than or equal to a preset frequency, obtain a target field in the existing data table, which has the same Chinese name as the field, determine a target English name corresponding to the target field as an English synonym corresponding to the English name, and obtain a second English synonym set;
and the second determining subunit is configured to determine, according to the first english synonym set and the second english synonym set, a plurality of english synonyms corresponding to the english name.
In some embodiments, the first determining module 702 comprises:
a first representation submodule, configured to, when the data item feature and the standard feature both include the data format, represent, for each of the multiple data standards, a standard data format corresponding to the data standard as attribute values corresponding to multiple preset attributes, respectively, and obtain a standard attribute value corresponding to the data standard;
the second representation submodule is used for representing a target data format corresponding to the target data item into attribute values respectively corresponding to the plurality of preset attributes to obtain a target attribute value corresponding to the target data item;
the attribute comparison submodule is used for comparing the target attribute value with the standard attribute value aiming at each attribute in the plurality of preset attributes to obtain an attribute value comparison result;
and the second determining submodule is used for determining the characteristic similarity corresponding to the data format between the target data item and the data standard based on the attribute value comparison result corresponding to each attribute in the plurality of preset attributes.
In some of these embodiments, the plurality of preset attributes includes at least two of whether a numeric character is included, whether an alphabetic character is included, whether a special character is included, a maximum length, a minimum length, a length after a decimal point, whether a date is included, and whether a time is included.
In some embodiments, the first determining module 702 comprises:
a distance calculation sub-module, configured to calculate, according to a preset edit distance algorithm, an edit distance between target remark information corresponding to the target data item and standard remark information corresponding to the data standard for each of the multiple data standards when the remark information is included in both the data item feature and the standard feature;
and the third determining sub-module is used for determining the feature similarity corresponding to the remark information between the target data item and the data standard according to the editing distance.
In some embodiments, the second determining module 703 includes:
the weight determining submodule is used for determining the weight corresponding to each of the at least two characteristic items according to the priority corresponding to each of the at least two characteristic items;
and the weighted summation sub-module is used for carrying out weighted summation on the feature similarity based on the weights respectively corresponding to the at least two feature items to obtain the similarity between the target data item and the data standard.
In some embodiments, a weight corresponding to a first target feature item of the at least two feature items is greater than a sum of weights corresponding to a second target feature item, where the first target feature item is any one of the at least two feature items except a feature item with a smallest priority, and the second target feature item is a feature item with a smaller priority than the first target feature item.
In some embodiments, when the data item feature and the standard feature each include a name, a data format, and remark information, and the name includes a chinese name and an english name corresponding to the chinese name, the priority of each feature item is, in order from large to small: the priority corresponding to the Chinese name, the priority corresponding to the English name, the priority corresponding to the remark information and the priority corresponding to the data format.
Therefore, the feature similarity corresponding to each feature item between the target data item and the data standard is calculated from the dimensions corresponding to at least two feature items in the name, the data format and the remark information, the similarity between the target data item and the data standard is comprehensively determined from the feature similarities under at least two dimensions, and the target data standard with the highest similarity is determined from the multiple data standards and is the data standard matched with the target data item. Therefore, the embodiment of the application can realize automatic matching between the data items and the data standards, and compared with a manual matching mode, the matching mode of the embodiment of the application has higher efficiency. In addition, according to the embodiment of the application, the similarity between the target data item and the data standard is comprehensively calculated according to the dimensionalities corresponding to at least two feature items in the name, the data format and the remark information, so that compared with a mode of matching the data standard only according to the Chinese name, the embodiment of the application can enable the matching result to be more accurate. In summary, the embodiment of the application can improve the matching efficiency and improve the accuracy of the matching result.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
The electronic device 800 may include a processor 801 and a memory 802 that stores computer program instructions.
Specifically, the processor 801 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 802 may include a mass storage for data or instructions. By way of example, and not limitation, memory 802 may include a Hard Disk Drive (HDD), a floppy Disk Drive, flash memory, an optical Disk, a magneto-optical Disk, a tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 802 may include removable or non-removable (or fixed) media, where appropriate. The memory 802 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 802 is a non-volatile solid-state memory.
In particular embodiments, memory may include Read Only Memory (ROM), random Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors), it is operable to perform operations described with reference to the methods according to an aspect of the application.
The processor 801 reads and executes computer program instructions stored in the memory 802 to implement any of the data standard matching methods in the above-described embodiments.
In some examples, electronic device 800 may also include a communication interface 803 and a bus 810. As shown in fig. 8, the processor 801, the memory 802, and the communication interface 803 are connected via a bus 810 to complete communication therebetween.
The communication interface 803 is mainly used for implementing communication between various modules, apparatuses, units and/or devices in the embodiment of the present application.
Bus 810 includes hardware, software, or both to couple the components of the online data traffic billing device to each other. By way of example, and not limitation, the bus 810 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hyper Transport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of these. Bus 810 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
Illustratively, the electronic device 800 may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like.
The electronic device 800 may execute the data standard matching method in the embodiment of the present application, so as to implement the data standard matching method and apparatus described in conjunction with fig. 1 to 7.
In addition, in combination with the data standard matching method in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the data criterion matching methods in the above embodiments. Examples of computer readable storage media include non-transitory computer readable storage media such as portable disks, hard disks, random Access Memories (RAMs), read Only Memories (ROMs), erasable programmable read only memories (EPROMs or flash memories), portable compact disc read only memories (CD-ROMs), optical storage devices, magnetic storage devices, and so forth.
It is to be understood that the present application is not limited to the particular arrangements and instrumentality described above and shown in the attached drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions or change the order between the steps after comprehending the spirit of the present application.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranets, etc.
It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware for performing the specified functions or acts, or combinations of special purpose hardware and computer instructions.
As described above, only the specific embodiments of the present application are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application.

Claims (18)

1. A data standard matching method, comprising:
acquiring data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, wherein the data item characteristics and the standard characteristics respectively comprise at least two characteristic items in name, data format and remark information;
for each data standard of the plurality of data standards, determining a feature similarity corresponding to each feature item between the target data item and the data standard according to the data item feature and a standard feature corresponding to the data standard;
determining the similarity between the target data item and the data standard according to the feature similarity corresponding to the at least two feature items respectively;
and determining a target data standard with the highest similarity from the plurality of data standards based on the similarity between the target data item and each of the plurality of data standards, and accurately calibrating the target data standard as the data standard matched with the target data item.
2. The method of claim 1, wherein, in the case that the name is included in both the data item feature and the standard feature, the determining, for each of the plurality of data standards, a feature similarity for each feature item between the target data item and the data standard from the data item feature and a standard feature corresponding to the data standard comprises:
acquiring at least one synonym corresponding to the data item name of the target data item to obtain a target synonym set;
for each data standard in the multiple data standards, calculating similarity between the data item name and the corresponding synonym thereof and the standard name of the data standard respectively based on the target synonym set to obtain multiple name similarities corresponding to the target data item;
and determining feature similarity corresponding to the name between the target data item and the data standard according to the plurality of name similarities.
3. The method of claim 2, wherein determining the feature similarity corresponding to the name between the target data item and the data criterion based on the plurality of name similarities comprises:
and determining the highest name similarity in the plurality of name similarities as the feature similarity corresponding to the name between the target data item and the data standard.
4. The method of claim 2, wherein prior to obtaining at least one synonym corresponding to a data item name of a target data item, the method further comprises:
acquiring field names of a plurality of fields contained in an existing data table;
aiming at the field name of each field in the fields, acquiring a plurality of synonyms corresponding to the field name;
establishing a synonym dictionary according to the field names of the fields and the corresponding synonyms;
the obtaining of the at least one synonym corresponding to the data item name of the target data item includes:
querying, based on the synonym dictionary, at least one synonym corresponding to a data item name of the target data item.
5. The method of claim 4, wherein the names comprise Chinese names and their corresponding English names.
6. The method of claim 5, wherein obtaining a plurality of synonyms corresponding to the field name comprises:
acquiring a plurality of Chinese synonyms corresponding to the Chinese names;
acquiring a plurality of English synonyms corresponding to the English names;
establishing a synonym dictionary according to the field names of the fields and the synonyms corresponding to the field names, wherein the establishing of the synonym dictionary comprises the following steps:
establishing a Chinese synonym dictionary according to the Chinese names of the fields and the corresponding Chinese synonyms; and (c) a second step of,
establishing an English synonym dictionary according to the English names of the fields and the corresponding English synonyms;
wherein the synonym dictionary includes the Chinese synonym dictionary and the English synonym dictionary.
7. The method of claim 6, wherein obtaining a plurality of Chinese synonyms corresponding to the Chinese name comprises:
performing word segmentation processing on the Chinese name of the field to obtain a Chinese word segmentation corresponding to the field, and searching a Chinese synonym corresponding to the Chinese word segmentation by using a preset search engine to obtain a first Chinese synonym set;
under the condition that the occurrence frequency of the field in the existing data table is larger than or equal to the preset frequency, acquiring a target field with the same English name as the field in the existing data table, determining a target Chinese name corresponding to the target field as a Chinese synonym corresponding to the Chinese name, and acquiring a second Chinese synonym set;
and determining a plurality of Chinese synonyms corresponding to the Chinese names according to the first Chinese synonym set and the second Chinese synonym set.
8. The method of claim 6, wherein the obtaining a plurality of English synonyms corresponding to the English name comprises:
performing word segmentation processing on the English name of the field to obtain an English word corresponding to the field, and searching an English synonym corresponding to the English word by using a preset search engine to obtain a first English synonym set;
under the condition that the occurrence frequency of the field in the existing data table is greater than or equal to the preset frequency, acquiring a target field with the same Chinese name corresponding to the field in the existing data table, determining a target English name corresponding to the target field as an English synonym corresponding to the English name, and acquiring a second English synonym set;
and determining a plurality of English synonyms corresponding to the English names according to the first English synonym set and the second English synonym set.
9. The method of claim 1, wherein, in the case where the data format is included in both the data item feature and the standard feature, the determining, for each of the plurality of data standards, a feature similarity for each feature item between the target data item and the data standard from the data item feature and a standard feature corresponding to the data standard comprises:
aiming at each data standard in the plurality of data standards, expressing a standard data format corresponding to the data standard as attribute values respectively corresponding to a plurality of preset attributes to obtain a standard attribute value corresponding to the data standard;
representing a target data format corresponding to the target data item as attribute values corresponding to the preset attributes respectively to obtain a target attribute value corresponding to the target data item;
comparing the target attribute value with the standard attribute value aiming at each attribute in the plurality of preset attributes to obtain an attribute value comparison result;
and determining the characteristic similarity corresponding to the data format between the target data item and the data standard based on the attribute value comparison result corresponding to each attribute in the plurality of preset attributes.
10. The method of claim 9, wherein the plurality of predetermined attributes includes at least two of whether numeric characters are included, whether alphabetic characters are included, whether special characters are included, a maximum length, a minimum length, a decimal point-after-length, whether a date and whether a time.
11. The method of claim 1, wherein, in the case where the remark information is included in both the data item feature and the standard feature, the determining, for each of the plurality of data standards, a feature similarity corresponding to each feature item between the target data item and the data standard from the data item feature and a standard feature corresponding to the data standard comprises:
calculating an editing distance between target remark information corresponding to the target data item and standard remark information corresponding to the data standard according to a preset editing distance algorithm aiming at each data standard in the plurality of data standards;
and determining the feature similarity corresponding to the remark information between the target data item and the data standard according to the editing distance.
12. The method according to claim 1, wherein the determining the similarity between the target data item and the data standard according to the feature similarity corresponding to each of the at least two feature items comprises:
determining the weight corresponding to each of the at least two characteristic items according to the priority corresponding to each of the at least two characteristic items;
and carrying out weighted summation on the feature similarity based on the weights respectively corresponding to the at least two feature items to obtain the similarity between the target data item and the data standard.
13. The method according to claim 12, wherein a weight corresponding to a first target feature of the at least two features is greater than a sum of weights corresponding to second target features, wherein the first target feature is any one of the at least two features except a feature with a smallest priority, and the second target feature is a feature with a smaller priority than the first target feature.
14. The method according to claim 12 or 13, wherein when the data item feature and the standard feature each include a name, a data format, and remark information, and the name includes a chinese name and an english name corresponding thereto, the priority of each feature item is, in order from large to small: the priority corresponding to the Chinese name, the priority corresponding to the English name, the priority corresponding to the remark information and the priority corresponding to the data format.
15. A data standard matching apparatus, comprising:
the system comprises a characteristic acquisition module, a characteristic analysis module and a characteristic analysis module, wherein the characteristic acquisition module is used for acquiring data item characteristics corresponding to a target data item and standard characteristics corresponding to a plurality of data standards respectively, and the data item characteristics and the standard characteristics respectively comprise at least two characteristic items in names, data formats and remark information;
a first determining module, configured to determine, for each of the plurality of data standards, a feature similarity corresponding to each feature item between the target data item and the data standard according to the data item feature and a standard feature corresponding to the data standard;
the second determining module is used for determining the similarity between the target data item and the data standard according to the feature similarity corresponding to the at least two feature items respectively;
and the standard determining module is used for determining a target data standard with the highest similarity from the plurality of data standards based on the similarity between the target data item and each data standard in the plurality of data standards, and accurately determining the target data as the data standard matched with the target data item.
16. An electronic device, characterized in that the device comprises: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, performs the steps of the data criterion matching method of any of claims 1-14.
17. A computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, implement the steps of the data criterion matching method of any one of claims 1-14.
18. A computer program product, wherein instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform the steps of the data criterion matching method according to any one of claims 1-14.
CN202211402748.5A 2022-11-10 2022-11-10 Data standard matching method, device, equipment, medium and product Pending CN115794899A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211402748.5A CN115794899A (en) 2022-11-10 2022-11-10 Data standard matching method, device, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211402748.5A CN115794899A (en) 2022-11-10 2022-11-10 Data standard matching method, device, equipment, medium and product

Publications (1)

Publication Number Publication Date
CN115794899A true CN115794899A (en) 2023-03-14

Family

ID=85436520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211402748.5A Pending CN115794899A (en) 2022-11-10 2022-11-10 Data standard matching method, device, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN115794899A (en)

Similar Documents

Publication Publication Date Title
CN110019732B (en) Intelligent question answering method and related device
US8965872B2 (en) Identifying query formulation suggestions for low-match queries
US11651014B2 (en) Source code retrieval
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN112035599A (en) Query method and device based on vertical search, computer equipment and storage medium
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN114911917A (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN111324705B (en) System and method for adaptively adjusting associated search terms
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN113591476A (en) Data label recommendation method based on machine learning
CN116881432A (en) Text pushing method, text pushing device, electronic equipment and storage medium
CN115952770A (en) Data standardization processing method and device, electronic equipment and storage medium
CN116662633A (en) Search method, model training method, device, electronic equipment and storage medium
CN115794899A (en) Data standard matching method, device, equipment, medium and product
CN112965998B (en) Method and system for establishing and retrieving compound database
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN115017385A (en) Article searching method, device, equipment and storage medium
CN114971833A (en) Tax information processing method and related equipment
CN114647739A (en) Entity chain finger method, device, electronic equipment and storage medium
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
CN112115237A (en) Method and device for constructing tobacco scientific and technical literature data recommendation model
CN111382265A (en) Search method, apparatus, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination