CN115081531A - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN115081531A
CN115081531A CN202210761255.4A CN202210761255A CN115081531A CN 115081531 A CN115081531 A CN 115081531A CN 202210761255 A CN202210761255 A CN 202210761255A CN 115081531 A CN115081531 A CN 115081531A
Authority
CN
China
Prior art keywords
determining
similarity
data
data item
standard data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210761255.4A
Other languages
Chinese (zh)
Inventor
李鹏飞
王倩
甘长华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202210761255.4A priority Critical patent/CN115081531A/en
Publication of CN115081531A publication Critical patent/CN115081531A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method, a data processing device and electronic equipment, and relates to the technical field of data management, wherein the method comprises the following steps: and in the process of determining the data elements matched with the data items to be processed, determining target data elements of which the name information and the value range information are matched with the data items from the plurality of standard data elements according to the name information and the value range information of the data items. Therefore, the name information and the value range information of the data item are combined, manual participation is not needed, the target data element matched with the data item is accurately determined, the efficiency of determining the target data element matched with the data item is improved, and the cost is reduced.

Description

Data processing method and device and electronic equipment
Technical Field
The present application relates to the field of data management technologies, and in particular, to a data processing method and apparatus, and an electronic device.
Background
The standard data elements can be used for standardizing and classifying names, types and values of data of the industry, the standard data elements are data, and the efficiency and the quality of data management are directly determined by whether standardized management can be realized or not in the data management process. How to accurately and quickly find the standard data element corresponding to the data item from the existing multiple standard data elements for the data item to be processed in the data table is a very critical part for realizing quick and efficient standardized governance.
In the related art, the standard data element corresponding to the corresponding data item is usually determined manually. However, manual methods are labor-intensive, and the determination of the standard data elements corresponding to the corresponding data items is inefficient.
Disclosure of Invention
The object of the present application is to solve at least to some extent one of the above mentioned technical problems.
Therefore, the application provides a data processing method, a data processing device and electronic equipment.
An embodiment of a first aspect of the present application provides a data processing method, including: acquiring a data item to be processed and a plurality of preset standard data elements; and determining a target data element with the name information and the value range information matched with the data item from the plurality of standard data elements according to the name information and the value range information of the data item.
Optionally, the determining, according to the name information and the value range information of the data item, a target data element whose name information and value range information are both matched with the data item from the plurality of standard data elements includes: determining a first similarity between the name information of the data item and the name information of each standard data element; determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements; and determining a target data element with name information and value range information matched with the data item from the plurality of standard data elements based on the first similarity and the second similarity.
Optionally, the determining, based on the first similarity and the second similarity, a target data element whose name information and value range information are both matched with the data item from the plurality of standard data elements includes: determining a total similarity between the data item and each of the standard data elements according to the first similarity and the second similarity; and determining a target data element with name information and value range information matched with the data item from the plurality of standard data elements according to the total similarity.
Optionally, the determining, based on the first similarity and the second similarity, a target data element whose name information and value range information are both matched with the data item from the plurality of standard data elements includes: determining a plurality of first candidate data elements from the plurality of standard data elements according to the first similarity; determining a plurality of second candidate data elements from the plurality of first candidate data elements according to the second similarity; and determining the target data element according to the plurality of second candidate data elements.
Optionally, the determining the target data element according to the plurality of second candidate data elements includes: acquiring similar data items of labeled data elements with similarity greater than a preset similarity threshold, wherein the labeled data elements are data elements in the standard data elements; determining a common data element in the annotated data element and the plurality of second candidate data; and taking the common data element as the target data element.
Optionally, the determining, according to the first similarity, a plurality of first candidate data elements from the plurality of standard data elements includes: sequencing the plurality of standard data elements according to the sequence of the first similarity from big to small to obtain a sequencing result; performing fine-grained similarity calculation on each standard data element ranked at the top N bits in the ranking result and the name information of the data item to obtain a third similarity between each standard data element ranked at the top N bits in the ranking result and the name information of the data item, wherein N is an integer greater than 1, and the granularity size adopted by the coarse-grained similarity calculation for word segmentation of the name information during similarity calculation is greater than the fine-grained similarity calculation; and according to the third similarity, determining a plurality of first candidate data elements from the standard data elements which are sequenced at the top N bits in the sequencing result.
Optionally, in a case that the value range information includes a plurality of numbers, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements includes: determining a first total number of the plurality of digits of the data item and determining a second total number of the plurality of digits of each standard data element; and respectively determining second similarity between the data item and each standard data element according to the first total quantity and each second total quantity.
Optionally, the determining, according to the first total number and the respective second total number, a second similarity between the data item and the respective standard data element includes: determining, for each of the standard data elements, a maximum of a second total number of the plurality of digits of the standard data element and the first total number; determining a quantity difference between the first total quantity and a second total quantity of the plurality of digits of the canonical data element; determining a difference value obtained by subtracting the maximum value from the quantity difference, and determining a proportional value between the difference value and the maximum value; determining a second similarity between the plurality of digits of the data item and the plurality of digits of the standard data element based on the scale value.
Optionally, in a case that the value range information includes a plurality of texts, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements includes: determining, for each of the standard data elements, a number of matches of the text of the data item with the text of the standard data element; determining a number of repetitions that a plurality of texts of the data item match a same text of the standard data element; and determining a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition times and the total number of texts in the value range information of the data item.
Optionally, the determining the number of matches between the text of the data item and the text of the standard data element includes: for each text of the data item, determining a fourth degree of similarity between the text and the respective text of the standard data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity degree threshold value as the number of the matching between the text of the data item and the text of the standard data element.
Optionally, the determining a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition number, and the total number of texts in the value range information of the data item includes: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining a ratio of said number to said sum; and normalizing the ratio, and determining second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to a normalization processing result.
Optionally, in a case that the value range information includes a plurality of character strings, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements includes: determining, for each of the standard data elements, a number of matches of the character string of the data item with the character string of the standard data element; determining a number of repetitions that a plurality of strings of the data item match a same string of the standard data element; and determining a second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the number, the total number of character strings in the value range information of the data item and the repetition times.
Optionally, the determining the number of the character strings of the data items matching the character strings of the standard data elements includes: for each string of the data item, determining a fourth degree of similarity between the string and the respective string of the canonical data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the character strings of the data items which are matched with the character strings of the standard data elements.
Optionally, the determining a second similarity between the multiple character strings of the data item and the multiple character strings of the standard data element according to the number, the repetition number, and the total number of character strings in the value range information of the data item includes: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining a ratio of said number to said sum; and carrying out normalization processing on the ratio, and determining second similarity between the character strings of the data item and the character strings of the standard data element according to a normalization processing result.
According to the data processing method, in the process of determining the data elements matched with the data items to be processed, the target data elements with the name information and the value range information matched with the data items are determined from the standard data elements according to the name information and the value range information of the data items. Therefore, the name information and the value range information of the data item are combined, manual participation is not needed, the target data element matched with the data item is accurately determined, the efficiency of determining the target data element matched with the data item is improved, and the cost is reduced.
An embodiment of a second aspect of the present application provides a data processing apparatus, including: the first acquisition module is used for acquiring a data item to be processed and a plurality of preset standard data elements; and the determining module is used for determining a target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements according to the name information and the value range information of the data item.
Optionally, the determining module includes: a first determining submodule for determining a first similarity between the name information of the data item and the name information of each of the standard data elements; a second determining submodule for determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements; and the third determining submodule is used for determining a target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements on the basis of the first similarity and the second similarity.
Optionally, the third determining submodule includes: a first determining unit, configured to determine, according to the first similarity and the second similarity, a total similarity between the data item and each of the standard data elements; and the second determining unit is used for determining a target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements according to the total similarity.
Optionally, the second determining unit includes: a first determining subunit, configured to determine, according to the first similarity, a plurality of first candidate data elements from the plurality of standard data elements; a second determining subunit, configured to determine, according to the second similarity, a plurality of second candidate data elements from the plurality of first candidate data elements; a third determining subunit, configured to determine the target data element according to the plurality of second candidate data elements.
Optionally, the third determining subunit is specifically configured to: acquiring similar data items of labeled data elements with similarity greater than a preset similarity threshold, wherein the labeled data elements are data elements in the standard data elements; determining the annotated data element and a common data element in the plurality of second candidate data; and taking the common data element as the target data element.
Optionally, the first similarity is obtained by performing coarse-grained similarity calculation on the data elements and the standard data elements, and the first determining subunit is specifically configured to: sequencing the plurality of standard data elements according to the sequence of the first similarity from big to small to obtain a sequencing result; performing fine-grained similarity calculation on each standard data element ranked at the top N bits in the ranking result and the name information of the data item to obtain a third similarity between each standard data element ranked at the top N bits in the ranking result and the name information of the data item, wherein N is an integer greater than 1, and the granularity size adopted by the coarse-grained similarity calculation for word segmentation of the name information during similarity calculation is greater than the fine-grained similarity calculation; and according to the third similarity, determining a plurality of first candidate data elements from the standard data elements which are sequenced at the top N bits in the sequencing result.
Optionally, in a case that the value range information includes a plurality of numbers, the second determining sub-module includes: a third determining unit for determining a first total number of the plurality of digits of the data item and determining a second total number of the plurality of digits of each standard data element; and a fourth determining unit, configured to determine, according to the first total number and each of the second total numbers, a second similarity between the data item and each of the standard data elements, respectively.
Optionally, the fourth determining unit is specifically configured to: determining, for each of the standard data elements, a maximum of a second total number of the plurality of digits of the standard data element and the first total number; determining a quantity difference between the first total quantity and a second total quantity of the plurality of digits of the standard data element; determining a difference value obtained by subtracting the maximum value from the quantity difference, and determining a proportional value between the difference value and the maximum value; determining a second similarity between the plurality of digits of the data item and the plurality of digits of the standard data element based on the scale value.
Optionally, in a case that the value range information includes a plurality of texts, the second determining sub-module includes: a fifth determining unit, configured to determine, for each of the standard data elements, the number of matches between the text of the data item and the text of the standard data element; a sixth determining unit configured to determine a number of repetitions that a plurality of texts of the data item match a same text of the standard data element; a seventh determining unit, configured to determine a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition number, and a total number of texts in the value range information of the data item.
Optionally, the fifth determining unit is specifically configured to: for each text of the data item, determining a fourth degree of similarity between the text and the respective text of the standard data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the texts of the data items which are matched with the texts of the standard data elements.
Optionally, the seventh determining unit is specifically configured to: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining a ratio of said number to said sum; and normalizing the ratio, and determining a second similarity between the texts of the data item and the texts of the standard data element according to a normalization processing result.
Optionally, in a case that the value range information includes a plurality of character strings, the second determining sub-module includes: an eighth determining unit, configured to determine, for each of the standard data elements, the number of matches between the character string of the data item and the character string of the standard data element; a ninth determining unit configured to determine a number of repetitions that a plurality of character strings of the data item match to the same character string of the standard data element; a tenth determining unit, configured to determine, according to the number, the total number of character strings in the value range information of the data item, and the repetition number, a second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element.
Optionally, the eighth determining unit is specifically configured to: for each string of the data item, determining a fourth degree of similarity between the string and the respective string of the canonical data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the character strings of the data items which are matched with the character strings of the standard data elements.
Optionally, the tenth determining unit is specifically configured to: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining a ratio of said number to said sum; and carrying out normalization processing on the ratio, and determining second similarity between the character strings of the data item and the character strings of the standard data element according to a normalization processing result.
According to the data processing device, in the process of determining the data elements matched with the data items to be processed, the target data elements with the name information and the value range information matched with the data items are determined from the standard data elements according to the name information and the value range information of the data items. Therefore, the name information and the value range information of the data item are combined, manual participation is not needed, the target data element matched with the data item is accurately determined, the efficiency of determining the target data element matched with the data item is improved, and the cost is reduced.
An embodiment of a third aspect of the present application provides an electronic device, including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method according to the first aspect when executing the program.
A fourth aspect of the present application is directed to a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the data processing method according to the first aspect.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of another data processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present application;
fig. 5 is a schematic flow chart of another data processing method according to an embodiment of the present application;
fig. 6 is a schematic flowchart of another data processing method according to an embodiment of the present application;
fig. 7 is a schematic flowchart of another data processing method according to an embodiment of the present application;
FIG. 8 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application.
The execution subject of the embodiment of the application is the data processing device provided by the application. The data processing apparatus may be an electronic device or may be configured in an electronic device to perform a data processing function, for example, the data processing apparatus may be an application program configured in an electronic device so that the application program can perform the data processing function.
The electronic device may be any device with computing capabilities, which or applications in the device are capable of performing data processing functions. The device with computing capability may be, for example, a Personal Computer (PC), a mobile terminal, a server, and the like, and the mobile terminal may be, for example, a hardware device with various operating systems, a touch screen, and/or a display screen, such as an in-vehicle device, a mobile phone, a tablet Computer, a Personal digital assistant, and a wearable device.
As shown in fig. 1, the data processing method includes the steps of:
step 101, acquiring a data item to be processed and a plurality of preset standard data elements.
In some exemplary embodiments, in the structured data, the data item to be processed generally refers to a data field, that is, in the structured data, the data item to be processed may be referred to as a data field to be processed.
The data items to be processed may be data items to be processed in a data table. In some exemplary embodiments, any data table may be monitored, and newly added data items in the data table may be treated as pending data items. In other exemplary embodiments, the data governance instruction may be received, and the data item corresponding to the data governance instruction may be used as the data item to be processed, which is not limited in this embodiment.
In some embodiments, a predetermined plurality of standard data elements may be obtained from a standard data element library.
And 102, determining a target data element with the name information and the value range information matched with the data item from the plurality of standard data elements according to the name information and the value range information of the data item.
In some embodiments, the name information of the data item may include data item name information, and in the structured data, the name information of the data item may include a field name of the data field.
In some exemplary embodiments, the value range information of the data item may be obtained from a value range table corresponding to the data item.
The value range information may include a plurality of candidate values corresponding to the data item, and correspondingly, the plurality of candidate values may be numbers, character strings, texts, or the like, which is not limited in this embodiment.
It is to be understood that the target data element in this example may be one or more. That is, the target data element may be one or more of a plurality of standard data elements, and this embodiment is not particularly limited thereto.
According to the data processing method, in the process of determining the data elements matched with the data items to be processed, the target data elements with the name information and the value range information matched with the data items are determined from the standard data elements according to the name information and the value range information of the data items. Therefore, the name information and the value range information of the data item are combined, manual participation is not needed, the target data element matched with the data item is accurately determined, the efficiency of determining the target data element matched with the data item is improved, and the cost is reduced.
Based on the foregoing embodiment, in order to accurately determine the target data element matching the data item, the foregoing step 102 determines, according to the name information and the value range information of the data item, one possible implementation manner of determining, from a plurality of standard data elements, the target data element whose name information and value range information both match the data item, as shown in fig. 2, and may include:
in step 201, a first similarity between the name information of the data item and the name information of each standard data element is determined.
For each standard data element, semantic similarity calculation can be performed between the name information of the data item and the name information value of the data element, and the semantic similarity calculation result is used as a first similarity between the name information of the data item and the name information of each standard data element.
At step 202, a second similarity between the value range information of the data item and the value range information of each standard data element is determined.
In one embodiment of the application, for each standard data element, a similarity calculation is performed on the value range information of the data item and the value range information of the standard data element to obtain a second similarity between the value range information of the data item and the value range information of the standard data element.
Step 203, based on the first similarity and the second similarity, determining a target data element with name information and value range information matched with the data item from the plurality of standard data elements.
It can be understood that, in different application scenarios, the implementation manner of determining, from the plurality of standard data elements, a target data element whose name information and value range information both match the data item based on the first similarity and the second similarity is different, and the following exemplary descriptions are provided:
as an example, a total similarity between the data item and each standard data element is determined according to the first similarity and the second similarity, and a target data element of which the name information and the value range information are matched with the data item is determined from the plurality of standard data elements according to the total similarity.
As an exemplary embodiment, after determining the total similarity between the data item and each standard data element, the plurality of standard data elements may be sorted in an order from large to small of the total similarity between the data item and each standard data element to obtain a sorting result, and the standard data element located at the top K bits in the sorting result is taken as a target data element, where K is an integer greater than 1.
In other embodiments of the present application, after the data item to be processed and the preset plurality of standard data items are obtained, the total similarity between the data item and the name information of each standard data element may be calculated, and from the plurality of standard data elements, a standard data element whose total similarity is greater than a preset similarity threshold is obtained, and the obtained standard data element is taken as the target data element.
The preset similarity threshold is a critical value of the total similarity preset in the data processing device, and in practical application, a value of the preset similarity threshold may be preset according to an actual requirement, which is not specifically limited in this embodiment.
As another example, a plurality of first candidate data elements are determined from the plurality of standard data elements according to the first similarity; determining a plurality of second candidate data elements from the plurality of first candidate data elements according to the second similarity; and determining the target data element according to the plurality of second candidate data elements.
It should be noted that the total number of the second candidate data elements is smaller than or equal to the total number of the first candidate data elements.
In some exemplary embodiments, the plurality of first candidate data elements may be sorted in order of decreasing second similarity between the data item and each first candidate data element to obtain a sorting result, and the first candidate data element sorted at the top M bits may be selected from the sorting result as the second candidate data element, where M is an integer greater than 1.
In other exemplary embodiments, a first candidate data element with a similarity greater than a preset similarity threshold may be selected from the plurality of first candidate data elements according to a second similarity between the data item and each first candidate data element, and the selected first candidate data element may be used as the second candidate data element.
As another example, a plurality of first candidate data elements are determined from the plurality of standard data elements based on the second similarity; determining a plurality of second candidate data elements from the plurality of first candidate data elements according to the first similarity; and determining the target data element according to the plurality of second candidate data elements.
In one embodiment of the present application, the second candidate data element may be directly used as the target data element, or a preset number of data elements from among the plurality of candidate data elements may be optionally used as the target data element. The preset number is preset, and a value of the preset number may be set according to an actual requirement, which is not specifically limited in this embodiment.
Fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present application. It should be noted that the present embodiment is a further refinement or optimization of the foregoing embodiments.
As shown in fig. 3, may include:
step 301, acquiring a data item to be processed and a plurality of preset standard data elements.
In step 302, a first similarity between the name information of the data item and the name information of each standard data element is determined.
Step 303, determining a second similarity between the value range information of the data item and the value range information of each standard data element.
Step 304, a plurality of first candidate data elements are determined from the plurality of standard data elements according to the first similarity.
Step 305 determines a plurality of second candidate data elements from the plurality of first candidate data elements according to the second similarity.
For specific implementation manners of step 301 to step 305, reference may be made to the related descriptions of the above embodiments, and details are not described herein.
Step 306, obtaining similar data items of labeled data elements with similarity greater than a preset similarity threshold, wherein the labeled data elements are data elements in the plurality of standard data elements.
In some embodiments, in order to quickly determine a target data element corresponding to a labeled data item in combination with the labeled data item, similarities between the data item and a plurality of candidate data items of the labeled data elements may be calculated, and a candidate data item with a similarity greater than a preset similarity threshold value may be obtained from the plurality of candidate data items as a similar data item corresponding to the data item.
The preset similarity threshold is a critical value of similarity preset in the data processing device, for example, the preset similarity threshold may be 0.75, and in practical application, a value of the preset similarity threshold may be preset in the data processing device according to an actual requirement, which is not specifically limited in this embodiment.
Step 307, the annotated data element and the shared data element in the plurality of second candidate data are determined.
Wherein the common data element is a data element existing in both the plurality of second candidate data elements and the plurality of third candidate data elements.
At step 308, the common data element is used as the target data element matched with the data item.
In an embodiment of the present application, in order to avoid that duplicate data elements affect the annotation of the data item, after determining the common data elements, the common data elements may be subjected to deduplication processing, and a target data element matching the data item is determined according to the deduplicated common data elements.
In some embodiments, the deduplicated common data element may be the target data element that matches the data item. In other examples, one or more of the deduplicated common data elements may be randomly selected as target data elements that match the data item.
In the data processing method provided by the embodiment of the application, in the process of determining the data element matched with the data item to be processed, a plurality of first candidate data elements are determined from the plurality of standard data elements by combining the similarity between the name information of the data item and the name information of each standard data element, a plurality of second candidate data elements are determined from the plurality of first candidate data elements according to the similarity between the value range information of the data item and the value range information of each first candidate data element, and the target data element matched with the data item is determined according to the plurality of second candidate data elements. Therefore, the name information and the value range information of the data item are combined, the target data element matched with the data item is accurately determined, and the efficiency of determining the target data element matched with the data item is improved.
Fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present application.
One possible implementation manner of determining a plurality of first candidate data elements from the plurality of standard data elements according to the first similarity, as shown in fig. 4, may include:
step 401, according to the sequence of the first similarity from big to small, sorting the plurality of standard data elements to obtain a sorting result.
In this example, the first similarity is calculated by performing coarse-grained similarity on the data element and each standard data element.
In some embodiments, the name information may include a data item name.
In some exemplary embodiments, in the structured data, the data item may also be referred to as a data field, and correspondingly, the name information may be referred to as a field name.
In some embodiments, the specific implementation manner of performing coarse-grained similarity calculation on the name information of the data item and the name information of each standard data element to obtain the first similarity between the name information of the data item and the name information of each standard data element may be: respectively extracting the name information of the data item and the name information of each standard data element by a preset first extraction model for extracting coarse-grained feature information to obtain the coarse-grained feature information of the data item and the coarse-grained feature information of each standard data element, and then performing semantic similarity calculation according to the coarse-grained feature information of the data item and the coarse-grained feature information of each standard data element to obtain first similarity between the name information of the data item and the name information of each standard data element.
The coarse-grained feature information refers to feature information obtained by performing word segmentation on name information by the first feature extraction model according to the coarse-grained size and performing feature extraction on word segmentation results of the name information.
The coarse granularity size may include a word level, a phrase level, a fixed quadword level, and the like.
And 402, performing fine-grained similarity calculation on the name information of each standard data element and data item which are sequenced at the top N bits in the sequencing result to obtain a third similarity between each standard data element and data item which are sequenced at the top N bits in the sequencing result.
And N is an integer larger than 1, and the granularity size adopted by the coarse granularity similarity calculation for word segmentation of the name information during similarity calculation is larger than that of the fine granularity similarity calculation.
In some embodiments, the fine-grained similarity calculation is performed on the name information of each standard data element and data item ordered at the top N bits in the sorting result to obtain a third similarity between each standard data element and data item ordered at the top N bits in the sorting result, and the name information of the data item, and the specific implementation manner may be: and then, performing semantic similarity calculation according to the fine-grained feature information of the data item and the fine-grained feature information of each standard data element sequenced at the first N bits to obtain a third similarity between the name information of the data item and the fine-grained feature information of each standard data element sequenced at the first N bits.
The fine-grained feature information refers to feature information obtained by performing word segmentation on the name information by the second feature extraction model according to the fine-grained size and performing feature extraction on word segmentation results of the name information.
The fine-grained size may include a word level, a phrase level, and the like.
Wherein the coarse grain size is larger than the fine grain size. For example, the coarse-grained size may be word-level and the fine-grained size may be word-level. For another example, the coarse-grained size may be at the phrase level and the fine-grained size may be at the word level. For another example, the coarse-grained size may be at the phrase level and the fine-grained size may be at the word level. In practical application, the coarse grain size and the fine grain size may be set according to actual requirements, as long as the coarse grain size is larger than the fine grain size, and this embodiment does not specifically limit this.
Step 403, according to the third similarity, determining a plurality of first candidate data elements from the standard data elements sorted in the top N bits of the sorting result.
In some exemplary embodiments, the standard data elements ranked at the top N may be ranked again according to the third similarity from large to small to obtain a corresponding ranking result, and the standard data elements ranked at K bits may be obtained from the corresponding ranking result, and the obtained standard data elements may be used as the first candidate data elements.
Wherein K is an integer greater than 1 and less than N, N is 90, and K may be 5, that is, each standard data element ranked at the top 90 bits may be reordered with respect to the third similarity to obtain a corresponding ranking result, and a standard data element ranked at the top 5 bits may be selected from the corresponding ranking result, and the selected labeled data element ranked at the top 5 bits may be used as the first candidate data element.
In other exemplary embodiments, a standard data element with a third similarity greater than a preset similarity threshold may be obtained from the top N standard data elements according to the third similarity, and the obtained standard data element may be used as the first candidate data element.
In this embodiment, based on the calculation result of the coarse-grained similarity calculation, the coarse-grained matching is performed on the plurality of standard data elements to obtain each standard data element with the similarity ranked in the top N bits, the fine-grained similarity calculation is performed on each standard data element ranked in the top N bits and the name information of the data item, and the fine-grained matching is performed on each standard data element ranked in the top N bits according to the calculation result of the similarity calculation to realize the further screening of the standard data elements and obtain the first candidate data element. Thereby, the first candidate data element can be determined quickly and accurately from the plurality of standard data elements.
It is understood that the types of the values in the value range information may include numbers, character strings, texts, and other types, and in practical applications, the value range information may be set by selecting one of the numbers, the character strings, and the texts according to implementation requirements. For example, the value range information may include a plurality of numbers. For another example, the value range information may include a plurality of character strings. For another example, the value range information may include a plurality of texts, for example, the number of texts is 3, and the 3 texts may be text 1, text 2, and text 3. It can be understood that, when the value range information includes a plurality of texts, the language type corresponding to the plurality of texts may be any language type, for example, the language type may be chinese, japanese, or the like, and this embodiment is not limited to this specifically.
Fig. 5 is a schematic flow chart of another data processing method according to an embodiment of the present application. It should be noted that, in the case where the value range information includes a plurality of numbers, the present embodiment exemplarily describes one possible implementation manner of determining the second similarity between the value range information of the data item and the value range information of each standard data element.
As shown in fig. 5, may include:
in step 501, a first total number of a plurality of digits of a data item is determined, and a second total number of the plurality of digits of each standard data element is determined.
It is to be understood that, for different standard data elements, the total number of digits in the value range corresponding to the different standard data elements may be the same or different, and this embodiment is not limited in this respect.
Step 502, determining a second similarity between the data item and each standard data element according to the first total number and each second total number.
In some exemplary embodiments, in order to accurately determine the second similarity between the data item and each standard data element, for each standard data element, a quantity difference between a second total quantity and a first total quantity corresponding to the standard data element is determined, a maximum value of the first total quantity and the second total quantity is determined, then a difference obtained by subtracting the maximum value from the quantity difference is determined, and a proportion value between the difference and the maximum value is determined; a second similarity between the data item and the standard data element is determined based on the scale value.
Wherein the quantity difference is an absolute value of a difference between the second total quantity and the first total quantity corresponding to the standard data element.
In some embodiments, the calculation formula for determining the second similarity B between the data item and the value range information of the candidate data element is as follows:
B=1-sigmoid(5*(count_diff-max_count)/max_count)
wherein, count _ diff: is the absolute value of the difference between the first total number and the second total data, that is, count _ diff: is the difference in amount between the first total amount and the second total amount; max _ count: the maximum of the two numbers. Wherein sigmoid () is a normalization function for mapping the values obtained by 5 × (count _ diff-max _ count)/max _ count between 0 and 1.
In some exemplary embodiments, since there may be a case where the total number of the numbers in the plurality of standard data elements is the same, in order to accurately determine the second similarity between the value range information of the data item and the value range information of the standard data element, the second similarity between the value range information of the data item and the value range information of the standard data element may be weighted according to whether the first total number and the second total number are the same.
Specifically, in the case where the first total number and the second total number are the same, after the second similarity between the data item and the value range information of the standard data element is determined based on the absolute value and the maximum value of the difference, the standard data element whose total number of digits in the value range information is the first total number may be determined from the plurality of standard data elements, the third total number of the standard data elements may be determined, the fourth total number of the standard data elements may be determined, the frequency of occurrence of the first total number may be determined according to the ratio of the third total number to the fourth total number, the corresponding weight may be determined according to the frequency of occurrence of the first total number, and the second similarity may be adjusted by the weight.
In some exemplary embodiments, if the frequency of the first total number is greater than a preset frequency threshold, a corresponding weight is determined according to the frequency of the first total number (where the value of the weight is greater than zero and less than 1), and otherwise, the value of the weight is determined to be 1. It is understood that, in the case where the weight takes a value of 1, the values of the second similarities before and after the adjustment are the same.
The preset frequency threshold is a critical value of a preset frequency in the data processing device, and in practical application, a value of the preset frequency threshold may be set according to an actual requirement, which is not specifically limited in this embodiment.
In other exemplary embodiments, after determining a second similarity between the data item and the value range information of the standard data element based on the absolute value and the maximum value of the difference if the first total number and the second total number are different, a corresponding deviation range of the value range of the data item may be determined according to the first total number and the second total number, a frequency corresponding to each number in the deviation range of the value range of the data item may be determined, the weights generated for all the numbers in the deviation range of the value range of the data item may be summed, and the resultant may be used as a final weight, and the second similarity may be adjusted based on the weights.
The range of the deviation of the number of the data item value field refers to a range between a first value and a second value, wherein the first value is obtained by subtracting an absolute value of a difference between the second total number and the first total number from the first total number, and the second value is obtained by adding an absolute value of a difference between the second total number and the first total number to the second total number.
In other exemplary embodiments of the present application, in the case that the value range information is a number, another possible implementation manner of the above determining the second similarity between the value range information of the data item and the value range information of each standard data element is; for each standard data element, the number of digits of the standard data element that are the same as the number of digits in the plurality of digits of the data item may be compared, and from that number, a similarity in value range information between the data item and the standard data element is determined.
Wherein the total number of digits in the value range information of the standard data element is the same as the total number of digits in the value range information of the data item.
Fig. 6 is a schematic flow chart of another data processing method according to an embodiment of the present application. It should be noted that, in the case that the value range information includes a plurality of texts, the embodiment exemplarily describes one possible implementation manner of the above-described second similarity between the value range information of the data item and the value range information of each standard data element.
As shown in fig. 6, may include:
step 601, for each standard data element, determining the number of the matching between the text of the data item and the text of the standard data element.
In an embodiment of the application, in order to further accurately determine the number of the texts of the data item matching with the texts of the standard data element, a fourth similarity between each text of the data item and each text of the standard data element may be determined; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the matching between the text of the data item and the text of the standard data element. Therefore, the number of the matched texts of the data items and the standard data elements is accurately determined.
In step 602, the number of repetitions of matching multiple texts of a data item to the same text of a standard data element is determined.
For example, the two texts corresponding to the data item are respectively "man" in text 1 and "man" in text 2, and it is assumed that the text corresponding to the standard data element is "man". Correspondingly, when the matching calculation is performed on the data item and the standard data element, the text 1 of the data item is matched with the text "man" corresponding to the standard data element, and the text 2 of the data item is matched with the text "man" corresponding to the standard data element, that is, the text 1 and the text 2 of the data item are both matched with the text "man" of the standard data element, and at this time, the same text "man" matched with the standard data element can be determined.
Step 603, determining a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition times and the total number of texts in the value range information of the data item.
For example, if the value range information of the data item includes four chinese texts, the total number of the chinese texts in the value range information of the corresponding data item is 4.
In some exemplary embodiments of the present application, in order to further accurately determine the second similarity between the texts of the data item and the texts of the standard data element, the second similarity between the texts of the data item and the texts of the standard data element may be determined by combining the total number of texts in the value range information of the data item, the number of times that the maximum similarity among the texts of the data item is greater than the preset similarity threshold, and the number of times that the same text matched to the standard data element is repeated.
In an embodiment of the present application, in order to further accurately determine the second similarity between the texts of the data item and the texts of the standard data element, one possible implementation manner of determining the second similarity between the texts of the data item and the texts of the standard data element according to the number, the repetition number, and the total number of the texts in the value range information of the data item is as follows: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining the ratio of the number to the sum; and carrying out normalization processing on the values, and determining a second similarity between the plurality of texts of the data items and the plurality of texts of the standard data elements according to the normalization processing result.
The preset value is an error value preset empirically, and may be 1e-8, for example.
In an example embodiment, in the case where the above preset value is 1e-8, the calculation formula for the second similarity a between the plurality of texts of the data item and the plurality of texts of the standard data element is:
A=sigmoid((match_count/(phy_chncnt-repeat_count+1e-8))*10-5)
wherein, match _ count: the number of the matched texts representing the data item and the standard data element is equal to the value obtained by subtracting repeat _ count from the number of the texts of the data item, wherein the maximum similarity of the texts of the data item is greater than a preset similarity threshold;
phy _ chncnt: the total number of texts in the value range information of the data item;
repeat _ count: the number of repetitions of the same text in the standard data element.
Fig. 7 is a schematic flowchart of another data processing method according to an embodiment of the present application. It should be noted that, in the present embodiment, a possible implementation manner of the above-mentioned determining the second similarity between the value range information of the data item and the value range information of each standard data element in the case that the value range information includes a plurality of character strings is exemplarily described.
As shown in fig. 7, may include:
step 701, determining the number of the character strings of the data items matched with the character strings of the standard data elements for each standard data element.
In an embodiment of the present application, one possible implementation manner of determining the number of the character strings of the data items matching the character strings of the standard data elements is as follows: for each string of the data item, determining a fourth degree of similarity between the string and the respective string of the standard data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the character strings of the data items which are matched with the character strings of the standard data elements. Therefore, the number of the character strings of the data items matched with the character strings of the standard data elements is accurately determined.
In step 702, the number of repetitions of matching multiple strings of a data item to the same string of a standard data element is determined.
And 703, determining a second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the number, the repeated times and the total number of the character strings in the value range information of the data item.
For example, if the value range information of the data item includes four character strings, the total number of the character strings in the value range information of the corresponding data item is 4.
In some exemplary embodiments of the present application, in order to further accurately determine the second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element, the second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element may be determined by combining the total number of character strings in the value range information of the data item, the number of character strings of the data item, the maximum similarity of which is greater than the preset similarity threshold, and the number of repetitions of the same character string matched to the standard data element.
In an embodiment of the present application, in order to further accurately determine the second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element, one possible implementation manner of determining the second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the number, the number of repetitions, and the total number of character strings in the value range information of the data item is as follows: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining the ratio of the number to the sum; and carrying out normalization processing on the comparison value, and determining a second similarity between the character strings of the data item and the character strings of the standard data element according to the normalization processing result.
The preset value is an error value preset empirically, and may be 1e-8, for example.
In an exemplary embodiment, in the case where the above preset value is 1e-8, the calculation formula for the second similarity a between the plurality of character strings of the data item and the plurality of character strings of the standard data element is:
A=sigmoid((match_count/(phy_chncnt-repeat_count+1e-8))*10-5);
wherein, match _ count: the number of matched character strings representing the data item and the number of matched character strings of the standard data element is equal to the value obtained by subtracting repeat _ count from the number of the maximum similarity of the character strings of the data item which is greater than a preset similarity threshold;
phy _ chncnt: the total number of character strings in the value range information of the data item;
repeat _ count: the number of repetitions of the same string matched to the standard data element.
Corresponding to the data processing methods provided by the above several embodiments, an embodiment of the present application further provides a data processing apparatus. Since the data processing apparatus provided in the embodiments of the present application corresponds to the data processing methods provided in the above several embodiments, the implementation of the data processing method is also applicable to the data processing apparatus provided in the embodiments, and will not be described in detail in the embodiments.
Fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
As shown in fig. 8, the data processing apparatus 800 may include: an obtaining module 801 and a determining module 802, wherein:
an obtaining module 801, configured to obtain a data item to be processed and a plurality of preset standard data elements;
the determining module 802 is configured to determine, according to the name information and the value range information of the data item, a target data element from the plurality of standard data elements, where both the name information and the value range information are matched with the data item.
In one embodiment of the present application, the determining module 802 includes: the first determining submodule is used for determining first similarity between the name information of the data item and the name information of each standard data element; a second determining submodule for determining a second similarity between the value range information of the data item and the value range information of each standard data element; and the third determining submodule is used for determining a target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements on the basis of the first similarity and the second similarity.
In one embodiment of the present application, the third determining submodule includes: a first determining unit configured to determine a total similarity between the data item and each of the standard data elements based on the first similarity and the second similarity; and the second determining unit is used for determining the target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements according to the total similarity.
In one embodiment of the present application, the second determination unit includes: a first determining subunit, configured to determine, according to the first similarity, a plurality of first candidate data elements from the plurality of standard data elements; a second determining subunit, configured to determine, according to the second similarity, a plurality of second candidate data elements from the plurality of first candidate data elements; and the third determining subunit is used for determining the target data element according to the plurality of second candidate data elements.
In an embodiment of the application, the third determining subunit is specifically configured to: acquiring similar data items of labeled data elements with the similarity greater than a preset similarity threshold value with the data items, wherein the labeled data elements are data elements in a plurality of standard data elements; determining the labeled data element and a common data element in the plurality of second candidate data; the common data element is taken as a target data element.
In an embodiment of the present application, the first similarity is obtained by performing coarse-grained similarity calculation on the data element and each standard data element, and the first determining subunit is specifically configured to: sequencing the plurality of standard data elements according to the sequence of the first similarity from big to small to obtain a sequencing result; performing fine-grained similarity calculation on the name information of each standard data element and data item which are sequenced at the top N bits in the sequencing result to obtain a third similarity between each standard data element and data item which are sequenced at the top N bits in the sequencing result, wherein N is an integer greater than 1, and the granularity size adopted by the coarse-grained similarity calculation for word segmentation of the name information during similarity calculation is greater than the fine-grained similarity calculation; and according to the third similarity, determining a plurality of first candidate data elements from the standard data elements which are sequenced at the top N bits in the sequencing result.
In one embodiment of the present application, in a case where the value range information includes a plurality of digits, the second determination submodule includes: a third determining unit for determining a first total number of the plurality of digits of the data item and determining a second total number of the plurality of digits of each standard data element; and the fourth determining unit is used for respectively determining second similarity between the data item and each standard data element according to the first total quantity and each second total quantity.
In an embodiment of the application, the fourth determining unit is specifically configured to: determining, for each standard data element, a maximum of a second total number and a first total number of the plurality of digits of the standard data element; determining a quantity difference between the first quantity and a second quantity of the plurality of digits of the standard data element; determining a difference value obtained by subtracting the maximum value from the quantity difference, and determining a proportional value between the difference value and the maximum value; a second degree of similarity between the plurality of digits of the data item and the plurality of digits of the standard data element is determined based on the scale value.
In one embodiment of the present application, in a case where the value range information includes a plurality of texts, the second determination submodule includes: a fifth determining unit configured to determine, for each standard data element, the number of matches of the text of the data item with the text of the standard data element; a sixth determining unit configured to determine a number of repetitions that the plurality of texts of the data item match the same text of the standard data element; and a seventh determining unit, configured to determine a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition number, and the total number of texts in the value range information of the data item.
In an embodiment of the application, the fifth determining unit is specifically configured to: for each text of the data item, determining a fourth degree of similarity between the text and the respective text of the standard data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the matching between the text of the data item and the text of the standard data element.
In an embodiment of the application, the seventh determining unit is specifically configured to: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining the ratio of the number to the sum; and carrying out normalization processing on the values, and determining a second similarity between the plurality of texts of the data items and the plurality of texts of the standard data elements according to the normalization processing result.
In one embodiment of the present application, in a case where the value range information includes a plurality of character strings, the second determination submodule includes: an eighth determining unit configured to determine, for each standard data element, the number of matches between the character string of the data item and the character string of the standard data element; a ninth determining unit configured to determine the number of repetitions that a plurality of character strings of the data item match a same character string of the standard data element; and the tenth determining unit is used for determining a second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the number, the total number of the character strings in the value range information of the data item and the repetition times.
In an embodiment of the application, the eighth determining unit is specifically configured to: for each string of the data item, determining a fourth degree of similarity between the string and the respective string of the standard data element; determining the maximum similarity among the fourth similarities; and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the character strings of the data items which are matched with the character strings of the standard data elements.
In an embodiment of the application, the tenth determining unit is specifically configured to: determining a difference obtained by subtracting the repetition times from the total number; acquiring a sum obtained by summing the difference value and a preset value; determining the ratio of the number to the sum; and normalizing the values, and determining second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the normalization processing result. According to the data processing device, in the process of determining the data elements matched with the data items to be processed, the target data elements with the name information and the value range information matched with the data items are determined from the standard data elements according to the name information and the value range information of the data items. Therefore, the name information and the value range information of the data item are combined, the target data element matched with the data item is accurately determined, and the efficiency of determining the target data element matched with the data item is improved.
According to the data processing device provided by the embodiment of the application, in the process of determining the data element matched with the data item to be processed, the target data element with the name information and the value range information matched with the data item is determined from the plurality of standard data elements according to the name information and the value range information of the data item. Therefore, the name information and the value range information of the data item are combined, manual participation is not needed, the target data element matched with the data item is accurately determined, the efficiency of determining the target data element matched with the data item is improved, and the cost is reduced.
In order to implement the foregoing embodiments, the present application further provides an electronic device, and fig. 9 is a schematic structural diagram of the electronic device provided in the embodiments of the present application. The electronic device includes:
a memory 901, a processor 902 and a computer program stored on the memory 901 and executable on the processor 902.
The processor 902, when executing the program, implements the data processing method provided in the above-described embodiments.
Further, the electronic device further includes:
a communication interface 903 for communication between the memory 901 and the processor 902.
A memory 901 for storing computer programs executable on the processor 902.
Memory 901 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 902 is configured to implement the data processing method of the above embodiment when executing the program.
If the memory 901, the processor 902, and the communication interface 903 are implemented independently, the communication interface 903, the memory 901, and the processor 902 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.
Optionally, in a specific implementation, if the memory 901, the processor 902, and the communication interface 903 are integrated on a chip, the memory 901, the processor 902, and the communication interface 903 may complete mutual communication through an internal interface.
The processor 902 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
In order to implement the foregoing embodiments, the present application also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the data processing method provided in the foregoing embodiments.
In order to implement the foregoing embodiments, the present application further provides a computer program product, which when executed by an instruction processor in the computer program product, implements the data processing method provided in the foregoing embodiments.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (17)

1. A method of data processing, the method comprising:
acquiring a data item to be processed and a plurality of preset standard data elements;
and determining a target data element with the name information and the value range information matched with the data item from the plurality of standard data elements according to the name information and the value range information of the data item.
2. The method of claim 1, wherein determining a target data element from the plurality of standard data elements that has name information and value range information that match the data item based on the name information and value range information of the data item comprises:
determining a first similarity between the name information of the data item and the name information of each standard data element;
determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements;
and determining a target data element with name information and value range information matched with the data item from the plurality of standard data elements based on the first similarity and the second similarity.
3. The method of claim 2, wherein determining a target data element from the plurality of standard data elements that has name information and value range information that both match the data item based on the first similarity and the second similarity comprises:
determining a total similarity between the data item and each of the standard data elements according to the first similarity and the second similarity;
and determining a target data element with name information and value range information matched with the data item from the plurality of standard data elements according to the total similarity.
4. The method of claim 2, wherein determining a target data element from the plurality of standard data elements that has name information and value range information that match the data item based on the first similarity and the second similarity comprises:
determining a plurality of first candidate data elements from the plurality of standard data elements according to the first similarity;
determining a plurality of second candidate data elements from the plurality of first candidate data elements according to the second similarity;
and determining the target data element according to the plurality of second candidate data elements.
5. The method of claim 4, wherein said determining said target data element from said second plurality of candidate data elements comprises:
acquiring similar data items of labeled data elements with similarity greater than a preset similarity threshold, wherein the labeled data elements are data elements in the standard data elements;
determining the annotated data element and a common data element in the plurality of second candidate data;
and taking the common data element as the target data element.
6. The method of claim 4, wherein the first similarity is calculated by coarse-grained similarity between the data element and each of the standard data elements, and wherein determining a plurality of first candidate data elements from the plurality of standard data elements based on the first similarity comprises:
sequencing the plurality of standard data elements according to the sequence of the first similarity from big to small to obtain a sequencing result;
performing fine-grained similarity calculation on each standard data element ranked at the top N bits in the ranking result and the name information of the data item to obtain a third similarity between each standard data element ranked at the top N bits in the ranking result and the name information of the data item, wherein N is an integer greater than 1, and the granularity size adopted by the coarse-grained similarity calculation for word segmentation of the name information during similarity calculation is greater than the fine-grained similarity calculation;
and according to the third similarity, determining a plurality of first candidate data elements from the standard data elements which are sequenced at the top N bits in the sequencing result.
7. The method of claim 2, wherein, in the case that the value range information includes a plurality of digits, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements includes:
determining a first total number of the plurality of digits of the data item and determining a second total number of the plurality of digits of each standard data element;
and respectively determining second similarity between the data item and each standard data element according to the first total quantity and each second total quantity.
8. The method of claim 7, wherein said determining a second degree of similarity between said data item and each standard data element, respectively, based on said first total number and each said second total number, comprises:
determining, for each of the standard data elements, a maximum of a second total number of the plurality of digits of the standard data element and the first total number;
determining a quantity difference between the first total quantity and a second total quantity of the plurality of digits of the standard data element;
determining a difference value obtained by subtracting the maximum value from the quantity difference, and determining a proportional value between the difference value and the maximum value;
determining a second similarity between the plurality of digits of the data item and the plurality of digits of the standard data element based on the scale value.
9. The method of claim 2, wherein, in the case that the value range information includes a plurality of texts, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements comprises:
determining, for each of the standard data elements, a number of matches of the text of the data item with the text of the standard data element;
determining a number of repetitions that a plurality of texts of the data item match to a same text of the standard data element;
and determining a second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the repetition times and the total number of texts in the value range information of the data item.
10. The method of claim 9, wherein said determining a number of times that the text of the data item matches the text of the standard data element comprises:
for each text of the data item, determining a fourth degree of similarity between the text and the respective text of the standard data element;
determining the maximum similarity among the fourth similarities;
and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the texts of the data items which are matched with the texts of the standard data elements.
11. The method of claim 9 or 10, wherein determining the second similarity between the plurality of texts of the data item and the plurality of texts of the standard data element according to the number, the number of repetitions, and a total number of texts in the value range information of the data item comprises:
determining a difference obtained by subtracting the repetition times from the total number;
acquiring a sum obtained by summing the difference value and a preset value;
determining a ratio of said number to said sum;
and normalizing the ratio, and determining a second similarity between the texts of the data item and the texts of the standard data element according to a normalization processing result.
12. The method of claim 2, wherein, in a case where the value range information includes a plurality of character strings, the determining a second similarity between the value range information of the data item and the value range information of each of the standard data elements includes:
determining, for each of the standard data elements, a number of matches of the character string of the data item with the character string of the standard data element;
determining a number of repetitions that a plurality of strings of the data item match a same string of the standard data element;
and determining a second similarity between the plurality of character strings of the data item and the plurality of character strings of the standard data element according to the number, the total number of character strings in the value range information of the data item and the repetition times.
13. The method of claim 12, wherein said determining a number by which the string of data items matches the string of standard data elements comprises:
for each string of the data item, determining a fourth degree of similarity between the string and the respective string of the canonical data element;
determining the maximum similarity among the fourth similarities;
and taking the number of the maximum similarity degrees which is larger than a preset similarity threshold value as the number of the character strings of the data items which are matched with the character strings of the standard data elements.
14. The method of claim 12 or 13, wherein determining the second similarity between the plurality of strings of the data item and the plurality of strings of the standard data element based on the number, the number of repetitions, and a total number of strings in the value range information of the data item comprises:
determining a difference obtained by subtracting the repetition times from the total number;
acquiring a sum obtained by summing the difference value and a preset value;
determining a ratio of said number to said sum;
and carrying out normalization processing on the ratio, and determining second similarity between the character strings of the data item and the character strings of the standard data element according to a normalization processing result.
15. A data processing apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring a data item to be processed and a plurality of preset standard data elements;
and the determining module is used for determining a target data element of which the name information and the value range information are matched with the data item from the plurality of standard data elements according to the name information and the value range information of the data item.
16. An electronic device, comprising:
memory, processor and computer program stored on the memory and executable on the processor, which when executing the program implements a data processing method according to any of claims 1-14.
17. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a data processing method according to any one of claims 1 to 14.
CN202210761255.4A 2022-06-30 2022-06-30 Data processing method and device and electronic equipment Pending CN115081531A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210761255.4A CN115081531A (en) 2022-06-30 2022-06-30 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210761255.4A CN115081531A (en) 2022-06-30 2022-06-30 Data processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115081531A true CN115081531A (en) 2022-09-20

Family

ID=83256391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210761255.4A Pending CN115081531A (en) 2022-06-30 2022-06-30 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115081531A (en)

Similar Documents

Publication Publication Date Title
WO2021164231A1 (en) Official document abstract extraction method and apparatus, and device and computer readable storage medium
CN111143597B (en) Image retrieval method, terminal and storage device
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111767713B (en) Keyword extraction method and device, electronic equipment and storage medium
CN111382248B (en) Question replying method and device, storage medium and terminal equipment
CN112035480A (en) Data table management method, device, equipment and storage medium
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN107885875B (en) Synonymy transformation method and device for search words and server
CN111159481B (en) Edge prediction method and device for graph data and terminal equipment
CN112613310A (en) Name matching method and device, electronic equipment and storage medium
CN114418226B (en) Fault analysis method and device for power communication system
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
CN115081531A (en) Data processing method and device and electronic equipment
CN113051919A (en) Method and device for identifying named entity
CN115630595A (en) Automatic logic circuit generation method and device, electronic device and storage medium
CN115617978A (en) Index name retrieval method and device, electronic equipment and storage medium
CN114418114A (en) Operator fusion method and device, terminal equipment and storage medium
CN111459540B (en) Hardware performance improvement suggestion method and device and electronic equipment
CN109783816B (en) Short text clustering method and terminal equipment
CN110135412B (en) Business card recognition method and device
CN111782812A (en) K-Means text clustering method and device and terminal equipment
CN112783840B (en) Method and device for storing document, electronic equipment and storage medium
CN116894209B (en) Sampling point classification method, device, electronic equipment and readable storage medium
CN111723229B (en) Data comparison method, device, computer readable storage medium and electronic equipment
CN111382244B (en) Deep retrieval matching classification method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination