CN110889286A - Dependency relationship identification method and device based on data table and computer equipment - Google Patents

Dependency relationship identification method and device based on data table and computer equipment Download PDF

Info

Publication number
CN110889286A
CN110889286A CN201910968542.0A CN201910968542A CN110889286A CN 110889286 A CN110889286 A CN 110889286A CN 201910968542 A CN201910968542 A CN 201910968542A CN 110889286 A CN110889286 A CN 110889286A
Authority
CN
China
Prior art keywords
field name
source
word
words
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910968542.0A
Other languages
Chinese (zh)
Other versions
CN110889286B (en
Inventor
徐知己
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910968542.0A priority Critical patent/CN110889286B/en
Publication of CN110889286A publication Critical patent/CN110889286A/en
Application granted granted Critical
Publication of CN110889286B publication Critical patent/CN110889286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Abstract

The application relates to a dependency relationship identification method and device based on a data table and computer equipment in the field of data analysis. The method comprises the following steps: acquiring a dependency relationship identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency relationship identification task; calling multiple threads to perform word segmentation on the source field name in parallel to obtain a feature word in the source field name; obtaining descriptive words corresponding to the characteristic words from a preset word bank; generating a reference field name by using the feature words and the descriptor; calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting preset conditions as an intermediate field name, and recording the intermediate field name; and searching a target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table comprises the target field name. By adopting the method, the accuracy of identifying the dependency relationship among the data tables can be improved.

Description

Dependency relationship identification method and device based on data table and computer equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a dependency relationship identification method and apparatus based on a data table, a computer device, and a storage medium.
Background
With the development of computer technology, the data tables processed by computers are greatly increased. In the process of treating the data tables, in order to clear up the relationship among a large number of data tables, a dependency relationship is introduced. The dependency of a data table describes the association of data in the data table between data in different data tables during the lifecycle of generation, transmission, use, storage, sharing and destruction. Therefore, identifying dependencies between data tables facilitates data table governance.
Conventional ways of identifying dependencies include two. One is to identify dependencies between data tables manually. Because the data volume of the data table is large, the manual identification is not only low in efficiency, but also easy to miss, mix and other errors. And the other method is to identify the data link according to the code logic to obtain the dependency relationship between the data tables. This approach is very dependent on the system code, and when the system code is not known, the dependency between the data tables cannot be identified in this way. Moreover, because codes of different data may be different, the conversion between codes of different systems is prone to have deviation, and further, dependency identification is wrong, and identification accuracy is not high.
Disclosure of Invention
In view of the above, it is necessary to provide a dependency relationship identification method, device, computer device, and storage medium based on a data table, which can improve the accuracy of identifying the dependency relationship of the data table, in order to solve the technical problem of low accuracy of identifying the dependency relationship of the data table.
A method for dependency identification based on a data table, the method comprising:
acquiring a dependency relationship identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency relationship identification task;
calling multiple threads to perform word segmentation on the source field name in parallel to obtain a feature word in the source field name;
acquiring descriptive words corresponding to the characteristic words from a preset word bank;
generating a reference field name by using the feature words and the descriptor;
calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting preset conditions as an intermediate field name, and recording the intermediate field name;
and acquiring a target table corresponding to the target table identifier from a data table file, searching a target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table comprises the target field name.
In one embodiment, after the invoking multithreading concurrently participles the source field names, the method further comprises:
obtaining the unlabeled words in the multiple words obtained after word segmentation;
acquiring a word blacklist from the data table file;
when the unlabeled words do not belong to the word blacklist, reading historical word segmentation records corresponding to the unlabeled words, wherein the historical word segmentation records comprise historical word segmentation times corresponding to the unlabeled words;
and generating prompt information according to the unlabeled words with the historical word segmentation times larger than a threshold value, and sending the prompt information to the corresponding terminal.
In one embodiment, the invoking multithreading and tokenizing the source field names in parallel includes:
traversing the source field names, and matching the source field names by using the preset word stock to obtain various word segmentation results;
determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results;
calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result;
and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
In one embodiment, the searching the target table for the target field name corresponding to the intermediate field name includes:
reading candidate field names in the target table;
and calculating a second similarity between the candidate field names and the intermediate field names, and marking the candidate field names corresponding to the second similarity larger than a preset value as target field names.
In one embodiment, the calculating the first similarity between the source field name and the reference field name includes:
generating a corresponding source word vector according to the source field name;
generating a reference word vector corresponding to the reference field name by using the feature words and the descriptor;
and calculating cosine similarity between the source word vector and the reference word vector, and taking the cosine similarity as the first similarity.
An apparatus for dependency identification based on a data table, the apparatus comprising:
the field name reading module is used for acquiring a dependency relationship identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency relationship identification task;
the word segmentation module is used for calling multiple threads and performing word segmentation on the source field name in parallel to obtain a feature word in the source field name;
the descriptor acquisition module is used for acquiring descriptors corresponding to the feature words from a preset word bank;
the field name generating module is used for generating a reference field name by utilizing the characteristic words and the description words;
the similarity generating module is used for calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting preset conditions as an intermediate field name, and recording the intermediate field name;
and the dependency relationship determining module is used for acquiring a target table corresponding to the target table identifier from a data table file, searching a target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table comprises the target field name.
In one embodiment, after the word segmentation module, the apparatus further includes a prompt information generation module, configured to obtain an unlabeled word in the multiple words obtained after the word segmentation; acquiring a word blacklist from the data table file; when the unlabeled words do not belong to the word blacklist, reading historical word segmentation records corresponding to the unlabeled words, wherein the historical word segmentation records comprise historical word segmentation times corresponding to the unlabeled words; and generating prompt information according to the unlabeled words with the historical word segmentation times larger than a threshold value, and sending the prompt information to the corresponding terminal.
In one embodiment, the word segmentation module is further configured to traverse the source field names, and match the source field names with the preset lexicon to obtain multiple word segmentation results; determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results; calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result; and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above dependency identification method based on a data table when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned dependency identification method based on data tables.
According to the dependency relationship identification method, the dependency relationship identification device, the computer equipment and the storage medium based on the data table, the source field names in the source table are read according to the dependency relationship identification task, the multithreading is called to carry out word segmentation on the source field names in parallel to obtain the feature words, and the semantic features of the source field names are determined through the feature words. Reference field names are generated around the feature words, output results of the reference field names are standardized, and accuracy of semantic analysis on shorter source field names is improved. When the target field names corresponding to the intermediate field names exist in the target table, the accurate corresponding field names are utilized to determine the dependency relationship between the source table and the target table, and therefore the accuracy of dependency relationship identification between the data tables is effectively improved.
Drawings
FIG. 1 is a diagram of an application environment for a dependency identification method based on a data table in one embodiment;
FIG. 2 is a flowchart illustrating a method for dependency identification based on a data table in one embodiment;
FIG. 3 is a flowchart illustrating a method for dependency identification based on a data table in another embodiment;
FIG. 4 is a block diagram of an apparatus for dependency identification based on data tables according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The dependency relationship identification method based on the data table can be applied to a terminal and can also be applied to an application environment shown in fig. 1. Here, the application to the application environment shown in fig. 1 is taken as an example. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may send the dependency identification task carrying the source table identifier and the target table identifier to the server 104. The server 104 reads the source field names in the source table corresponding to the source table identification according to the dependency relationship identification task, and invokes multithreading to perform word segmentation on the source field names in parallel to obtain the feature words in the source field names. The server 104 obtains the descriptor corresponding to the feature word from the preset word stock, and generates a reference field name by using the feature word and the descriptor. The server 104 calculates a first similarity between the source field name and the reference field name, marks the reference field name corresponding to the first similarity meeting a preset condition as an intermediate field name, and records the intermediate field name. The server 104 obtains the target table corresponding to the target table identifier from the data table file, searches the target field name corresponding to the intermediate field name from the target table, and when the target field name exists in the target table, the server 104 determines the dependency relationship between the source table and the target table to obtain the identification result. The server 104 may return the recognition result to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a dependency relationship identification method based on a data table is provided, and the method may be applied to a terminal, and may also be applied to a server in fig. 1. The method is described by taking the server in fig. 1 as an example, and comprises the following steps:
step 202, obtaining a dependency identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency identification task.
The source table refers to a data table of unknown dependency relationship with the target table. The source table may record data of various fields. For example, the source table may contain data of at least one of a plurality of fields such as finance, medical care, and machinery. The source field names are field names corresponding to each column of fields in the source table, and the source field names are multi-word phrases with 4-10 characters in general.
The server may obtain the dependency identification task in a variety of ways. For example, the server may retrieve the dependency identification task from a task list. The server can also establish a dependency relationship identification task according to the dependency relationship identification request by receiving the dependency relationship identification request uploaded by the terminal. The dependency relationship identification task carries a source table identifier and a target table identifier, and the server can respectively extract a source table and a target table from the data table file according to the source table identifier and the target table identifier. The data table file is used to store data tables, including but not limited to source tables and target tables, and may be stored in a database. After obtaining the source table, the server may read the table structure of the source table. Specifically, the server may read a source field corresponding to each column in the source table, and a source field name, a field type, and a field comment corresponding to each column of the source field, where the field comment is used to interpret the source field name.
And step 204, calling multiple threads to perform word segmentation on the source field name in parallel to obtain the feature words in the source field name.
The server can call a plurality of threads to perform word segmentation on the source field name in parallel, so that the word segmentation efficiency of the server on the source field name is improved. And the server divides the source field name into words to obtain a plurality of words. The server obtains the characteristic words in the source field names from a plurality of words obtained after word segmentation. Specifically, the server obtains tags corresponding to the plurality of words respectively, and determines the feature words in the source field names according to the tags corresponding to the words.
The server can acquire the labels corresponding to the words in various ways. For example, a preset mapping relationship exists between the word and the corresponding tag, and the server may obtain the tag corresponding to the word according to the preset mapping relationship. The server can also search corresponding words from the preset word bank and determine labels corresponding to the words by using preset information in the preset word bank.
The label may include a variety of indicia. For example, the tag may mark a word as a "feature word" or a "descriptor," and the server determines that the word tagged as the "feature word" is the feature word in the source field name. Alternatively, words may also be labeled with "1" and "0". The word labeled "1" is a feature word in the source field name, and the word labeled "0" is a descriptor for describing the feature word. Since the source field name is a short multi-word phrase, among a plurality of words obtained by segmenting the source field name, there is usually only one feature word, and the other words are descriptors for describing the feature words.
For example, a source table in which medical information is described includes a source field named "medical patient name". The server performs word segmentation on the source field name to obtain a plurality of words which are respectively 'internal medicine', 'patient' and 'name'. The server can match a plurality of words obtained by word segmentation in a preset word bank to obtain labels corresponding to the plurality of words respectively. If the label corresponding to the word "internal medicine" is "descriptor", the label corresponding to the word "patient" is "descriptor", the label corresponding to the word "name" is "feature word", and the server can determine that the feature word in the source field name "internal medicine patient name" is "name".
And step 206, obtaining the descriptive words corresponding to the characteristic words from the preset word bank.
The characteristic words have corresponding descriptors, and the descriptors corresponding to the characteristic words can be preset descriptors and are stored in a preset word bank together with the characteristic words. And a mapping relation exists between each characteristic word and the corresponding descriptor. A feature word may include one descriptor, and may also include two or more descriptors. One descriptor may be used to describe one feature word, and may also be used to describe two or more feature words. The server can extract the preset descriptive words which have mapping relation with the characteristic words in the source field names from the preset word library.
And step 208, generating a reference field name by using the characteristic words and the descriptor.
The server may generate the reference field names using the feature words and corresponding descriptors in the source field names. The descriptors refer to all preset descriptors which are acquired by the server and have a mapping relation with the feature words. The preset descriptors corresponding to the feature words include, but are not limited to, descriptors used for describing the feature words in the source field names. The server can arrange and combine the feature words and the preset description words to generate a plurality of reference field names. The server-generated plurality of reference field names includes, but is not limited to, a source field name.
By way of example, the server obtains the feature word in the source field name as a "name", and the server obtains a plurality of descriptors including "internal medicine", "surgery", "patient" and "doctor" corresponding to the feature word "name" according to the mapping relationship between the feature word and the descriptors. The server generates a plurality of reference field names including a medical name, a surgical name, a patient name, a doctor name, a medical patient name, a medical doctor name, a surgical patient name and a surgical doctor name by using the feature words and the descriptors, and the server generates the reference field names by arranging and combining the feature words and the descriptors, which are not listed one by one.
Step 210, calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting the preset condition as a middle field name, and recording the middle field name.
The server calculates a plurality of first similarities corresponding to the source field names and the plurality of reference field names, respectively. The server judges whether the calculated first similarity meets a preset condition or not. The preset condition may be a judgment condition preset according to an actual requirement. For example, the preset condition may be to determine whether the first similarity is greater than the first similarities corresponding to the other reference field names, that is, whether the first similarity is the maximum first similarity. If the first similarity is the maximum, the preset condition is met. Otherwise, the preset condition is not met. The preset condition may also be that whether the first similarity is the maximum value is judged, and the first similarity is not less than the threshold. If yes, the preset condition is met. Otherwise, the preset condition is not met. The server judges the first similarity according to preset conditions, marks the reference field name corresponding to the first similarity meeting the preset conditions as an intermediate field name, and records the intermediate field name in a cache list.
Step 212, obtaining the target table corresponding to the target table identifier from the data table file, searching the target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table includes the target field name.
And the server acquires the target table corresponding to the target table identifier from the data table file according to the dependency relationship identification task, and searches the target field name corresponding to the intermediate field name from the target table. The target table is a data table that needs to identify the dependency relationship with the source table, and the target table can record data in multiple fields. For example, the target table may contain data of at least one of a plurality of fields such as finance, medical care, and machinery. The target field names are field names included in the target table corresponding to the intermediate field names, and the target field names are typically multi-word phrases of 4-10 words. The server may look up from one or more target tables. The target field name may or may not be the same as the intermediate field name. One intermediate field name may correspond to one or more target field names, and one target field name may also correspond to one or more intermediate field names.
When the server finds the target field name corresponding to the intermediate field name from the target table, it may be determined that a dependency relationship exists between the source table and the target table, specifically, a dependency relationship exists between the source field in the source table and the target field in the target table. The server can send the identification result to the corresponding terminal, so that the corresponding terminal can display the identification result through the display interface.
In the embodiment, the source field names in the source table are read according to the dependency relationship identification task, and the multithreading is called to perform word segmentation on the source field names in parallel, so that the efficiency of performing word segmentation on the source field names is improved. The server determines the feature words and the description words through the labels corresponding to the words, and determines the semantic features of the source field names through the feature words. The reference field names are generated around the feature words, the output result of the reference field names is standardized, the screening range is narrowed, and the accuracy of semantic analysis on shorter field names is effectively improved. When the target table has a target field name corresponding to the intermediate field name, the dependency relationship between the source table and the target table is determined using the accurately corresponding field name. The whole identification process does not need manpower, the efficiency of dependency identification is improved, and the accuracy of dependency identification on the data table is effectively improved.
In one embodiment, prior to invoking multithreading and performing tokenization on the source field name in parallel, the server may identify a field type corresponding to the source field name in the source table, and the field type corresponding to the source field name may include a language type corresponding to the source field name. For example, the source field name may be Chinese or English. The server identifies the language type corresponding to the source field name, and when the source field name is Chinese, the server directly performs word segmentation on the source field name to obtain a plurality of words after word segmentation. When the source field name is English, the server uses the field annotation to replace the source field name for word segmentation, and the language of the field annotation is generally Chinese.
In this embodiment, since the chinese word segmentation method is different from the english word segmentation method, before performing word segmentation on the source field name, the server identifies the language type corresponding to the source field name. And when the language of the source field name is English, replacing the source field name with the field annotation to finish word segmentation. The data table comprising various field types can be effectively confronted, and the universality of dependency identification is improved.
In one embodiment, invoking multithreading to concurrently participle source field names includes: traversing the source field names, and matching the source field names by using a preset word bank to obtain various word segmentation results; determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results; calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result; and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
The server may traverse the source field name in a variety of traversal manners. For example, the server may perform a binary tree traversal of the source field names. Specifically, the server may use each word in the source field name as a node of the binary tree to sequentially access each word in the source field name. The server may also traverse the source field names in multiple orders. For example, the server may traverse the source field names in one of a forward-order traversal, a middle-order traversal, and a backward-order traversal. And the server performs character string matching by using the words in the preset word bank and the source field names to obtain various word segmentation results. The character string matching mode can comprise forward character string matching and reverse character string matching.
The server can determine the directed acyclic relation corresponding to all the words in the source field name according to the obtained multiple word segmentation results, each word can be used as an independent node, and directed connection exists between the nodes but ring connection is not formed. The server may generate a DAG (Directed Acyclic Graph) according to the Directed Acyclic Graph corresponding to the source field name, where the word in the source field name is a node in the DAG Graph, and one edge in the DAG Graph is a word in a word segmentation manner.
The server can call a preset probability model to operate the directed acyclic relation corresponding to the source field name. Specifically, the preset probability model may be a unicgram probability model preset by a user. The server can determine the weight of the corresponding edge in the directed acyclic relation according to the word frequency corresponding to the same word in the multiple word segmentation results, and the probability corresponding to each word segmentation mode is calculated based on the preset probability model and the weight corresponding to the edge. The server compares the probabilities corresponding to the word segmentation results generated by the various word segmentation modes, and performs word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
In this embodiment, the server obtains a plurality of word segmentation results by traversing the source field names and matching the source field names by using the preset lexicon, determines the directed acyclic relation corresponding to the source field names according to the plurality of word segmentation results, converts the maximum probability into the maximum path, and performs word segmentation on the source field names by using the word segmentation result with the maximum probability, thereby effectively improving the accuracy of word segmentation on the source field names.
In one embodiment, as shown in fig. 3, after invoking multithreading and performing word segmentation on the source field name in parallel, the method further comprises:
step 302, obtaining the unlabeled words in the multiple words obtained after word segmentation.
Step 304, obtain the word blacklist from the data sheet file.
And step 306, when the non-labeled words do not belong to the word blacklist, reading the historical word segmentation records corresponding to the non-labeled words, wherein the historical word segmentation records comprise the historical word segmentation times corresponding to the non-labeled words.
And 308, generating prompt information according to the unlabeled words with the historical word segmentation times larger than the threshold value, and sending the prompt information to the corresponding terminal.
And the server divides the source field name into a plurality of words and phrases. The server may obtain the tags corresponding to the words in a variety of ways. For example, a preset mapping relationship exists between the word and the corresponding tag, and the server may obtain the tag corresponding to the word according to the preset mapping relationship. The server can also search corresponding words from the preset word bank and determine labels corresponding to the words by using preset information in the preset word bank. In the process of acquiring the labels corresponding to the words, the server may have the possibility that some words do not have corresponding labels. And the server takes partial words without mapping relation with the labels as unlabeled words, and extracts the unlabeled words from the separated multiple words. In one embodiment, when the server retrieves an unlabeled word, the server may invoke a Hidden Markov Model (HMM) to tokenize the source field name.
The server may obtain the word blacklist from the data table file, which may be stored in a database. Unlabeled words that do not need to be labeled with a corresponding label are recited in the word blacklist. And the server matches the obtained unlabeled words with words in the word blacklist, and judges whether the unlabeled words belong to the word blacklist or not. When the non-labeled words belong to the word blacklist, the non-labeled words are represented as words without corresponding labels, and the server ignores the non-labeled words belonging to the word blacklist.
When the unlabeled word does not belong to the word blacklist, it means that the unlabeled word may belong to a new word or a word with a tag is omitted. The server reads the historical word segmentation records corresponding to the non-labeled words. The historical word segmentation records are records of the server in the previous word segmentation process of the source field names, and include the historical source field names, the historical words, the times of the historical words and the times of the historical word segmentation corresponding to the non-labeled words and the like.
And the server acquires the historical word segmentation times corresponding to the unlabeled words. When the historical word segmentation times are larger than the threshold value, the server generates prompt information according to the unlabeled words corresponding to the historical word segmentation times, and the server sends the generated prompt information to the terminal to prompt the user to mark the labels corresponding to the unlabeled words. The user can set the mapping relation between the unlabeled words and the labels through the terminal according to the prompt information displayed by the terminal, or the unlabeled words are added into a preset word bank. In one embodiment, when the historical word segmentation times are less than a threshold, the server ignores the unlabeled word. The threshold value can be a natural number, and can be set according to the actual application requirement.
For example, to prevent missing unlabeled words, the threshold may be set to "0," i.e., the server generates a hint from all the occurring unlabeled words that do not belong in the word blacklist. And screening the non-labeled words which occur occasionally through a threshold, setting the threshold as N, and only generating prompt information for the non-labeled words of which the historical word segmentation times are more than N. Wherein N may be a positive integer.
In this embodiment, the server obtains an unlabeled word from the plurality of words obtained after word segmentation, determines whether to generate prompt information by judging whether the unlabeled word belongs to a word blacklist and whether a corresponding historical word segmentation record is greater than a threshold value, so as to prompt a user to mark a corresponding label for the unlabeled word and update the word and the label corresponding to the word, thereby facilitating the server to more accurately identify a feature word and a description word in a source field name, and effectively improving the accuracy of identifying the dependency relationship between the server and the data table.
In one embodiment, calculating a first similarity between the source field name and the reference field name comprises: generating a corresponding source word vector according to the source field name; generating a reference word vector corresponding to the reference field name by using the feature words and the descriptor; and calculating cosine similarity between the source word vector and the reference word vector, and taking the cosine similarity as a first similarity.
And the server divides the source field name into words to obtain a plurality of words. And the server generates a source word vector corresponding to the source field name by using a plurality of words obtained by word segmentation. The reference field name is formed by arranging and combining the characteristic words and the description words, and the server can also generate the reference word vector corresponding to the reference field name by using the characteristic words and the description words of the combined reference field name. Each reference field name corresponds to a reference word vector. The server can calculate cosine similarity between the source word vector and each reference word vector, and the cosine similarity obtained through calculation is used as first similarity between the source field name and the reference field name. In one embodiment, the way in which the server calculates the cosine similarity between the source word vector and the reference word vector can be expressed as:
Figure BDA0002231311460000121
where cos (θ) represents the cosine similarity between the source word vector and the reference word vector. x is the number ofiRepresenting the respective components of the source word vector x from i-1 to i-n, yiEach component of the reference word vector y from i-1 to i-n, n being a positive integer, is represented.
And the server determines the name of the middle field according to the judgment whether the first similarity meets the preset condition. And the server marks the reference field name corresponding to the first similarity meeting the preset condition as the middle field name. In one embodiment, when the first similarity corresponding to the plurality of reference field names does not meet the preset condition, the server may generate alarm information and send the alarm information to the corresponding terminal to prompt that an error may occur.
For example, in a source table that records medical field data, a source field named "admission diagnosis" is included. After the server divides the source field name into words, the characteristic word in the source field name is obtained as 'diagnosis', and the descriptor is 'admission'. The server obtains corresponding preset descriptors including 'discharge', 'admission' and 'outpatient service' according to the feature words. The server can generate reference field names including ' discharge diagnosis ', ' admission diagnosis ', outpatient diagnosis ' and the like according to the characteristic words and the preset descriptors. The manner in which the server generates the reference field names may refer to the description of the generation of the reference field names in the above embodiments, which is not listed here. The server may generate a corresponding reference word vector according to the reference field name, where the reference word vector corresponding to the reference field name "discharge diagnosis" is (0,1,0,1), the reference word vector corresponding to the reference field name "admission diagnosis" is (1,0,0,1), and the reference word vector corresponding to the reference field name "outpatient diagnosis" is (0,0,1, 1). The server may generate a plurality of first similarities of 0.5, 1, and 0.5 from the source word vector and the reference word vector, respectively. The server may mark the reference field name "admission diagnosis" as an intermediate field name according to whether the plurality of first similarities meet a preset condition.
In this embodiment, the server determines the source word vector corresponding to the source field name and the reference word vector corresponding to the reference field name by determining the source word vector corresponding to the source field name and calculating the cosine similarity between the source word vector and the reference word vector to determine the first similarity between the source field name and the reference field name, which is helpful for accurately determining the middle field name corresponding to the source field name and effectively improving the accuracy of identifying the dependency relationship between the data tables.
In one embodiment, looking up the target field name corresponding to the intermediate field name from the target table comprises: reading the candidate field names in the target table; and calculating a second similarity between the candidate field names and the middle field names, and marking the candidate field names corresponding to the second similarity larger than a preset value as target field names.
The candidate field names are field names corresponding to the candidate fields in the target table, and the candidate field names may be all field names in the target table or partial field names in the target table. For example, the candidate field names may be field names corresponding to each column in the target table, or may be field names including the same feature word as the source field name. The server reads the candidate field names in the target table and calculates a second similarity between the candidate field names and the intermediate field names. The calculation method of the second similarity between the candidate field name and the middle field name may be similar to the calculation method of the first similarity in the above embodiment, and therefore, the description thereof is omitted here.
And the server marks the candidate field name corresponding to the second similarity larger than the preset value as the target field name according to the second similarity between the candidate field name and the intermediate field name. The preset value can be set according to the actual application requirement, and the preset value can be set to a number between 0 and 1, and can be equal to 0 or 1. The server looks up the target field name from the target table, and when the target table has the candidate field name marked as the target field name, the server determines the dependency relationship between the source table and the target table. In particular, the server may determine a dependency between a source field name in the source table and a target field name in the target table.
In this embodiment, the server reads the candidate field names in the target table, and marks the target field names by using the second similarity between the candidate field names and the intermediate field names, which is helpful for the server to search the target field names corresponding to the intermediate field names from the target table, so as to determine the dependency between the source table and the target table, thereby effectively improving the accuracy of identifying the dependency between the data tables.
In one embodiment, the server may read the source field names in the source table and the candidate field names in the target table after obtaining the source table and the target table. And the server compares the source field name with the candidate field name, and when the source field name is consistent with the candidate field name, the server determines that the comparison is successful. Otherwise, the comparison fails. The server can determine the dependency relationship between the source table and the target table according to the successfully compared source field names and candidate field names.
In this embodiment, the server determines the dependency relationship between the source table and the target table corresponding to the successfully-compared source field name and candidate field name according to the comparison result by directly comparing the source field name and the candidate field name, thereby effectively improving the identification efficiency of the dependency relationship between the data tables.
It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 4, there is provided a dependency relationship identifying apparatus based on a data table, including: a field name reading module 402, a word segmentation module 404, a descriptor acquisition module 406, a field name generation module 408, a similarity generation module 410, and a dependency determination module 412, wherein:
a field name reading module 402, configured to acquire a dependency identification task carrying a source table identifier and a target table identifier, and read a source field name in a source table corresponding to the source table identifier according to the dependency identification task.
And the word segmentation module 404 is configured to invoke multiple threads to perform word segmentation on the source field name in parallel to obtain a feature word in the source field name.
And a descriptor obtaining module 406, configured to obtain a descriptor corresponding to the feature word from a preset lexicon.
And a field name generating module 408 for generating a reference field name by using the feature word and the descriptor.
The similarity generating module 410 is configured to calculate a first similarity between the source field name and the reference field name, mark the reference field name corresponding to the first similarity meeting the preset condition as an intermediate field name, and record the intermediate field name.
The dependency relationship determining module 412 is configured to obtain a target table corresponding to the target table identifier from the data table file, search a target field name corresponding to the middle field name from the target table, and determine a dependency relationship between the source table and the target table when the target table includes the target field name.
In one embodiment, after the word segmentation module, the apparatus further includes a prompt information generation module, configured to obtain an unlabeled word in the multiple words obtained after the word segmentation; acquiring a word blacklist from a data table file; when the unlabeled words do not belong to the word blacklist, reading historical word segmentation records corresponding to the unlabeled words, wherein the historical word segmentation records comprise historical word segmentation times corresponding to the unlabeled words; and generating prompt information according to the unlabeled words with the historical word segmentation times larger than the threshold value, and sending the prompt information to the corresponding terminal.
In one embodiment, the word segmentation module 404 is further configured to traverse the source field names, and match the source field names with a preset word library to obtain multiple word segmentation results; determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results; calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result; and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
In one embodiment, the dependency determination module 412 is further configured to read candidate field names in the target table; and calculating a second similarity between the candidate field names and the middle field names, and marking the candidate field names corresponding to the second similarity larger than a preset value as target field names.
In one embodiment, the similarity generating module 410 is further configured to generate a corresponding source word vector according to the source field name; generating a reference word vector corresponding to the reference field name by using the feature words and the descriptor; and calculating cosine similarity between the source word vector and the reference word vector, and taking the cosine similarity as a first similarity.
For specific limitations of the dependency relationship identification apparatus based on the data table, reference may be made to the above limitations of the dependency relationship identification method based on the data table, and details are not repeated here. The modules in the above-mentioned dependency relationship identification device based on the data table can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal or a server. Here, the server is taken as an example, and the internal structure thereof may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing dependency identification data based on the data table. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a dependency identification method based on a data table.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for dependency identification based on a data table, the method comprising:
acquiring a dependency relationship identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency relationship identification task;
calling multiple threads to perform word segmentation on the source field name in parallel to obtain a feature word in the source field name;
acquiring descriptive words corresponding to the characteristic words from a preset word bank;
generating a reference field name by using the feature words and the descriptor;
calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting preset conditions as an intermediate field name, and recording the intermediate field name;
and acquiring a target table corresponding to the target table identifier from a data table file, searching a target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table comprises the target field name.
2. The method of claim 1, wherein after the invoking multithreading tokenizes the source field names in parallel, the method further comprises:
obtaining the unlabeled words in the multiple words obtained after word segmentation;
acquiring a word blacklist from the data table file;
when the unlabeled words do not belong to the word blacklist, reading historical word segmentation records corresponding to the unlabeled words, wherein the historical word segmentation records comprise historical word segmentation times corresponding to the unlabeled words;
and generating prompt information according to the unlabeled words with the historical word segmentation times larger than a threshold value, and sending the prompt information to the corresponding terminal.
3. The method of claim 1, wherein the invoking multithreading and tokenizing the source field name in parallel comprises:
traversing the source field names, and matching the source field names by using the preset word stock to obtain various word segmentation results;
determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results;
calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result;
and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
4. The method of claim 1, wherein the looking up the target field name corresponding to the intermediate field name from the target table comprises:
reading candidate field names in the target table;
and calculating a second similarity between the candidate field names and the intermediate field names, and marking the candidate field names corresponding to the second similarity larger than a preset value as target field names.
5. The method of claim 1, wherein calculating the first similarity between the source field name and the reference field name comprises:
generating a corresponding source word vector according to the source field name;
generating a reference word vector corresponding to the reference field name by using the feature words and the descriptor;
and calculating cosine similarity between the source word vector and the reference word vector, and taking the cosine similarity as the first similarity.
6. An apparatus for identifying dependencies based on data tables, the apparatus comprising:
the field name reading module is used for acquiring a dependency relationship identification task carrying a source table identifier and a target table identifier, and reading a source field name in a source table corresponding to the source table identifier according to the dependency relationship identification task;
the word segmentation module is used for calling multiple threads and performing word segmentation on the source field name in parallel to obtain a feature word in the source field name;
the descriptor acquisition module is used for acquiring descriptors corresponding to the feature words from a preset word bank;
the field name generating module is used for generating a reference field name by utilizing the characteristic words and the description words;
the similarity generating module is used for calculating a first similarity between the source field name and the reference field name, marking the reference field name corresponding to the first similarity meeting preset conditions as an intermediate field name, and recording the intermediate field name;
and the dependency relationship determining module is used for acquiring a target table corresponding to the target table identifier from a data table file, searching a target field name corresponding to the intermediate field name from the target table, and determining the dependency relationship between the source table and the target table when the target table comprises the target field name.
7. The device according to claim 6, wherein after the word segmentation module, the device further comprises a prompt information generation module for obtaining an unlabeled word in the plurality of words obtained after the word segmentation; acquiring a word blacklist from the data table file; when the unlabeled words do not belong to the word blacklist, reading historical word segmentation records corresponding to the unlabeled words, wherein the historical word segmentation records comprise historical word segmentation times corresponding to the unlabeled words; and generating prompt information according to the unlabeled words with the historical word segmentation times larger than a threshold value, and sending the prompt information to the corresponding terminal.
8. The device of claim 6, wherein the word segmentation module is further configured to traverse the source field names, and match the source field names with the preset lexicon to obtain multiple word segmentation results; determining a directed acyclic relation corresponding to the source field name according to the multiple word segmentation results; calling a preset probability model to operate the directed acyclic relation corresponding to the source field name to obtain the probability corresponding to the word segmentation result; and performing word segmentation on the source field name by using the word segmentation result with the maximum probability to obtain a plurality of words.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN201910968542.0A 2019-10-12 2019-10-12 Dependency relationship identification method and device based on data table and computer equipment Active CN110889286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910968542.0A CN110889286B (en) 2019-10-12 2019-10-12 Dependency relationship identification method and device based on data table and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910968542.0A CN110889286B (en) 2019-10-12 2019-10-12 Dependency relationship identification method and device based on data table and computer equipment

Publications (2)

Publication Number Publication Date
CN110889286A true CN110889286A (en) 2020-03-17
CN110889286B CN110889286B (en) 2022-04-12

Family

ID=69746085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910968542.0A Active CN110889286B (en) 2019-10-12 2019-10-12 Dependency relationship identification method and device based on data table and computer equipment

Country Status (1)

Country Link
CN (1) CN110889286B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723210A (en) * 2020-06-29 2020-09-29 深圳壹账通智能科技有限公司 Method and device for storing data table, computer equipment and readable storage medium
CN112883014A (en) * 2021-03-25 2021-06-01 上海众源网络有限公司 Data backtracking method and device, computer equipment and storage medium
CN112948400A (en) * 2020-09-17 2021-06-11 深圳市明源云科技有限公司 Database management method, database management device and terminal equipment
CN114385623A (en) * 2021-11-30 2022-04-22 北京达佳互联信息技术有限公司 Data table acquisition method, device, apparatus, storage medium, and program product
CN114896352A (en) * 2022-04-06 2022-08-12 北京月新时代科技股份有限公司 Method, system, medium and computer device for automatically matching field names of well files without field names

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06334181A (en) * 1993-05-26 1994-12-02 Fujitsu Ltd Field effect transistor
WO2014168899A2 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN107688664A (en) * 2017-09-25 2018-02-13 平安科技(深圳)有限公司 Chart generation method, device, computer equipment and storage medium
CN107704625A (en) * 2017-10-30 2018-02-16 锐捷网络股份有限公司 Fields match method and apparatus
CN108038135A (en) * 2017-11-21 2018-05-15 平安科技(深圳)有限公司 Electronic device, the method for multilist correlation inquiry and storage medium
CN108776673A (en) * 2018-05-23 2018-11-09 哈尔滨工业大学 Automatic switching method, device and the storage medium of relation schema
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Method and device is determined based on the data blood relationship of structured data
CN110019825A (en) * 2017-07-25 2019-07-16 华为技术有限公司 A kind of method and device for analyzing data semantic

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06334181A (en) * 1993-05-26 1994-12-02 Fujitsu Ltd Field effect transistor
WO2014168899A2 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN110019825A (en) * 2017-07-25 2019-07-16 华为技术有限公司 A kind of method and device for analyzing data semantic
CN107688664A (en) * 2017-09-25 2018-02-13 平安科技(深圳)有限公司 Chart generation method, device, computer equipment and storage medium
CN107704625A (en) * 2017-10-30 2018-02-16 锐捷网络股份有限公司 Fields match method and apparatus
CN108038135A (en) * 2017-11-21 2018-05-15 平安科技(深圳)有限公司 Electronic device, the method for multilist correlation inquiry and storage medium
CN108776673A (en) * 2018-05-23 2018-11-09 哈尔滨工业大学 Automatic switching method, device and the storage medium of relation schema
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Method and device is determined based on the data blood relationship of structured data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
模糊查找与模糊分组在批量数据合并中的应用: "模糊查找与模糊分组在批量数据合并中的应用C", 《北京印刷学院学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723210A (en) * 2020-06-29 2020-09-29 深圳壹账通智能科技有限公司 Method and device for storing data table, computer equipment and readable storage medium
CN112948400A (en) * 2020-09-17 2021-06-11 深圳市明源云科技有限公司 Database management method, database management device and terminal equipment
CN112883014A (en) * 2021-03-25 2021-06-01 上海众源网络有限公司 Data backtracking method and device, computer equipment and storage medium
CN114385623A (en) * 2021-11-30 2022-04-22 北京达佳互联信息技术有限公司 Data table acquisition method, device, apparatus, storage medium, and program product
CN114896352A (en) * 2022-04-06 2022-08-12 北京月新时代科技股份有限公司 Method, system, medium and computer device for automatically matching field names of well files without field names

Also Published As

Publication number Publication date
CN110889286B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN110889286B (en) Dependency relationship identification method and device based on data table and computer equipment
CN109933785B (en) Method, apparatus, device and medium for entity association
CN110457431B (en) Knowledge graph-based question and answer method and device, computer equipment and storage medium
WO2022105122A1 (en) Answer generation method and apparatus based on artificial intelligence, and computer device and medium
CN108664595B (en) Domain knowledge base construction method and device, computer equipment and storage medium
CN110751533B (en) Product portrait generation method and device, computer equipment and storage medium
CN112015900B (en) Medical attribute knowledge graph construction method, device, equipment and medium
CN110688853B (en) Sequence labeling method and device, computer equipment and storage medium
CN112380837B (en) Similar sentence matching method, device, equipment and medium based on translation model
CN112488896B (en) Emergency plan generation method and device, computer equipment and storage medium
CN111445968A (en) Electronic medical record query method and device, computer equipment and storage medium
CN112181489B (en) Code migration method, device, computer equipment and storage medium
CN113536735B (en) Text marking method, system and storage medium based on keywords
CN112231224A (en) Business system testing method, device, equipment and medium based on artificial intelligence
CN113707300A (en) Search intention identification method, device, equipment and medium based on artificial intelligence
CN112988595A (en) Dynamic synchronization test method, device, equipment and storage medium
CN111985241A (en) Medical information query method, device, electronic equipment and medium
CN110781677A (en) Medicine information matching processing method and device, computer equipment and storage medium
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
US10866944B2 (en) Reconciled data storage system
Alatawi et al. The expansion of source code abbreviations using a language model
CN113255343A (en) Semantic identification method and device for label data, computer equipment and storage medium
CN111191446B (en) Interactive information processing method and device, computer equipment and storage medium
Zou et al. Quantity tagger: A latent-variable sequence labeling approach to solving addition-subtraction word problems
CN111191028A (en) Sample labeling method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant