WO2019120169A1 - 异构数据库中的同义数据自动关联方法、装置及电子设备 - Google Patents

异构数据库中的同义数据自动关联方法、装置及电子设备 Download PDF

Info

Publication number
WO2019120169A1
WO2019120169A1 PCT/CN2018/121512 CN2018121512W WO2019120169A1 WO 2019120169 A1 WO2019120169 A1 WO 2019120169A1 CN 2018121512 W CN2018121512 W CN 2018121512W WO 2019120169 A1 WO2019120169 A1 WO 2019120169A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
word
database
thesaurus
words
Prior art date
Application number
PCT/CN2018/121512
Other languages
English (en)
French (fr)
Inventor
郭杏荣
Original Assignee
北京金山云网络技术有限公司
北京金山云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京金山云网络技术有限公司, 北京金山云科技有限公司 filed Critical 北京金山云网络技术有限公司
Publication of WO2019120169A1 publication Critical patent/WO2019120169A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a synonymous data automatic association method, apparatus, and electronic device in a heterogeneous database.
  • software systems that perform the same or similar business functions often have multiple implementation methods, such as various types of network car application software for individual consumers, various banking systems for enterprises, and various hospital information systems. .
  • the software systems whose functions are the same or similar, but whose implementation and internal structure are inconsistent are called heterogeneous systems, and each database in the heterogeneous system is called a heterogeneous database.
  • a heterogeneous database the same data may be different in the naming, processing, and storage of each database.
  • Data that is identically equivalent in the heterogeneous system to the same business object or attribute may be referred to as synonymous data.
  • the main reason for the heterogeneous system is that there are many competing enterprises in the same sub-area. For example, China provides information technology systems for hospitals. According to incomplete statistics, there are more than 130, of which 10 are large national manufacturers. Many, and the market share of a single software system is not high, the market is highly fragmented. Eventually, the data in the industry formed a lot of fragmentation, that is, “data islands”, which also led to software systems of different vendors, and even different deployment examples of software systems of the same manufacturer, data could not be opened and connected, which is the industry. Convergence, business linkages, rich data-based applications, and government and industry regulation have created significant barriers and difficulties. To solve these problems, the first step is to open up the data and connect the data on the "data island”, which requires the association of synonymous data in the heterogeneous database.
  • Synonymous data association methods in existing heterogeneous databases are implemented by uniformly converting synonymous data into a canonical format. Specifically, the national competent authority or industry organization first develops a data standard specification, and then manually converts the synonymous data in these heterogeneous databases into a standardized data format according to the data standard specification, so that the converted The data format of the sense data is consistent, thereby achieving the association of synonymous data in the heterogeneous database.
  • the purpose of the embodiments of the present application is to provide a method, a device, and an electronic device for synonymous data automatic association in a heterogeneous database, so as to improve the efficiency of synonymous data association between heterogeneous databases.
  • the specific technical solutions are as follows:
  • the embodiment of the present application discloses a synonymous data automatic association method in a heterogeneous database, and the method includes:
  • the embodiment of the present application discloses a synonymous data automatic association device in a heterogeneous database, and the device includes:
  • An obtaining module configured to acquire a field in the first database and the second database, where the first database and the second database are mutually heterogeneous databases;
  • a search module configured to search for a corresponding word in the vocabulary of the obtained field based on a mapping relationship between a preset field and a word in the lexicon, and obtain a word corresponding to each field in the first database.
  • a word corresponding to each field in the second database wherein the thesaurus contains technical terms of the industry in which the first database and the second database belong;
  • a comparison module configured to compare a similarity between a word corresponding to each field in each of the first databases and a word corresponding to each field in the second database, and compare the first field and the second field in the first database
  • the second field in the database is associated, wherein the similarity between the word corresponding to the first field and the word corresponding to the second field is greater than a preset threshold.
  • the embodiment of the present application further discloses an electronic device, including a processor and a machine readable storage medium, where the machine readable storage medium stores machine executable instructions executable by the processor, and the processor is prompted by the machine executable instructions:
  • a computer readable storage medium stores instructions that, when run on a computer, cause the computer to perform one of any of the above Synonymous data auto-association method steps in heterogeneous databases.
  • an embodiment of the present application further provides a computer program product comprising instructions, when executed on a computer, causing the computer to perform the same in a heterogeneous database of any of the foregoing The data is automatically associated with the method steps.
  • the embodiment of the present application further provides a computer program, when running on a computer, causing the computer to perform automatic association of synonymous data in a heterogeneous database provided by the above first aspect. Method steps.
  • the synonymous data automatic association method, device and electronic device in the heterogeneous database provided by the embodiment of the present application first acquire the fields in the first database and the second database, wherein the first database and the second database are different from each other. Constructing a database; searching for the corresponding words in the lexicon according to the mapping relationship between the preset fields and the words in the lexicon, and obtaining corresponding words of each field in the first database and corresponding fields in the second database respectively Finally, comparing the similarity between the word corresponding to each field in each field in the first database and the word corresponding to each field in the second database, and comparing the first field in the first database with the second field in the second database Correlation, wherein the similarity between the word corresponding to the first field and the word corresponding to the second field is greater than a preset threshold.
  • FIG. 1 is a schematic flowchart of a method for automatically synonymous data association in a heterogeneous database according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a synonymous data automatic association device in a heterogeneous database according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 4 is another schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • Table 1 is the database of the military and civilian health file system of the manufacturer A.
  • both Table 1 and Table 2 include words such as the date of physical examination, body temperature, and pulse rate.
  • the words included in Tables 1 and 2, such as the date of medical examination, questions, and pulse can be referred to as synonyms.
  • synonyms the expression, naming, and storage of these synonyms in heterogeneous systems may be different. It is very meaningful for individuals and groups to be able to correlate synonymous data in these heterogeneous databases.
  • the present application provides a synonymous data automatic association method in a heterogeneous database, which can complete heterogeneous (such as different software developers, or different versions of the same developer, etc.) software systems that perform the same or similar functions.
  • the synonymous data of each database is automatically linked to solve the problems of industry data opening, integration, linkage and big data analysis. The specific process is as follows:
  • FIG. 1 is a schematic flowchart of a method for automatically synonymizing data in a heterogeneous database according to an embodiment of the present application, including the following steps:
  • a heterogeneous database refers to each database in a heterogeneous system, wherein the heterogeneous system is a software system with the same or similar business functions but inconsistent implementation and internal structure.
  • the heterogeneous system is a software system with the same or similar business functions but inconsistent implementation and internal structure.
  • To associate synonymous data in a heterogeneous database you need to obtain the fields in the heterogeneous database first, and then compare the meanings of the different fields in the heterogeneous database, and perform the same meaning in the heterogeneous database. Association.
  • the obtained fields in the first database and the second database are: the fields in the software system whose service functions are the same or similar, but the implementation and the internal structure are inconsistent, that is, the first database and the second database are mutually heterogeneous databases.
  • the obtained first database and the fields in the second database have the same or similar meanings in the fields, which are synonymous data.
  • the mapping refers to the relationship between the elements of the two element sets, and the mapping relationship is pre-established, indicating the mapping relationship between the preset fields and the words in the thesaurus, for example, establishing four mapping relationships.
  • M1, m2, m3, m4 wherein each mapping relationship includes a plurality of sets of keys (preset fields) to values (preset fields corresponding words in the lexicon), and a value is one of the lexicons Or multiple words. Searching for the corresponding words in the lexicon by the mapping relationship, and the words in the returned result are sorted according to the priority size, wherein the technical terms of the first database and the second database belong to the industry with higher priority, and the priority is higher.
  • the high word is used as the word corresponding to each field in the first database and the corresponding word in each field in the second database.
  • the corresponding words of the acquired fields in the thesaurus can be found, and the corresponding words of each field in the first database and the corresponding fields in the second database are respectively corresponding.
  • the words so that the conversion of synonymous data is in a uniform format, laying the foundation for the association of synonymous data. For example, if the mapping relationship between the preset field and the words in the thesaurus is the English word of the word and the corresponding word of the English word in the thesaurus, then the corresponding word in the thesaurus is found as "date" through the mapping relationship. ".
  • the words corresponding to the respective fields in the first database are respectively compared with the words corresponding to the respective fields in the second database, to obtain a comparison result of each of the two words.
  • the way to compare the similarity of two words can be: convert the string of two words into a four-digit code through the SOUNDEX function, and compare the SOUNDEX values of the two four-digit codes through the DIFFERENCE function, and based on the SOUNDEX value. Evaluate the similarity between two words, and finally return a value between 0 and 4, where 4 indicates the highest similarity.
  • the way to compare the similarity between two words can also be: directly compare the approximation degree of the tf-idf (Term Frequency-Inverse Document Frequency) feature of the two words on the cosine similarity, and obtain two words. Similarity.
  • the manner of comparing the similarities of the two words may also be: comparing the similarity between the words corresponding to the respective fields in the first database and the words corresponding to the respective fields in the second database by using the likelihood function. And associating the first field in the first database with the second field in the second database, wherein the similarity between the word corresponding to the first field and the word corresponding to the second field is greater than a preset threshold.
  • any method for comparing the similarity between the words corresponding to the fields in the first database and the words corresponding to the fields in the second database belongs to the protection scope of the present application.
  • the fields with the similarity higher than the preset threshold are associated, where the preset threshold is based on the actual For the desired setting, for example, the preset threshold may be 0.8.
  • the fields corresponding to the words with the highest similarity among the plurality of words may be selected for association, and may also be selected.
  • the fields corresponding to the words whose similarities are closest to the actually set values among the plurality of words are associated.
  • the synonymous data automatic association method in the heterogeneous database first acquires the fields in the first database and the second database, wherein the first database and the second database are heterogeneous a database; based on the mapping relationship between the preset field and the words in the thesaurus, searching for the corresponding words of the acquired field in the thesaurus, obtaining the words corresponding to the fields in the first database and the words corresponding to the fields in the second database; Comparing the similarity between the word corresponding to each field in the first database and the word corresponding to each field in the second database, and associating the first field in the first database with the second field in the second database, where The similarity between the word corresponding to the one field and the word corresponding to the second field is greater than a preset threshold.
  • the format is transformed, and the problem of operational errors caused by manual operations can be avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • the synonymous data automatic association method in the heterogeneous database can associate natural persons of different financial institutions, so that the bank lending situation and credit status of the same natural person can be further analyzed;
  • the patient's medical records in different medical institutions are linked in chronological order to show a person's health trajectory;
  • the license plate number of a car can be correlated in different network car systems to show the operation of a car.
  • the first mapping relationship may be: a first mapping relationship between the first preset field and a word in the thesaurus, and the first preset field is a Chinese pinyin of the words in the thesaurus.
  • the pinyin of each word in the lexicon is first used as the first preset field, and then the corresponding word in the lexicon is used as the first mapping relationship, for example, the words in the lexicon.
  • the Chinese pinyin "TIWEN” or “tiwen” corresponding to "body temperature” uses “TIWEN” or “tiwen” as the first preset field, then the first mapping relationship is: “TIWEN” or “tiwen” in the corresponding thesaurus The word “body temperature.”
  • the Chinese pinyin of each word in the lexicon is used as the first preset field, and for the case where the Chinese pinyin is the same, but the Chinese pinyin corresponds to different words, in the first mapping relationship, the Chinese pinyin is in the thesaurus.
  • the words corresponding to "TIWEN" in the thesaurus are "body temperature”, “question”, “Taiwan” and the like.
  • the second mapping relationship may be: a second mapping relationship between the second preset field and the words in the thesaurus, and the second preset field is the initials of the Hanyu Pinyin of the words in the thesaurus.
  • the first letter of the pinyin of each word in the lexicon may be used as the second preset field, and then the corresponding word in the second preset field is used as the second mapping relationship, for example, in the thesaurus.
  • the first letter of the Chinese pinyin corresponding to the word “body temperature” is “TW” or "tw”, then "TW” or “tw” is used as the second preset field, then the second mapping relationship is "TW" or "tw” Corresponding to the word "body temperature” in the thesaurus.
  • the first letter of the Chinese pinyin of each word in the lexicon is used as the second preset field, and the first letter of the Chinese pinyin is the same, but the word corresponding to the first letter of the Chinese pinyin is different, and the second mapping is In the relationship, the first letter of the Chinese phonetic alphabet has a plurality of corresponding words in the thesaurus, for example, "TW” or “tw” in the thesaurus corresponding words are "body temperature”, “question”, “Taiwan”, “streak” ", dance” and so on.
  • the third mapping relationship may be: a third mapping relationship between the third preset field and the words in the thesaurus, and the third preset field is an English word of the words in the thesaurus.
  • the English word of each word in the vocabulary may be used as a third preset field, and then the third preset field is used as a third mapping relationship, for example, a word in the lexicon.
  • the English word corresponding to body temperature is "Temperature”, then "Temperature” is used as the third preset field, then the third mapping relationship is the word “body temperature” in the corresponding dictionary of "Temperature”.
  • the English word of each word in the lexicon is used as the third preset field, and if the English word is the same, but the word corresponding to the English word is different, in the third mapping relationship, the English word is in the thesaurus.
  • the words "Temperature” in the lexicon are "body temperature”, “temperature”, “air temperature” and the like.
  • the fourth mapping relationship may be: a fourth mapping relationship between the fourth preset field and the words in the thesaurus, and the fourth preset field is an abbreviation of the English words of the words in the thesaurus.
  • the abbreviation of the English word of each word in the lexicon may be used as the fourth preset field, and then the fourth preset field is used as the fourth mapping relationship in the lexicon, for example, in the thesaurus.
  • the English word corresponding to the word "body temperature” is abbreviated as "Temp”
  • “Temp” is used as the fourth preset field
  • the fourth mapping relationship is "Temp” corresponding to the word "body temperature” in the lexicon.
  • the abbreviation of the English word of each word in the lexicon is used as the fourth preset field, and if the English word is the same, but the word corresponding to the English word is different, in the fourth mapping relationship, the English word is
  • the abbreviations correspond to a plurality of words in the thesaurus.
  • the words "Temp” in the thesaurus are "body temperature”, “temperature”, “air temperature”, “temporary” and the like.
  • the corresponding words in the vocabulary are searched, and the corresponding words in each field in the first database and the second database are respectively obtained.
  • the corresponding words of the fields may specifically be:
  • the mapping finder returns the mapping value specified by the optional parameter possible_type to find the corresponding value. If you do not specify possible_type, then all mappings are called, and the words in the returned result are also sorted by priority. For example, call mapper ("TIWEN”), return value value equal to ⁇ "body temperature", "question" ⁇ . In this way, by searching the corresponding words in the vocabulary of the obtained fields in the mapping finder, the words corresponding to the respective fields in the first database and the words corresponding to the respective fields in the second database can be quickly found.
  • TIWEN call mapper
  • mapping relationships in the first mapping relationship, the second mapping relationship, the third mapping relationship, and the fourth mapping relationship are obtained, which may specifically include:
  • S1021 Determine a preset field category of the acquired field, where the preset field category is one of a first preset field, a second preset field, a third preset field, and a fourth preset field.
  • the mapping relationship includes at least four mapping relationships, and the four mapping fields further include four preset fields, that is, a first preset field, a second preset field, a third preset field, and a fourth The preset field is determined. Therefore, the preset field category corresponding to the obtained field needs to be determined first, so that the mapping relationship corresponding to the preset field category can be directly determined according to the preset field category.
  • the mapping relationship corresponding to the preset field category may be determined, and the corresponding word in the vocabulary may be searched for in the determined mapping relationship.
  • the preset field category of the field is the second preset field
  • the mapping relationship corresponding to the second preset field is the second mapping relationship
  • the corresponding word in the vocabulary is found by the second mapping relationship.
  • the preset field category of the first obtained field is determined, that is, the field is determined to be one of a first preset field, a second preset field, a third preset field, and a fourth preset field, so as to be preset
  • the mapping relationship corresponding to the field directly searches for the corresponding word of the field in the thesaurus, and does not need to search for one time in each mapping relationship, thereby improving the searching efficiency of the corresponding words of the field in the thesaurus.
  • the mapping relationship corresponding to the preset field category of the determined field the corresponding words in the lexicon are searched, so that the conversion of the synonymous data is in a unified format, which lays a foundation for the association of synonymous data.
  • the acquired field includes multiple preset fields
  • the obtained field is segmented to obtain multiple sub-fields, and the preset field category of each sub-field is determined, where the preset field category may be the first preset field. And one of a second preset field, a third preset field, and a fourth preset field.
  • the field may be segmented according to different preset field categories included in the field segmentation, for example, the preset field category of the field "ZERENYS" is not the same preset field category, and the "ZERENYS" segment is included
  • the preset field category is the first preset field category corresponding to “ZEREN” and the second preset field category corresponding to “YS”.
  • mapping relationship corresponding to the determined preset field category searching for a word corresponding to the field in the vocabulary, obtaining each word corresponding to each field in the first database and corresponding to each field in the second database
  • the words can include:
  • mapping relationship corresponding to the determined preset field category respectively searching for corresponding words in each vocabulary of each sub-segment, and combining the found words to obtain respective words corresponding to each field in the first database.
  • Each word corresponding to each field in the second database Each word corresponding to each field in the second database.
  • the first sub-field is sequentially taken when the corresponding word in the lexicon appears, and the second sub-field has the highest probability word among the corresponding words in the lexicon, and the two are Word combination; for the case of three subfields, the third subfield can also be obtained according to the Markov probability model, in which the first subfield and the second subfield are combined in the corresponding words in the thesaurus.
  • the word with the highest probability among the corresponding words in the thesaurus for the case of more subfields, according to the method, each word corresponding to each field in the first database and each word corresponding to each field in the second database are obtained.
  • each word corresponding to each sub-field in the lexicon is searched, and the found words are combined to obtain a first database.
  • Each word corresponding to each field and each word corresponding to each field in the second database may specifically include:
  • the preset field category corresponding to the acquired field is first determined, and then the mapping relationship corresponding to the preset field category is determined, and then the corresponding word of each sub-field in the thesaurus is searched for in the mapping relationship.
  • “shangciTIJIANRQ” is divided into three fields: “shangci”, “TIJIAN” and “RQ”. Find the field “shangci” in the thesaurus, the corresponding word is “last time”, and the field “TIJIAN” is in the thesaurus.
  • the corresponding words in the middle are "physical examination” and "kicking", and the corresponding words in the vocabulary "RQ" are found as "date", "gas” and the like.
  • the word is directly used as the corresponding word of the field in the lexicon, instead of searching for one time in each mapping relationship, the field is improved.
  • the efficiency of the search for the corresponding words in the thesaurus is improved.
  • S10222 Combine the corresponding words of the first two subfields in the thesaurus according to the order of each subfield from left to right, and use the combined words as the first word corresponding to the field.
  • the first two sub-fields in each field are first combined in the lexicon, and the words corresponding to the first two sub-field combinations are obtained.
  • the first word corresponding to each field it is convenient to continue to combine the words and the words corresponding to the remaining fields in the thesaurus.
  • the current two subfields are combined to obtain the first word, and then the first word is combined with the corresponding words of the next adjacent subfield in the lexicon, to obtain a new word, and the combination is obtained.
  • the new words replace the first words, and the remaining words that are not combined are sequentially combined according to the method until all the corresponding words in the vocabulary are combined.
  • "TIJIANRQJutiTime” is divided into four fields: "TIJIAN”, “RQ”, "Juti", and "Time”.
  • the corresponding word is "Physical Examination", then look for " The words corresponding to RQ in the thesaurus are “date”, “gas”, etc., the corresponding words in “Juti” in the thesaurus are “specific”, and the words corresponding to “Time” in the thesaurus are “time”, then The words corresponding to "TIJIAN” in the thesaurus are combined with the words corresponding to "RQ” in the thesaurus to obtain the corresponding words of "TIJIANRQ” in the thesaurus.
  • searching for a corresponding word in each vocabulary of the sub-field includes two cases:
  • the word is determined as the corresponding word of the field in the thesaurus.
  • the word is the corresponding word of the subfield in the lexicon.
  • the higher-priority words of the plurality of words are determined as the corresponding words of the sub-field in the thesaurus, wherein, in the thesaurus
  • the technical terminology has a higher priority.
  • the sub-field has multiple words in the lexicon, that is, through the mapping relationship, if there are multiple words corresponding to the sub-field in the lexicon, then one of the multiple words is selected as the The corresponding word of the field in the thesaurus.
  • the specific selection method is: selecting a word with a higher priority among the plurality of words, and determining the word as a corresponding word of the field in the thesaurus.
  • the database corresponding to the field in advance is correspondingly
  • the priority of the terminology is set to high priority.
  • the words “TIJIAN” in the thesaurus are "physical examination” and “snack”, in which "physical examination” is the technical terminology of the industry, that is, the priority of "physical examination” is higher than the priority of "snapping". Therefore, the corresponding word in "TIJIAN” in the thesaurus is “physical examination”.
  • the words corresponding to "TZ” in the thesaurus are "weight", characteristics, "notification”, etc.
  • the second field in the second database is associated with, specifically:
  • a likelihood function like value (value1, value2) is designed, wherein value1 is a word corresponding to each field in the first database, and value2 is a word corresponding to each field in the second database, and two words are transmitted by comparison ( The phrase) parameter, which compares value1 and value2, returns its similarity. If value1 and value2 are equal or highly similar, it returns true, that is, value1 and value2 are associated. Otherwise, false is returned, that is, value1 and value2 are not associated.
  • Likerhood's algorithm can use simple strcmp() (string comparison in c language) method, string Haiming distance algorithm, and word similarity algorithm such as word2vec.
  • simple strcmp() string comparison in c language
  • string Haiming distance algorithm string Haiming distance algorithm
  • word similarity algorithm such as word2vec.
  • the strcmp function compares the ASCII (American Standard Code for Information Interchange) code of the character, and the implementation principle is as follows: first compare the first character of the two strings, if not equal, stop comparing and The result of comparing the two ASCII codes is obtained; if they are equal, then the second character and then the third character are compared. Regardless of the two strings, the strcmp function compares to the fact that one of the strings encounters the terminator ‘/0’.
  • the string Hamming distance algorithm is obtained by vectorizing the text, or extracting the features of the text into a code, and then XORing the code to calculate the Hamming distance, so as to obtain the word similarity according to the Hamming distance.
  • Word2vec is a Google open source tool for word vector calculation. It can be effectively trained on millions of dictionary and hundreds of millions of data sets. The training result of this tool is word embedding, which can be very Goodly measure the similarity between words and words.
  • the synonymous data automatic association method in the heterogeneous database provided by the embodiment of the present application first acquires the fields in the first database and the second database, and then searches for the obtained based on the mapping relationship between the preset field and the words in the thesaurus.
  • the words corresponding to the fields in the lexicon and finally comparing the similarities between the words corresponding to the respective fields in the first database and the words corresponding to the respective fields in the second database, and finally the first field and the second in the first database
  • the second field in the database is associated, wherein the similarity between the word corresponding to the first field and the word corresponding to the second field is greater than a preset threshold.
  • the first database and the second database are obtained, wherein the first database and the second database are mutually heterogeneous databases, that is, the obtained fields in the first database and the fields in the second database are respectively shown in Table 3. :
  • the corresponding words in the lexicon are searched, and the words corresponding to the respective fields in the first database and the corresponding words in the fields in the second database are obtained, wherein
  • the thesaurus contains the technical terms of the industry in which the first database and the second database belong.
  • the field type is the second preset field
  • the preset field type of "QITA” and “SHENGAO” is the first preset field
  • the "TIJIANRQ” includes two preset field types
  • the "TIJIANRQ” is segmented, and the segmentation is
  • the fields of "TIJIAN” and “RQ” the preset field type of "TIJIAN” is the first preset field
  • the preset field type of "RQ” is the second preset field.
  • the first preset field corresponds to the first mapping relationship
  • the second preset field corresponds to the second mapping relationship. Therefore, the first mapping relationship is searched for “QITA”, “SHENGAO”, and “TIJIAN” in the thesaurus.
  • the corresponding words are “other”, “height” and “physical examination”.
  • the words “TJRQ”, “TZ”, “SG” and “RQ” in the vocabulary are referred to as “physical examination date”. "Body weight”, “height”, “date”, the words “TIJIAN” and “RQ” are combined in the respective lexicons to obtain the "date of medical examination”.
  • the preset field category of the first obtained field is determined, that is, the field is determined to be one of a first preset field, a second preset field, a third preset field, and a fourth preset field, so as to be preset
  • the mapping relationship corresponding to the field directly searches for the corresponding word of the field in the thesaurus, and does not need to search for one time in each mapping relationship, thereby improving the searching efficiency of the corresponding words of the field in the thesaurus.
  • the mapping relationship corresponding to the preset field category of the determined field the corresponding words in the lexicon are searched, so that the conversion of the synonymous data is in a unified format, which lays a foundation for the association of synonymous data.
  • the fields in the heterogeneous database are first converted into the corresponding words of each field in the lexicon through the preset mapping relationship, and then the fields with high similarity in the heterogeneous database are associated, so that the synonym data
  • the conversion is in a uniform format, and avoids the operation error caused by manual operation, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • FIG. 2 is a schematic structural diagram of a synonymous data automatic association device in a heterogeneous database according to an embodiment of the present application, including the following modules:
  • the obtaining module 201 is configured to acquire the fields in the first database and the second database, wherein the first database and the second database are mutually heterogeneous databases.
  • the searching module 202 is configured to search for a corresponding word in the vocabulary based on the mapping relationship between the preset field and the word in the lexicon, and obtain the corresponding word in each field in the first database and each in the second database.
  • the comparison module 203 is configured to compare the similarity between the word corresponding to each field in each field in the first database and the word corresponding to each field in the second database, and compare the first field in the first database with the second database in the second database The second field is associated, wherein the similarity between the word corresponding to the first field and the word corresponding to the second field is greater than a preset threshold.
  • the synonymous data automatic association device in the heterogeneous database provided by the embodiment of the present application first converts the fields in the heterogeneous database into corresponding fields in the lexicon through a preset mapping relationship. Words, and then associate the similarity fields in the heterogeneous database, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving synonymous data between heterogeneous databases. The efficiency of the association.
  • mapping relationship includes one or more of the following mapping relationships:
  • a second mapping relationship between the second preset field and the words in the thesaurus, and the second preset field is the initials of the Hanyu Pinyin of the words in the thesaurus;
  • a fourth mapping field and a fourth mapping relationship of words in the thesaurus is an abbreviation of the English words of the words in the thesaurus
  • the searching module 202 is specifically configured to:
  • searching module 202 includes:
  • a determining submodule configured to determine a preset field category of the acquired field, where the preset field category is one of a first preset field, a second preset field, a third preset field, and a fourth preset field ;
  • the search sub-module is configured to search for a corresponding word in the lexicon in the mapping relationship corresponding to the determined preset field category, and obtain each word corresponding to each field in the first database and each field in the second database.
  • the words are configured to search for a corresponding word in the lexicon in the mapping relationship corresponding to the determined preset field category, and obtain each word corresponding to each field in the first database and each field in the second database.
  • search submodule includes:
  • a determining unit configured to: when the acquired field includes multiple preset fields, segment the acquired field to obtain a plurality of subfields; and determine a preset field category of each subfield;
  • the first search unit is configured to
  • the first searching unit includes:
  • a first search subunit configured to search for a corresponding word in each of the subfields in the mapping relationship corresponding to the determined preset field category
  • the first combination sub-unit is configured to combine the first two sub-fields in the order of the sub-fields from left to right, and combine the obtained words as the first words corresponding to the respective fields;
  • a second combination sub-unit configured to sequentially combine the first word with the corresponding word of the next adjacent sub-field not combined in the lexicon, and replace the first word with the combined word until all sub-fields are
  • the corresponding words in the thesaurus are combined to get the words corresponding to the fields.
  • first search subunit is specifically configured to:
  • the word is determined as the corresponding word of the subfield in the thesaurus
  • the higher-priority words of the plurality of words are determined as the corresponding words of the sub-field in the thesaurus, wherein the terminology is prioritized in the thesaurus Level is high.
  • comparison module 203 is specifically configured to:
  • the embodiment of the present application further provides an electronic device.
  • the method for automatically synonymous data association in a heterogeneous database should be configured as a schematic structural diagram of an electronic device.
  • a processor 301 and a machine readable storage medium 302 are stored, the machine readable storage medium 302 storing machine executable instructions executable by the processor 301, the processor 301 being caused by machine executable instructions to: implement any of the above embodiments
  • the synonymous data automatic association method in the heterogeneous database is
  • the electronic device provided by the embodiment of the present invention first converts the fields in the heterogeneous database into corresponding words in the lexicon through the preset mapping relationship, and then similarly in the heterogeneous database.
  • the high-level fields are associated, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • the embodiment of the present application further provides an electronic device, as shown in FIG. 4, including the foregoing processor 301 and a machine readable storage medium 302, and a communication interface 303, a communication bus 304, wherein the processor 301, the communication interface 303, The machine readable storage medium 302 is in communication with each other via a communication bus 304 that is configured to store a computer program; the processor 301 is configured to execute a program stored on the machine readable storage medium 302 And implementing the synonymous data automatic association method in a heterogeneous database according to any one of the foregoing embodiments.
  • the communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface is configured to communicate between the electronic device and other devices.
  • the machine-readable storage medium 302 may include a random access memory (RAM), and may also include a non-volatile memory (Non-Volatile Memory, NVM). ), for example at least one disk storage.
  • NVM non-Volatile Memory
  • the memory may also be at least one storage device located away from the aforementioned processor.
  • the processor 301 may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc., or may be a digital signal processing (DSP), dedicated.
  • CPU central processing unit
  • NP network processor
  • DSP digital signal processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • an electronic device provided by the embodiment of the present application first converts the fields in the heterogeneous database into corresponding words in each of the thesaurus through a preset mapping relationship, and then similarly in the heterogeneous database.
  • the high-level fields are associated, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • a computer readable storage medium having stored therein instructions that, when run on a computer, cause the computer to perform any of the above embodiments
  • the synonymous data automatic association method in the heterogeneous database is provided.
  • the computer readable storage medium provided by the embodiment of the present application first converts the fields in the heterogeneous database into corresponding words in the lexicon through a preset mapping relationship, and then uses the corresponding words in the lexicon.
  • the similarity of the fields in the database is related, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform in a heterogeneous database of any of the above embodiments Synonymous data automatic association method.
  • the computer program product including the instruction provided by the embodiment of the present application converts the fields in the heterogeneous database into corresponding words in the lexicon through the preset mapping relationship, and then the heterogeneous The similarity of the fields in the database is related, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.
  • the embodiment of the present application further provides a computer program, when run on a computer, causing the computer to execute the synonymous data automatic association method in a heterogeneous database according to any one of the above embodiments.
  • the computer program product including the instruction provided by the embodiment of the present application first converts the fields in the heterogeneous database into corresponding words in the lexicon through a preset mapping relationship, and then The fields with high similarity in the database are associated, so that the conversion of synonymous data is in a uniform format, and the problem of operational errors caused by manual operations is avoided, thereby improving the efficiency of synonymous data association between heterogeneous databases.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种异构数据库中的同义数据自动关联方法、装置及电子设备,该方法包括:获取第一数据库和第二数据库中的字段,第一数据库与第二数据库互为异构数据库;基于预设字段与词库中词语的映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,其中,词库包含第一数据库和第二数据库所属行业的专业术语;分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。应用本申请实施例,能够提高异构数据库间同义数据关联的效率。

Description

异构数据库中的同义数据自动关联方法、装置及电子设备
本申请要求于2017年12月19日提交中国专利局、申请号为201711377197.0发明名称为“异构数据库中的同义数据自动关联方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,特别是涉及一种异构数据库中的同义数据自动关联方法、装置及电子设备。
背景技术
目前,完成相同或相近业务功能的软件系统往往存在多种实现方法,例如,面向个人消费者的各类网约车应用软件,面向企业的各类银行业务系统,以及各类医院信息化系统等。其中,这些业务功能相同或相近,但实现方式和内部结构不一致的软件系统被称为异构系统,异构系统中的各数据库被称为异构数据库。在异构数据库中,同样的数据在各数据库中的命名、处理和存储等方面可能存在差异,可以将异构系统中完全等价表达同一业务对象或属性相同的数据称为同义数据。
造成异构系统的主要原因是同一细分领域存在多家互相竞争的企业,例如我国为医院提供信息化系统的厂商,据不完全统计就有130多家,其中全国性的大型厂商就有10多家,且单一软件系统的市场占有率都不高,市场高度分散。最终导致行业内的数据形成了非常多的碎片,即“数据孤岛”,也导致不同厂商的软件系统,甚至同一厂商的软件系统的不同部署实例间,数据都无法打通和连接,这都为行业的融合、业务联动、丰富基于大数据的应用、以及政府和行业监管都带来了很大的阻碍和困难。要解决这些问题,首先在于打通数据,连接“数据孤岛”上的数据,这就需要将异构数据库中的同义数据进行关联。
现有的异构数据库中的同义数据关联方法是通过将同义数据统一转换为规范的格式来实现的。具体的,先由国家主管部门或行业组织制定一个数据标准规范,然后通过人工操作将这些异构数据库中的同义数据,按该数据标准规范转换为规范的数据格式,这样,转换后的同义数据的数据格式一致, 从而实现异构数据库中的同义数据的关联。
但是,在现有的异构数据库中的同义数据关联的方法中,一方面,由于所制定的数据标准规范是非强制性的,其约束力不强,一些厂商往往并不遵守该数据标准规范或者部分遵守,这样,转换后的数据不符合所制定的数据标准规范;另一方面,数据按标准规范转换过程中,由于人工操作不可避免的会发生操作错误,也会使得转换后的数据不符合所制定的数据标准规范,最终导致异构数据库间同义数据关联的效率比较低。
发明内容
本申请实施例的目的在于提供一种异构数据库中的同义数据自动关联方法、装置及电子设备,以提高异构数据库间同义数据关联的效率。具体技术方案如下:
本申请实施例公开了一种异构数据库中的同义数据自动关联方法,所述方法包括:
获取第一数据库和第二数据库中的字段,其中,所述第一数据库与所述第二数据库互为异构数据库;
基于预设字段与词库中词语的映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语,其中,所述词库包含所述第一数据库和所述第二数据库所属行业的专业术语;
分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,所述第一字段对应的词语与所述第二字段对应的词语的相似度大于预设阈值。
本申请实施例公开了一种异构数据库中的同义数据自动关联装置,所述装置包括:
获取模块,被配置为获取第一数据库和第二数据库中的字段,其中,所述第一数据库与所述第二数据库互为异构数据库;
查找模块,被配置为基于预设字段与词库中词语的映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自 对应的词语和所述第二数据库中各字段各自对应的词语,其中,所述词库包含所述第一数据库和所述第二数据库所属行业的专业术语;
比较模块,被配置为分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,所述第一字段对应的词语与所述第二字段对应的词语的相似度大于预设阈值。
本申请实施例还公开了一种电子设备,包括处理器和机器可读存储介质,机器可读存储介质存储有能够被处理器执行的机器可执行指令,处理器被机器可执行指令促使:实现上述一种异构数据库中的同义数据自动关联方法步骤。
在本申请实施的又一方面,还公开了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述任一所述的一种异构数据库中的同义数据自动关联方法步骤。
在本申请实施的又一方面,本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述任一所述的一种异构数据库中的同义数据自动关联方法步骤。
在本申请实施的又一方面,本申请实施例还提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面提供的一种异构数据库中的同义数据自动关联方法步骤。
本申请实施例提供的一种异构数据库中的同义数据自动关联方法、装置及电子设备,先获取第一数据库和第二数据库中的字段,其中,第一数据库与第二数据库互为异构数据库;再基于预设字段与词库中词语的映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语;最后分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。这种先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构 数据库间同义数据关联的效率。
附图说明
为了更清楚地说明本申请实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种异构数据库中的同义数据自动关联方法的流程示意图;
图2为本申请实施例提供的一种异构数据库中的同义数据自动关联装置的结构示意图;
图3为本申请实施例提供的一种电子设备的一种结构示意图;
图4为本申请实施例提供的一种电子设备的另一种结构示意图。
具体实施方式
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
随着信息技术的快速发展,同一细分领域出现了多家互相竞争的企业,但是这些互相竞争的企业之间的数据并没有打通,使得数据形成了非常多的碎片,即“数据孤岛”,同时也为行业的融合和业务联动、基于大数据的应用,以及政府和行业监管都带来了很大的阻碍和困难。因此,有必要将异构数据库中的同义数据进行关联。而进行同义数据关联的关键在于打通数据,即连接“数据孤岛”上的数据;更关键的在于如何将不同软件系统中的同义数据项进行识别和关联。例如,有两个不同厂家的居民健康档案系统,这两个不同的厂家分别为:A厂家和B厂家,这两个厂家的居民健康档案系统能够实现相同的业务功能,但这两个厂家的军民健康档案系统的数据库设计不同,数据的存储也不同。
如表1所示,表1是A厂家的军民健康档案系统的数据库。
表1
Figure PCTCN2018121512-appb-000001
在B厂家的军民健康档案系统的数据库中,同样的数据则是出现在几个 不同的表中,如表2所示:
表2
Figure PCTCN2018121512-appb-000002
通过表1和表2可以看出:表1和表2均包括体检日期、体温、脉率等词语,可以将体检日期、提问、脉搏等表1和表2均包括的词语称为同义词语。但是在异构系统中这些同义词语的表达、命名和存储等方式可能是不同 的,如果能够将这些异构数据库中的同义数据关联起来对个人和群体都是非常有意义的。基于此,本申请提供了一种异构数据库中的同义数据自动关联方法,可以将完成相同或相近功能的异构(如来自不同软件开发商、或同一开发商的不同版本等)软件系统中各个数据库的同义数据自动关联起来,从而解决行业数据打通、整合、联动和大数据分析的问题。具体过程如下:
参见图1,图1为本申请实施例提供的一种异构数据库中的同义数据自动关联方法的流程示意图,包括如下步骤:
S101,获取第一数据库和第二数据库中的字段,其中,第一数据库与第二数据库互为异构数据库。
具体的,异构数据库是指异构系统中的各数据库,其中,异构系统为业务功能相同或相近,但实现方式和内部结构不一致的软件系统。对异构数据库中的同义数据进行关联,需要先获取异构数据库中的各字段,然后分别比较异构数据库不同字段所表示的意思是否相同,并将异构数据库中表达意思相同的字段进行关联。
这里,获取的第一数据库和第二数据库中的字段为:业务功能相同或相近,但实现方式和内部结构不一致的软件系统中的字段,即第一数据库与第二数据库互为异构数据库。所获取的第一数据库与第二数据库中的这些字段中所表达的意思相同或者相近的字段为同义数据,通过将这些字段自动关联起来,可以有效解决行业数据打通、整合、联动和大数据分析的问题。
S102,基于预设字段与词库中词语的映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,其中,词库包含第一数据库和第二数据库所属行业的专业术语。
具体的,映射是指两个元素集之间元素相互“对应”的关系,映射关系是预先建立好的,表示的是预设字段与词库中词语的映射关系,例如,建立4个映射关系m1,m2,m3,m4,其中,每个映射关系中包含若干组key(预设字段)到value(预设字段在词库中对应的词语)的对应关系,一个value为词库中的一个或多个词语。通过映射关系查找获取的字段在词库中对应的词语,返回结果中的词语按优先级大小排序,其中,第一数据库和第二数据库所属行业的专业术语的优先级较高,将优先级较高的词语作为第一数据库 中各字段各自对应的词语和第二数据库中各字段各自对应的词语。
这里,基于预设字段与词库中词语的映射关系,可以查找到所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,从而使得同义数据的转化都是按照统一的格式,为同义数据的关联奠定了基础。例如,预设字段与词库中词语的映射关系为词语的英文单词与该英文单词在词库中对应的词语,那么通过该映射关系查找到“DATE”在词库中对应的词语为“日期”。
S103,分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。
具体的,将第一数据库中各字段各自对应的词语分别与第二数据库中各字段各自对应的词语进行比较,得到每两个词语的比较结果。其中,比较两个词语的相似度的方式可以为:通过SOUNDEX函数将两个词语的字符串转换为四位数字代码,再通过DIFFERENCE函数比较两个四位数字代码的SOUNDEX值,并基于SOUNDEX值评估两个词语之间的相似性,最后返回0到4之间的一个值,其中4表示相似度最高。比较两个词语的相似度的方式还可以为:直接比较两个词语的tf-idf(Term Frequency-Inverse Document Frequency,词频-逆向文件频率)特征在余弦相似度上的近似程度,得到两个词语的相似度。比较两个词语的相似度的方式还可以为:利用似然函数比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度。并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。通过将第一数据库中的第一字段与第二数据库中的第二字段相关联,避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。需要说明的是,凡是能比较出第一数据库中各字段对应的词语与第二数据库中各字段对应的词语的相似度的方法,都属于本申请的保护范围。
另外,当比较了第一数据库中各字段对应的词语与第二数据库中各字段对应的词语的相似度后,将相似度高于预设阈值的字段进行关联,这里,预设阈值是根据实际所需设定的,例如,预设阈值可以为0.8。而当第一数据库 的其中一个词语与第二数据库中的多个词语的相似度都高于预设阈值时,可以选取这多个词语中相似度最高的词语对应的字段进行关联,还可以选取这多个词语中相似度最接近实际所设定的值的词语所对应的字段进行关联。
由此可见,本申请实施例提供的一种异构数据库中的同义数据自动关联方法,先获取第一数据库和第二数据库中的字段,其中,第一数据库与第二数据库互为异构数据库;再基于预设字段与词库中词语的映射关系,查找获取的字段在词库中对应的词语,得到第一数据库中各字段对应的词语和第二数据库中各字段对应的词语;最后分别比较第一数据库中各字段对应的词语与第二数据库中各字段对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。这种先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据按照统一的格式进行转化,而且可以避免人工操作带来操作错误的问题,从而提高了异构数据库间同义数据关联的效率。
通过本申请实施例提供的一种异构数据库中的同义数据自动关联方法,可以将不同金融机构的自然人关联起来,从而可以进一步分析同一自然人的全部银行借贷情况和信用情况;也可以将一个患者在不同医疗机构的就诊记录按时间顺序关联起来,从而展现一个人的健康轨迹;还可以将一个车的车牌号在不同网约车系统中关联起来,从而展现一辆车的运营情况,这为监管、保险等提供相关依据;还可以将一群人的同义数据关联起来,有利于群体性数据的趋势、特征的研究等。
在本申请实施例中的映射关系包括如下四种映射关系中的一个或两个以上:
第一种映射关系可以为:第一预设字段与词库中词语的第一映射关系,第一预设字段为词库中词语的汉语拼音。
具体的,可以先将词库中每个词语的汉语拼音作为第一预设字段,然后将该第一预设字段在词库中对应的词语作为第一映射关系,例如,词库中的词语“体温”对应的汉语拼音“TIWEN”或者“tiwen”,则将“TIWEN”或者“tiwen”,作为第一预设字段,那么第一映射关系为:“TIWEN”或者“tiwen” 对应词库中的词语“体温”。
另外,将词库中每个词语的汉语拼音作为第一预设字段,对于汉语拼音相同,但该汉语拼音对应的词语不同的情况,其在该第一映射关系中,该汉语拼音在词库中对应的词语为多个,例如“TIWEN”在词库中对应的词语为“体温”、“提问”、“台湾”等。
第二种映射关系可以为:第二预设字段与词库中词语的第二映射关系,第二预设字段为词库中词语的汉语拼音的首字母。
具体的,可以先将词库中每个词语的汉语拼音的首字母作为第二预设字段,然后将该第二预设字段到词库中对应的词语作为第二映射关系,例如词库中的词语“体温”对应的汉语拼音的首字母为“TW”或者“tw”,则将“TW”或者“tw”作为第二预设字段,那么第二映射关系为“TW”或者“tw”对应词库中的词语“体温”。
同样,将词库中每个词语的汉语拼音的首字母作为第二预设字段,对于汉语拼音的首字母相同,但该汉语拼音的首字母对应的词语不同的情况,其在该第二映射关系中,该汉语拼音的首字母在词库中对应的词语为多个,例如“TW”或者“tw”在词库中对应的词语为“体温”、“提问”、“台湾”、“条纹”、“跳舞”等。
第三种映射关系可以为:第三预设字段与词库中词语的第三映射关系,第三预设字段为词库中词语的英文单词。
具体的,可以先将词库中每个词语的英文单词作为第三预设字段,然后将该第三预设字段到词库中对应的词语作为第三映射关系,例如词库中的词语“体温”对应的英文单词为“Temperature”,则将“Temperature”作为第三预设字段,那么第三映射关系为“Temperature”对应词库中的词语“体温”。
同样,将词库中每个词语的英文单词作为第三预设字段,对于英文单词相同,但该英文单词对应的词语不同的情况,其在该第三映射关系中,该英文单词在词库中对应的词语为多个,例如“Temperature”在词库中对应的词语为“体温”、“温度”、“气温”等。
第四种映射关系可以为:第四预设字段与词库中词语的第四映射关系,第四预设字段为词库中词语的英文单词的缩写。
具体的,可以先将词库中每个词语的英文单词的缩写作为第四预设字段,然后将该第四预设字段到词库中对应的词语作为第四映射关系,例如词库中的词语“体温”对应的英文单词的缩写为“Temp”,则将“Temp”作为第四预设字段,那么第四映射关系为“Temp”对应词库中的词语“体温”。
同样,将词库中每个词语的英文单词的缩写作为第四预设字段,对于英文单词相同,但该英文单词对应的词语不同的情况,其在该第四映射关系中,该英文单词的缩写在词库中对应的词语为多个,例如“Temp”在词库中对应的词语为“体温”、“温度”、“气温”、“临时”等。
在本申请实施例中,基于预设字段与词库中词语的映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,具体可以为:
基于第一映射关系、第二映射关系、第三映射关系和第四映射关系中的一个或两个以上映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语。
具体的,基于第一映射关系、第二映射关系、第三映射关系和第四映射,建立一个映射查找器mapper,其功能等价于:value=mapper(key,[possible_type]),其中,value表示获取的字段在词库中对应的词语,key表示获取的字段,possible_type表示可能的映射关系。对于传入的参数key,映射查找器会返回按照可选参数possible_type指定的映射关系去查找对应的value。如果不指定possible_type,那么则调用所有的映射关系,返回结果中的词同样按优先级大小排序。例如,调用mapper(“TIWEN”),返回值value等于{“体温”,“提问”}。这样,通过在映射查找器查找获取的字段在词库中对应的词语,可以快速的查找到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语。
在本申请一个可选的实施例中,基于第一映射关系、第二映射关系、第三映射关系和第四映射关系中的一个或两个以上映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,具体可以包括:
S1021,确定所获取的字段的预设字段类别,预设字段类别为第一预设字段、第二预设字段、第三预设字段和第四预设字段中的一种。
具体的,由于映射关系至少包括四种映射关系,而这四种映射关系中又包含了四种预设字段,即第一预设字段、第二预设字段、第三预设字段和第四预设字段,因此,需要先确定获取的字段所对应的预设字段类别,这样,可以直接根据预设字段类别确定该预设字段类别所对应的映射关系。
S1022,在所确定的预设字段类别对应的映射关系中,查找字段在词库中对应的词语,得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语。
具体的,在确定了预设字段类别后,可以确定该预设字段类别对应的映射关系,并可以在所确定的映射关系中查找该字段在词库中对应的词语。例如,字段的预设字段类别为第二预设字段,第二预设字段对应的映射关系为第二映射关系,则通过第二映射关系查找到字段在词库中对应的词语。这种先确定获取的字段的预设字段类别,即确定该字段为第一预设字段、第二预设字段、第三预设字段和第四预设字段中的一种,以便在预设字段对应的映射关系直接查找该字段在词库中对应的词语,而不用在每个映射关系中都去查找一遍,提高了字段在词库中对应的词语的查找效率。并且,在确定的字段的预设字段类别对应的映射关系中,查找字段在词库中对应的词语,使得同义数据的转化都是按照统一的格式,为同义数据的关联奠定了基础。
其中,确定所获取的字段的预设字段类别,具体可以包括:
当所获取的字段包含多种预设字段时,将所获取的字段分段,得到多个子字段,并确定各个子字段的预设字段类别,其中,该预设字段类别可以为第一预设字段、第二预设字段、第三预设字段和第四预设字段中的一种。
这里,可以根据字段分段后所包含的预设字段类别的不同对字段进行分段,例如,字段“ZERENYS”的预设字段类别不是同一种预设字段类别,“ZERENYS”分段后所包含的预设字段类别为“ZEREN”对应的第一预设字段类别和“YS”对应的第二预设字段类别。还可以设计一个列名规范器normalizer(column),它将制定的列名(column)转换为规范的一种规范表达。这里的列名为数据库中的各列对应的字段,从column的第一个字符开始,依次取长度递减的子串,可以记为sub_name1,如果vi=mapper(sub_name1)存在,则记录vi,且令column等于子串后余下的部分。继续执行本步,直至子串长度为0。令v=v1+v2+…vi(i为本步记录的次数)。如果i为0,令v为空值“”, 例如,column为“tijianRQ”,那么先取到子串“tijianR”,查询词库也不存在,直到子串“tijian”,查询到v1={“体检”,“踢毽”},那么取剩余部分“RQ”继续查询,得到结果v2={“燃气”,“日期”}。当进行v1和v2组合时,首先取v1的优先级最高词“体检”,当“体检”确定时,后面跟“日期”的概率要大于“燃气”的概率,因此选v2的“日期”一词,这样v=“体检日期”。
相应的,在所确定的预设字段类别对应的映射关系中,查找所述字段在所述词库中对应的词语,得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语,可以包括:
在所确定的预设字段类别对应的映射关系中,分别查找各个子分段在词库中对应的词语,并将所查找到的各词语组合,得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语。
具体的,可以根据马尔科夫概率模型,依次取第一个子字段在词库中对应的词语出现时,第二个子字段在词库中对应的词语中概率最高的词语,并将这两个词语组合;对于有三个子字段的情况,同样可以根据马尔科夫概率模型,依次取第一个子字段与第二个子字段在词库中对应的词语组合后的词语出现时,第三个子字段在词库中对应的词语中概率最高的词语;对于有更多个子字段的情况,也是根据该方法得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语。
另外,在本申请实施例中,在所确定的预设字段类别对应的映射关系中,分别查找各个子字段在词库中对应的词语,并将所查找到的各词语组合,得到第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语,,具体可以包括:
S10221,在所确定的预设字段类别对应的映射关系中,查找各个子字段在词库中对应的词语。
这里,先确定所获取的字段对应的预设字段类别,然后确定该预设字段类别对应的映射关系,再在该映射关系中查找各个子字段在词库中对应的词语。例如,“shangciTIJIANRQ”分段后为“shangci”、“TIJIAN”和“RQ”三个字段,查找到字段“shangci”在词库中对应的词语为“上次”,字段“TIJIAN”在词库中对应的词语为“体检”、“踢毽”,查找到字段“RQ”在词库中对应的词语为“日期”、“燃气”等。这里,对于查找到字段在词库中对应的词语只 有一个的情况,直接将该词语作为该字段在词库中对应的词语,而不用在每个映射关系中都去查找一遍,提高了字段在词库中对应的词语的查找效率。
S10222,按照各个子字段从左到右的顺序,将前两个子字段在词库中对应的词语进行组合,并将组合得到的词语作为该字段对应的首个词语。
具体的,按照分段后各子字段从左到右的顺序,先将各字段中的前两个子字段在词库中对应的词语进行组合,得到前两个子字段组合对应的词语,将该词语作为各字段对应的首个词语,这样,方便继续组合该词语与剩余字段在词库中对应的词语。
S10223,依次将首个词语与未组合的下一个相邻子字段在词库中对应的词语进行组合,并用组合得到的词语替换首个词语,直至所有的子字段在词库中对应的词语都被组合,得到字段对应的词语。
具体的,当前两个子字段组合后得到首个词语,然后依次将首个词语与未组合的下一个相邻子字段在词库中对应的词语进行组合,得到一个新词语,并将组合得到的新词语替换首个词语,根据该方法依次对剩余没有进行组合的词语进行组合,直到所有的字段在词库中对应的词语都被组合。例如,“TIJIANRQJutiTime”分段后为“TIJIAN”、“RQ”和“Juti”、“Time”四个字段,在查找了“TIJIAN”在词库中对应的词语为“体检”之后,再查找“RQ”在词库中对应的词语为“日期”、“燃气”等,“Juti”在词库中对应的词语为“具体”,“Time”在词库中对应的词语为“时间”,然后将“TIJIAN”在词库中对应的词语与“RQ”在词库中对应的词语进行组合,得到“TIJIANRQ”在词库中对应的词语。这里,由于“RQ”在词库中对应的词语有多个,因此需要将“体检”分别与“日期”、“燃气”等词组合,选取组合概率较大的词语,得到“TIJIANRQ”在词库中对应的词语为“体检日期”。再将“体检日期”与“Juti”在词库中对应的词语为“具体”进行组合,得到“TIJIANRQJuti”在词库中对应的词语为“体检日期具体”。再将“体检日期具体”与“Time”在词库中对应的词语为“时间”进行组合,得到“TIJIANRQJutiTime”在词库中对应的词语为“体检日期具体时间”。
在本申请一个可选的实施例中,在所确定的字段的预设字段类别对应的映射关系中,查找各个子字段在词库中对应的词语,包括两种情况:
第一种情况,当子字段在词库中对应的词语为一个时,将该词语确定为 该字段在词库中对应的词语。
具体的,当子字段在词库中对应的词语为一个时,即通过映射关系查找到该子字段在词库中对应的词语只有一个,那么这个词语就是该子字段在词库中对应的词语。
第二种情况,当子字段在词库中对应的词语为多个时,将该多个词语中优先级较高的词语确定为该子字段在词库中对应的词语,其中,在词库中专业术语的优先级较高。
具体的,子字段在词库中对应的词语为多个时,即通过映射关系查找到该子字段在词库中对应的词语有多个,那么要选取该多个词语中的一个词语作为该字段在词库中对应的词语。
具体的选取方法是:选取该多个词语中优先级较高的词语,并将该词语确定为该字段在词库中对应的词语,这里,在构建词库时,预先将字段所在数据库对应的专业术语的优先级设置为高优先级。例如,“TIJIAN”在词库中对应的词语为“体检”、“踢毽”,其中“体检”为该行业的专业术语,即“体检”的优先级高于“踢毽”的优先级,因此,“TIJIAN”在词库中对应的词语为“体检”。“TZ”在词库中对应的词语为“体重”、特征”、“通知”等,这里,具体将“体重”、特征”、“通知”中哪个词作为“TZ”在词库中对应的词语,是根据“体重”、特征”、“通知”的优先级确定的,其中,在词库中第一数据库和第二数据库所属行业的专业术语的优先级高,由于所举的例子属于医疗行业,因此“体重”的优先级较高,即最终选取“体重”作为“TZ”在词库中对应的词语。
在本申请一个可选的实施例中,分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,具体可以为:
利用似然函数分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联。
具体的,设计一个似然函数likehood(value1,value2),其中,value1为第一数据库中各字段各自对应的词语,value2为第二数据库中各字段各自对应的词语,通过比较传递两个词语(词组)参数,即比较value1和value2, 返回其相似度。如果value1和value2相等或高度相似,则返回true,即将value1和value2进行关联,否则返回false,即不关联value1和value2。这种通过将第一数据库中的第一字段与第二数据库中的第二字段相关联,避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
在关联同义词的具体过程中,似然函数Likehood的算法可采用简单的strcmp()(c语言中字符串比较)方法,还可以采用字符串海明距离算法,还可以采用词语相似度算法如word2vec之一等。
其中,strcmp函数是对字符的ASCII(American Standard Code for Information Interchange,美国信息交换标准代码)码进行比较,实现原理如下:首先比较两个串的第一个字符,若不相等,则停止比较并得出两个ASCII码大小比较的结果;如果相等就接着比较第二个字符然后第三个字符等等。无论两个字符串是什么样,strcmp函数最多比较到其中一个字符串遇到结束符‘/0’为止,就能得出结果。字符串海明距离算法是通过对文本进行向量化,或者说把文本的特征抽取出来映射成编码,然后再对编码进行异或计算出海明距离,从而根据海明距离得到词语相似度。word2vec是Google开源的一款用于词向量计算的工具,可以在百万数量级的词典和上亿的数据集上进行高效地训练,该工具得到的训练结果是词向量(word embedding),可以很好地度量词与词之间的相似性。
本申请实施例提供的一种异构数据库中的同义数据自动关联方法,先获取第一数据库和第二数据库中的字段,再基于预设字段与词库中词语的映射关系,查找获取的字段在词库中对应的词语,最后分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,最终将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。具体的过程举例如下:
先获取第一数据库和第二数据库中的字段,其中,第一数据库与第二数据库互为异构数据库,即获取的第一数据库中的字段和第二数据库中的字段分别如表3所示:
表3
第一数据库中的字段 第二数据库中的字段
TJRQ TIJIANRQ
TZ QITA
SG SHENGAO
然后基于预设字段与词库中词语的映射关系,查找获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,其中,词库包含第一数据库和第二数据库所属行业的专业术语。
具体的,先查找“TJRQ”、“TZ”、“SG”、“TIJIANRQ”、“QITA”、“SHENGAO”对应的预设字段类型,得到“TJRQ”、“TZ”、“SG”的预设字段类型为第二预设字段,“QITA”、“SHENGAO”的预设字段类型为第一预设字段,“TIJIANRQ”包含两种预设字段类型,将“TIJIANRQ”分段,分段后为“TIJIAN”和“RQ”两个字段,“TIJIAN”的预设字段类型为第一预设字段,“RQ”的预设字段类型为第二预设字段。
由于第一预设字段对应的是第一映射关系,第二预设字段对应的是第二映射关系,因此,在第一映射关系查找“QITA”、“SHENGAO”、“TIJIAN”在词库中对应的词语为“其他”、“身高”、“体检”,在第一映射关系查找“TJRQ”、“TZ”、“SG”、“RQ”在词库中对应的词语为“体检日期”、“体重”、“身高”、“日期”,将“TIJIAN”和“RQ”在分别词库中对应的词语进行合并,得到“体检日期”。这种先确定获取的字段的预设字段类别,即确定该字段为第一预设字段、第二预设字段、第三预设字段和第四预设字段中的一种,以便在预设字段对应的映射关系直接查找该字段在词库中对应的词语,而不用在每个映射关系中都去查找一遍,提高了字段在词库中对应的词语的查找效率。并且,在确定的字段的预设字段类别对应的映射关系中,查找字段在词库中对应的词语,使得同义数据的转化都是按照统一的格式,为同义数据的关联奠定了基础。
再分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的 第二字段相关联,其中,第一字段对应的词语与第二字段对应的词语的相似度大于预设阈值。
具体的,首先,将第一数据库中的字段“TJRQ”所对应的词语“体检日期”分别与第二数据库中的字段“TIJIANRQ”、“QITA”、“SHENGAO”所对应的词语进行比较,得到“TJRQ”所对应的词语与“TIJIANRQ”的相似度高,则将“TJRQ”和“TIJIANRQ”这两个字段相关联;
其次,将第一数据库中的字段“TZ”所对应的词语“体重”分别与第二数据库中的字段“TIJIANRQ”、“QITA”、“SHENGAO”所对应的词语进行比较,得到“TZ”所对应的词语与“TIJIANRQ”、“QITA”、“SHENGAO”所对应的词语的相似度都比较低,因此不进行关联。
最后将第一数据库中的字段“SG”所对应的词语“身高”分别与第二数据库中的字段“TIJIANRQ”、“QITA”、“SHENGAO”所对应的词语进行比较,得到“SG”所对应的词语与“SHENGAO”的相似度高,则将“SG”和“SHENGAO”这两个字段相关联。
可见,这种先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
参见图2,图2为本申请实施例提供的一种异构数据库中的同义数据自动关联装置的结构示意图,包括如下模块:
获取模块201,被配置为获取第一数据库和第二数据库中的字段,其中,第一数据库与第二数据库互为异构数据库。
查找模块202,被配置为基于预设字段与词库中词语的映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语,其中,词库包含第一数据库和第二数据库所属行业的专业术语。
比较模块203,被配置为分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,第一字段对应的词语与第二字 段对应的词语的相似度大于预设阈值。
由此可见,本申请实施例提供的一种异构数据库中的同义数据自动关联装置,先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
进一步的,映射关系包括如下映射关系中的一个或两个以上:
第一预设字段与词库中词语的第一映射关系,第一预设字段为词库中词语的汉语拼音;
第二预设字段与词库中词语的第二映射关系,第二预设字段为词库中词语的汉语拼音的首字母;
第三预设字段与词库中词语的第三映射关系,第三预设字段为词库中词语的英文单词;
第四预设字段与词库中词语的第四映射关系,第四预设字段为词库中词语的英文单词的缩写;
查找模块202,具体被配置为:
基于第一映射关系、第二映射关系、第三映射关系和第四映射关系中的一个或两个以上映射关系,查找所获取的字段在词库中对应的词语,得到第一数据库中各字段各自对应的词语和第二数据库中各字段各自对应的词语。
进一步的,查找模块202,包括:
确定子模块,被配置为确定所获取的字段的预设字段类别,预设字段类别为第一预设字段、第二预设字段、第三预设字段和第四预设字段中的一种;
查找子模块,被配置为在所确定的预设字段类别对应的映射关系中,查找字段在词库中对应的词语,得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语。
进一步的,查找子模块,包括:
确定单元,被配置为当所获取的字段包含多种预设字段时,将所获取的字段分段,得到多个子字段;确定各个子字段的预设字段类别;
第一查找单元,被配置为
被配置为在所确定的字段的预设字段类别对应的映射关系中,分别查找各个子字段在词库中对应的词语,并将所查找到的各词语组合,得到第一数据库中各字段对应的各词语和第二数据库中各字段对应的各词语。
进一步的,第一查找单元,包括:
第一查找子单元,被配置为在所确定的预设字段类别对应的映射关系中,查找各个子字段在词库中对应的词语;
第一组合子单元,被配置为按照各个子字段从左到右的顺序,将前两个子字段在词库中对应的词语进行组合,并将组合得到的词语作为各字段对应的首个词语;
第二组合子单元,被配置为依次将首个词语与未组合的下一个相邻子字段在词库中对应的词语进行组合,并用组合得到的词语替换首个词语,直至所有的子字段在词库中对应的词语都被组合,得到字段对应的词语。
进一步的,第一查找子单元,具体被配置为:
当子字段在词库中对应的词语为一个时,将该词语确定为该子字段在词库中对应的词语;
当子字段在词库中对应的词语为多个时,将该多个词语中优先级较高的词语确定为该子字段在词库中对应的词语,其中,在词库中专业术语的优先级高。
进一步的,比较模块203,具体被配置为:
利用似然函数分别比较第一数据库中各字段各自对应的词语与第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联。
本申请实施例还提供了一种电子设备,如图3所示,为本申请实施例的一种异构数据库中的同义数据自动关联方法应被配置为电子设备的结构示意图,该电子设备可以包括处理器301和机器可读存储介质302,机器可读存储介质302存储有能够被处理器301执行的机器可执行指令,处理器301被机器可执行指令促使:实现上述实施例中任一所述的一种异构数据库中的同义数据自动关联方法。
由此可见,本发明实施例提供的一种电子设备,先通过预设的映射关系 将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
本申请实施例还提供了一种电子设备,如图4所示,包括上述处理器301和机器可读存储介质302、以及通信接口303、通信总线304,其中,处理器301,通信接口303,机器可读存储介质302通过通信总线304完成相互间的通信,机器可读存储介质302,被配置为存放计算机程序;处理器301,被配置为执行机器可读存储介质302上所存放的程序时,实现上述实施例中任一所述的一种异构数据库中的同义数据自动关联方法。
上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口被配置为上述电子设备与其他设备之间的通信机器可读存储介质302可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。
上述的处理器301可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
由此可见,本申请实施例提供的一种电子设备,先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
在本申请提供的又一实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的一种异构数据库中的同义数据自动关联方法。
由此可见,本申请实施例提供的一种计算机可读存储介质,先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
在本申请提供的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的一种异构数据库中的同义数据自动关联方法。
由此可见,本申请实施例提供的一种包含指令的计算机程序产品,通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。
本申请实施例还提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的一种异构数据库中的同义数据自动关联方法。
由此可见,本申请实施例提供的一种包含指令的计算机程序产品,先通过预设的映射关系将异构数据库中的字段都转换成各字段在词库中各自对应的词语,再将异构数据库中相似度高的字段相关联,使得同义数据的转化都是按照统一的格式,而且避免了人工操作带来操作错误问题,从而提高了异构数据库间同义数据关联的效率。

Claims (17)

  1. 一种异构数据库中的同义数据自动关联方法,其中,所述方法包括:
    获取第一数据库和第二数据库中的字段,其中,所述第一数据库与所述第二数据库互为异构数据库;
    基于预设字段与词库中词语的映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语,其中,所述词库包含所述第一数据库和所述第二数据库所属行业的专业术语;
    分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,所述第一字段对应的词语与所述第二字段对应的词语的相似度大于预设阈值。
  2. 根据权利要求1所述的方法,其中,所述映射关系包括如下映射关系中的一个或两个以上:
    第一预设字段与所述词库中词语的第一映射关系,所述第一预设字段为所述词库中词语的汉语拼音;
    第二预设字段与所述词库中词语的第二映射关系,所述第二预设字段为所述词库中词语的汉语拼音的首字母;
    第三预设字段与所述词库中词语的第三映射关系,所述第三预设字段为所述词库中词语的英文单词;
    第四预设字段与所述词库中词语的第四映射关系,所述第四预设字段为所述词库中词语的英文单词的缩写;
    基于预设字段与词库中词语的映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语,包括:
    基于所述第一映射关系、所述第二映射关系、所述第三映射关系和所述第四映射关系中的一个或两个以上映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语。
  3. 根据权利要求2所述的方法,其中,所述基于所述第一映射关系、所述第二映射关系、所述第三映射关系和所述第四映射关系中的一个或两个以上映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语,包括:
    确定所获取的所述字段的预设字段类别,所述预设字段类别为所述第一预设字段、所述第二预设字段、所述第三预设字段和所述第四预设字段中的一种;
    在所确定的预设字段类别对应的映射关系中,查找所述字段在所述词库中对应的词语,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语。
  4. 根据权利要求3所述的方法,其中,所述确定所获取的所述字段的预设字段类别,包括:
    当所获取的所述字段包含多种预设字段时,将所获取的字段分段,得到多个子字段;确定各个子字段的预设字段类别;
    在所确定的预设字段类别对应的映射关系中,查找所述字段在所述词库中对应的词语,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语,包括:
    在所确定的预设字段类别对应的映射关系中,分别查找各个子字段在所述词库中对应的词语,并将所查找到的各词语组合,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语。
  5. 根据权利要求4所述的方法,其中,所述在所确定的预设字段类别对应的映射关系中,分别查找各个子字段在所述词库中对应的词语,并将所查找到的各词语组合,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语,包括:
    在所确定的预设字段类别对应的映射关系中,查找各个子字段在所述词库中对应的词语;
    按照各个子字段从左到右的顺序,将前两个子字段在所述词库中对应的词语进行组合,并将组合得到的词语作为所述字段对应的首个词语;
    依次将所述首个词语与未组合的下一个相邻子字段在所述词库中对应的词语进行组合,并用组合得到的词语替换所述首个词语,直至所有的子字段在所述词库中对应的词语都被组合,得到所述字段对应的词语。
  6. 根据权利要求5所述的方法,其中,所述在所确定的预设字段类别对应的映射关系中,查找各个子字段在所述词库中对应的词语,包括:
    当所述子字段在所述词库中对应的词语为一个时,将该词语确定为该字段在所述词库中对应的词语;
    当所述子字段在所述词库中对应的词语为多个时,将该多个词语中优先级高的词语确定为该子字段在所述词库中对应的词语,其中,在所述词库中专业术语的优先级高。
  7. 根据权利要求1所述的方法,其中,所述分别比较所述第一数据库中各字段对应的词语与所述第二数据库中各字段对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,包括:
    利用似然函数分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联。
  8. 一种异构数据库中的同义数据自动关联装置,其中,所述装置包括:
    获取模块,被配置为获取第一数据库和第二数据库中的字段,其中,所述第一数据库与所述第二数据库互为异构数据库;
    查找模块,被配置为基于预设字段与词库中词语的映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语,其中,所述词库包含所述第一数据库和所述第二数据库所属行业的专业术语;
    比较模块,被配置为分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联,其中,所述第一字段对应的词语与所述第二字段对应的词语的相似度大于预设阈值。
  9. 根据权利要求8所述的装置,其中,所述映射关系包括如下映射关系中的一个或两个以上:
    第一预设字段与所述词库中词语的第一映射关系,所述第一预设字段为所述词库中词语的汉语拼音;
    第二预设字段与所述词库中词语的第二映射关系,所述第二预设字段为所述词库中词语的汉语拼音的首字母;
    第三预设字段与所述词库中词语的第三映射关系,所述第三预设字段为所述词库中词语的英文单词;
    第四预设字段与所述词库中词语的第四映射关系,所述第四预设字段为所述词库中词语的英文单词的缩写;
    所述查找模块,被配置为:
    基于所述第一映射关系、所述第二映射关系、所述第三映射关系和所述第四映射关系中的一个或两个以上映射关系,查找所获取的所述字段在所述词库中对应的词语,得到所述第一数据库中各字段各自对应的词语和所述第二数据库中各字段各自对应的词语。
  10. 根据权利要求9所述的装置,其中,所述查找模块,包括:
    确定子模块,被配置为确定所获取的所述字段的预设字段类别,所述预设字段类别为所述第一预设字段、所述第二预设字段、所述第三预设字段和所述第四预设字段中的一种;
    查找子模块,被配置为在所确定的预设字段类别对应的映射关系中,查找所述字段在所述词库中对应的词语,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语。
  11. 根据权利要求10所述的装置,其中,所述查找子模块,包括:
    确定单元,被配置为当所获取的字段包含多种预设字段时,将所获取的字段分段,得到多个子字段;确定各个子字段的预设字段类别;
    第一查找单元,被配置为在所确定的所述字段的预设字段类别对应的映射关系中,分别查找各个子字段在所述词库中对应的词语,并将所查找到的各词语组合,得到所述第一数据库中各字段对应的各词语和所述第二数据库中各字段对应的各词语。
  12. 根据权利要求11所述的装置,其中,所述第一查找单元,包括:
    第一查找子单元,被配置为在所确定的预设字段类别对应的映射关系中, 查找各个子字段在所述词库中对应的词语;
    第一组合子单元,被配置为按照各个子字段从左到右的顺序,将前两个子字段在所述词库中对应的词语进行组合,并将组合得到的词语作为所述字段对应的首个词语;
    第二组合子单元,被配置为依次将所述首个词语与未组合的下一个相邻子字段在所述词库中对应的词语进行组合,并用组合得到的词语替换所述首个词语,直至所有的子字段在所述词库中对应的词语都被组合,得到所述字段对应的词语。
  13. 根据权利要求12所述的装置,其中,所述第一查找子单元,被配置为:
    当所述子字段在所述词库中对应的词语为一个时,将该词语确定为该字段在所述词库中对应的词语;
    当所述子字段在所述词库中对应的词语为多个时,将该多个词语中优先级高的词语确定为该子字段在所述词库中对应的词语,其中,在所述词库中专业术语的优先级高。
  14. 根据权利要求8所述的装置,其中,所述比较模块,被配置为:
    利用似然函数分别比较所述第一数据库中各字段各自对应的词语与所述第二数据库中各字段各自对应的词语的相似度,并将第一数据库中的第一字段与第二数据库中的第二字段相关联。
  15. 一种电子设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使:实现权利要求1-7任一项所述的方法步骤。
  16. 一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1-7任一所述的方法步骤。
  17. 一种应用程序,所述应用程序用于在运行时执行权利要求1-7任一项所述的方法步骤。
PCT/CN2018/121512 2017-12-19 2018-12-17 异构数据库中的同义数据自动关联方法、装置及电子设备 WO2019120169A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711377197.0 2017-12-19
CN201711377197.0A CN110019474B (zh) 2017-12-19 2017-12-19 异构数据库中的同义数据自动关联方法、装置及电子设备

Publications (1)

Publication Number Publication Date
WO2019120169A1 true WO2019120169A1 (zh) 2019-06-27

Family

ID=66993094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/121512 WO2019120169A1 (zh) 2017-12-19 2018-12-17 异构数据库中的同义数据自动关联方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN110019474B (zh)
WO (1) WO2019120169A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680083A (zh) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 智能化多级政府财政数据采集系统及数据采集方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347320A (zh) * 2020-11-05 2021-02-09 杭州数梦工场科技有限公司 数据表字段的关联字段推荐方法及装置
CN112597124A (zh) * 2020-11-30 2021-04-02 新华三大数据技术有限公司 一种数据字段映射方法、装置及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090012928A1 (en) * 2002-11-06 2009-01-08 Lussier Yves A System And Method For Generating An Amalgamated Database
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN107045534A (zh) * 2017-01-20 2017-08-15 中国航天系统科学与工程研究院 大数据环境下基于HBase的异构数据库在线交换与共享系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1708099A1 (en) * 2005-03-29 2006-10-04 BRITISH TELECOMMUNICATIONS public limited company Schema matching
US20090228463A1 (en) * 2008-03-10 2009-09-10 Cramer Richard D Method for Searching Compound Databases Using Topomeric Shape Descriptors and Pharmacophoric Features Identified by a Comparative Molecular Field Analysis (CoMFA) Utilizing Topomeric Alignment of Molecular Fragments
CN102385635A (zh) * 2011-12-14 2012-03-21 湖南科技大学 一种基于本体模式的异构数据集成方法
US9087044B2 (en) * 2012-08-30 2015-07-21 Wal-Mart Stores, Inc. Establishing “is a” relationships for a taxonomy
CN103336852B (zh) * 2013-07-24 2017-04-05 清华大学 跨语言本体构建方法及装置
CN103412917B (zh) * 2013-08-08 2016-08-10 广西大学 一种可扩展的多类型领域数据协调管理的数据库系统和管理方法
CN103488759A (zh) * 2013-09-25 2014-01-01 深圳好视网络科技有限公司 一种根据关键词搜索应用程序的方法和装置
CN104036048B (zh) * 2014-07-02 2016-12-21 电子科技大学 一种本体与关系数据库模式之间的映射方法
US9075840B1 (en) * 2014-10-27 2015-07-07 Intuitive Control Systems, Llc Method and computer program product for allowing a software application to interact with a product
TWI619115B (zh) * 2014-12-30 2018-03-21 鴻海精密工業股份有限公司 會議記錄裝置及其自動生成會議記錄的方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090012928A1 (en) * 2002-11-06 2009-01-08 Lussier Yves A System And Method For Generating An Amalgamated Database
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN107045534A (zh) * 2017-01-20 2017-08-15 中国航天系统科学与工程研究院 大数据环境下基于HBase的异构数据库在线交换与共享系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680083A (zh) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 智能化多级政府财政数据采集系统及数据采集方法

Also Published As

Publication number Publication date
CN110019474B (zh) 2022-03-04
CN110019474A (zh) 2019-07-16

Similar Documents

Publication Publication Date Title
WO2021000676A1 (zh) 问答方法、问答装置、计算机设备及存储介质
US10347019B2 (en) Intelligent data munging
US20200081899A1 (en) Automated database schema matching
US8666983B2 (en) Architecture for generating responses to search engine queries
US9280535B2 (en) Natural language querying with cascaded conditional random fields
US9773053B2 (en) Method and apparatus for processing electronic data
US20150227505A1 (en) Word meaning relationship extraction device
US9483460B2 (en) Automated formation of specialized dictionaries
WO2021139262A1 (zh) 文献主题词聚合方法、装置、计算机设备及可读存储介质
US20160063097A1 (en) Data Clustering System, Methods, and Techniques
WO2019120169A1 (zh) 异构数据库中的同义数据自动关联方法、装置及电子设备
US20190340503A1 (en) Search system for providing free-text problem-solution searching
Pinto et al. A statistical approach to crosslingual natural language tasks
CN109033080A (zh) 基于概率转移矩阵的医疗术语标准化方法及系统
Greene et al. Producing accurate interpretable clusters from high-dimensional data
WO2021175005A1 (zh) 基于向量的文档检索方法、装置、计算机设备及存储介质
US11487943B2 (en) Automatic synonyms using word embedding and word similarity models
WO2018090468A1 (zh) 视频节目的搜索方法和装置
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
WO2021114825A1 (zh) 机构标准化方法、装置、电子设备及存储介质
WO2021169203A1 (zh) 基于多层级结构相似度的单基因病名称推荐方法和系统
US10073890B1 (en) Systems and methods for patent reference comparison in a combined semantical-probabilistic algorithm
AU2019200371A1 (en) Utilizing artificial intelligence to integrate data from multiple diverse sources into a data structure
TWI640877B (zh) 語意分析裝置、方法及其電腦程式產品
CN108733702B (zh) 用户查询上下位关系提取的方法、装置、电子设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18890718

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC , EPO FORM 1205A DATED 22.09.20.

122 Ep: pct application non-entry in european phase

Ref document number: 18890718

Country of ref document: EP

Kind code of ref document: A1