CN107844560B - Data access method and device, computer equipment and readable storage medium - Google Patents

Data access method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN107844560B
CN107844560B CN201711052158.3A CN201711052158A CN107844560B CN 107844560 B CN107844560 B CN 107844560B CN 201711052158 A CN201711052158 A CN 201711052158A CN 107844560 B CN107844560 B CN 107844560B
Authority
CN
China
Prior art keywords
field
data set
mapped
standard
identity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711052158.3A
Other languages
Chinese (zh)
Other versions
CN107844560A (en
Inventor
谢永恒
李贺
火一莽
万月亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201711052158.3A priority Critical patent/CN107844560B/en
Publication of CN107844560A publication Critical patent/CN107844560A/en
Application granted granted Critical
Publication of CN107844560B publication Critical patent/CN107844560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

The invention discloses a data access method, a data access device, computer equipment and a readable storage medium, wherein at least one first field to be mapped corresponding to a first external data set to be accessed is obtained; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.

Description

Data access method and device, computer equipment and readable storage medium
Technical Field
Embodiments of the present invention relate to data processing technologies, and in particular, to a method and an apparatus for data access, a computer device, and a readable storage medium.
Background
In the production process of enterprises, a large amount of data access work is carried out every day, for example, China mobile needs to access data of telephone information, short messages, QQ chat data, WeChat data and the like of each user, the accessed data formats are often different, and the enterprises need to invest a large amount of time and manpower to configure a warehousing lattice transfer strategy of data in different formats.
At present, the data access work is mainly realized in a manual mode, and the technical defect of the manual configuration lattice change strategy is as follows: data access is costly, inefficient and has poor scalability.
Disclosure of Invention
The invention provides a data access method, a data access device, computer equipment and a readable storage medium, which are used for realizing automatic recommendation of data access and automatic mapping of field levels.
In a first aspect, an embodiment of the present invention provides a method for data access, including:
acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;
obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped;
determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;
establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped;
and accessing the first external data set into the first standard data set according to the first field mapping relation.
In a second aspect, an embodiment of the present invention further provides a data access apparatus, including:
the first field to be mapped acquisition module is used for acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;
a to-be-matched data set vector obtaining module, configured to obtain a to-be-matched data set vector corresponding to the first external data set according to a word vector corresponding to each participle in the first to-be-mapped field;
the first standard data set determining module is used for determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;
a first field mapping relationship establishing module, configured to establish a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field in the first standard data set and the first field to be mapped;
and the first external data set access module is used for accessing the first external data set into the first standard data set according to the first field mapping relation.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for accessing data according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement a method for data access according to any embodiment of the present invention.
The method comprises the steps of obtaining at least one first field to be mapped corresponding to a first external data set to be accessed; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.
Drawings
Fig. 1 is a flowchart of a method for data access according to an embodiment of the present invention;
fig. 2a is a flowchart of a data access method according to a second embodiment of the present invention;
FIG. 2b is a flowchart of constructing a vector of data sets to be matched according to a second embodiment of the present invention;
fig. 2c is a flowchart of calculating the identity of the first field according to the second embodiment of the present invention;
fig. 3a is a flowchart of a data access method according to a third embodiment of the present invention;
fig. 3b is a technical route diagram of data access according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data access apparatus according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a data access method according to an embodiment of the present invention, where the present embodiment is applicable to a data access situation, and the method may be executed by a data access apparatus, where the data access apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device. The method of the embodiment specifically comprises the following steps:
step 110, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained.
In the data access method, an object of access is an external data set. For example: the China Mobile needs to access the data of the telephone information, the data of the short message, the data of the QQ chat, the WeChat data and the like of each user, and the data of the telephone information, the data of the short message, the data of the QQ chat, the WeChat data and the like of each user are accessed into the data set of the China Mobile as an external data set. Specifically, the data of the QQ chat is accessed as the first external data set, and the QQ chat external data set includes fields of chat time, chat content, chat people a, chat people B, and QQ space, and the fields are used as the first fields to be mapped.
And step 120, obtaining a to-be-matched data set vector corresponding to the first external data set according to the word vector corresponding to each participle in the first to-be-mapped field.
The word segmentation is a process of segmenting a Chinese sequence into a single word, so that continuous word sequences are recombined into word sequences according to a certain specification, and the word segmentation can be realized by a word segmentation model, for example, the word segmentation of ' I ' goes off duty ' is ' I ', the word segmentation of ' off duty ' and the word segmentation of ' off duty '. On the basis, the word-off filtering can be carried out on the word-dividing result, and common words without much information, such as 'the' and the like, are excluded. While the natural language is given to the algorithmic processing in machine learning, the language is usually required to be mathematic, and a word vector is a way to perform the mathematic processing on words in the language, and words corresponding to each participle are vectorized according to a word vector model. The data set vector can be obtained by a weighted average of the word vectors of each participle, or by combining the word vectors of each participle.
And step 130, determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model.
Specifically, if the external data set of the QQ chat is accessed into the standard data set, and the standard data set includes many categories, such as QQ, WeChat, telephone, SMS, etc., it is necessary to judge that the external data set of the QQ chat should be accessed into the category of the QQ of the standard data set, but the external data set of the QQ chat cannot be accessed into the category of the WeChat. It will be appreciated that determining the first standard data set that matches the first external data set is a classification problem, and therefore, classification may be performed using data set classification models, such as Multiple Layer Perceptron (MLP) and Support Vector Model (SVM).
It can be understood that before the data set vector to be matched is input into the data set classification model, it needs to determine whether to perform the normalization processing on the data set vector to be matched according to the data set classification model, and if the SVM model is adopted as the data set classification model, the normalization processing needs to be performed on the data set to be matched.
Step 140, establishing a first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped.
It will be appreciated that after determining the category of the standard data set to which the external data set is to be accessed, each field of the external data set also needs to be matched. Specifically, by taking the above example as an example, after it is determined that the QQ chat external data set needs to be accessed to the QQ standard data set corresponding to the china mobile, assuming that the data set corresponding to the QQ chat in the china mobile includes time, content, and a talker, it is further necessary to correspond fields, such as chat time, chat content, chat talker a, chat talker B, QQ space, included in the QQ chat external data set to the time, content, and talker included in the QQ standard data set one to one, and the established one-to-one correspondence is a field mapping relationship.
The mapping relationship can be determined according to the similarity between the fields, for example, according to cosine similarity calculation, if the similarity result is higher, the first to-be-mapped field of the first external data set can be determined to be in one-to-one correspondence with the standard field in the first standard data set, and the mapping relationship can be established; and when the similarity result is lower, the mapping relation cannot be established. As in the above example, the mapping relationship that can be established is: chat time-time, chat content-content, chat person A-conversation person, chat person B-conversation person. The similarity between the QQ space and the time, content and speaker in the standard data set is low, and the mapping relation cannot be established.
And 150, accessing the first external data set to the first standard data set according to the first field mapping relation.
According to the first field mapping relation, after the first external data set is obtained, the first field mapping relation is automatically matched, and if the matching is successful, the first external data set is automatically accessed. And if the matching is unsuccessful, manual intervention is required. At this time, the QQ space spare field may be stored or deleted according to the memory size and actual requirements.
According to the technical scheme of the embodiment, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.
Example two
Fig. 2a is a flowchart of a data access method according to a second embodiment of the present invention, which is optimized based on the second embodiment. In this embodiment, after establishing the first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field and the first field to be mapped in the first standard data set, the method further includes: calculating a first field identity corresponding to the first field to be mapped; and storing the first field identity, the first standard data set identity and the first field mapping relation in a mapping cache table.
As shown in fig. 2a, the embodiment of the present invention specifically includes:
step 210, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained.
Step 220, obtaining a to-be-matched data set vector corresponding to the first external data set according to the word vector corresponding to each participle in the first to-be-mapped field.
In fig. 2b, a flowchart for constructing a to-be-matched data set vector according to a second embodiment of the present invention is shown, and as shown in fig. 2b, a construction process of the to-be-matched data set vector includes:
and step 221, determining each participle of the first field to be mapped according to a pre-trained participle device.
The word segmentation device can adopt a conditional random field CRF word segmentation model to realize automatic word segmentation through sequence marking. The CRF model occupies a certain word formation position when each word constructs a specific word, and each word is only assumed to have 4 word positions: the beginning of a word (B), the middle of a word (M), the end of a word (E) and the independent word (S). The result of word segmentation is represented in the form of word sequence notation. A CRF word segmentation labeling model belongs to a discrimination model, a conditional probability model is modeled, model parameters are iteratively solved by utilizing a training data set through maximum likelihood estimation or regularized maximum likelihood estimation during learning, and an output sequence with the maximum conditional probability is predicted through a given input sequence.
For example, planning the sentence "Shanghai to the end of this century achieves a total domestic production of five thousand dollars. "word segmentation, after passing through CRF word segmentation model, the result is:
Shanghai/Bhai/Ejime/Bmin/Eto/Sben/Shi/Bmin/Endest/Shi/Bmin/Eren/Byun/England/Esheng/Bproduct/Etot/Bmin/Ewu/Bk/Mmei/M yuan/E. and/S.
Step 222, obtaining a word vector corresponding to each participle according to the word vector model.
The word vector model adopts a cross-word sequence Skip-Gram model based on Hierarchical soft maximum value Hierarchical software Softmax, a neural network word represents one of the models in the model, the context and the relation between the context and a target are modeled through a neural network technology, and the model has the greatest advantage of being capable of identifying complex context environments due to the fact that the neural network is flexible. The Skip-Gram model based on the Hierarchical Softmax comprises an input layer, a projection layer and an output layer. Taking sample (w, context (w)) as an example,
input layer word vector v (w) ∈ R containing only the word w in the center of the current samplem
Projection layer, like the input layer, only the word vector v (w) ∈ R of the central word w of the current samplem
An output layer: and carrying out statistical analysis on the speech to construct a Huffman tree.
In the model, each leaf node of the Huffman tree represents a word, each branch is regarded as one time and classified, the probability P (w, context (w)) is modeled, and model parameters and word vectors are solved by taking a tree likelihood function as a cost function without a gradient ascent method.
And 223, combining the word vectors corresponding to the participles to obtain a to-be-matched data set vector corresponding to the first external data set.
The data set vector to be matched is obtained through word vector combination, wherein the data set vector can be represented by a weighted average value of each word vector, specifically, a Frequency-inverse document Frequency model (TF-IDF) can be adopted to calculate the data set vector, firstly, a TF appearing in a first standard data set to be accessed in a word vector corresponding to each participle in each field of each data set to be matched is counted, according to an inverse document Frequency IDF of the word vector corresponding to the participle in the first standard data set accessed in history, meanwhile, due to the fact that the IDF tries to suppress noise, smoothing processing is carried out on the calculation process of the IDF, and finally, a weight coefficient of the word vector corresponding to each participle, namely TF. And after determining the weight coefficient of the word vector corresponding to each participle, taking the weighted average of the word vectors as the data set vector to be matched. Moreover, a Bag-of-words Model (CBOW) may also be employed to obtain the dataset vector.
And step 230, determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model.
The pre-trained data set classification model is a multi-layer perceptron MLP model, historical data are adopted for training, the model takes a negative log-likelihood function as a cost function, model parameters are solved through a gradient descent method, and because the performance of the model is relevant to the initialization of weight, the solved model parameters are probably suboptimal solutions, so that the model with the minimum average test error in S times of evaluation is selected by considering an S-fold cross validation mode aiming at small sample data. The solved model parameters are first standard data set category codes matched with the first external data set, and the category codes are used as the identification of the first standard data set.
Step 240, establishing a first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped.
And step 250, calculating a first field identity corresponding to the first field to be mapped.
Due to the limitation of a storage memory, storing a field to be mapped in a mapping cache pool may affect an operation speed, and therefore, an identity representing a first field needs to be obtained, where the identity may be a feature value corresponding to the field to be mapped, for example, the first field to be mapped includes at least one word, and a weight of a word vector corresponding to each word is used as a feature value of the first field to be mapped.
Fig. 2c shows a flowchart for calculating the first field identity according to the second embodiment of the present invention, and as shown in fig. 2c, the specific calculation step of the first field identity includes:
step 251, if it is determined that the number of the first fields to be mapped is at least two, sorting the first fields to be mapped according to a preset sorting rule.
If the first field to be mapped is one, sorting the first field to be mapped is not needed; if the number of the first fields to be mapped is at least two, at least two first fields to be mapped of the first external data set to be accessed need to be ordered so as to ensure that the output result is influenced by the different orders of the first fields to be mapped. The preset ordering rule includes determining according to the sequence of the fields of the external data set, for example, taking QQ chat as the first external data set to be accessed, the fields of chat time, chat content, chat people a, chat people B, QQ space and the like included in the external data set can be ordered according to "chat time chat content chat people a chat people B".
And 252, merging the sorted first fields to be mapped into a long character string.
The ordered "chat time chat content chat A chat people B" is converted to a binary string according to the above example.
Step 253, calculating the hash value of the long character string according to a hash algorithm as the first field identity.
The first field identity can be obtained by calculation through a hash algorithm, the hash algorithm can map a binary value with any length into a short binary system with a fixed length, the binary value is the identity of the first field, and the hash value is a very compact numerical value representation form of the first field and represents the characteristic information of the first field.
Step 260, storing the first field identity, the first standard data set identity and the first field mapping relationship in a mapping cache table.
The first field identity may be a hash value, the first standard data set identity may be a first standard data set class code matched with the first external data set, and the first field mapping relationship is a one-to-one correspondence relationship between the first field to be mapped and the standard field in the first standard data set.
The three are stored in a mapping cache table, a pre-cache is established, if other external data sets are accessed, whether records corresponding to the external data sets exist can be judged according to the mapping cache table, and if the records corresponding to the external data sets exist, the external data sets can be directly accessed according to a field mapping relation.
Step 270, according to the first field mapping relationship, accessing the first external data set to the first standard data set.
According to the embodiment of the invention, the field to be mapped of the first external data set to be accessed is subjected to word segmentation and word vectorization through the combination of the CRF model, the Skip-Gram model based on the Hierarchical software max and the MLP model, the identity of the first standard data set is determined, the identity of the first field is determined through the Hash algorithm, and meanwhile, the first total segment identity, the identity of the first standard data set and the mapping relation of the first field are stored in the mapping cache table.
EXAMPLE III
Fig. 3a is a flowchart of data access provided by a third embodiment of the present invention, and as shown in fig. 3a, the specific steps of the data access are as follows:
step 310, at least one second field to be mapped corresponding to a second external data set to be accessed is obtained.
And step 320, calculating a second field identity corresponding to the second field to be mapped.
The calculation process of the second field identity is the same as that of the first field identity in the second embodiment of the present invention. And the terms "first" and "second" in any embodiment of the present invention are used merely for distinction and are not intended to be limiting.
Step 330, determining whether the mapping cache table stores the second field identity.
And 340, if yes, acquiring a second standard data set identity and a second field mapping relation corresponding to the second field identity in the mapping cache table. And step 370 is performed.
Whether the second field identity is stored in the mapping cache pool or not is determined, and the field identity in the mapping cache pool, such as the hash value of the field, can be accelerated and traversed by considering the Bloom filter, so that the judgment result can be obtained in a short time.
It can be understood that, if it is determined that the second field identity is not stored in the mapping cache table, the specific content of the first embodiment of the present invention is executed, that is:
step 350, if not, obtaining a to-be-matched data set vector corresponding to the second external data set according to the word vector corresponding to each participle in the second to-be-mapped field.
And step 360, determining a second standard data set matched with the second external data set according to the data set vector to be matched and a pre-trained data set classification model.
Step 370, establishing a second field mapping relationship between the second field to be mapped and the standard field according to the similarity between the standard field and the second field to be mapped in the second standard data set.
And 380, accessing the second external data set into the second standard data set according to the second field mapping relation.
According to the embodiment of the invention, whether the mapping cache pool has the second field identity is judged, if yes, the second external data set is accessed to the second standard data set according to the second field mapping relation; and if not, segmenting words of a second field to be mapped corresponding to the second external data set, vectorizing the words to obtain a data set vector to be matched corresponding to the second external data set, determining the class of a second standard data set corresponding to the second external data set according to the data set classification model, and meanwhile, accessing the second external data set into the second standard data set according to the second field mapping relation. When the field identity of the data set to be accessed exists in the mapping cache pool, word segmentation is not needed, and the external data set is accessed according to the field mapping relation in the cache pool, so that the data access is faster and more convenient.
Further, on the basis of any of the above embodiments, fig. 3b illustrates a technical route diagram for data access provided by the embodiment of the present invention, and as shown in fig. 3b, a person skilled in the art may implement automatic access of external data according to the technical route for data access.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a data access apparatus according to a fourth embodiment of the present invention, and as shown in fig. 4, the apparatus includes: a first field to be mapped obtaining module 410, a data set vector to be matched obtaining module 420, a first standard data set determining module 430, a first field mapping relationship establishing module 440, and a first external data set accessing module 450, wherein:
a first field to be mapped obtaining module 410, configured to obtain at least one first field to be mapped corresponding to a first external data set to be accessed;
a to-be-matched data set vector obtaining module 420, configured to obtain a to-be-matched data set vector corresponding to the first external data set according to a word vector corresponding to each participle in the first to-be-mapped field;
a first standard data set determining module 430, configured to determine, according to the to-be-matched data set vector and a pre-trained data set classification model, a first standard data set that matches the first external data set;
a first field mapping relationship establishing module 440, configured to establish a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field and the first field to be mapped in the first standard data set;
a first external data set accessing module 450, configured to access the first external data set into the first standard data set according to the first field mapping relationship.
The invention discloses a data access method, a data access device, computer equipment and a readable storage medium, wherein at least one first field to be mapped corresponding to a first external data set to be accessed is obtained; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.
On the basis of the above embodiments, the method may further include:
after the first field mapping relationship establishing module, the method further comprises:
the first field identity identification calculation module is used for calculating a first field identity identification corresponding to the first field to be mapped;
a mapping cache table storage module, configured to store the first field identity, the first standard data set identity, and the first field mapping relationship in a mapping cache table;
a second mapping field obtaining module, configured to obtain at least one second to-be-mapped field corresponding to a second external data set to be accessed;
the second field identity identification calculation module is used for calculating a second field identity identification corresponding to the second field to be mapped;
a second field identity judgment module, configured to, if it is determined that the second field identity is stored in the mapping cache table, obtain a second standard data set identity and a second field mapping relationship, where the second standard data set identity corresponds to the second field identity, in the mapping cache table;
a second external data set access module, configured to access the second external data set to the second standard data set according to the second field mapping relationship;
the module for obtaining the vector of the data set to be matched comprises:
the word segmentation submodule is used for determining each word segmentation of the first field to be mapped according to a pre-trained word segmentation device;
the word vector submodule is used for obtaining a word vector corresponding to each participle according to the word vector model;
the combination submodule is used for combining the word vectors corresponding to the participles to obtain a data set vector to be matched corresponding to the first external data set;
the first field identity calculation module comprises:
the sorting submodule is used for sorting the first fields to be mapped according to a preset sorting rule;
the conversion submodule is used for converting the sequenced first field to be mapped into a long character string;
the calculation submodule is used for taking the hash value of the long character string as the first field identity according to a hash algorithm;
in the first standard data set determining module, the data set classification model is a multi-layer perceptron MLP model;
in the word segmentation submodule, the word segmentation device is a conditional random field CRF model;
in the word vector submodule, the word vector model is a word frequency-inverse file frequency TF-IDF model.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a computer apparatus according to a fifth embodiment of the present invention, as shown in fig. 5, the apparatus includes a processor 50, a memory 51, an input device 52, and an output device 53; the number of processors 50 in the device may be one or more, and one processor 50 is taken as an example in fig. 5; the processor 50, the memory 51, the input means 52 and the output means 53 of the device may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.
The memory 51 is used as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data accessing method in the embodiment of the present invention (for example, a first field to be mapped obtaining module 401, a data set vector obtaining module 402, a first standard data set determining module 403, a first field mapping relationship establishing module 404, and a first external data set accessing module 405 in a data accessing device). The processor 50 executes various functional applications of the device and data processing, i.e. implements the above-mentioned data access method, by running software programs, instructions and modules stored in the memory 51.
The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 51 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 51 may further include memory located remotely from the processor 50, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 52 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the appliance server. The output device 53 may include a display device such as a display screen.
The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
The fifth embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are not limited to the operations of the method described above, and may also perform related operations in the method for accessing data provided in any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the data access apparatus, each unit and each module included in the apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for data access, comprising:
acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;
obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped;
determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;
establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped;
and accessing the first external data set into the first standard data set according to the first field mapping relation.
2. The method according to claim 1, further comprising, after establishing a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field and the first field to be mapped in the first standard data set, the method further comprising:
calculating a first field identity corresponding to the first field to be mapped;
and storing the first field identity, the first standard data set identity and the first field mapping relation in a mapping cache table.
3. The method of claim 2, further comprising:
acquiring at least one second field to be mapped corresponding to a second external data set to be accessed;
calculating a second field identity corresponding to the second field to be mapped;
if the second field identity is determined to be stored in the mapping cache table, acquiring a second standard data set identity and a second field mapping relation corresponding to the second field identity in the mapping cache table;
and accessing the second external data set into the second standard data set according to the second field mapping relation.
4. The method of claim 1, wherein obtaining a to-be-matched dataset vector corresponding to the first external dataset from a word vector corresponding to each participle in the first to-be-mapped field comprises:
determining each participle of the first field to be mapped according to a pre-trained participle device;
obtaining a word vector corresponding to each participle according to a word vector model;
and combining the word vectors corresponding to the participles to obtain a data set vector to be matched corresponding to the first external data set.
5. The method of claim 2, wherein calculating the first field identity corresponding to the first field to be mapped comprises:
if the number of the first fields to be mapped is determined to be at least two, sequencing the first fields to be mapped according to a preset sequencing rule;
merging the sorted first fields to be mapped into a long character string;
and calculating the hash value of the long character string according to a hash algorithm to be used as the first field identity.
6. The method of claim 4, wherein:
the data set classification model is a multi-layer perceptron MLP model;
the word segmentation device is a conditional random field CRF model;
the word vector model is a cross-word sequence Skip-Gram model based on a Hierarchical soft maximum value Hierarchical Softmax.
7. An apparatus for data access, comprising:
the first field to be mapped acquisition module is used for acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;
a to-be-matched data set vector obtaining module, configured to obtain a to-be-matched data set vector corresponding to the first external data set according to a word vector corresponding to each participle in the first to-be-mapped field;
the first standard data set determining module is used for determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;
a first field mapping relationship establishing module, configured to establish a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field in the first standard data set and the first field to be mapped;
and the first external data set access module is used for accessing the first external data set into the first standard data set according to the first field mapping relation.
8. The apparatus of claim 7,
after the first field mapping relationship establishing module, the method further comprises:
the first field identity identification calculation module is used for calculating a first field identity identification corresponding to the first field to be mapped;
a mapping cache table storage module, configured to store the first field identity, the first standard data set identity, and the first field mapping relationship in a mapping cache table;
a second mapping field obtaining module, configured to obtain at least one second to-be-mapped field corresponding to a second external data set to be accessed;
the second field identity identification calculation module is used for calculating a second field identity identification corresponding to the second field to be mapped;
a second field identity judgment module, configured to, if it is determined that the second field identity is stored in the mapping cache table, obtain a second standard data set identity and a second field mapping relationship, where the second standard data set identity corresponds to the second field identity, in the mapping cache table;
a second external data set access module, configured to access the second external data set to the second standard data set according to the second field mapping relationship;
the module for obtaining the vector of the data set to be matched comprises:
the word segmentation submodule is used for determining each word segmentation of the first field to be mapped according to a pre-trained word segmentation device;
the word vector submodule is used for obtaining a word vector corresponding to each participle according to the word vector model;
the combination submodule is used for combining the word vectors corresponding to the participles to obtain a data set vector to be matched corresponding to the first external data set;
the first field identity calculation module comprises:
the sorting submodule is used for sorting the first fields to be mapped according to a preset sorting rule;
the conversion submodule is used for converting the sequenced first field to be mapped into a long character string;
the calculation submodule is used for taking the hash value of the long character string as the first field identity according to a hash algorithm;
in the first standard data set determining module, the data set classification model is a multi-layer perceptron MLP model;
in the word segmentation submodule, the word segmentation device is a conditional random field CRF model;
the word vector model is a cross-word sequence Skip-Gram model based on a Hierarchical soft maximum value Hierarchical Softmax.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of data access according to any of claims 1-6 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of data access according to any one of claims 1 to 5.
CN201711052158.3A 2017-10-30 2017-10-30 Data access method and device, computer equipment and readable storage medium Active CN107844560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711052158.3A CN107844560B (en) 2017-10-30 2017-10-30 Data access method and device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711052158.3A CN107844560B (en) 2017-10-30 2017-10-30 Data access method and device, computer equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN107844560A CN107844560A (en) 2018-03-27
CN107844560B true CN107844560B (en) 2020-09-08

Family

ID=61681153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711052158.3A Active CN107844560B (en) 2017-10-30 2017-10-30 Data access method and device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN107844560B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472198B (en) * 2018-05-10 2023-01-24 腾讯科技(深圳)有限公司 Keyword determination method, text processing method and server
CN109410069A (en) * 2018-09-03 2019-03-01 平安医疗健康管理股份有限公司 Settlement data processing method, device, computer equipment and storage medium
CN109474678B (en) * 2018-10-31 2021-04-02 新华三信息安全技术有限公司 Information transmission method and device
CN109543772B (en) * 2018-12-03 2020-08-25 北京锐安科技有限公司 Data set automatic matching method, device, equipment and computer readable storage medium
CN109871382A (en) * 2019-02-13 2019-06-11 北京明略软件系统有限公司 A kind of implementation method and device of tables of data access java standard library
CN109902083A (en) * 2019-02-26 2019-06-18 北京明略软件系统有限公司 Method, apparatus, computer storage medium and the terminal of a kind of pair of mark processing
CN110414229B (en) * 2019-03-29 2023-12-12 腾讯科技(深圳)有限公司 Operation command detection method, device, computer equipment and storage medium
CN110008193B (en) * 2019-04-16 2021-06-18 成都四方伟业软件股份有限公司 Data standardization method and device
CN110287191B (en) * 2019-06-25 2021-07-27 北京明略软件系统有限公司 Data alignment method and device, storage medium and electronic device
CN110471926B (en) * 2019-08-15 2022-07-19 北京明智和术科技有限公司 File establishing method and device
CN110727710B (en) * 2019-10-12 2023-02-07 平安医疗健康管理股份有限公司 Data analysis method and device, computer equipment and storage medium
CN110941717B (en) * 2019-11-22 2023-08-11 深圳马可孛罗科技有限公司 Passenger ticket rule analysis method and device, electronic equipment and computer readable medium
CN110895533B (en) * 2019-11-29 2023-01-17 北京锐安科技有限公司 Form mapping method and device, computer equipment and storage medium
CN111061833B (en) * 2019-12-10 2023-03-21 北京明略软件系统有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111310448B (en) * 2020-02-10 2023-10-31 江苏满运软件科技有限公司 Data supplementing method, system, device and storage medium
CN111667923B (en) * 2020-06-05 2022-11-18 医渡云(北京)技术有限公司 Data matching method and device, computer readable medium and electronic equipment
CN111949716A (en) * 2020-08-11 2020-11-17 北京锐安科技有限公司 Formatted data output field processing method, computer device and storage medium
CN112597124A (en) * 2020-11-30 2021-04-02 新华三大数据技术有限公司 Data field mapping method and device and storage medium
CN115186650B (en) * 2022-09-07 2022-12-09 中国中金财富证券有限公司 Data detection method and related device
CN117235240B (en) * 2023-11-14 2024-02-20 神州医疗科技股份有限公司 Multi-model result fusion question-answering method and system based on asynchronous consumption queue

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055623A (en) * 2016-05-26 2016-10-26 《中国学术期刊(光盘版)》电子杂志社有限公司 Cross-language recommendation method and system
CN107291673A (en) * 2017-05-19 2017-10-24 广州视源电子科技股份有限公司 A kind of processing method of document, system, readable storage medium storing program for executing and computer equipment
CN108536664A (en) * 2017-03-01 2018-09-14 华东师范大学 The knowledge fusion method in commodity field

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101504654B (en) * 2009-03-17 2011-02-09 东南大学 Method for implementing automatic database schema matching
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN102271090B (en) * 2011-09-06 2013-09-25 电子科技大学 Transport-layer-characteristic-based traffic classification method and device
US9164667B2 (en) * 2013-03-15 2015-10-20 Luminoso Technologies, Inc. Word cloud rotatable through N dimensions via user interface
CN106844390A (en) * 2015-12-07 2017-06-13 北京航天长峰科技工业集团有限公司 A kind of inter-sectional data resource cut-in method
CN106055652A (en) * 2016-06-01 2016-10-26 兰雨晴 Method and system for database matching based on patterns and examples
CN106682099A (en) * 2016-12-01 2017-05-17 北京奇虎科技有限公司 Data storage method and device
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute
CN110427991A (en) * 2019-07-22 2019-11-08 联动优势科技有限公司 A kind of character string matching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055623A (en) * 2016-05-26 2016-10-26 《中国学术期刊(光盘版)》电子杂志社有限公司 Cross-language recommendation method and system
CN108536664A (en) * 2017-03-01 2018-09-14 华东师范大学 The knowledge fusion method in commodity field
CN107291673A (en) * 2017-05-19 2017-10-24 广州视源电子科技股份有限公司 A kind of processing method of document, system, readable storage medium storing program for executing and computer equipment

Also Published As

Publication number Publication date
CN107844560A (en) 2018-03-27

Similar Documents

Publication Publication Date Title
CN107844560B (en) Data access method and device, computer equipment and readable storage medium
CN108038183B (en) Structured entity recording method, device, server and storage medium
CN112148877B (en) Corpus text processing method and device and electronic equipment
CN107609185B (en) Method, device, equipment and computer-readable storage medium for similarity calculation of POI
CN109978060B (en) Training method and device of natural language element extraction model
CN111444677A (en) Reading model optimization method, device, equipment and medium based on big data
CN114676689A (en) Sentence text recognition method and device, storage medium and electronic device
CN113449821A (en) Intelligent training method, device, equipment and medium fusing semantics and image characteristics
CN113254615A (en) Text processing method, device, equipment and medium
JP7388078B2 (en) Accessible machine learning backend
CN112035449A (en) Data processing method and device, computer equipment and storage medium
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN113220828A (en) Intention recognition model processing method and device, computer equipment and storage medium
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN112801784A (en) Bit currency address mining method and device for digital currency exchange
CN110717577A (en) Time series prediction model construction method for noting regional information similarity
CN110852103A (en) Named entity identification method and device
US9378466B2 (en) Data reduction in nearest neighbor classification
CN114925158A (en) Sentence text intention recognition method and device, storage medium and electronic device
CN115169342A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114611609A (en) Graph network model node classification method, device, equipment and storage medium
CN114449342A (en) Video recommendation method and device, computer readable storage medium and computer equipment
CN114429140A (en) Case cause identification method and system for causal inference based on related graph information
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN115186096A (en) Recognition method, device, medium and electronic equipment for specific type word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: The invention relates to a data access method, a device, a computer device and a readable storage medium

Effective date of registration: 20220105

Granted publication date: 20200908

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220712

Granted publication date: 20200908

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method, apparatus, computer device and readable storage medium for data access

Effective date of registration: 20220907

Granted publication date: 20200908

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206

PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20200908

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206