CN107844560B

CN107844560B - Data access method and device, computer equipment and readable storage medium

Info

Publication number: CN107844560B
Application number: CN201711052158.3A
Authority: CN
Inventors: 谢永恒; 李贺; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2020-09-08
Anticipated expiration: 2037-10-30
Also published as: CN107844560A

Abstract

The invention discloses a data access method, a data access device, computer equipment and a readable storage medium, wherein at least one first field to be mapped corresponding to a first external data set to be accessed is obtained; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.

Description

Data access method and device, computer equipment and readable storage medium

Technical Field

Embodiments of the present invention relate to data processing technologies, and in particular, to a method and an apparatus for data access, a computer device, and a readable storage medium.

Background

In the production process of enterprises, a large amount of data access work is carried out every day, for example, China mobile needs to access data of telephone information, short messages, QQ chat data, WeChat data and the like of each user, the accessed data formats are often different, and the enterprises need to invest a large amount of time and manpower to configure a warehousing lattice transfer strategy of data in different formats.

At present, the data access work is mainly realized in a manual mode, and the technical defect of the manual configuration lattice change strategy is as follows: data access is costly, inefficient and has poor scalability.

Disclosure of Invention

The invention provides a data access method, a data access device, computer equipment and a readable storage medium, which are used for realizing automatic recommendation of data access and automatic mapping of field levels.

In a first aspect, an embodiment of the present invention provides a method for data access, including:

acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;

obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped;

determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;

establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped;

and accessing the first external data set into the first standard data set according to the first field mapping relation.

In a second aspect, an embodiment of the present invention further provides a data access apparatus, including:

the first field to be mapped acquisition module is used for acquiring at least one first field to be mapped corresponding to a first external data set to be accessed;

a to-be-matched data set vector obtaining module, configured to obtain a to-be-matched data set vector corresponding to the first external data set according to a word vector corresponding to each participle in the first to-be-mapped field;

the first standard data set determining module is used for determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model;

a first field mapping relationship establishing module, configured to establish a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field in the first standard data set and the first field to be mapped;

and the first external data set access module is used for accessing the first external data set into the first standard data set according to the first field mapping relation.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for accessing data according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement a method for data access according to any embodiment of the present invention.

The method comprises the steps of obtaining at least one first field to be mapped corresponding to a first external data set to be accessed; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.

Drawings

Fig. 1 is a flowchart of a method for data access according to an embodiment of the present invention;

fig. 2a is a flowchart of a data access method according to a second embodiment of the present invention;

FIG. 2b is a flowchart of constructing a vector of data sets to be matched according to a second embodiment of the present invention;

fig. 2c is a flowchart of calculating the identity of the first field according to the second embodiment of the present invention;

fig. 3a is a flowchart of a data access method according to a third embodiment of the present invention;

fig. 3b is a technical route diagram of data access according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data access apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a data access method according to an embodiment of the present invention, where the present embodiment is applicable to a data access situation, and the method may be executed by a data access apparatus, where the data access apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device. The method of the embodiment specifically comprises the following steps:

step 110, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained.

In the data access method, an object of access is an external data set. For example: the China Mobile needs to access the data of the telephone information, the data of the short message, the data of the QQ chat, the WeChat data and the like of each user, and the data of the telephone information, the data of the short message, the data of the QQ chat, the WeChat data and the like of each user are accessed into the data set of the China Mobile as an external data set. Specifically, the data of the QQ chat is accessed as the first external data set, and the QQ chat external data set includes fields of chat time, chat content, chat people a, chat people B, and QQ space, and the fields are used as the first fields to be mapped.

And step 120, obtaining a to-be-matched data set vector corresponding to the first external data set according to the word vector corresponding to each participle in the first to-be-mapped field.

The word segmentation is a process of segmenting a Chinese sequence into a single word, so that continuous word sequences are recombined into word sequences according to a certain specification, and the word segmentation can be realized by a word segmentation model, for example, the word segmentation of ' I ' goes off duty ' is ' I ', the word segmentation of ' off duty ' and the word segmentation of ' off duty '. On the basis, the word-off filtering can be carried out on the word-dividing result, and common words without much information, such as 'the' and the like, are excluded. While the natural language is given to the algorithmic processing in machine learning, the language is usually required to be mathematic, and a word vector is a way to perform the mathematic processing on words in the language, and words corresponding to each participle are vectorized according to a word vector model. The data set vector can be obtained by a weighted average of the word vectors of each participle, or by combining the word vectors of each participle.

And step 130, determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model.

Specifically, if the external data set of the QQ chat is accessed into the standard data set, and the standard data set includes many categories, such as QQ, WeChat, telephone, SMS, etc., it is necessary to judge that the external data set of the QQ chat should be accessed into the category of the QQ of the standard data set, but the external data set of the QQ chat cannot be accessed into the category of the WeChat. It will be appreciated that determining the first standard data set that matches the first external data set is a classification problem, and therefore, classification may be performed using data set classification models, such as Multiple Layer Perceptron (MLP) and Support Vector Model (SVM).

It can be understood that before the data set vector to be matched is input into the data set classification model, it needs to determine whether to perform the normalization processing on the data set vector to be matched according to the data set classification model, and if the SVM model is adopted as the data set classification model, the normalization processing needs to be performed on the data set to be matched.

Step 140, establishing a first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped.

It will be appreciated that after determining the category of the standard data set to which the external data set is to be accessed, each field of the external data set also needs to be matched. Specifically, by taking the above example as an example, after it is determined that the QQ chat external data set needs to be accessed to the QQ standard data set corresponding to the china mobile, assuming that the data set corresponding to the QQ chat in the china mobile includes time, content, and a talker, it is further necessary to correspond fields, such as chat time, chat content, chat talker a, chat talker B, QQ space, included in the QQ chat external data set to the time, content, and talker included in the QQ standard data set one to one, and the established one-to-one correspondence is a field mapping relationship.

The mapping relationship can be determined according to the similarity between the fields, for example, according to cosine similarity calculation, if the similarity result is higher, the first to-be-mapped field of the first external data set can be determined to be in one-to-one correspondence with the standard field in the first standard data set, and the mapping relationship can be established; and when the similarity result is lower, the mapping relation cannot be established. As in the above example, the mapping relationship that can be established is: chat time-time, chat content-content, chat person A-conversation person, chat person B-conversation person. The similarity between the QQ space and the time, content and speaker in the standard data set is low, and the mapping relation cannot be established.

And 150, accessing the first external data set to the first standard data set according to the first field mapping relation.

According to the first field mapping relation, after the first external data set is obtained, the first field mapping relation is automatically matched, and if the matching is successful, the first external data set is automatically accessed. And if the matching is unsuccessful, manual intervention is required. At this time, the QQ space spare field may be stored or deleted according to the memory size and actual requirements.

According to the technical scheme of the embodiment, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained; obtaining a data set vector to be matched corresponding to the first external data set according to the word vector corresponding to each participle in the first field to be mapped; determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model; establishing a first field mapping relation between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped; and accessing the first external data set into the first standard data set according to the first field mapping relation. The invention can realize automatic recommendation of data access and automatic mapping of field level.

Example two

Fig. 2a is a flowchart of a data access method according to a second embodiment of the present invention, which is optimized based on the second embodiment. In this embodiment, after establishing the first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field and the first field to be mapped in the first standard data set, the method further includes: calculating a first field identity corresponding to the first field to be mapped; and storing the first field identity, the first standard data set identity and the first field mapping relation in a mapping cache table.

As shown in fig. 2a, the embodiment of the present invention specifically includes:

step 210, at least one first field to be mapped corresponding to a first external data set to be accessed is obtained.

Step 220, obtaining a to-be-matched data set vector corresponding to the first external data set according to the word vector corresponding to each participle in the first to-be-mapped field.

In fig. 2b, a flowchart for constructing a to-be-matched data set vector according to a second embodiment of the present invention is shown, and as shown in fig. 2b, a construction process of the to-be-matched data set vector includes:

and step 221, determining each participle of the first field to be mapped according to a pre-trained participle device.

The word segmentation device can adopt a conditional random field CRF word segmentation model to realize automatic word segmentation through sequence marking. The CRF model occupies a certain word formation position when each word constructs a specific word, and each word is only assumed to have 4 word positions: the beginning of a word (B), the middle of a word (M), the end of a word (E) and the independent word (S). The result of word segmentation is represented in the form of word sequence notation. A CRF word segmentation labeling model belongs to a discrimination model, a conditional probability model is modeled, model parameters are iteratively solved by utilizing a training data set through maximum likelihood estimation or regularized maximum likelihood estimation during learning, and an output sequence with the maximum conditional probability is predicted through a given input sequence.

For example, planning the sentence "Shanghai to the end of this century achieves a total domestic production of five thousand dollars. "word segmentation, after passing through CRF word segmentation model, the result is:

Shanghai/Bhai/Ejime/Bmin/Eto/Sben/Shi/Bmin/Endest/Shi/Bmin/Eren/Byun/England/Esheng/Bproduct/Etot/Bmin/Ewu/Bk/Mmei/M yuan/E. and/S.

Step 222, obtaining a word vector corresponding to each participle according to the word vector model.

The word vector model adopts a cross-word sequence Skip-Gram model based on Hierarchical soft maximum value Hierarchical software Softmax, a neural network word represents one of the models in the model, the context and the relation between the context and a target are modeled through a neural network technology, and the model has the greatest advantage of being capable of identifying complex context environments due to the fact that the neural network is flexible. The Skip-Gram model based on the Hierarchical Softmax comprises an input layer, a projection layer and an output layer. Taking sample (w, context (w)) as an example,

input layer word vector v (w) ∈ R containing only the word w in the center of the current sample^m；

Projection layer, like the input layer, only the word vector v (w) ∈ R of the central word w of the current sample^m；

An output layer: and carrying out statistical analysis on the speech to construct a Huffman tree.

In the model, each leaf node of the Huffman tree represents a word, each branch is regarded as one time and classified, the probability P (w, context (w)) is modeled, and model parameters and word vectors are solved by taking a tree likelihood function as a cost function without a gradient ascent method.

And 223, combining the word vectors corresponding to the participles to obtain a to-be-matched data set vector corresponding to the first external data set.

The data set vector to be matched is obtained through word vector combination, wherein the data set vector can be represented by a weighted average value of each word vector, specifically, a Frequency-inverse document Frequency model (TF-IDF) can be adopted to calculate the data set vector, firstly, a TF appearing in a first standard data set to be accessed in a word vector corresponding to each participle in each field of each data set to be matched is counted, according to an inverse document Frequency IDF of the word vector corresponding to the participle in the first standard data set accessed in history, meanwhile, due to the fact that the IDF tries to suppress noise, smoothing processing is carried out on the calculation process of the IDF, and finally, a weight coefficient of the word vector corresponding to each participle, namely TF. And after determining the weight coefficient of the word vector corresponding to each participle, taking the weighted average of the word vectors as the data set vector to be matched. Moreover, a Bag-of-words Model (CBOW) may also be employed to obtain the dataset vector.

And step 230, determining a first standard data set matched with the first external data set according to the data set vector to be matched and a pre-trained data set classification model.

The pre-trained data set classification model is a multi-layer perceptron MLP model, historical data are adopted for training, the model takes a negative log-likelihood function as a cost function, model parameters are solved through a gradient descent method, and because the performance of the model is relevant to the initialization of weight, the solved model parameters are probably suboptimal solutions, so that the model with the minimum average test error in S times of evaluation is selected by considering an S-fold cross validation mode aiming at small sample data. The solved model parameters are first standard data set category codes matched with the first external data set, and the category codes are used as the identification of the first standard data set.

Step 240, establishing a first field mapping relationship between the first field to be mapped and the standard field according to the similarity between the standard field in the first standard data set and the first field to be mapped.

And step 250, calculating a first field identity corresponding to the first field to be mapped.

Due to the limitation of a storage memory, storing a field to be mapped in a mapping cache pool may affect an operation speed, and therefore, an identity representing a first field needs to be obtained, where the identity may be a feature value corresponding to the field to be mapped, for example, the first field to be mapped includes at least one word, and a weight of a word vector corresponding to each word is used as a feature value of the first field to be mapped.

Fig. 2c shows a flowchart for calculating the first field identity according to the second embodiment of the present invention, and as shown in fig. 2c, the specific calculation step of the first field identity includes:

step 251, if it is determined that the number of the first fields to be mapped is at least two, sorting the first fields to be mapped according to a preset sorting rule.

If the first field to be mapped is one, sorting the first field to be mapped is not needed; if the number of the first fields to be mapped is at least two, at least two first fields to be mapped of the first external data set to be accessed need to be ordered so as to ensure that the output result is influenced by the different orders of the first fields to be mapped. The preset ordering rule includes determining according to the sequence of the fields of the external data set, for example, taking QQ chat as the first external data set to be accessed, the fields of chat time, chat content, chat people a, chat people B, QQ space and the like included in the external data set can be ordered according to "chat time chat content chat people a chat people B".

And 252, merging the sorted first fields to be mapped into a long character string.

The ordered "chat time chat content chat A chat people B" is converted to a binary string according to the above example.

Step 253, calculating the hash value of the long character string according to a hash algorithm as the first field identity.

The first field identity can be obtained by calculation through a hash algorithm, the hash algorithm can map a binary value with any length into a short binary system with a fixed length, the binary value is the identity of the first field, and the hash value is a very compact numerical value representation form of the first field and represents the characteristic information of the first field.

Step 260, storing the first field identity, the first standard data set identity and the first field mapping relationship in a mapping cache table.

The first field identity may be a hash value, the first standard data set identity may be a first standard data set class code matched with the first external data set, and the first field mapping relationship is a one-to-one correspondence relationship between the first field to be mapped and the standard field in the first standard data set.

The three are stored in a mapping cache table, a pre-cache is established, if other external data sets are accessed, whether records corresponding to the external data sets exist can be judged according to the mapping cache table, and if the records corresponding to the external data sets exist, the external data sets can be directly accessed according to a field mapping relation.

Step 270, according to the first field mapping relationship, accessing the first external data set to the first standard data set.

According to the embodiment of the invention, the field to be mapped of the first external data set to be accessed is subjected to word segmentation and word vectorization through the combination of the CRF model, the Skip-Gram model based on the Hierarchical software max and the MLP model, the identity of the first standard data set is determined, the identity of the first field is determined through the Hash algorithm, and meanwhile, the first total segment identity, the identity of the first standard data set and the mapping relation of the first field are stored in the mapping cache table.

EXAMPLE III

Fig. 3a is a flowchart of data access provided by a third embodiment of the present invention, and as shown in fig. 3a, the specific steps of the data access are as follows:

step 310, at least one second field to be mapped corresponding to a second external data set to be accessed is obtained.

And step 320, calculating a second field identity corresponding to the second field to be mapped.

The calculation process of the second field identity is the same as that of the first field identity in the second embodiment of the present invention. And the terms "first" and "second" in any embodiment of the present invention are used merely for distinction and are not intended to be limiting.

Step 330, determining whether the mapping cache table stores the second field identity.

And 340, if yes, acquiring a second standard data set identity and a second field mapping relation corresponding to the second field identity in the mapping cache table. And step 370 is performed.

Whether the second field identity is stored in the mapping cache pool or not is determined, and the field identity in the mapping cache pool, such as the hash value of the field, can be accelerated and traversed by considering the Bloom filter, so that the judgment result can be obtained in a short time.

It can be understood that, if it is determined that the second field identity is not stored in the mapping cache table, the specific content of the first embodiment of the present invention is executed, that is:

step 350, if not, obtaining a to-be-matched data set vector corresponding to the second external data set according to the word vector corresponding to each participle in the second to-be-mapped field.

And step 360, determining a second standard data set matched with the second external data set according to the data set vector to be matched and a pre-trained data set classification model.

Step 370, establishing a second field mapping relationship between the second field to be mapped and the standard field according to the similarity between the standard field and the second field to be mapped in the second standard data set.

And 380, accessing the second external data set into the second standard data set according to the second field mapping relation.

According to the embodiment of the invention, whether the mapping cache pool has the second field identity is judged, if yes, the second external data set is accessed to the second standard data set according to the second field mapping relation; and if not, segmenting words of a second field to be mapped corresponding to the second external data set, vectorizing the words to obtain a data set vector to be matched corresponding to the second external data set, determining the class of a second standard data set corresponding to the second external data set according to the data set classification model, and meanwhile, accessing the second external data set into the second standard data set according to the second field mapping relation. When the field identity of the data set to be accessed exists in the mapping cache pool, word segmentation is not needed, and the external data set is accessed according to the field mapping relation in the cache pool, so that the data access is faster and more convenient.

Further, on the basis of any of the above embodiments, fig. 3b illustrates a technical route diagram for data access provided by the embodiment of the present invention, and as shown in fig. 3b, a person skilled in the art may implement automatic access of external data according to the technical route for data access.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a data access apparatus according to a fourth embodiment of the present invention, and as shown in fig. 4, the apparatus includes: a first field to be mapped obtaining module 410, a data set vector to be matched obtaining module 420, a first standard data set determining module 430, a first field mapping relationship establishing module 440, and a first external data set accessing module 450, wherein:

a first field to be mapped obtaining module 410, configured to obtain at least one first field to be mapped corresponding to a first external data set to be accessed;

a to-be-matched data set vector obtaining module 420, configured to obtain a to-be-matched data set vector corresponding to the first external data set according to a word vector corresponding to each participle in the first to-be-mapped field;

a first standard data set determining module 430, configured to determine, according to the to-be-matched data set vector and a pre-trained data set classification model, a first standard data set that matches the first external data set;

a first field mapping relationship establishing module 440, configured to establish a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field and the first field to be mapped in the first standard data set;

a first external data set accessing module 450, configured to access the first external data set into the first standard data set according to the first field mapping relationship.

On the basis of the above embodiments, the method may further include:

after the first field mapping relationship establishing module, the method further comprises:

the first field identity identification calculation module is used for calculating a first field identity identification corresponding to the first field to be mapped;

a mapping cache table storage module, configured to store the first field identity, the first standard data set identity, and the first field mapping relationship in a mapping cache table;

a second mapping field obtaining module, configured to obtain at least one second to-be-mapped field corresponding to a second external data set to be accessed;

the second field identity identification calculation module is used for calculating a second field identity identification corresponding to the second field to be mapped;

a second field identity judgment module, configured to, if it is determined that the second field identity is stored in the mapping cache table, obtain a second standard data set identity and a second field mapping relationship, where the second standard data set identity corresponds to the second field identity, in the mapping cache table;

a second external data set access module, configured to access the second external data set to the second standard data set according to the second field mapping relationship;

the module for obtaining the vector of the data set to be matched comprises:

the word segmentation submodule is used for determining each word segmentation of the first field to be mapped according to a pre-trained word segmentation device;

the word vector submodule is used for obtaining a word vector corresponding to each participle according to the word vector model;

the combination submodule is used for combining the word vectors corresponding to the participles to obtain a data set vector to be matched corresponding to the first external data set;

the first field identity calculation module comprises:

the sorting submodule is used for sorting the first fields to be mapped according to a preset sorting rule;

the conversion submodule is used for converting the sequenced first field to be mapped into a long character string;

the calculation submodule is used for taking the hash value of the long character string as the first field identity according to a hash algorithm;

in the first standard data set determining module, the data set classification model is a multi-layer perceptron MLP model;

in the word segmentation submodule, the word segmentation device is a conditional random field CRF model;

in the word vector submodule, the word vector model is a word frequency-inverse file frequency TF-IDF model.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer apparatus according to a fifth embodiment of the present invention, as shown in fig. 5, the apparatus includes a processor 50, a memory 51, an input device 52, and an output device 53; the number of processors 50 in the device may be one or more, and one processor 50 is taken as an example in fig. 5; the processor 50, the memory 51, the input means 52 and the output means 53 of the device may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.

The memory 51 is used as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data accessing method in the embodiment of the present invention (for example, a first field to be mapped obtaining module 401, a data set vector obtaining module 402, a first standard data set determining module 403, a first field mapping relationship establishing module 404, and a first external data set accessing module 405 in a data accessing device). The processor 50 executes various functional applications of the device and data processing, i.e. implements the above-mentioned data access method, by running software programs, instructions and modules stored in the memory 51.

The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 51 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 51 may further include memory located remotely from the processor 50, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 52 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the appliance server. The output device 53 may include a display device such as a display screen.

The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

The fifth embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are not limited to the operations of the method described above, and may also perform related operations in the method for accessing data provided in any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the data access apparatus, each unit and each module included in the apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for data access, comprising:

2. The method according to claim 1, further comprising, after establishing a first field mapping relationship between the first field to be mapped and the standard field according to a similarity between the standard field and the first field to be mapped in the first standard data set, the method further comprising:

calculating a first field identity corresponding to the first field to be mapped;

and storing the first field identity, the first standard data set identity and the first field mapping relation in a mapping cache table.

3. The method of claim 2, further comprising:

acquiring at least one second field to be mapped corresponding to a second external data set to be accessed;

calculating a second field identity corresponding to the second field to be mapped;

if the second field identity is determined to be stored in the mapping cache table, acquiring a second standard data set identity and a second field mapping relation corresponding to the second field identity in the mapping cache table;

and accessing the second external data set into the second standard data set according to the second field mapping relation.

4. The method of claim 1, wherein obtaining a to-be-matched dataset vector corresponding to the first external dataset from a word vector corresponding to each participle in the first to-be-mapped field comprises:

determining each participle of the first field to be mapped according to a pre-trained participle device;

obtaining a word vector corresponding to each participle according to a word vector model;

and combining the word vectors corresponding to the participles to obtain a data set vector to be matched corresponding to the first external data set.

5. The method of claim 2, wherein calculating the first field identity corresponding to the first field to be mapped comprises:

if the number of the first fields to be mapped is determined to be at least two, sequencing the first fields to be mapped according to a preset sequencing rule;

merging the sorted first fields to be mapped into a long character string;

and calculating the hash value of the long character string according to a hash algorithm to be used as the first field identity.

6. The method of claim 4, wherein:

the data set classification model is a multi-layer perceptron MLP model;

the word segmentation device is a conditional random field CRF model;

the word vector model is a cross-word sequence Skip-Gram model based on a Hierarchical soft maximum value Hierarchical Softmax.

7. An apparatus for data access, comprising:

8. The apparatus of claim 7,

the module for obtaining the vector of the data set to be matched comprises:

the first field identity calculation module comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of data access according to any of claims 1-6 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of data access according to any one of claims 1 to 5.