CN109947858B

CN109947858B - Data processing method and device

Info

Publication number: CN109947858B
Application number: CN201710619053.5A
Authority: CN
Inventors: 管蓉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2022-10-21
Anticipated expiration: 2037-07-26
Also published as: CN109947858A

Abstract

The application discloses a data processing method and a data processing device, wherein the method comprises the following steps: acquiring a training data set to be processed, wherein the training data set comprises at least two training data subjected to semantic analysis; performing cluster analysis on the training data in the training data set to obtain a target data set, wherein the target data set comprises at least two pieces of training data with the similarity higher than a preset similarity; and mapping each training data in the target training set to the same category directory, wherein the category directory is used for providing an entry for acquiring the training data in the category directory. By adopting the scheme, the accuracy of mapping of multiple data sources can be improved, training data which are different in form but identical or similar in semantics can be accurately identified, and the reliability and fault tolerance of mapping are improved.

Description

Data processing method and device

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a data processing method and apparatus.

Background

At present, in the technical field of big data processing, some O2O websites provide users with introduction information of various merchants, organizations, and the like, and the websites acquire data sources provided by a plurality of third parties and then map data sources indicating the same merchant or organization to the same directory for the users to select. However, when the website maps multiple data sources, mapping is unsuccessful because the data sources provided by a third party may have the problems that the introduction is not detailed or specified, or even partial information is inaccurate. At present, a character string matching mode, namely precise matching and fuzzy substring matching, is mainly adopted, namely, the data sources provided by a third party are considered to be mapped to the same directory only on the premise that the data sources are completely or partially identical in form. It can be seen that when the data source is not standardized or the information is changed, the character string matching mode may not complete mapping, and the recognition probability is limited and the fault tolerance is not high.

Although the existing multidimensional matching mode can perform multidimensional matching from each local information of a data source, if some local information of the data source is inconsistent, for example, several third parties provide hospital data of one hospital at the same time, when hospital names are not completely matched, and information such as a telephone, an address or a doctor is different, the several pieces of hospital data are considered to be not mappable under the same directory, but actually, one hospital includes information such as a plurality of departments, a plurality of doctors and a plurality of telephones, and possibly, hospital data provided by the third parties are incomplete, but are actually the same hospital. Therefore, the mapping probability of the multi-dimensional matching mode is not high, and the fault tolerance is not high.

Disclosure of Invention

The application provides a data processing method and device, which can solve the problem that the mapping probability of multi-data source mapping is not high in the prior art.

A first aspect of the present application provides a method for data processing, where the method includes:

acquiring a training data set to be processed, wherein the training data set comprises at least two training data subjected to semantic analysis;

performing cluster analysis on the training data in the training data set to obtain a target data set, wherein the target data set comprises at least two pieces of training data with the similarity higher than a preset similarity;

and mapping each training data in the target training set to the same category directory, wherein the category directory is used for providing an entry for acquiring the training data in the category directory.

A second aspect of the present application provides an apparatus for processing data having functions to implement a method corresponding to the data processing provided by the first aspect described above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one possible design, the apparatus includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training data set to be processed, and the training data set comprises at least two training data subjected to semantic analysis;

the processing module is used for carrying out cluster analysis on the training data in the training data set acquired by the acquisition module to obtain a target data set, and the target data set comprises at least two pieces of training data with the similarity higher than a preset similarity;

and the mapping module is used for mapping each training data in the target training set obtained by the processing module to the same category directory, and the category directory is used for providing an entry for acquiring the training data in the category directory.

A further aspect of the application provides an apparatus for processing data comprising at least one connected processor, memory, transmitter and receiver, wherein the memory is configured to store program code, and the processor is configured to invoke the program code in the memory to perform the method of the first aspect.

A further aspect of the present application provides a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.

Yet another aspect of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.

Compared with the prior art, in the scheme provided by the application, the acquired training data set comprises at least two training data subjected to semantic analysis, so that the training data which are mapped to the same category of directory can be judged in a preliminary and rough manner by a preprocessing mode of the semantic analysis, and the mapping range is reduced. And then carrying out cluster analysis on the training data in the training data set to obtain a target data set comprising at least two pieces of training data with the similarity higher than the preset similarity, wherein the training data with the higher similarity is identified through the cluster analysis, so that the training data which can be really mapped to the same category can be further determined. And finally mapping each training data in the target training set to the same category catalogue. Therefore, the method and the device can improve the accuracy of mapping of multiple data sources, can accurately identify training data which are different in form but identical or similar in semanteme, and improve the reliability and fault tolerance of mapping.

Drawings

FIG. 1 is a schematic diagram of a network topology according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method of data processing according to an embodiment of the present invention;

FIG. 3-a is a diagram illustrating element group partitioning according to an embodiment of the present invention;

FIG. 3-b is a schematic diagram of a second matrix in an embodiment of the invention;

FIG. 4 is a diagram illustrating a method of data processing according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a word frequency matrix according to an embodiment of the present invention;

FIG. 6 is a diagram of a TF-IDF matrix according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating hospital similarity ranking according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an apparatus for data processing according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another structure of an apparatus for data processing according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of another structure of a server for data processing according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a mobile phone for data processing according to an embodiment of the present invention.

Detailed Description

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, wherein the division of modules presented herein is merely a logical division and may be implemented in a practical application in a different manner, such that a plurality of modules may be combined or integrated into another system or that certain features may be omitted or not implemented, and wherein shown or discussed as coupled or directly coupled or communicatively coupled to each other via interfaces and indirectly coupled or communicatively coupled to each other via electrical or other similar means, all of which are not intended to be limiting in this application. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

The application provides a data processing method and device, which can be used in the field of large data processing, for example, for collecting data provided by a third-party platform, for example, collecting merchant details provided by each website, associating the merchant details belonging to the same merchant in the same directory, and providing browsing services for a user, for example, hospital data about the first-person hospital in Shenzhen city provided by 4 third-party platforms, although the hospital data provided by the 4 third-party platforms may be inconsistent in detail information such as hospital name, hospital address, doctor, department or department telephone, after data analysis, the 4 pieces of hospital data substantially all belong to the first-person hospital in Shenzhen city, and therefore, the 4 pieces of hospital data are associated in the same hospital directory for the user to view and select. Fig. 1 is a schematic diagram of a network topology for collecting multiple data sources and processing the multiple data sources, in fig. 1, a server may interact with multiple terminal devices, and may collect hospital data 1 and hospital data 2 \8230fromthe terminal devices, and hospital data n, after the hospital data are collected, the hospital data may be pre-screened to screen out a hospital data set with similar data, and then the hospital data in the hospital data set may be subjected to cluster analysis, and the hospital data may be mapped to a semantic space from a word space to obtain several pieces of hospital data with similarity exceeding a preset threshold. And finally, mapping a plurality of hospital data with the similarity ranking at the front to the same hospital directory, and providing the hospital data to the online platform, so that the patient can select the hospital to be checked by the autonomous online platform.

It should be noted that the terminal device referred to in this application may refer to a device providing voice and/or data connectivity to a user, a handheld device having a wireless connection function, or other processing device connected to a wireless modem. A wireless terminal may communicate with one or more core networks via a Radio Access Network (RAN), and may be a mobile terminal, such as a mobile phone (or called a "cellular" phone) and a computer with a mobile terminal, for example, a portable, pocket, hand-held, computer-embedded or vehicle-mounted mobile device, which exchanges voice and/or data with the RAN. Examples of such devices include Personal Communication Service (PCS) phones, cordless phones, session Initiation Protocol (SIP) phones, wireless Local Loop (WLL) stations, and Personal Digital Assistants (PDA). A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), an Access Point (Access Point), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a Terminal Device, a User Agent (User Agent), a User Device (User Device), or a User Equipment (User Equipment).

In order to solve the technical problems, the application mainly provides the following technical scheme:

the mapping of multiple data sources can be processed based on the potential semantic analysis model, after the multiple data sources are obtained, the semantics of each data source are extracted based on the potential semantic analysis model, and the semantics can be expressed by a mathematical language. And then comparing the similarity of the semantics of the data sources, converting a high-dimensional space formed by all words in the data sources into a low-dimensional semantic space, and performing semantic comparison after abstraction in the semantic space. The present application need not be concerned with the order of occurrence of these terms, but rather may consider two terms to have semantic similarity based on a "co-occurrence" assumption, such as the large number of simultaneous occurrences of the two terms in multiple data sources. For example, a large number of articles describing an automobile may mix an "engine" and an "engine", and based on a potential semantic analysis model, the two terms are considered to have semantic similarity, and the two terms are not considered to be different terms, so that the accuracy of similarity analysis is improved, and the probability of more data sources belonging to the same category can be identified to a certain extent, thereby increasing the fault tolerance of the data sources.

Referring to fig. 2, a data processing method provided in the present application is illustrated as follows, where the data processing method mainly includes:

201. and acquiring a training data set to be processed.

The training data set comprises at least two training data subjected to semantic analysis.

The semantic analysis refers to semantic check and processing according to the grammar category recognized by the grammar analyzer to generate corresponding intermediate codes or object codes. In the application, before the training data are subjected to cluster analysis, in order to reduce workload, each training data in the training data set can be pre-screened through semantic analysis, so that the range of the cluster analysis can be narrowed, the training data with higher similarity can be screened out, the accuracy of data analysis can be improved, and the training data which are partially similar but do not belong to the same category catalog in essence are excluded.

202. And carrying out cluster analysis on the training data in the training data set to obtain a target data set.

The target data set comprises at least two pieces of training data with similarity higher than preset similarity.

In some embodiments, the target data set may be obtained by the following steps (1) and (2):

(1) And mapping each training data in the training data set from an element group space to a semantic space.

Mapping training data from the element group space to the semantic space may include:

firstly, element group division processing is respectively carried out on each training data in the training data set to obtain at least two element group sets, wherein the element group sets comprise at least one element group, each element group set corresponds to one piece of training data, and the element groups represent inseparable sets of at least one element. Fig. 3-a is a schematic diagram of element group division, the original training data includes an element group 1, an element group 2 \8230, an element group n, noise data 1 and noise data 2, where the noise data 1 and the noise data 2 are data interfering with bag-of-word training, and therefore the noise data 1 and the noise data 2 need to be removed. For example, the key words such as hospital names, telephones, hospital addresses, department names and doctors in hospital documents are divided into words, and a Chinese word segmentation tool is specifically adopted to perform Chinese word segmentation, so that noise data such as punctuation marks, stop words and HyperText Markup Language (HTML) tags can be effectively removed, and the interference of the noise data on the training process of the word bag model can be reduced after Chinese word segmentation. The method for dividing the element groups and the word segmentation tool are not limited.

Secondly, vectorizing the at least two element group sets respectively to obtain a first matrix, wherein the first matrix can be used for representing the frequency of the at least one element group in each element group set. In some embodiments, the first matrix may be obtained by:

respectively carrying out vectorization processing on the at least two element group sets according to the frequency of the element groups appearing in each element group set to obtain at least two training vectors;

and forming the first matrix according to the at least two obtained training vectors. The present application does not limit the manner of obtaining the first matrix.

Fig. 3-b is a schematic diagram of a first matrix, which may be a word frequency matrix (as shown in fig. 5) when analyzing hospital data.

And then, calculating to obtain a second matrix according to the weight of the element group, the frequency of the element group and the first matrix, wherein the second matrix is used for expressing the frequency weight value of the element group.

And finally, performing bag-of-words model training on the second matrix.

(2) And calculating the similarity between the training data mapped to the semantic space, and determining the target data set according to the similarity between the training data.

203. And mapping each training data in the target training set to the same category catalogue.

The category directory is used for providing an entry for acquiring training data under the category directory.

According to the scheme provided by the application, the acquired training data set comprises at least two parts of training data subjected to semantic analysis, and the training data which are mapped to the same category of directory can be judged primarily through a preprocessing mode of the semantic analysis, so that the mapping range is reduced. And then carrying out cluster analysis on the training data in the training data set to obtain a target data set comprising at least two pieces of training data with the similarity higher than the preset similarity, wherein the training data with the higher similarity is identified through the cluster analysis, so that the training data which can be really mapped to the same category can be further determined. And finally mapping each training data in the target training set to the same category catalogue. Therefore, the method and the device can improve the accuracy of mapping of multiple data sources, can accurately identify training data which are different in form but identical or similar in semantics, and improve the reliability and fault tolerance of mapping.

Optionally, in some embodiments of the present invention, since the calculated first matrix may be too sparse, especially when the training data causes a large amount of data, a large amount of computation may be severely caused, resulting in a long computation time. It is also considered that there may be more noisy data in the first matrix that may interfere with the training of the bag-of-words model. In addition, the similarity calculation is greatly interfered by the synonym, but the similarity calculated substantially is not high, which causes the training data which should be mapped to the same category of directory to be considered as unmanageable. Therefore, in order to eliminate these interferences, the present application also provides the following scheme to eliminate the above-mentioned interference phenomenon. In some embodiments, when performing bag-of-words model training on the second matrix, singular value decomposition may be performed on the second matrix based on the bag-of-words model to obtain a left singular matrix, a diagonal matrix, and a right singular matrix, so as to perform dimension reduction processing on the second matrix to remove noise data in the second matrix.

The TF-IDF matrix can be trained based on a Bag-of-words model (BOWM), wherein the Bag-of-words model omits the grammar and the word order of a text, and a set of unordered words (words) is used for expressing a segment of characters or a document, which is mainly used for text classification and is a simple hypothesis in natural language processing and information retrieval. In this model, text (paragraphs or documents) is treated as an unordered collection of words, ignoring the order of grammars and even words. The basic ideas of the bag-of-words model include:

1. extracting characteristics: selecting features according to the data set, then describing to form feature data, such as detecting sift keypoints in the image, then calculating keypoints descriptors, and generating a 128-D feature vector;

2. learning word bag: combining all processed feature data, and dividing the feature words into a plurality of classes by using a clustering algorithm, wherein the number of the classes is set by the user, and each class is equivalent to a visual word;

3. quantifying image features by using visual bag of words: each image is composed of a plurality of visual words, and the statistical word frequency histogram can be used for indicating the class of the image.

When the bag-of-words model is used for model training, feature point extraction and Cluster analysis are mainly included, wherein the Cluster analysis is composed of a plurality of patterns (Pattern), and generally, a Pattern is a vector of Measurement (Measurement) or a point in a multidimensional space.

Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster. In the present application, the clustering analysis can adopt a Model-Based method (Model-Based Methods), which mainly includes three aspects:

1) Calculating each cluster to determine an initial cluster center, so that there are k initial cluster centers

2) Allocating the samples in the sample set to the nearest neighbor cluster according to the minimum distance principle

3) Using the sample mean in each cluster as a new cluster center until the cluster center no longer changes

In some embodiments, the bag of words model mainly includes a Latent Semantic Analysis (LSA) model and a Probabilistic Latent Semantic Analysis (PLSA) model.

In other embodiments, the semantic similarity between words may be determined by training each word to be mapped into a K-dimensional real number vector (K is generally a hyper-parameter in the model) based on the word vector expression (word 2 vec), and by using distances between words (such as cosine similarity, euclidean distance, etc.), which uses a three-layer neural network, an input layer-a hidden layer-an output layer. The core technology is that Huffman coding is used according to word frequency, so that activated contents of word hiding layers with similar word frequency are basically consistent, the number of the word hiding layers activated by the words with higher frequency is less, and the computational complexity is lowered due to doubtful fact. The model on which the dimension reduction process of the second matrix is based is not limited in this application.

Optionally, in some embodiments of the present invention, each element group has a weight, and the weight may be used to represent a relative importance degree of the element group in the overall evaluation, through which keywords in the training data may be effectively distinguished. Specifically, for a first element group in the element group set, the weight of the first element group is obtained according to the total number of the element group set and the total number of the element group set including the first element group, and the first element group refers to any element group in the element group set.

Optionally, in some embodiments of the present invention, after the similarity comparison is performed on each training data, in order to increase the system fault tolerance, the training data of the similarity ranking TopA may be further selected to enter the next accurate determination, that is, the determination through the association rule. Therefore, even if certain errors exist in the similarity comparison of the training data, the training data which should be mapped are not ranked in the first place and cannot be missed. In some application scenarios, the value of TopA may also be selected according to a service scenario, the current total number of data sources of the registration platform, and the number of coincidences of training data, and the value of a may be dynamically changed. For hospital data, the top 10 digits can be taken for further association rule determination. Specifically, after performing cluster analysis on the training data in the training data set, before associating each training data in the target training set to the same category directory, the embodiment of the present application may further include:

and judging whether the training data in the target data set meet a mapping rule or not, and if the training data in the target data set meet the mapping rule, mapping all the training data in the target training set to the same category directory.

Optionally, in some embodiments, the mapping rule satisfies:

and judging whether the element groups in one element group set are the same or similar to the element groups in the same level in another element group set in the semantic space according to the descending order of the levels of the element groups, if so, determining that the mapping rule is met, and if not, judging the next level.

For the convenience of understanding, the following takes hospital data of the registration platform as an example, the LSA model may be applied to a data processing module of the registration platform, and after the hospital data of a plurality of partners is acquired, the data processing module stores hospital, department and doctor data in the hospital data into a database table, which is referred to as an external table. The data in these external tables are then mapped into internal tables for use by the online modules of the registration platform.

The data processing module belongs to a preprocessing part of the registration platform and can be completed off line, so that a user using the registration platform to check hospital data cannot be aware of the data processing module. The data processing module is used for processing data of a hospital and departments, wherein the hospital mainly comprises information such as hospital names, brief introduction, telephone numbers, addresses, city and area information, hospital properties and levels and the like; the department mainly comprises information such as department name, introduction, doctor introduction and the like. Information composed of natural languages, such as hospital names, alias names, brief introduction, address information, telephone numbers and the like in an external table, is extracted to form a document describing the hospital, hospital data provided by an operator is preliminarily screened by judging the similarity among the documents, several pieces of hospital data with high similarity can be obtained by preliminary screening, and then the hospital data with high similarity are subjected to association judgment according to association rules. The following are separately described:

1. training of LSA model-performing document clustering

The LSA model is an unsupervised learning model, training data is not required to be labeled in advance, the hospital documents formed in the front are the training data, but model training can be performed through a series of processing, the processing flow is as shown in the LSA model training preparation flow shown in fig. 4, the LSA model training preparation flow mainly comprises Chinese word segmentation, document vectorization, calculation of TF-IDF values of document sets and training of the LSA model by using TF-IDF matrices, and the following description is respectively given:

(1) Chinese word segmentation has a plurality of mature and open-source word segmentation tools, punctuation marks, stop words and HTML labels need to be removed, and the punctuation marks, the stop words and the HTML labels are all noise data in the model training process.

(2) And inspecting all words in all document sets, assigning a number to each word, and calculating word frequency.

For example, for the hospital document 1 "coking coal central hospital, he nan province, coke making city health road", this article consists of the hospital name and address. After Chinese word segmentation is carried out on the hospital documents, the Chinese word segmentation result of the hospital documents is as follows: coking coal, central, hospital, province of Henan, coke making, health, road. It is assumed that "coking coal" is numbered 8, "center" is numbered 52, "hospital" is numbered 268, "Henan province" is numbered 500, "Joule city" is numbered 1608, "health" is numbered 2112, and road "is numbered 3068. The document vector for this hospital document can then be represented as follows:

[(8，1)，(52，1)，(268，1)，(500，1)，(1608，1)，(2112，1)，(3068,1)]。

assuming that another hospital document 2 is "the health road of the coking coal group hospital, the southwestern coke", after performing chinese word segmentation and dictionary numbering on it, the document vector can be expressed as follows:

[(8，1)，(52，1),(268，1)，(297，1)，(574，1)，(1608，1)，(2142，1)，(3068，1)]。

it can be seen that there is a certain similarity between the two hospital documents, and then the dictionary space of the document set composed of the two hospital documents is doubtful to be combined into the following document vector:

[(8，2)，(52，2)，(268，2)，(297，1)，(574，1)，(1608，2)，(2142，1)，(3068，2)，(500，1)，(2112，1)]

since dictionary numbers are not useful for model training, only the second-dimension word frequency value is used, but the document vector of each document must contain all the dictionary numbers in the dictionary.

The final document vector for hospital document 1 is therefore: [1,1,1,0,0,1,0,1,1,1]

The final document vector for hospital document 2 is: [1,1,1,1,1,1,1,1,0,0]

Considering that the degree of intersection of words in each hospital document is not too high, each document vector may be a sparse vector containing a large number of 0 s. After the whole document set is vectorized, a sparse matrix composed of word frequencies is formed, each row is a word, and each column is a document. The word frequency matrix is shown in schematic diagram 5 below.

(3) Calculating TF-IDF values of a document collection

TF, the Term Frequency, is the word Frequency, the previous steps have been calculated, IDF, the Inverse Document Frequency, is calculated by dividing the number of all documents by the number of documents containing the word, and then taking the natural logarithm of the quotient.

TF-IDF is TF multiplied by IDF, which is equivalent to the weight of the word. The TF-IDF value has a more reasonable meaning in terms of description of words with respect to word frequency, for example, if some words appear many times in a document, TF of the word is large, but the words are common in the document set, so that the TF does not play a great role in distinguishing the documents, IDF is used to solve the problem, the word frequency of each word is weighted, and the IDF value is smaller if some word appears more frequently in the document set. For example, the word "hospital" is a word, almost every document in hospital data contains the word, and the word frequency of the word in every document is higher, so that the contribution or influence on the similarity of the documents is larger than that of other words, but the word "hospital" is not obvious in the distinguishing effect of the documents, so that the word should be given a lower weight to balance the negative influence caused by the higher word frequency, and the IDF is the weight.

The word frequency matrix is a calculated TF-IDF matrix which represents the document vectors (column vectors in the matrix), and the TF-IDF matrix is shown as a schematic diagram 6.

In some embodiments, after obtaining the TF-IDF matrix, because the document vectors (column vectors in the matrix) are already determined, it can be used to calculate the cosine similarity of the documents, but this direct calculation has three problems:

if the TF-IDF matrix is too sparse and the data amount is large, the calculation is very time-consuming

2. Excessive noise data due to singular values

3. Interference of near meaning words

For hospital data, a document contains hospital profiles, and there are words which have little meaning to model clustering, the words can be called noise, document vectors are all high-dimensional sparse vectors, and for noise, the processing mode is generally dimensionality reduction, and the problem that a matrix is too large and too sparse can be solved at the same time.

The influence of the interference of the near-meaning words on the similarity calculation is also great, for example, two documents, namely a "healthy road in south-Henan province in a coking coal central hospital as a city" and a "healthy road in south-Henan province in a coking coal group hospital as a city" need to be mapped into two data sources of the same hospital, but the similarity of the documents is not too high if the documents are calculated according to TF-IDF vectors, because the "south-Henan province" in the former document and the "south-Henan" in the latter document are actually near-meaning words but are endowed with different dictionary numbers, and the document vectors are different from each other. The more general case is: two documents describing the engine of the automobile, namely a document I 'engine sound is loud' and a document II 'engine sound is loud', the two documents are calculated according to the TF-IDF vector, the similarity is very low, but the two documents are extremely similar from the perspective of natural language understanding, which is the problem of synonym interference, information in the TD-IDF matrix is not enough to judge that the 'engine' and the 'engine' are synonyms, and the 'flood' and the 'loud' are synonyms. Therefore, a method for transforming the matrix, which is the LSA model, is needed to reduce the dimension and recognize synonyms.

The LSA model is used as an example below. Model training is carried out on the TF-IDF matrix based on an LSA model, and the basic principle of training the LSA model by the TF-IDF matrix is Singular Value Decomposition (SVD) in linear algebra, namely a matrix can be decomposed into products of three matrices: a = U Σ V ^T Where A is the original matrix, U is the left singular matrix, Σ is the diagonal matrix, V is the right singular matrix, each row of U represents a category of words related in meaning, each column of VRepresenting a category of semantically related articles, the singular values contained in Σ are arranged from top to bottom, from large to small. Cutting sigma, assuming that the original n-order square matrix is truncated to k-order, according to the matrix multiplication principle, U and V need to be correspondingly cut, and the cut matrix multiplication results do not reduce the document number and the word number, only merge and disassemble semantically, which is equivalent to performing matrix dimensionality reduction by SVD and retaining the important information of the original matrix. The merging of semantics can be represented by a mathematical expression, as follows:

0.73 engine +0.54 engine +0.3 vehicle, such a weighted combination of synonyms, that is, the semantic merge of "engine" and "engine";

0.72 tire +0.7 car, such a weighted word combination of synonyms is the semantic merge of "tire";

semantic parsing may be understood as an automobile in the above expression, wherein the component of automobile 0.3 is parsed into the related semantic of "engine", and the component of automobile 0.7 is parsed into the related semantic of "tire", because the term "automobile" may include multi-layer semantics.

The LSA model performs semantic parsing on original words, which is the most critical point for completing document clustering and dimension reduction. More precisely, it is said that: the process of mapping the original word space to the semantic space is completed by carrying out semantic disassembly, and similar documents are closer in distance in the semantic space, so that the clustering of the documents is completed. The LSA model is trained to complete.

2. Comparison of document similarity

After the LSA model is trained, after Chinese word segmentation, vectorization and TF-IDF value calculation, the trained LSA model can be used for mapping the semantic space of the hospital to be mapped, and then similarity matching is carried out between the semantic space and other hospital document vectors, the calculation method adopted in the scheme is cosine similarity, and the calculation formula is as follows:

cosine similarity does not consider the length of the vector, but only considers the included angle theta of the vector, which is more suitable for comparison of high-dimensional sparse vectors such as document vectors. The lower graph shows the results of the ranking of the similarity of the hospital documents of "the healthy road of the city of coke made by the province of Henan of the central hospital for coking coal" and the existing hospitals in the database from high to low as shown in FIG. 7. In fig. 7, the first column represents category indexes, the second column represents hospital catalog indexes, the third column represents hospital names, and the fourth column represents similarities calculated based on the cosine similarity formula, and it can be seen that all similar hospitals are clustered together and the ordering is basically correct.

Specifically, the mapping of hospital data can be mainly divided into the following two scenarios:

A. mapping at initial state

Initially, there is only one partner's hospital data, which is used as the reference data, and there is only one piece of hospital data in one hospital. Later when other partners enter data, duplicate hospital data will appear, and therefore similarity comparisons are needed, where each hospital data of the new partner is compared to all of the baseline data.

In the initial state, the hospital data which is firstly registered in the database can be used as reference data, when data is registered later, the data which is registered later is compared with the similarity of each piece of hospital data in the database, then a plurality of pieces of hospital data before the hospital data is ranked are taken, and whether the newly registered hospital data can be mapped with the hospital data in the database is judged through association rules.

If the initial state has a plurality of data, one of the data is selected as reference data, and then similarity comparison is carried out respectively.

B. Subsequent update mapping

For the hospital data of the newly joining partner, the similarity can be compared with each piece of hospital data in the database respectively, and then a ranking table of the similarity is obtained.

3. Design of association rules

After the similarity degree is pre-judged, in order to increase the fault tolerance of the system, after the similarity degree comparison is passed, whether the data can be mapped to the existing classification directory in the database can be judged, namely, the document with the top 10 of the similarity degree ranking is selected to enter the next accurate judgment, namely, the association rule judgment. Therefore, even if a certain error exists in the previous step, namely the rank of the hospital to be mapped is not the first rank, the rank of the hospital cannot be missed, and according to the business scene, the top 10 ranks are more suitable values in terms of the current number of the partners of the registration platform and the hospital coincidence number.

In addition, the design of the association rules also needs to be adjusted according to the business scenario, for example, the data dimensions of the hospital and the department are not the same, and thus the association rules are not the same.

There are three orders in the association rule of hospital data in the registration platform: the first order is that hospital names are the same = > hospital mappable; the second level is that the hospital alias and hospital name are the same = > hospital mappable; the third order is that city code, regional code and telephone number are the same = > hospital mappable. The application mode of the third-order association rule is as follows: if the first order derivation is successful, the second order and third order rules are not considered; the first step is unsuccessful and the second step is investigated, and if the second step is successful, the third step is investigated. If none of the third-order is successful, the hospital is considered to have no mappable hospital, a new hospital is provided for the partner, or is submitted to manual review.

The association rules for department data are also designed to be third order: the first order is that department names are the same = > department can be mapped; the second level is that the department name has the department mappable relation of = > department; the third order is that the doctor name under the department has a matching rate of 60% = > department can be mapped. The application mode of the third-order association rule is the same as that of the hospital.

The application scenario of the second-order association rule of the hospital is as follows:

daqing Hospital VS Chongqing third military medical university Hospital [0.634987]

The similarity between the two hospitals is 0.634987, and the mapping is successful because the former is an alias of the latter.

The application scenario of the third-order association rule of the hospital is as follows:

shenzhen city traditional Chinese medicine Jindi seascape community health service center 440300 440304 0755-23811165

Jindi Jing Jie kang 440300 440304 0755-23811165 [0.775298]

The similarity between the two hospitals is 0.775298, and the mapping is successful finally because the cities, the regional codes and the telephone numbers of the two hospitals are consistent.

The application scenario of the second-order association rule of departments is as follows:

medical cosmetology department (master yard) VS medical cosmetology department [0.94992]

Because the name has a containment relationship, it is mapped.

The application scenario of the third-order association rule of departments is as follows:

1158 department of procreation medicine (North institute) Wangjunxia Chenhuazhou Jianjun Wanfen

1158 25347 North institute of China, center of reproduction, royal membrane, chenhua royal Junxia Zhou Jianjun (0.536493)

The 60% coincidence degree of the doctor name in the third-order association rule of the department is an empirical value and can be adjusted according to different application scenes. The reason why the doctor name is included in the judgment of the association rule is that the department information is very small compared with the hospital information, the department has no information such as address and telephone, the introduction null rate of the department is very high, and the null rate of the current registration platform is higher than 50%, so the doctor name is very necessary to be used as the association rule. Instead of incorporating the doctor name into the training data of the LSA model, it is considered that the doctor name can be considered as a proper noun and cannot be disassembled or combined.

The theoretical basis of the LSA model is SVD decomposition of linear algebra, and the problem of this scheme is semantic inexplicability. The application also provides a bag-of-words model, namely Probabilistic Latent Semantic Analysis (PLSA), which is based on probability, semantic is an implicit variable, the basic idea is also space transformation, but the theoretical support is probability theory, has better interpretability and is a theoretically better model. However, in the application scenario of hospital and department data mapping, PLSA is less effective than LSA. In 513 hospitals, there are 32 mappable hospitals, the comparison of LSA and PLSA is shown in Table 1 below:

hospital	Rule	1	Rule 2	Rule 3
					Based on LSA model	20	1	9
Based on PLSA model	12	2	8

TABLE 1

Wherein, rule1, rule2 and Rule3 all represent association rules, rule1 represents a first-order association Rule, rule2 represents a second-order association Rule, and Rule3 represents a third-order association Rule.

When Rule1 is used for the association judgment, 20 pieces of hospital data that can be mapped to the same hospital can be associated based on the LSA model, and 12 pieces of hospital data that can be mapped to the same hospital can be associated based on the PLSA model.

Because some hospital data which can be mapped to the same hospital may be missed when Rule1 is possibly adopted for association, second-order association judgment can be continuously carried out, namely Rule2 is continuously adopted for association judgment, and the association result is as follows: based on the LSA model, 1 piece of hospital data that can be mapped to the same hospital can be associated, and based on the PLSA model, 2 pieces of hospital data that can be mapped to the same hospital can be associated.

Similarly, because some hospital data that can be mapped to the same hospital may be missed when Rule2 is possibly used for association, the third-order association judgment can be further performed, that is, rule3 is continuously used for association judgment, and the association result is: based on the LSA model, 9 pieces of hospital data that can be mapped to the same hospital can be associated, and based on the PLSA model, 8 pieces of hospital data that can be mapped to the same hospital can be associated.

Finally, a total of 30 hospital data pieces that can be mapped to the same hospital are identified based on the LSA model, and a total of 22 hospital data pieces that can be mapped to the same hospital are identified based on the PLSA model.

In some embodiments, the LSA model may be deployed alone, the PLSA model may be deployed alone, and the LSA model and the PLSA model may also be deployed to perform parallel computation, so that the computation efficiency may be effectively improved, and the LSA model and the PLSA model may be pushed to the client in time, and for a user using the client, the change of background data is not perceived.

In some embodiments, the application may also be based on the extension of word2vec, where word2vec is a word vector expression mode and may be generalized to document vector expression, and therefore, the similarity comparison of document vectors may be performed between such special document vectors. Such models would look at the precedence of words, i.e., the context of the words, as being better able to fit natural language than bag of words models (e.g., LSA models and PLSA models).

The following describes a device for executing the data processing method, where the device may be a server or a terminal device, or an interactive application installed on the server or the terminal device, and the device is mainly taken as the server and the interactive application installed on the server.

1. Referring to fig. 8, illustrating the apparatus 80 for data processing, the apparatus 80 for data processing may include:

an obtaining module 801, configured to obtain a training data set to be processed, where the training data set includes at least two training data that have undergone semantic analysis;

a processing module 802, configured to perform cluster analysis on the training data in the training data set acquired by the acquiring module 801 to obtain a target data set, where the target data set includes at least two pieces of training data with similarity higher than a preset similarity;

a mapping module 803, configured to map each piece of training data in the target training set obtained by the processing module 802 to the same category directory, where the category directory is used to provide an entry for obtaining the training data in the category directory.

In the embodiment of the present application, the training data set acquired by the transceiver module 801 includes at least two training data after semantic analysis, and thus, the training data that are mapped to the same category of directory can be determined in an initial and rough manner by a preprocessing method of semantic analysis, and the mapping range is narrowed. Then, the processing module 802 performs cluster analysis on the training data in the training data set to obtain a target data set including at least two pieces of training data with similarity higher than a preset similarity, and since the training data with higher similarity is identified through the cluster analysis, the training data which can be really mapped to the same category can be further determined. Finally, the mapping module 803 maps each training data in the target training set to the same category list. Therefore, the method and the device can improve the accuracy of mapping of multiple data sources, can accurately identify training data which are different in form but identical or similar in semanteme, and improve the reliability and fault tolerance of mapping.

Optionally, in some embodiments of the present invention, the processing module 802 is specifically configured to:

mapping each training data in the training data set from an element group space to a semantic space;

and calculating the similarity between the training data mapped to the semantic space, and determining the target data set according to the similarity between the training data.

respectively carrying out element group division processing on each training data in the training data set to obtain at least two element group sets, wherein each element group set comprises at least one element group, each element group set corresponds to one piece of training data, and the element groups represent inseparable sets of at least one element;

vectorizing the at least two element group sets respectively to obtain a first matrix, wherein the first matrix is used for representing the frequency of at least one element group appearing in each element group set;

calculating to obtain a second matrix according to the weight of the element group, the frequency of the element group and the first matrix, wherein the second matrix is used for expressing the frequency weight value of the element group;

and performing bag-of-words model training on the second matrix.

Optionally, in some invention embodiments, the processing module 802 is specifically configured to:

and performing singular value decomposition on the second matrix based on the bag-of-words model to obtain a left singular matrix, a diagonal matrix and a right singular matrix, and performing dimension reduction processing on the second matrix to remove noise data in the second matrix.

Optionally, in some embodiments, the weight of the first element group is obtained according to a total number of the element group set and a total number of the element group set including the first element group, where the first element group refers to an element group in the element group set.

vectorizing the at least two element group sets respectively according to the frequency of the element groups appearing in each element group set to obtain at least two training vectors;

and forming the first matrix according to the at least two obtained training vectors.

Optionally, in some embodiments of the present invention, after performing cluster analysis on the training data in the training data set, before associating each training data in the target training set to the same category directory, the processing module 802 is further configured to:

Optionally, in some inventive embodiments, the mapping rule satisfies:

The present application also provides a computer storage medium, which stores a program, and the program comprises a part or all of the steps of the data processing method executed by the data processing device.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform some or all of the steps of a method performed by an apparatus for data processing.

The apparatus for data processing in the embodiment of the present invention is described above from the perspective of a modular functional entity, and the network authentication server and the terminal device in the embodiment of the present invention are described below from the perspective of hardware processing, respectively. It should be noted that, in the embodiment shown in fig. 8, the entity device corresponding to the obtaining module may be an input/output unit, and the entity device corresponding to the processing module may be a processor. The apparatus shown in fig. 8 may have a structure as shown in fig. 9, when the apparatus shown in fig. 8 has the structure as shown in fig. 9, the processor and the input/output unit in fig. 9 can implement the same or similar functions of the processing module and the obtaining module provided in the embodiment of the apparatus corresponding to the apparatus, and the memory in fig. 9 stores program codes that the processor needs to call when executing the method of data processing.

Fig. 10 is a schematic diagram of a server 1000 according to an embodiment of the present invention, where the server 1000 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and a memory 1032, and one or more storage media 1030 (e.g., one or more mass storage devices) for storing applications 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1022 may be disposed in communication with the storage medium 1030, to execute a series of instruction operations in the storage medium 1030 on the server 1000.

The Server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS XT, unix, linux, freeBSD, and so forth.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.

As shown in fig. 11, for convenience of description, only the parts related to the embodiment of the present invention are shown, and details of the specific technology are not disclosed, please refer to the method part in the embodiment of the present invention. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA, for short, the whole english is: personal Digital Assistant), a Point of sale terminal (POS, for short, the whole english is: point of Sales), a vehicle-mounted computer, etc., taking the terminal device as the mobile phone as an example:

fig. 11 is a block diagram showing a partial structure of a cellular phone related to a terminal device provided in an embodiment of the present invention. Referring to fig. 11, the cellular phone includes: radio Frequency (RF) circuit 1111, memory 1120, input unit 1130, display unit 1140, sensor 1150, audio circuit 1160, wireless fidelity (WiFi) module 1170, processor 1180, and power supply 1190. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

The following specifically describes each component of the mobile phone with reference to fig. 11:

RF circuit 1111 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information from a base station and process the received downlink information to processor 1180; in addition, data for designing uplink is transmitted to the base station. In general, the RF circuit 1111 may include, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 1111 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), e-mail, short Message Service (SMS), etc.

The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1130 may include a touch panel 1131 and other input devices 1132. Touch panel 1131, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near touch panel 1131 by using any suitable object or accessory such as a finger or a stylus pen), and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and receives and executes commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1140 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The Display unit 1140 may include a Display panel 1141, and optionally, the Display panel 1141 may be configured by a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of the touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, the description is omitted here.

Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and a cell phone. The audio circuit 1160 can transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signal into an electrical signal, which is received by the audio circuit 1160 and converted into audio data, and then the audio data is processed by the audio data output processor 1180, and then the audio data is sent to another mobile phone through the RF circuit 1111, or the audio data is output to the memory 1120 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1170, and provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1180 is a control center of the mobile phone, and is connected to various parts of the whole mobile phone through various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the mobile phone. Optionally, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.

The phone also includes a power supply 1190 (e.g., a battery) for providing power to various components, which may preferably be logically coupled to the processor 1180 via a power management system, such that the power management system may manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present invention, the processor 1180 included in the mobile phone further has a function of controlling and executing the above method procedure executed by the terminal device.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims

1. A method of data processing, the method comprising:

acquiring a training data set to be processed, wherein the training data set comprises at least two training data subjected to semantic analysis; pre-screening each training data in the training data set through the semantic analysis;

performing cluster analysis on the training data in the training data set to obtain a target data set, including: mapping each training data in the training data set from an element group space to a semantic space; calculating the similarity between training data mapped to a semantic space, and determining a target data set according to the similarity between the training data, wherein the target data set comprises at least two training data with the similarity higher than a preset similarity; wherein the mapping each training data in the training data set from an element group space to a semantic space, respectively, comprises: respectively carrying out element group division processing on each training data in the training data set to obtain at least two element group sets, wherein each element group set comprises at least one element group, each element group set corresponds to one piece of training data, and the element groups represent inseparable sets of at least one element; vectorizing the at least two element group sets respectively to obtain a first matrix, wherein the first matrix is used for representing the frequency of at least one element group appearing in each element group set; calculating to obtain a second matrix according to the weight of the element group, the frequency of the element group and the first matrix, wherein the second matrix is used for expressing the frequency weight value of the element group; performing bag-of-words model training on the second matrix;

2. The method of claim 1, wherein the training of the bag of words model on the second matrix comprises:

3. The method of claim 1, wherein the weight of the first element group is obtained according to a total number of the element group set and a total number of the element group set including the first element group, and the first element group refers to an element group in the element group set.

4. The method of claim 3, wherein the vectorizing the at least two element group sets respectively to obtain a first matrix comprises:

5. The method according to any one of claims 1-4, wherein after performing the cluster analysis on the training data in the training data set and before associating each training data in the target training set to a same category list, the method further comprises:

and judging whether the training data in the target data set meets a mapping rule or not, and if the training data in the target data set meets the mapping rule, mapping each training data in the target training set to the same category directory.

6. The method of claim 5, wherein the mapping rule satisfies:

7. An apparatus for data processing, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training data set to be processed, and the training data set comprises at least two training data subjected to semantic analysis; pre-screening each training data in the training data set through the semantic analysis;

the mapping module is used for mapping each training data in the target training set obtained by the processing module to the same category directory, and the category directory is used for providing an entry for acquiring the training data in the category directory;

the processing module is specifically configured to:

mapping each training data in the training data set from an element group space to a semantic space; calculating the similarity between the training data mapped to the semantic space, and determining the target data set according to the similarity between the training data;

the processing module is specifically configured to:

respectively carrying out element group division processing on each training data in the training data set to obtain at least two element group sets, wherein each element group set comprises at least one element group, each element group set corresponds to one piece of training data, and each element group represents an inseparable set of at least one element; vectorizing the at least two element group sets respectively to obtain a first matrix, wherein the first matrix is used for representing the frequency of at least one element group appearing in each element group set; calculating to obtain a second matrix according to the weight of the element group, the frequency of the element group and the first matrix, wherein the second matrix is used for expressing the frequency weight value of the element group; and performing bag-of-words model training on the second matrix.

8. The apparatus of claim 7, wherein the processing module is specifically configured to:

9. The apparatus according to claim 8, wherein the processing module is specifically configured to:

10. A computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 6.

11. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 6.