CN109947858A

CN109947858A - A kind of method and device of data processing

Info

Publication number: CN109947858A
Application number: CN201710619053.5A
Authority: CN
Inventors: 管蓉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2019-06-28
Anticipated expiration: 2037-07-26
Also published as: CN109947858B

Abstract

This application discloses a kind of method and devices of data processing, which comprises obtains training data set to be processed, the training data set includes at least two parts of training datas after semantic analysis；Clustering is carried out to the training data in the training data set, obtains target data set, the target data set includes at least two parts of training datas that similarity is higher than default similarity；Each training data in target training set is mapped to the same category list, the category list is used to provide the entrance for obtaining the training data under the category list.By using this programme, the accuracy of multi-data source mapping can be improved, can be recognized accurately different in form but in semantically the same or similar training data, improve the reliability and fault-tolerance of mapping.

Description

A kind of method and device of data processing

Technical field

This application involves big data processing technology field more particularly to a kind of method and devices of data processing.

Background technique

At present in big data processing technology field, some websites O2O, which are met, provides a user Jie of all kinds of businessmans, mechanism etc. Continue information, which can obtain the data source that multiple third parties provide, and then will indicate the data of the same businessman or mechanism Source is mapped under same catalogue, for selection by the user.But due to third party provide data source there may be introduce it is not detailed or Person is lack of standardization or even the problem of partial information inaccuracy, and when website carries out multi-data source mapping, it is unsuccessful to will lead to mapping. String matching mode is mainly used at present, i.e., accurately match and fuzzy substring matching, that is, only third party provides Data source is identical in form or part it is identical under the premise of, just think to be mapped under same catalogue.As it can be seen that word It is lack of standardization or when information changes in data source to accord with String matching mode, may be unable to complete mapping, identification probability it is limited and Fault-tolerance is not high.

Although existing various dimensions matching way can carry out the matching of various dimensions from each local message of data source, If certain local messages of data source are inconsistent, such as several third parties provide the hospital data of a hospital simultaneously, When hospital name Incomplete matching, when the information such as phone, address or doctor are all different, it just will be considered that this several parts of hospital datas are It does not map under same catalogue, but substantially, a hospital will include the letter such as multiple departments, multiple doctors and multiple phones Breath, the hospital data of possible third party's offer is simultaneously imperfect, but substantially same hospital.As it can be seen that various dimensions matching way Mapping probabilities it is not high, fault-tolerance is not also high.

Summary of the invention

This application provides a kind of method and device of data processing, it is able to solve multi-data source mapping in the prior art The not high problem of mapping probabilities.

The application first aspect provides a kind of method of data processing, which comprises

Training data set to be processed is obtained, the training data set includes at least two parts after semantic analysis Training data；

Clustering is carried out to the training data in the training data set, obtains target data set, the target Data acquisition system includes at least two parts of training datas that similarity is higher than default similarity；

Each training data in target training set is mapped to the same category list, the category list is used for The entrance for obtaining the training data under the category list is provided.

The application second aspect provides a kind of for handling the device of data, has and realizes that corresponding to above-mentioned first aspect mentions The function of the method for the data processing of confession.The function can also be executed corresponding soft by hardware realization by hardware Part is realized.Hardware or software include one or more modules corresponding with above-mentioned function, the module can be software and/or Hardware.

In a kind of possible design, described device includes:

Module is obtained, for obtaining training data set to be processed, the training data set is passed through including at least two parts Training data after crossing semantic analysis；

Processing module, the training data in the training data set for obtaining to the acquisition module cluster Analysis, obtains target data set, and the target data set includes at least two parts of training that similarity is higher than default similarity Data；

Mapping block, each training data in target training set for obtaining the processing module are mapped to The same category list, the category list are used to provide the entrance for obtaining the training data under the category list.

The another aspect of the application provides a kind of for handling the device of data comprising the processing of at least one connection Device, memory, transmitter and receiver, wherein the memory is for storing program code, and the processor is for calling institute The program code in memory is stated to execute method described in above-mentioned first aspect.

The another aspect of the application provides a kind of computer readable storage medium, in the computer readable storage medium It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned first aspect.

The another aspect of the application provides a kind of computer program product comprising instruction, when it runs on computers When, so that computer executes method described in above-mentioned various aspects.

Compared to the prior art, in scheme provided by the present application, the training data set of acquisition is passed through including at least two parts Training data after semantic analysis, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet be mapped to it is same The training data of category list reduces mapping range.Then cluster point is carried out to the training data in the training data set Analysis obtains including that similarity is higher than the target data set for presetting at least two parts training datas of similarity, due to passing through cluster Analysis identifies the higher training data of similarity, really is able to be mapped to same category of trained number so can further determine that According to.Each training data in target training set is finally mapped to the same category list.As it can be seen that the application can mention Difference in form can be recognized accurately but in semantically the same or similar trained number in the accuracy of high multi-data source mapping According to improving the reliability and fault-tolerance of mapping.

Detailed description of the invention

Fig. 1 is a kind of network topology structure schematic diagram in the embodiment of the present invention；

Fig. 2 is a kind of flow diagram of method of data processing in the embodiment of the present invention；

Fig. 3-a is a kind of schematic diagram that element group divides in the embodiment of the present invention；

Fig. 3-b is a kind of schematic diagram of the second matrix in the embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of the method for data processing in the embodiment of the present invention；

Fig. 5 is a kind of schematic diagram of frequency matrix in the embodiment of the present invention；

Fig. 6 is a kind of schematic diagram of TF-IDF matrix in the embodiment of the present invention；

Fig. 7 is a kind of schematic diagram of institute of traditional Chinese medicine of embodiment of the present invention similarity ranking；

Fig. 8 is a kind of structural schematic diagram in the embodiment of the present invention for the device of data processing；

Fig. 9 is another structural schematic diagram in the embodiment of the present invention for the device of data processing；

Figure 10 is another structural schematic diagram in the embodiment of the present invention for the server of data processing；

Figure 11 is a kind of structural schematic diagram in the embodiment of the present invention for the mobile phone of data processing.

Specific embodiment

The description and claims of this application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing Similar object, without being used to describe a particular order or precedence order.It should be understood that the data used in this way are in appropriate feelings It can be interchanged under condition, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein It applies.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet The process, method, system, product or equipment for having contained series of steps or module those of be not necessarily limited to be clearly listed step or Module, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment or Module, the division of module appeared in the application, only a kind of division in logic can have when realizing in practical application Other division mode, such as multiple modules can be combined into or are integrated in another system, or some features can be ignored, Or do not execute, in addition, shown or discussion mutual coupling, direct-coupling or communication connection can be by one A little interfaces, the indirect coupling or communication connection between module can be electrical or other similar form, do not make in the application It limits.Also, module or submodule can be the separation that may not be physically as illustrated by the separation member, can be It can not be physical module, or can be distributed in multiple circuit modules, portion therein can be selected according to the actual needs Point or whole module realize the purpose of application scheme.

The application has supplied a kind of method and device of data processing, can be used for big data processing field, such as collecting Data provided by third-party platform, such as each website provider man details are collected, the businessman for belonging to same businessman is detailed Feelings are associated under the same catalogue, for user provide browsing service, such as by from 4 third-party platforms provide about Shenzhen The hospital data of First People's Hospital, city, although hospital data provided by this 4 third-party platforms may be in hospital name, doctor The detailed information such as barnyard location, doctor, department or department's phone can be inconsistent, but after data are analyzed, and substantially this 4 parts Hospital data belongs to First People's Hospital, Shenzhen, so this 4 parts of hospital datas are associated under same hospital's catalogue, with It checks and selects for user.Fig. 1 is a kind of collection multi-data source and the network topology schematic diagram for handling multi-data source, in Fig. 1, clothes Business device can be interacted with multiple terminal devices, can collect hospital data 1, hospital data 2 ... hospital data from these terminal devices N after having collected these hospital datas, then can first carry out prescreening to these hospital datas, filtering out data, there are similar doctors Then institute's data acquisition system carries out clustering to the hospital data in the hospital data set, by hospital data from word space It is mapped to semantic space, to obtain several parts of hospital datas that similarity is more than preset threshold.It is finally that similarity is in the top Several parts of hospital datas are mapped under same hospital's catalogue, are available to line upper mounting plate, allow patient autonomous on line The hospital to be checked of platform selecting.

Wherein, it should be strongly noted that this application involves terminal device, can be directed to user provide voice and/ Or the equipment of data connectivity, with wireless connecting function handheld device or be connected to radio modem other Processing equipment.Wireless terminal can be through wireless access network (full name in English: Radio Access Network, English abbreviation: RAN) Communicated with one or more core nets, wireless terminal can be mobile terminal, as mobile phone (or for " honeycomb " electricity Words) and computer with mobile terminal, for example, it may be portable, pocket, hand-held, built-in computer or vehicle The mobile device of load, they exchange voice and/or data with wireless access network.For example, personal communication service (full name in English: Personal Communication Service, English abbreviation: PCS) phone, wireless phone, Session initiation Protocol (SIP) words Machine, wireless local loop (Wireless Local Loop, English abbreviation: WLL) stand, personal digital assistant (full name in English: Personal Digital Assistant, English abbreviation: PDA) etc. equipment.Wireless terminal is referred to as system, Ding Hudan Member (Subscriber Unit), subscriber station (Subscriber Station), movement station (Mobile Station), mobile station (Mobile), distant station (Remote Station), access point (Access Point), remote terminal (Remote Terminal), access terminal (Access Terminal), user terminal (User Terminal), terminal device, user agent (User Agent), user equipment (User Device) or user equipment (User Equipment).

In order to solve the above technical problems, the application it is main the following technical schemes are provided:

The application can handle the mapping of multi-data source based on latent semantic analysis model, after obtaining multiple data sources, Extract the semanteme of each data source first based on latent semantic analysis model, semanteme can be expressed with mathematical linguistics.Then right again The higher dimensional space of each word composition in data source is converted to low-dimensional by the semantic comparison for carrying out similarity of these data sources Semantic space, and semantic space carry out it is abstract after it is semantic relatively, by using which, carrying out similarity-rough set When, the same or similar difference of inconsistent but substantial semanteme on some words can be neglected between each data source.This Shen It please not need to pay close attention to the appearance sequence of these words, but based on " co-occurrence " it is assumed that for example two words are in more parts of data sources In it is a large amount of while occur, then it is believed that the two words semantically have similitude.Such as a large amount of texts of automobile are described Chapter may use " engine " and " engine " with, when being based on latent semantic analysis model, then will be considered that the two words in semanteme It is upper that there is similitude, the two words will not be thought to be different word, improve the accuracy of similarity analysis, to a certain degree On can recognize that the probability for more belonging to data source under same category, to increase the fault-tolerant of data source.

Referring to figure 2., a kind of method for providing data processing to the application below is illustrated, a kind of data processing Method, the method specifically includes that

201, training data set to be processed is obtained.

Wherein, the training data set includes at least two parts of training datas after semantic analysis.

Wherein, semantic analysis, which refers to, carries out semantic test and processing according to the grammatical category of syntax analyzer identification, generates Corresponding intermediate code or object code.In the application, before carrying out clustering to training data, to reduce workload, Prescreening can be carried out to each training data in training data set by semantic analysis, the model of clustering can be reduced in this way It encloses, the higher training data of similarity can also be filtered out, can also improve the accuracy of data analysis, it is similar to exclude some parts But not substantially belong to training data under the same category catalogue.

202, clustering is carried out to the training data in the training data set, obtains target data set.

Wherein, the target data set includes at least two parts of training datas that similarity is higher than default similarity.

In some embodiments, above-mentioned target data set can be obtained by following step (1) and (2):

(1), respectively by each training data in the training data set from element group space reflection to semantic space.

By training data from element group space reflection to semantic space, it may include:

Firstly, carrying out element group division processing respectively to each training data in the training data set, obtain at least Two element group set, the element group set include at least one element group, the corresponding a training number of each element group set According to element group indicates the set of at least one indivisible element.Fig. 3-a is a kind of schematic diagram that element group divides, original instruction Practicing includes element group 1, element group 2 ... element group n, noise data 1 and noise data 2 in data, noise data therein 1 and is made an uproar Sound data 2 are to interfere the data of bag of words training, so needing to reject noise data 1 and noise data 2.In another example to hospital The key words such as hospital name, phone, hospital address, department's title and doctor in document carry out word division, can specifically adopt Chinese word segmentation is carried out with Chinese word segmentation tool, can effectively reject punctuation mark, stop words and hypertext markup language in this way These noise datas such as (full name in English: HyperText Markup Language, English abbreviation: HTML) label pass through in this way After Chinese word segmentation processing, so that it may reduce interference of the noise data to bag of words training process.The application does not draw element group The mode and participle tool for dividing processing are defined.

Secondly, carrying out vectorization processing at least two elements group set respectively, the first matrix is obtained, described first Matrix can be used for indicating the frequency that at least one element group occurs in each element group set.In some embodiments, the One matrix can be obtained by operations described below:

According to the frequency that element group occurs in each element group set, respectively at least two elements group set into Row vectorization processing, obtains at least two training vectors；

First matrix is formed according to obtained at least two training vector.The application does not obtain the first matrix Mode is taken to limit.

Fig. 3-b is a kind of schematic diagram of the first matrix, and when analyzing hospital data, which can be frequency matrix (as shown in Figure 5).

Then, according to the weight of element group, the frequency of element group and first matrix, the second matrix, institute is calculated The second matrix is stated for indicating the frequency weighted value of element group.

Finally, carrying out bag of words training to second matrix.

(2), the similarity being mapped between each training data of semantic space is calculated, according to similar between training data Degree determines the target data set.

203, each training data in target training set is mapped to the same category list.

Wherein, the category list is used to provide the entrance for obtaining the training data under the category list.

In scheme provided by the present application, the training data set of acquisition includes at least two parts of training after semantic analysis Data, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet the training number for being mapped to same category catalogue According to diminution mapping range.Then clustering is carried out to the training data in the training data set, obtains including similarity Higher than the target data set of at least two parts training datas of default similarity, due to by clustering identify similarity compared with High training data really is able to be mapped to same category of training data so can further determine that.Finally by the target Each training data in training set is mapped to the same category list.As it can be seen that the application can be improved multi-data source mapping Accuracy, can be recognized accurately different in form but in semantically the same or similar training data, and that improves mapping can By property and fault-tolerance.

Optionally, in some inventive embodiments, due to may be excessively sparse in calculated first matrix, especially exist When training data makes data volume larger, it can seriously lead to biggish operand, cause operation duration longer.It is additionally contemplates that first There can be more noise data in matrix, these noise datas can interfere the training of bag of words.In addition, near synonym also can be right The calculating of similarity causes biggish interference, but the calculated similarity of essence is not high, and will lead to should be mapped in this way It is considered as that can not map for the training data under same category catalogue.So in order to eliminate these interference, the application is also mentioned Above-mentioned interference phenomenon is eliminated for following scheme.In some embodiments, bag of words training is being carried out to second matrix When, singular value decomposition can be carried out to second matrix based on the bag of words, obtain left singular matrix, diagonal matrix and the right side Singular matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.

Wherein, the application can based on bag of words (full name in English: Bag-of-words model, English abbreviation: BOWM) TF-IDF matrix is trained, wherein bag of words neglect the grammer and word order of text, with one group of unordered list Word (words) expresses passage or a document, is mainly used for text classification, examines in natural language processing and information One of rope simple hypothesis.In this model, text (paragraph or document) is counted as unordered lexical set, ignores The sequence of grammer even word.The basic thought of bag of words includes:

1, it extracts feature: according to data set selected characteristic, being then described, characteristic is formed, in detection image Sift keypoints, then calculate keypoints descriptors, generate the feature vector of 128-D；

2, learn bag of words: all being merged using the characteristic handled well, then with clustering algorithm Feature Words are divided into several Class, if the number of this Ganlei is set by oneself, each class is equivalent to a visual word；

3, utilize vision bag of words quantized image feature: each image is made of many visual vocabularies, we utilize statistics Word frequency histogram, can indicate which kind of image belongs to.

It include mainly feature point extraction and clustering, wherein clustering when carrying out model training based on bag of words (Cluster) if analysis be made of dry model (Pattern), in general, mode be one measure (Measurement) to A point in amount or hyperspace.

Clustering based on similitude, one cluster in mode between than the mode not in same cluster it Between have more similitudes.In the application, the method (Model-Based based on model is can be used in clustering Methods), mainly include three aspects:

1) it calculates each cluster and determines an initial cluster center, can thus there is k initial cluster center

2) sample in sample set is assigned to nearest neighbor classifier according to minimal distance principle

3) use the sample average in each cluster as new cluster centre until cluster centre no longer changes

In some embodiments, bag of words mainly include Latent Semantic analysis (full name in English: Latent Semantic Analysis, English abbreviation: LSA) model and probability Latent Semantic analyze (full name in English: Probability Latent Semantic Analysis, English abbreviation: PLSA) model.

In other embodiments, it is also based on term vector expression way (word2vec), based on this model, can be passed through Training by each word be mapped to K dimension real vector (K is generally the hyper parameter in model), by the distance between word (such as Cosine similarity, Euclidean distance etc.) judge the semantic similarity between them, use one three layers of neural network, Input layer-hidden layer-output layer.The technology of core be according to word frequency Huffman encoding (full name in English: HuffmanCoding), so that the content of the similar word hidden layer activation of all word frequency is almost the same, the frequency of occurrences is higher Word, the hiding number of layers that they activate is fewer, the complexity for reducing calculating suspicious in this way.The application is not to the second matrix The model that dimension-reduction treatment is based on limits.

Optionally, in some inventive embodiments, for each element group, all has weight, weight can be used to It indicates the relative importance of the element group in the overall evaluation, the key in each training data can be effectively distinguished by it Word.Specifically, for the first element group in element group set, the weight of the first element group is according to element group set Sum and the sum of the element group set including the first element group obtain, and the first element group refers to element group set In arbitrary element group.

Optionally, in some inventive embodiments, after carrying out similarity-rough set to each training data, it is to increase System fault-tolerance, the training data that similarity ranking TopA also may be selected enter the accurate judgement of next step, that is, pass through correlation rule Judgement.Accordingly even when carrying out similarity-rough set to each training data, there is a certain error, i.e., the training number that should be mapped According to ranking there is no at first, will not be missed., can also be according to business scenario in application scenes, platform of registering Current data source source sum and training data are overlapped number to select TopA, the value of A can dynamic change, the application do not make It limits.For hospital data, first 10 can be taken to judge to carry out the correlation rule of next step.Specifically, to institute Each training data after stating the training data progress clustering in training data set, in gathering target training It is associated with before the same category list, the embodiment of the present application may also include that

Judge whether the training data in the target data set meets mapping ruler, however, it is determined that meet the mapping rule Then, then each training data in target training set is mapped to the same category list.

Optionally, in some embodiments, the mapping ruler meets:

Similarity between training data is higher than the default similarity, and according to the grade descending of element group, judgement Whether the element group in one element group set is identical in semantic space as the element group of ad eundem in another element group set Or it is similar, if so, determination meets mapping ruler, if it is not, then carrying out the judgement of lower level.

For ease of understanding, below by taking the hospital data for platform of registering as an example, LSA model can be used in the data for platform of registering Processing module, after the hospital data for getting multiple partners, data processing module can be hospital, the department in hospital data And doctor data storage, into database table, these tables are known as external table.Then the data in these external tables are mapped to interior The line upper module that platform of registering is supplied in portion's table uses.

Data processing module belongs to the preprocessing part for platform of registering, and can complete offline, so to using platform of registering to look into See the user of hospital data and unaware.Data processing module needs to handle hospital and department's two parts data, and hospital mainly wraps Containing the information such as hospital's name, brief introduction, phone, address, city and area information, hospital's property and rank；Department mainly includes department The information such as name, brief introduction, doctor's brief introduction.Hospital's name, alias, brief introduction, address information, telephone number in external table etc. these by The information of natural language composition extracts the document that composition one describes the hospital, similar between each document by judging Degree to carry out primary dcreening operation to the hospital data that partner provides, and the higher several parts of hospital datas of similarity can be obtained in primary dcreening operation, then root Judgement is associated according to the correlation rule hospital data high to these similarities.It is illustrated separately below:

One, the training of LSA model-carry out clustering documents

LSA model is unsupervised learning model, does not need to mark training data in advance, hospital's document that front is formed is exactly Training data, but need just to can be carried out model training by a series of processing, process flow LSA model instruction as shown in Figure 4 Practice and prepare process, which prepares process and be broadly divided into Chinese word segmentation, document vectorization, the TF- for calculating collection of document IDF value and use TF-IDF matrix training LSA model, are illustrated separately below:

(1), there are many participle tools of mature open source for Chinese word segmentation, it should be noted that reject punctuation mark, stop words and Html tag, these are all the noise datas during model training.

(2), word all in whole collection of document is investigated, assigns a digital number for each word, and calculate Word frequency.

For example, for hospital's document 1 " coking coal Central Hospital Henan Province Jiaozuo City health road ", the piece is by hospital Title and address composition.After proceeding through Chinese word segmentation to it, the result of the Chinese word segmentation of this hospital document are as follows: coking coal, in Centre, hospital, Henan Province, Jiaozuo City, health, road.The number of " coking coal " can be set as 8, the number in " center " is 52, the volume of " hospital " Number be 268, the number in " Henan Province " is 500, and the number of " Jiaozuo City " is 1608, and the number of " health " is 2112, the volume on " road " Number be 3068.So, the document vector of this hospital's document can indicate as follows:

[(8,1), (52,1), (268,1), (500,1), (1608,1), (2112,1), (3068,1)].

Assuming that other hospital's document 2 is " Coking Coal Group hospital Henan Jiaozuo health road ", Chinese word segmentation is carried out to it After dictionary number, document vector can be expressed as follows:

[(8,1), (52,1), (268,1), (297,1), (574,1), (1608,1), (2142,1), (3068,1)].

As it can be seen that there are certain similitudes for this Liang Pian hospital document, then, the suspicious text for forming this Liang Pian hospital document The dictionary space of shelves set merges following document vector:

[(8,2), (52,2), (268,2), (297,1), (574,1), (1608,2), (2142,1), (3068,2), (500,1), (2112,1)]

Since dictionary number has no use for model training, it is only necessary to tie up word frequency value, but each document using second Document vector, it is necessary to include all dictionary numbers in dictionary.

Therefore the final document vector of hospital's document 1 are as follows: [1,1,1,0,0,1,0,1,1,1]

The final document vector of hospital's document 2 are as follows: [1,1,1,1,1,1,1,1,0,0]

In view of the intersection degree of the word in every hospital's document will not be too high, thus each document vector may for containing There is a large amount of 0 sparse vector.After entire collection of document is quantified, the sparse matrix of word frequency composition is formd, it is each Row is a word, and each column are a documents.Illustrate frequency matrix shown in fig. 5 as follows.

(3), the TF-IDF value of collection of document is calculated

TF, that is, Term Frequency is exactly word frequency, before the step of calculated, IDF, that is, Inverse Document Frequency is inverse document frequency, and the calculation of IDF is whole number of files divided by the document comprising the word Number, then takes natural logrithm to quotient.

TF-IDF is exactly the weight that TF is equivalent to the word multiplied by IDF, IDF.TF-IDF value is relative to word frequency, to word There is more reasonable meaning in description, for example certain words occur many times in a document, then the TF of the word is just very big, but These words are again general existing in this document sets, therefore can't be played a big part for distinguishing each document, and IDF is just It is to assign weight to the word frequency of each word, it is existing that some word the more is concentrated the more general in the document, then for solving this problem IDF value is with regard to smaller.Such as " hospital " this word, every document in hospital data nearly all can include the word, and Word frequency of this word in every document can be higher, therefore influence power can be than other in other words for the contribution of Documents Similarity Word is big, but " hospital " this word is not obvious the differentiation effect of document in fact, therefore this word should be endowed One lower weight, to balance its higher word frequency bring negative effect, IDF is exactly such weight.

Frequency matrix indicates document vector (column vector in matrix), TF- by the TF-IDF matrix being calculated IDF matrix is as shown in schematic diagram 6.

In some embodiments, after getting TF-IDF matrix, because of document vector (column vector in matrix) It has determined that, may be used for the cosine similarity for calculating document in fact, but directly there are three problems for calculating in this way:

1.TF-IDF matrix is excessively sparse, and calculating when data volume is very big can be very time-consuming

2. noise data caused by singular value is excessive

3. near synonym interfere

For hospital data, in a document because comprising hospital's brief introduction, has and some Model tying is had no The word of too big meaning, these words can be referred to as noise, and document vector is all higher-dimension sparse vector, for the place of noise Reason mode is usually dimensionality reduction, this can also be crossed greater than sparse problem solving matrix simultaneously.

Influence of the near synonym interference to similarity calculation is also very big, such as " coking coal Central Hospital Henan Province Jiaozuo City health Road " and " Coking Coal Group hospital Henan Jiaozuo health road " this two documents, are need to be mapped as same hospital two in fact Part data source, but if calculating according to TF-IDF vector, their similarity can't be too high, because in previous piece document " Henan " is exactly near synonym in fact in " Henan Province " and latter piece document, but has been assigned different dictionary numbers, document vector Also difference is thereby produced.More generally situation: the document of two description automobile engines, document one " engine sound is loud and clear ", Document two " engine sound is loud ", this two documents are calculated according to TF-IDF vector, then similarity can be very low, but from certainly The problem of this two documents are extremely similar for the angle of right language understanding, and here it is near synonym interference, TD-IDF matrix In information, be not sufficient to judge that " engine " and " engine " are synonyms, and " loud and clear " and " loud " is synonym.Therefore Need it is a kind of dimensionality reduction but also can identify the method for synonym to convert to matrix, here it is LSA models.

Below by taking LSA model as an example.Model training, the training of TF-IDF matrix are carried out to TF-IDF matrix based on LSA model The basic principle of LSA model is the singular value decomposition (SVD) in linear algebra, i.e. a matrix can be decomposed into three matrixes Product: A=U Σ V^T, wherein A is original matrix, and U is left singular matrix, and Σ is diagonal matrix, and V is right singular matrix, and U's is each Row represents the relevant a kind of word of meaning, and each column of V represent semantic relevant a kind of article, and the singular value for including in Σ is from upper Arrive down descending arrangement.Σ is cut, it is assumed that originally the square matrix of n rank is punctured into k rank, according to matrix multiple original Then, U and V will also be cut accordingly, these cutting after matrix multiple as a result, number of files and word can't be reduced Number is equivalent to only to semantically being merged and having been disassembled and has done matrix dimensionality reduction with SVD and remain the important of original matrix again Information.Semantic merging can be showed by a formulation, as follows:

0.73* engine+0.54* engine+0.3* automobile, the near synonym combination of such a Weighted Coefficients, is " to draw Hold up " and the semantic of " engine " merge；

0.72* tire+0.7* automobile, the near synonym combination of such a Weighted Coefficients are that the semantic of " tire " merges；

Semantic dismantling can be understood as the automobile in above-mentioned statement, and wherein the component of automobile 0.3 is disassembled " engine " In this related semanteme, in addition the component of automobile 0.7 is disassembled in " tire " this related semanteme, because of " automobile " this word Language can be semantic comprising multilayer.

LSA model has carried out dismantling semantically to original word, this is to complete clustering documents and dimensionality reduction most critical A bit.More accurate saying is: carry out semantic dismantling just complete original word space reflection to semantic space this Process, in semantic space, similar document is closer, therefore completes the cluster of document.It has just been trained to this LSA model Cheng Liao.

Two, the comparison of Documents Similarity

After LSA model training, to hospital to be mapped, via Chinese word segmentation, vectorization, calculate TF-IDF value it Afterwards, the mapping of semantic space can be carried out to hospital to be mapped with trained LSA model, then in semantic space and others Hospital's document vector carries out similarity mode, and the calculation method used in this programme is cosine similarity, calculation formula are as follows:

Cosine similarity will not investigate vector length, only investigate the angle theta of vector, this higher-dimension this for document vector is dilute It is more particularly suitable for dredging the comparison of vector.The following figure illustrates " coking coal Central Hospital Henan Province Jiaozuo City health road " this hospital's text The ranking results of the similarity of shelves and hospital existing in database from high to low are as shown in Figure 7.In Fig. 7, first row indicates classification Index, secondary series indicate that hospital's directory index, third column indicate hospital name, and the 4th column indicate public based on above-mentioned cosine similarity The similarity that formula is calculated, it can be seen that all similar hospitals all cluster together, and sequence is in the main true.

Specifically, the mapping of hospital data can be divided mainly into following two kinds of scenes:

A, mapping when original state

When initial, as soon as the hospital data of only partner, the hospital data of this partner is used as base value According at this moment a hospital only has a hospital data.After later other partner's input data, the doctor that will duplicate Institute's data, it is therefore desirable to carry out similarity-rough set, the comparison of this when is exactly each hospital of this new partner Data are compared with all reference datas.

When original state, the hospital data that database can be entered using first part as benchmark data, behind have data add come When, the data come will be just added below and all carry out similarity-rough set with every part of hospital data in database respectively, then the row of taking Several parts of preceding hospital datas, then by the hospital data being newly added judged by correlation rule its whether can in database Hospital data mapping.

If original state there are more parts of data, select a copy of it for benchmark data, then carries out the ratio of similarity respectively Compared with.

B, subsequent more new mappings

For the hospital data for the partner being newly added, can by its respectively with every part of hospital data in database The comparison for carrying out similarity, then obtains the permutation table of similarity.

Three, the design of correlation rule

It, can also be after similarity-rough set passes through, so that it may judge the number in order to increase System Error-tolerance Property after similarity anticipation According to whether may map in database under existing classified catalogue, that is, under 10 document enters before selection similarity ranking The accurate judgement of one step, i.e. correlation rule judgement.Accordingly even when have certain error in previous step, i.e., the hospital that should be mapped Ranking will not be missed there is no at first, and according to business scenario, the current partner's number of platform of registering and hospital From the point of view of coincidence number, taking first 10 is than better suited value.

In addition, the design of correlation rule is also required to be adjusted according to business scenario, for example, hospital and department data dimension simultaneously It is different, therefore correlation rule can not be the same.

The correlation rule for platform institute of traditional Chinese medicine data of registering has three ranks: the first rank be hospital name it is identical=> hospital can map； Second-order be hospital's alias it is identical with hospital name=> hospital can map；Third rank is city code, regional code and phone number Identical=> the hospital of code can map.The operational mode of three rank correlation rules is: if the first rank derives successfully, no longer investigating the Second order and third rank rule；First rank is unsuccessful to investigate second-order again, and second-order is if it succeeds, no longer investigate third rank.Such as Three rank of fruit is all unsuccessful, then it is assumed that the hospital that this hospital, family can not map, for the new hospital that the partner provides, or friendship By manual examination and verification.

The correlation rule of department's data is also designed as three ranks: the first rank be department's title it is identical=> department can map；The Second order, which has inclusion relation=> department for department's title, to be mapped；Third rank is the matching that doctor's title under department has 60% Rate=> department can map.The operational mode of three rank correlation rules is identical with hospital.

The application scenarios of the second-order correlation rule of hospital are as follows:

The Chongqing great Ping hospital VS great Ping hospital, Third Military Medical University [0.634987]

The similarity of this Liang Jia hospital is 0.634987, and being finally mapped successfully is the alias because the former is the latter.

The application scenarios of the third rank correlation rule of hospital are as follows:

440,300 440304 0755-23811 165 of institute of traditional Chinese medicine, Shenzhen gold ground seascape Community Healthcare Service Center

440,300 440304 0755-23811165 [0.775298] of Jin Di seascape society health

The similarity of this Liang Jia hospital is 0.775298, and being finally mapped successfully is city, regional code because of the two It is consistent with telephone number.

The application scenarios of the second-order correlation rule of department are as follows:

Medical and beauty treatment section (area Zhu Yuan) VS medical and beauty treatment section [0.94992]

Because title has inclusion relation, it is mapped.

The application scenarios of the third rank correlation rule of department are as follows:

1158 reproductive medicine section (Bei Yuan) Wang Junxia Chen Hua Zhou Jianjun Wang Fen

The 1158 25347 old Hua Wangjun rosy clouds Zhou Jianjuns [0.536493] of north institute reproductive center Wang Fen

Physician names registration 60% in the third rank correlation rule of department is an empirical value, can be answered with different It is adjusted with scene.Why physician names are included in the judgement of correlation rule, are because department's information is come relative to information for hospital Say considerably less, department does not have very high, the null value in platform of registering at present of introducing null value rate of the information such as address, phone and department Rate is higher than 50%, therefore physician names are necessary in this as correlation rule.Without physician names are included in LSA mould In the training data of type, then allowing for physician names may be considered inherent noun, cannot be disassembled or merge.

The problem of theoretical foundation of LSA model is that the SVD of linear algebra is decomposed, this scheme is can not explaining for semanteme Property.The application also provides a kind of bag of words, i.e. probability latent semantic analysis (PLSA), is based on probability, and semanteme becomes to be implicit Amount, basic ideas are also space conversion, but theory support is probability theory, has preferable interpretation, be one theoretically more Excellent model.But in the case where hospital and department's data map this application scenarios, the effect of PLSA is not so good as LSA.In 513 hospitals, There are 32 can map hospital, the comparing result of LSA and PLSA are as shown in table 1 below:

Hospital	Rule 1	Rule 2	Rule 3
				Based on LSA model	20	1	9
Based on PLSA model	12	2	8

Table 1

Wherein, Rule1, Rule2 and Rule3 indicate that correlation rule, Rule1 indicate the first rank correlation rule, Rule2 table Show that second-order correlation rule, Rule3 indicate third rank correlation rule.

When being associated judgement using Rule1, can be associated with out 20 parts based on LSA model can be mapped as same hospital Hospital data, and 12 parts of hospital datas that can be mapped as same hospital can be associated with out based on PLSA model.

When due to that may be associated using Rule1, some hospitals for being mapped as same hospital may be missed Data continue to be associated judgement, association results using Rule2 so can also continue to carry out second-order association judgement are as follows: It can be associated with out 1 part of hospital data that can be mapped as same hospital based on LSA model, and 2 can be associated with out based on PLSA model Part can be mapped as the hospital data of same hospital.

When similarly, due to that may be associated using Rule2, it may miss and some be mapped as same hospital Hospital data continues to be associated judgement using Rule3, association knot so can also continue to carry out the association judgement of third rank Fruit are as follows: 9 parts of hospital datas that can be mapped as same hospital can be associated with out based on LSA model, and can be closed based on PLSA model Join 8 parts of hospital datas that can be mapped as same hospital out.

Finally, identifying the 30 parts of hospital datas that can be mapped as same hospital in total based on LSA model, it is based on PLSA Model identifies the 22 parts of hospital datas that can be mapped as same hospital altogether.

In some embodiments, LSA model can be individually disposed, PLSA model can also be individually disposed, it can be with portion Operation efficiency can effectively be improved, and be timely pushed to carry out parallel computation in this way by affixing one's name to LSA model and PLSA model Client, for the user for using client, the variation unaware of back-end data.

In some embodiments, the application is also based on the extension of word2vec, and word2vec is a kind of term vector Form of presentation can be generalized to the statement of document vector, therefore the similarity comparison of document vector can be in this special text It is carried out between shelves vector.This model can investigate the sequencing of word, the i.e. context of co-text of word, than bag of words (example Such as LSA model and PLSA model) it can more be fitted natural language.

The method of data processing a kind of in the application is illustrated above, below to the method for executing above-mentioned data processing Device be described, which can be server or terminal device, be also possible to be installed on server or this terminal device On interactive application, the application is mainly using the device as server, and with the device is the interaction being installed on server For formula application.

One, referring to Fig. 8, the device 80 for data processing is illustrated, the device 80 for data processing can include:

Module 801 is obtained, for obtaining training data set to be processed, the training data set includes at least two parts Training data after semantic analysis；

Processing module 802, for it is described acquisition module 801 obtain the training data set in training data into Row clustering, obtains target data set, and the target data set includes at least two that similarity is higher than default similarity Part training data；

Mapping block 803, each training data in target training set for obtaining the processing module 802 It is mapped to the same category list, the category list is used to provide the entrance for obtaining the training data under the category list.

In the embodiment of the present application, the training data set that transceiver module 801 obtains includes at least two parts and passes through semantic analysis Training data afterwards, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet to be mapped to same category catalogue Training data, reduce mapping range.Then the training data in the training data set is gathered by processing module 802 Alanysis obtains including that similarity is higher than the target data set for presetting at least two parts training datas of similarity, due to passing through Clustering identifies the higher training data of similarity, really is able to be mapped to same category of instruction so can further determine that Practice data.Each training data in target training set is finally mapped to the same category list by mapping block 803. As it can be seen that the application can be improved the accuracy of multi-data source mapping, difference in form can be recognized accurately but semantically The same or similar training data improves the reliability and fault-tolerance of mapping.

Optionally, in some inventive embodiments, the processing module 802 is specifically used for:

Respectively by each training data in the training data set from element group space reflection to semantic space；

The similarity being mapped between each training data of semantic space is calculated, it is true according to the similarity between training data The fixed target data set.

Element group division processing is carried out to each training data in the training data set respectively, obtains at least two yuan Element group set, the element group set includes at least one element group, the corresponding a training data of each element group set, element Group indicates the set of at least one indivisible element；

Vectorization processing is carried out at least two elements group set respectively, obtains the first matrix, first matrix The frequency occurred for indicating at least one element group in each element group set；

According to the weight of element group, the frequency of element group and first matrix, it is calculated the second matrix, described second Matrix is used to indicate the frequency weighted value of element group；

Bag of words training is carried out to second matrix.

Based on the bag of words to second matrix carry out singular value decomposition, obtain left singular matrix, diagonal matrix and Right singular matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.

Optionally, in some embodiments, the weight of the first element group according to the sum of element group set and including The sum of the element group set of the first element group obtains, and the first element group refers to the element group in element group set.

First matrix is formed according to obtained at least two training vector.

Optionally, in some inventive embodiments, the processing module 802 is to the training in the training data set Data carry out clustering after, by the target training set in each training data be associated with the same category list it Before, it is also used to:

Optionally, in some inventive embodiments, the mapping ruler meets:

The application also provides a kind of computer storage medium, which has program, which includes above-mentioned when executing Device for data processing executes some or all of step in the method for above-mentioned data processing.

The application also provides a kind of computer program product comprising instruction, when run on a computer, makes to succeed in one's scheme Calculation machine executes some or all of step in the method as performed by the device for data processing.

The device for data processing in the embodiment of the present invention is carried out from the angle of modular functionality entity above Description, below from the angle of hardware handles respectively in the embodiment of the present invention network authentication server and terminal device retouch It states.It should be noted that the corresponding entity device of acquisition module in the embodiment shown in fig. 8 can be input/output list Member, the corresponding entity device of processing module can be processor.Device shown in Fig. 8 can have structure as shown in Figure 9, when When device shown in Fig. 8 has structure as shown in Figure 9, it is aforementioned right that processor and I/O unit in Fig. 9 can be realized Should device the processing module that provides of Installation practice and obtain the same or similar function of module, the memory in Fig. 9 is deposited The program code that storage processor needs to call when executing the method for above-mentioned data processing.

Figure 10 is a kind of server architecture schematic diagram provided in an embodiment of the present invention, which can be because of configuration or property Energy is different and generates bigger difference, may include one or more central processing unit (full name in English: central Processing units, English abbreviation: CPU) 1022 (for example, one or more processors) and memory 1032, one The storage medium 1030 of a or more than one storage application program 1042 or data 1044 (such as deposit by one or more magnanimity Store up equipment).Wherein, memory 1032 and storage medium 1030 can be of short duration storage or persistent storage.It is stored in storage medium 1030 program may include one or more modules (diagram does not mark), and each module may include in server Series of instructions operation.Further, central processing unit 1022 can be set to communicate with storage medium 1030, in server The series of instructions operation in storage medium 1030 is executed on 1000.

Server 1000 can also include one or more power supplys 1026, one or more wired or wireless nets Network interface 1050, one or more input/output interfaces 1058, and/or, one or more operating systems 1041, example Such as Windows Server, Mac OS XT, Unix, Linux, FreeBSD etc..

The step as performed by server can be based on the server architecture shown in Fig. 10 in above-described embodiment.

The embodiment of the invention also provides another terminal devices, as shown in figure 11, for ease of description, illustrate only with The relevant part of the embodiment of the present invention, it is disclosed by specific technical details, please refer to present invention method part.The terminal Equipment can be include mobile phone, tablet computer, personal digital assistant (full name in English: Personal Digital Assistant, English abbreviation: PDA), point-of-sale terminal (full name in English: Point of Sales, English abbreviation: POS), vehicle-mounted computer etc. it is any eventually End equipment, by taking terminal device is mobile phone as an example:

Figure 11 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided in an embodiment of the present invention.Ginseng Figure 11 is examined, mobile phone includes: radio frequency (full name in English: Radio Frequency, English abbreviation: RF) circuit 1111, memory 1120, input unit 1130, display unit 1140, sensor 1150, voicefrequency circuit 1160, Wireless Fidelity (full name in English: Wireless fidelity, English abbreviation: WiFi) components such as module 1170, processor 1180 and power supply 1190.This field Technical staff is appreciated that handset structure shown in Figure 11 does not constitute the restriction to mobile phone, may include more than illustrating Or less component, perhaps combine certain components or different component layouts.

It is specifically introduced below with reference to each component parts of the Figure 11 to mobile phone:

RF circuit 1111 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 1180；In addition, the data for designing uplink are sent to base station.In general, RF circuit 1111 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (full name in English: Low Noise Amplifier, English abbreviation: LNA), duplexer etc..In addition, RF circuit 1111 can also by wireless communication with net Network and other equipment communication.Any communication standard or agreement can be used in above-mentioned wireless communication, and including but not limited to the whole world is mobile Communication system (full name in English: Global System of Mobile communication, English abbreviation: GSM), general point Group wireless service (full name in English: General Packet Radio Service, English abbreviation: GPRS), CDMA (English Full name: Code Division Multiple Access, English abbreviation: CDMA), wideband code division multiple access (full name in English: Wideband Code Division Multiple Access, English abbreviation: WCDMA), long term evolution (full name in English: Long Term Evolution, English abbreviation: LTE), Email, short message service (full name in English: Short Messaging Service, English abbreviation: SMS) etc..

Memory 1120 can be used for storing software program and module, and processor 1180 is stored in memory by operation 1120 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1120 can be led It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function Application program (such as sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses institute according to mobile phone Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1120 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.

Input unit 1130 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 1130 may include touch panel 1131 and other inputs Equipment 1132.Touch panel 1131, also referred to as touch screen collect touch operation (such as the user of user on it or nearby Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1131 or near touch panel 1131 Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1131 may include touch detection Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band The signal come, transmits a signal to touch controller；Touch controller receives touch information from touch detecting apparatus, and by it It is converted into contact coordinate, then gives processor 1180, and order that processor 1180 is sent can be received and executed.In addition, Touch panel 1131 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface Plate 1131, input unit 1130 can also include other input equipments 1132.Specifically, other input equipments 1132 may include But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. It is one or more.

Display unit 1140 can be used for showing information input by user or be supplied to user information and mobile phone it is each Kind menu.Display unit 1140 may include display panel 1141, optionally, can using liquid crystal display (full name in English: Liquid Crystal Display, English abbreviation: LCD), Organic Light Emitting Diode (full name in English: Organic Light- Emitting Diode, English abbreviation: OLED) etc. forms configure display panel 1141.Further, touch panel 1131 can Covering display panel 1141 sends processor to after touch panel 1131 detects touch operation on it or nearby 1180, to determine the type of touch event, are followed by subsequent processing device 1180 and are provided on display panel 1141 according to the type of touch event Corresponding visual output.Although touch panel 1131 and display panel 1141 are come as two independent components in Figure 11 Realize the input and input function of mobile phone, but in some embodiments it is possible to by touch panel 1131 and display panel 1141 It is integrated and that realizes mobile phone output and input function.

Mobile phone may also include at least one sensor 1150, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 1141, proximity sensor can close display panel when mobile phone is moved in one's ear 1141 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；Also as mobile phone The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.

Voicefrequency circuit 1160, loudspeaker 1161, microphone 1162 can provide the audio interface between user and mobile phone.Audio Electric signal after the audio data received conversion can be transferred to loudspeaker 1161, be converted by loudspeaker 1161 by circuit 1160 For voice signal output；On the other hand, the voice signal of collection is converted to electric signal by microphone 1162, by voicefrequency circuit 1160 Audio data is converted to after reception, then by after the processing of audio data output processor 1180, through RF circuit 1111 to be sent to ratio Such as another mobile phone, or audio data is exported to memory 1120 to be further processed.

WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1170 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 11 is shown WiFi module 1170, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 1180 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, By running or execute the software program and/or module that are stored in memory 1120, and calls and be stored in memory 1120 Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 1180 may include one or more processing units；Preferably, processor 1180 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1180.

Mobile phone further includes the power supply 1190 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply Management system and processor 1180 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system The functions such as reason.

Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.

In embodiments of the present invention, processor 1180 included by the mobile phone also there is control to execute above by terminal device The method flow of execution.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple On network module.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the application It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit realizes and that when sold or used as an independent product can store can in a computer in the form of software function module It reads in storage medium.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.

The computer program product includes one or more computer instructions.Load and execute on computers the meter When calculation machine program instruction, entirely or partly generate according to process or function described in the embodiment of the present invention.The computer can To be general purpose computer, special purpose computer, computer network or other programmable devices.The computer instruction can be deposited Storage in a computer-readable storage medium, or from a computer readable storage medium to another computer readable storage medium Transmission, for example, the computer instruction can pass through wired (example from a web-site, computer, server or data center Such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave) mode to another website Website, computer, server or data center are transmitted.The computer readable storage medium can be computer and can deposit Any usable medium of storage either includes that the data storages such as one or more usable mediums integrated server, data center are set It is standby.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or partly lead Body medium (such as solid state hard disk Solid State Disk (SSD)) etc..

Claims

1. a kind of method of data processing, which is characterized in that the described method includes:

Training data set to be processed is obtained, the training data set includes at least two parts of training after semantic analysis Data；

Clustering is carried out to the training data in the training data set, obtains target data set, the target data Set includes at least two parts of training datas that similarity is higher than default similarity；

Each training data in target training set is mapped to the same category list, the category list is for providing Obtain the entrance of the training data under the category list.

2. the method according to claim 1, wherein the training data in the training data set into Row clustering obtains target data set, comprising:

The similarity being mapped between each training data of semantic space is calculated, institute is determined according to the similarity between training data State target data set.

3. according to the method described in claim 2, it is characterized in that, described respectively by each training in the training data set Data are from element group space reflection to semantic space, comprising:

Element group division processing is carried out to each training data in the training data set respectively, obtains at least two element groups Set, the element group set includes at least one element group, the corresponding a training data of each element group set, element group table Show the set of at least one indivisible element；

Vectorization processing is carried out at least two elements group set respectively, obtains the first matrix, first matrix is used for Indicate the frequency that at least one element group occurs in each element group set；

According to the weight of element group, the frequency of element group and first matrix, the second matrix, second matrix is calculated For indicating the frequency weighted value of element group；

Bag of words training is carried out to second matrix.

4. according to the method described in claim 3, it is characterized in that, it is described to second matrix carry out bag of words training, Include:

Singular value decomposition is carried out to second matrix based on the bag of words, obtains left singular matrix, diagonal matrix and right surprise Different matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.

5. according to the method described in claim 3, it is characterized in that, the weight of the first element group is according to the sum of element group set And the sum of the element group set including the first element group obtains, the first element group refers in element group set Element group.

6. according to the method described in claim 5, it is characterized in that, described respectively carry out at least two elements group set Vectorization processing, obtains the first matrix, comprising:

According to the frequency that element group occurs in each element group set, respectively at least two elements group set carry out to Quantification treatment obtains at least two training vectors；

First matrix is formed according to obtained at least two training vector.

7. according to any method of claim 3-6, which is characterized in that the training in the training data set After data carry out clustering, each training data by target training set is associated with the same category list Before, the method also includes:

Judge whether the training data in the target data set meets mapping ruler, however, it is determined that meet the mapping ruler, Each training data in target training set is then mapped to the same category list.

8. the method according to the description of claim 7 is characterized in that the mapping ruler meets:

Similarity between training data is higher than the default similarity, and according to the grade descending of element group, judges one The element group of element group in element group set and ad eundem in another element group set is whether semantic space identical or phase Seemingly, if so, determination meets mapping ruler, if it is not, then carrying out the judgement of lower level.

9. a kind of device for data processing, which is characterized in that described device includes:

Module is obtained, for obtaining training data set to be processed, the training data set includes at least two parts and passes through language Training data after justice analysis；

Processing module, the training data in the training data set for obtaining to the acquisition module carry out cluster point Analysis, obtains target data set, and the target data set includes at least two parts trained numbers that similarity is higher than default similarity According to；

Mapping block, each training data in target training set for obtaining the processing module are mapped to same A category list, the category list are used to provide the entrance for obtaining the training data under the category list.

10. device according to claim 9, which is characterized in that the processing module is specifically used for:

11. device according to claim 10, which is characterized in that the processing module is specifically used for:

Bag of words training is carried out to second matrix.

12. device according to claim 11, which is characterized in that the processing module is specifically used for:

13. device according to claim 12, which is characterized in that the processing module is specifically used for:

First matrix is formed according to obtained at least two training vector.

14. a kind of computer storage medium, which is characterized in that it includes instruction, when run on a computer, so that calculating Machine executes method a method as claimed in any one of claims 1-8.

15. a kind of computer program product comprising instruction, which is characterized in that when run on a computer, so that calculating Machine executes any method of the claims 1-8.