CN109947858A - A kind of method and device of data processing - Google Patents
A kind of method and device of data processing Download PDFInfo
- Publication number
- CN109947858A CN109947858A CN201710619053.5A CN201710619053A CN109947858A CN 109947858 A CN109947858 A CN 109947858A CN 201710619053 A CN201710619053 A CN 201710619053A CN 109947858 A CN109947858 A CN 109947858A
- Authority
- CN
- China
- Prior art keywords
- element group
- training data
- training
- matrix
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 193
- 238000013507 mapping Methods 0.000 claims abstract description 38
- 241001269238 Data Species 0.000 claims abstract description 26
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 239000011159 matrix material Substances 0.000 claims description 91
- 239000013598 vector Substances 0.000 claims description 37
- 238000003860 storage Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 238000011002 quantification Methods 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 19
- 230000006870 function Effects 0.000 description 16
- 230000006854 communication Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 9
- 230000036541 health Effects 0.000 description 8
- 239000003245 coal Substances 0.000 description 7
- 238000004939 coking Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000033001 locomotion Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000003796 beauty Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000012905 input function Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013506 data mapping Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000001850 reproductive effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of method and devices of data processing, which comprises obtains training data set to be processed, the training data set includes at least two parts of training datas after semantic analysis;Clustering is carried out to the training data in the training data set, obtains target data set, the target data set includes at least two parts of training datas that similarity is higher than default similarity;Each training data in target training set is mapped to the same category list, the category list is used to provide the entrance for obtaining the training data under the category list.By using this programme, the accuracy of multi-data source mapping can be improved, can be recognized accurately different in form but in semantically the same or similar training data, improve the reliability and fault-tolerance of mapping.
Description
Technical field
This application involves big data processing technology field more particularly to a kind of method and devices of data processing.
Background technique
At present in big data processing technology field, some websites O2O, which are met, provides a user Jie of all kinds of businessmans, mechanism etc.
Continue information, which can obtain the data source that multiple third parties provide, and then will indicate the data of the same businessman or mechanism
Source is mapped under same catalogue, for selection by the user.But due to third party provide data source there may be introduce it is not detailed or
Person is lack of standardization or even the problem of partial information inaccuracy, and when website carries out multi-data source mapping, it is unsuccessful to will lead to mapping.
String matching mode is mainly used at present, i.e., accurately match and fuzzy substring matching, that is, only third party provides
Data source is identical in form or part it is identical under the premise of, just think to be mapped under same catalogue.As it can be seen that word
It is lack of standardization or when information changes in data source to accord with String matching mode, may be unable to complete mapping, identification probability it is limited and
Fault-tolerance is not high.
Although existing various dimensions matching way can carry out the matching of various dimensions from each local message of data source,
If certain local messages of data source are inconsistent, such as several third parties provide the hospital data of a hospital simultaneously,
When hospital name Incomplete matching, when the information such as phone, address or doctor are all different, it just will be considered that this several parts of hospital datas are
It does not map under same catalogue, but substantially, a hospital will include the letter such as multiple departments, multiple doctors and multiple phones
Breath, the hospital data of possible third party's offer is simultaneously imperfect, but substantially same hospital.As it can be seen that various dimensions matching way
Mapping probabilities it is not high, fault-tolerance is not also high.
Summary of the invention
This application provides a kind of method and device of data processing, it is able to solve multi-data source mapping in the prior art
The not high problem of mapping probabilities.
The application first aspect provides a kind of method of data processing, which comprises
Training data set to be processed is obtained, the training data set includes at least two parts after semantic analysis
Training data;
Clustering is carried out to the training data in the training data set, obtains target data set, the target
Data acquisition system includes at least two parts of training datas that similarity is higher than default similarity;
Each training data in target training set is mapped to the same category list, the category list is used for
The entrance for obtaining the training data under the category list is provided.
The application second aspect provides a kind of for handling the device of data, has and realizes that corresponding to above-mentioned first aspect mentions
The function of the method for the data processing of confession.The function can also be executed corresponding soft by hardware realization by hardware
Part is realized.Hardware or software include one or more modules corresponding with above-mentioned function, the module can be software and/or
Hardware.
In a kind of possible design, described device includes:
Module is obtained, for obtaining training data set to be processed, the training data set is passed through including at least two parts
Training data after crossing semantic analysis;
Processing module, the training data in the training data set for obtaining to the acquisition module cluster
Analysis, obtains target data set, and the target data set includes at least two parts of training that similarity is higher than default similarity
Data;
Mapping block, each training data in target training set for obtaining the processing module are mapped to
The same category list, the category list are used to provide the entrance for obtaining the training data under the category list.
The another aspect of the application provides a kind of for handling the device of data comprising the processing of at least one connection
Device, memory, transmitter and receiver, wherein the memory is for storing program code, and the processor is for calling institute
The program code in memory is stated to execute method described in above-mentioned first aspect.
The another aspect of the application provides a kind of computer readable storage medium, in the computer readable storage medium
It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned first aspect.
The another aspect of the application provides a kind of computer program product comprising instruction, when it runs on computers
When, so that computer executes method described in above-mentioned various aspects.
Compared to the prior art, in scheme provided by the present application, the training data set of acquisition is passed through including at least two parts
Training data after semantic analysis, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet be mapped to it is same
The training data of category list reduces mapping range.Then cluster point is carried out to the training data in the training data set
Analysis obtains including that similarity is higher than the target data set for presetting at least two parts training datas of similarity, due to passing through cluster
Analysis identifies the higher training data of similarity, really is able to be mapped to same category of trained number so can further determine that
According to.Each training data in target training set is finally mapped to the same category list.As it can be seen that the application can mention
Difference in form can be recognized accurately but in semantically the same or similar trained number in the accuracy of high multi-data source mapping
According to improving the reliability and fault-tolerance of mapping.
Detailed description of the invention
Fig. 1 is a kind of network topology structure schematic diagram in the embodiment of the present invention;
Fig. 2 is a kind of flow diagram of method of data processing in the embodiment of the present invention;
Fig. 3-a is a kind of schematic diagram that element group divides in the embodiment of the present invention;
Fig. 3-b is a kind of schematic diagram of the second matrix in the embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of the method for data processing in the embodiment of the present invention;
Fig. 5 is a kind of schematic diagram of frequency matrix in the embodiment of the present invention;
Fig. 6 is a kind of schematic diagram of TF-IDF matrix in the embodiment of the present invention;
Fig. 7 is a kind of schematic diagram of institute of traditional Chinese medicine of embodiment of the present invention similarity ranking;
Fig. 8 is a kind of structural schematic diagram in the embodiment of the present invention for the device of data processing;
Fig. 9 is another structural schematic diagram in the embodiment of the present invention for the device of data processing;
Figure 10 is another structural schematic diagram in the embodiment of the present invention for the server of data processing;
Figure 11 is a kind of structural schematic diagram in the embodiment of the present invention for the mobile phone of data processing.
Specific embodiment
The description and claims of this application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing
Similar object, without being used to describe a particular order or precedence order.It should be understood that the data used in this way are in appropriate feelings
It can be interchanged under condition, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein
It applies.In addition, term " includes " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet
The process, method, system, product or equipment for having contained series of steps or module those of be not necessarily limited to be clearly listed step or
Module, but may include other steps being not clearly listed or intrinsic for these process, methods, product or equipment or
Module, the division of module appeared in the application, only a kind of division in logic can have when realizing in practical application
Other division mode, such as multiple modules can be combined into or are integrated in another system, or some features can be ignored,
Or do not execute, in addition, shown or discussion mutual coupling, direct-coupling or communication connection can be by one
A little interfaces, the indirect coupling or communication connection between module can be electrical or other similar form, do not make in the application
It limits.Also, module or submodule can be the separation that may not be physically as illustrated by the separation member, can be
It can not be physical module, or can be distributed in multiple circuit modules, portion therein can be selected according to the actual needs
Point or whole module realize the purpose of application scheme.
The application has supplied a kind of method and device of data processing, can be used for big data processing field, such as collecting
Data provided by third-party platform, such as each website provider man details are collected, the businessman for belonging to same businessman is detailed
Feelings are associated under the same catalogue, for user provide browsing service, such as by from 4 third-party platforms provide about Shenzhen
The hospital data of First People's Hospital, city, although hospital data provided by this 4 third-party platforms may be in hospital name, doctor
The detailed information such as barnyard location, doctor, department or department's phone can be inconsistent, but after data are analyzed, and substantially this 4 parts
Hospital data belongs to First People's Hospital, Shenzhen, so this 4 parts of hospital datas are associated under same hospital's catalogue, with
It checks and selects for user.Fig. 1 is a kind of collection multi-data source and the network topology schematic diagram for handling multi-data source, in Fig. 1, clothes
Business device can be interacted with multiple terminal devices, can collect hospital data 1, hospital data 2 ... hospital data from these terminal devices
N after having collected these hospital datas, then can first carry out prescreening to these hospital datas, filtering out data, there are similar doctors
Then institute's data acquisition system carries out clustering to the hospital data in the hospital data set, by hospital data from word space
It is mapped to semantic space, to obtain several parts of hospital datas that similarity is more than preset threshold.It is finally that similarity is in the top
Several parts of hospital datas are mapped under same hospital's catalogue, are available to line upper mounting plate, allow patient autonomous on line
The hospital to be checked of platform selecting.
Wherein, it should be strongly noted that this application involves terminal device, can be directed to user provide voice and/
Or the equipment of data connectivity, with wireless connecting function handheld device or be connected to radio modem other
Processing equipment.Wireless terminal can be through wireless access network (full name in English: Radio Access Network, English abbreviation: RAN)
Communicated with one or more core nets, wireless terminal can be mobile terminal, as mobile phone (or for " honeycomb " electricity
Words) and computer with mobile terminal, for example, it may be portable, pocket, hand-held, built-in computer or vehicle
The mobile device of load, they exchange voice and/or data with wireless access network.For example, personal communication service (full name in English:
Personal Communication Service, English abbreviation: PCS) phone, wireless phone, Session initiation Protocol (SIP) words
Machine, wireless local loop (Wireless Local Loop, English abbreviation: WLL) stand, personal digital assistant (full name in English:
Personal Digital Assistant, English abbreviation: PDA) etc. equipment.Wireless terminal is referred to as system, Ding Hudan
Member (Subscriber Unit), subscriber station (Subscriber Station), movement station (Mobile Station), mobile station
(Mobile), distant station (Remote Station), access point (Access Point), remote terminal (Remote
Terminal), access terminal (Access Terminal), user terminal (User Terminal), terminal device, user agent
(User Agent), user equipment (User Device) or user equipment (User Equipment).
In order to solve the above technical problems, the application it is main the following technical schemes are provided:
The application can handle the mapping of multi-data source based on latent semantic analysis model, after obtaining multiple data sources,
Extract the semanteme of each data source first based on latent semantic analysis model, semanteme can be expressed with mathematical linguistics.Then right again
The higher dimensional space of each word composition in data source is converted to low-dimensional by the semantic comparison for carrying out similarity of these data sources
Semantic space, and semantic space carry out it is abstract after it is semantic relatively, by using which, carrying out similarity-rough set
When, the same or similar difference of inconsistent but substantial semanteme on some words can be neglected between each data source.This Shen
It please not need to pay close attention to the appearance sequence of these words, but based on " co-occurrence " it is assumed that for example two words are in more parts of data sources
In it is a large amount of while occur, then it is believed that the two words semantically have similitude.Such as a large amount of texts of automobile are described
Chapter may use " engine " and " engine " with, when being based on latent semantic analysis model, then will be considered that the two words in semanteme
It is upper that there is similitude, the two words will not be thought to be different word, improve the accuracy of similarity analysis, to a certain degree
On can recognize that the probability for more belonging to data source under same category, to increase the fault-tolerant of data source.
Referring to figure 2., a kind of method for providing data processing to the application below is illustrated, a kind of data processing
Method, the method specifically includes that
201, training data set to be processed is obtained.
Wherein, the training data set includes at least two parts of training datas after semantic analysis.
Wherein, semantic analysis, which refers to, carries out semantic test and processing according to the grammatical category of syntax analyzer identification, generates
Corresponding intermediate code or object code.In the application, before carrying out clustering to training data, to reduce workload,
Prescreening can be carried out to each training data in training data set by semantic analysis, the model of clustering can be reduced in this way
It encloses, the higher training data of similarity can also be filtered out, can also improve the accuracy of data analysis, it is similar to exclude some parts
But not substantially belong to training data under the same category catalogue.
202, clustering is carried out to the training data in the training data set, obtains target data set.
Wherein, the target data set includes at least two parts of training datas that similarity is higher than default similarity.
In some embodiments, above-mentioned target data set can be obtained by following step (1) and (2):
(1), respectively by each training data in the training data set from element group space reflection to semantic space.
By training data from element group space reflection to semantic space, it may include:
Firstly, carrying out element group division processing respectively to each training data in the training data set, obtain at least
Two element group set, the element group set include at least one element group, the corresponding a training number of each element group set
According to element group indicates the set of at least one indivisible element.Fig. 3-a is a kind of schematic diagram that element group divides, original instruction
Practicing includes element group 1, element group 2 ... element group n, noise data 1 and noise data 2 in data, noise data therein 1 and is made an uproar
Sound data 2 are to interfere the data of bag of words training, so needing to reject noise data 1 and noise data 2.In another example to hospital
The key words such as hospital name, phone, hospital address, department's title and doctor in document carry out word division, can specifically adopt
Chinese word segmentation is carried out with Chinese word segmentation tool, can effectively reject punctuation mark, stop words and hypertext markup language in this way
These noise datas such as (full name in English: HyperText Markup Language, English abbreviation: HTML) label pass through in this way
After Chinese word segmentation processing, so that it may reduce interference of the noise data to bag of words training process.The application does not draw element group
The mode and participle tool for dividing processing are defined.
Secondly, carrying out vectorization processing at least two elements group set respectively, the first matrix is obtained, described first
Matrix can be used for indicating the frequency that at least one element group occurs in each element group set.In some embodiments, the
One matrix can be obtained by operations described below:
According to the frequency that element group occurs in each element group set, respectively at least two elements group set into
Row vectorization processing, obtains at least two training vectors;
First matrix is formed according to obtained at least two training vector.The application does not obtain the first matrix
Mode is taken to limit.
Fig. 3-b is a kind of schematic diagram of the first matrix, and when analyzing hospital data, which can be frequency matrix
(as shown in Figure 5).
Then, according to the weight of element group, the frequency of element group and first matrix, the second matrix, institute is calculated
The second matrix is stated for indicating the frequency weighted value of element group.
Finally, carrying out bag of words training to second matrix.
(2), the similarity being mapped between each training data of semantic space is calculated, according to similar between training data
Degree determines the target data set.
203, each training data in target training set is mapped to the same category list.
Wherein, the category list is used to provide the entrance for obtaining the training data under the category list.
In scheme provided by the present application, the training data set of acquisition includes at least two parts of training after semantic analysis
Data, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet the training number for being mapped to same category catalogue
According to diminution mapping range.Then clustering is carried out to the training data in the training data set, obtains including similarity
Higher than the target data set of at least two parts training datas of default similarity, due to by clustering identify similarity compared with
High training data really is able to be mapped to same category of training data so can further determine that.Finally by the target
Each training data in training set is mapped to the same category list.As it can be seen that the application can be improved multi-data source mapping
Accuracy, can be recognized accurately different in form but in semantically the same or similar training data, and that improves mapping can
By property and fault-tolerance.
Optionally, in some inventive embodiments, due to may be excessively sparse in calculated first matrix, especially exist
When training data makes data volume larger, it can seriously lead to biggish operand, cause operation duration longer.It is additionally contemplates that first
There can be more noise data in matrix, these noise datas can interfere the training of bag of words.In addition, near synonym also can be right
The calculating of similarity causes biggish interference, but the calculated similarity of essence is not high, and will lead to should be mapped in this way
It is considered as that can not map for the training data under same category catalogue.So in order to eliminate these interference, the application is also mentioned
Above-mentioned interference phenomenon is eliminated for following scheme.In some embodiments, bag of words training is being carried out to second matrix
When, singular value decomposition can be carried out to second matrix based on the bag of words, obtain left singular matrix, diagonal matrix and the right side
Singular matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.
Wherein, the application can based on bag of words (full name in English: Bag-of-words model, English abbreviation:
BOWM) TF-IDF matrix is trained, wherein bag of words neglect the grammer and word order of text, with one group of unordered list
Word (words) expresses passage or a document, is mainly used for text classification, examines in natural language processing and information
One of rope simple hypothesis.In this model, text (paragraph or document) is counted as unordered lexical set, ignores
The sequence of grammer even word.The basic thought of bag of words includes:
1, it extracts feature: according to data set selected characteristic, being then described, characteristic is formed, in detection image
Sift keypoints, then calculate keypoints descriptors, generate the feature vector of 128-D;
2, learn bag of words: all being merged using the characteristic handled well, then with clustering algorithm Feature Words are divided into several
Class, if the number of this Ganlei is set by oneself, each class is equivalent to a visual word;
3, utilize vision bag of words quantized image feature: each image is made of many visual vocabularies, we utilize statistics
Word frequency histogram, can indicate which kind of image belongs to.
It include mainly feature point extraction and clustering, wherein clustering when carrying out model training based on bag of words
(Cluster) if analysis be made of dry model (Pattern), in general, mode be one measure (Measurement) to
A point in amount or hyperspace.
Clustering based on similitude, one cluster in mode between than the mode not in same cluster it
Between have more similitudes.In the application, the method (Model-Based based on model is can be used in clustering
Methods), mainly include three aspects:
1) it calculates each cluster and determines an initial cluster center, can thus there is k initial cluster center
2) sample in sample set is assigned to nearest neighbor classifier according to minimal distance principle
3) use the sample average in each cluster as new cluster centre until cluster centre no longer changes
In some embodiments, bag of words mainly include Latent Semantic analysis (full name in English: Latent
Semantic Analysis, English abbreviation: LSA) model and probability Latent Semantic analyze (full name in English: Probability
Latent Semantic Analysis, English abbreviation: PLSA) model.
In other embodiments, it is also based on term vector expression way (word2vec), based on this model, can be passed through
Training by each word be mapped to K dimension real vector (K is generally the hyper parameter in model), by the distance between word (such as
Cosine similarity, Euclidean distance etc.) judge the semantic similarity between them, use one three layers of neural network,
Input layer-hidden layer-output layer.The technology of core be according to word frequency Huffman encoding (full name in English:
HuffmanCoding), so that the content of the similar word hidden layer activation of all word frequency is almost the same, the frequency of occurrences is higher
Word, the hiding number of layers that they activate is fewer, the complexity for reducing calculating suspicious in this way.The application is not to the second matrix
The model that dimension-reduction treatment is based on limits.
Optionally, in some inventive embodiments, for each element group, all has weight, weight can be used to
It indicates the relative importance of the element group in the overall evaluation, the key in each training data can be effectively distinguished by it
Word.Specifically, for the first element group in element group set, the weight of the first element group is according to element group set
Sum and the sum of the element group set including the first element group obtain, and the first element group refers to element group set
In arbitrary element group.
Optionally, in some inventive embodiments, after carrying out similarity-rough set to each training data, it is to increase
System fault-tolerance, the training data that similarity ranking TopA also may be selected enter the accurate judgement of next step, that is, pass through correlation rule
Judgement.Accordingly even when carrying out similarity-rough set to each training data, there is a certain error, i.e., the training number that should be mapped
According to ranking there is no at first, will not be missed., can also be according to business scenario in application scenes, platform of registering
Current data source source sum and training data are overlapped number to select TopA, the value of A can dynamic change, the application do not make
It limits.For hospital data, first 10 can be taken to judge to carry out the correlation rule of next step.Specifically, to institute
Each training data after stating the training data progress clustering in training data set, in gathering target training
It is associated with before the same category list, the embodiment of the present application may also include that
Judge whether the training data in the target data set meets mapping ruler, however, it is determined that meet the mapping rule
Then, then each training data in target training set is mapped to the same category list.
Optionally, in some embodiments, the mapping ruler meets:
Similarity between training data is higher than the default similarity, and according to the grade descending of element group, judgement
Whether the element group in one element group set is identical in semantic space as the element group of ad eundem in another element group set
Or it is similar, if so, determination meets mapping ruler, if it is not, then carrying out the judgement of lower level.
For ease of understanding, below by taking the hospital data for platform of registering as an example, LSA model can be used in the data for platform of registering
Processing module, after the hospital data for getting multiple partners, data processing module can be hospital, the department in hospital data
And doctor data storage, into database table, these tables are known as external table.Then the data in these external tables are mapped to interior
The line upper module that platform of registering is supplied in portion's table uses.
Data processing module belongs to the preprocessing part for platform of registering, and can complete offline, so to using platform of registering to look into
See the user of hospital data and unaware.Data processing module needs to handle hospital and department's two parts data, and hospital mainly wraps
Containing the information such as hospital's name, brief introduction, phone, address, city and area information, hospital's property and rank;Department mainly includes department
The information such as name, brief introduction, doctor's brief introduction.Hospital's name, alias, brief introduction, address information, telephone number in external table etc. these by
The information of natural language composition extracts the document that composition one describes the hospital, similar between each document by judging
Degree to carry out primary dcreening operation to the hospital data that partner provides, and the higher several parts of hospital datas of similarity can be obtained in primary dcreening operation, then root
Judgement is associated according to the correlation rule hospital data high to these similarities.It is illustrated separately below:
One, the training of LSA model-carry out clustering documents
LSA model is unsupervised learning model, does not need to mark training data in advance, hospital's document that front is formed is exactly
Training data, but need just to can be carried out model training by a series of processing, process flow LSA model instruction as shown in Figure 4
Practice and prepare process, which prepares process and be broadly divided into Chinese word segmentation, document vectorization, the TF- for calculating collection of document
IDF value and use TF-IDF matrix training LSA model, are illustrated separately below:
(1), there are many participle tools of mature open source for Chinese word segmentation, it should be noted that reject punctuation mark, stop words and
Html tag, these are all the noise datas during model training.
(2), word all in whole collection of document is investigated, assigns a digital number for each word, and calculate
Word frequency.
For example, for hospital's document 1 " coking coal Central Hospital Henan Province Jiaozuo City health road ", the piece is by hospital
Title and address composition.After proceeding through Chinese word segmentation to it, the result of the Chinese word segmentation of this hospital document are as follows: coking coal, in
Centre, hospital, Henan Province, Jiaozuo City, health, road.The number of " coking coal " can be set as 8, the number in " center " is 52, the volume of " hospital "
Number be 268, the number in " Henan Province " is 500, and the number of " Jiaozuo City " is 1608, and the number of " health " is 2112, the volume on " road "
Number be 3068.So, the document vector of this hospital's document can indicate as follows:
[(8,1), (52,1), (268,1), (500,1), (1608,1), (2112,1), (3068,1)].
Assuming that other hospital's document 2 is " Coking Coal Group hospital Henan Jiaozuo health road ", Chinese word segmentation is carried out to it
After dictionary number, document vector can be expressed as follows:
[(8,1), (52,1), (268,1), (297,1), (574,1), (1608,1), (2142,1), (3068,1)].
As it can be seen that there are certain similitudes for this Liang Pian hospital document, then, the suspicious text for forming this Liang Pian hospital document
The dictionary space of shelves set merges following document vector:
[(8,2), (52,2), (268,2), (297,1), (574,1), (1608,2), (2142,1), (3068,2),
(500,1), (2112,1)]
Since dictionary number has no use for model training, it is only necessary to tie up word frequency value, but each document using second
Document vector, it is necessary to include all dictionary numbers in dictionary.
Therefore the final document vector of hospital's document 1 are as follows: [1,1,1,0,0,1,0,1,1,1]
The final document vector of hospital's document 2 are as follows: [1,1,1,1,1,1,1,1,0,0]
In view of the intersection degree of the word in every hospital's document will not be too high, thus each document vector may for containing
There is a large amount of 0 sparse vector.After entire collection of document is quantified, the sparse matrix of word frequency composition is formd, it is each
Row is a word, and each column are a documents.Illustrate frequency matrix shown in fig. 5 as follows.
(3), the TF-IDF value of collection of document is calculated
TF, that is, Term Frequency is exactly word frequency, before the step of calculated, IDF, that is, Inverse
Document Frequency is inverse document frequency, and the calculation of IDF is whole number of files divided by the document comprising the word
Number, then takes natural logrithm to quotient.
TF-IDF is exactly the weight that TF is equivalent to the word multiplied by IDF, IDF.TF-IDF value is relative to word frequency, to word
There is more reasonable meaning in description, for example certain words occur many times in a document, then the TF of the word is just very big, but
These words are again general existing in this document sets, therefore can't be played a big part for distinguishing each document, and IDF is just
It is to assign weight to the word frequency of each word, it is existing that some word the more is concentrated the more general in the document, then for solving this problem
IDF value is with regard to smaller.Such as " hospital " this word, every document in hospital data nearly all can include the word, and
Word frequency of this word in every document can be higher, therefore influence power can be than other in other words for the contribution of Documents Similarity
Word is big, but " hospital " this word is not obvious the differentiation effect of document in fact, therefore this word should be endowed
One lower weight, to balance its higher word frequency bring negative effect, IDF is exactly such weight.
Frequency matrix indicates document vector (column vector in matrix), TF- by the TF-IDF matrix being calculated
IDF matrix is as shown in schematic diagram 6.
In some embodiments, after getting TF-IDF matrix, because of document vector (column vector in matrix)
It has determined that, may be used for the cosine similarity for calculating document in fact, but directly there are three problems for calculating in this way:
1.TF-IDF matrix is excessively sparse, and calculating when data volume is very big can be very time-consuming
2. noise data caused by singular value is excessive
3. near synonym interfere
For hospital data, in a document because comprising hospital's brief introduction, has and some Model tying is had no
The word of too big meaning, these words can be referred to as noise, and document vector is all higher-dimension sparse vector, for the place of noise
Reason mode is usually dimensionality reduction, this can also be crossed greater than sparse problem solving matrix simultaneously.
Influence of the near synonym interference to similarity calculation is also very big, such as " coking coal Central Hospital Henan Province Jiaozuo City health
Road " and " Coking Coal Group hospital Henan Jiaozuo health road " this two documents, are need to be mapped as same hospital two in fact
Part data source, but if calculating according to TF-IDF vector, their similarity can't be too high, because in previous piece document
" Henan " is exactly near synonym in fact in " Henan Province " and latter piece document, but has been assigned different dictionary numbers, document vector
Also difference is thereby produced.More generally situation: the document of two description automobile engines, document one " engine sound is loud and clear ",
Document two " engine sound is loud ", this two documents are calculated according to TF-IDF vector, then similarity can be very low, but from certainly
The problem of this two documents are extremely similar for the angle of right language understanding, and here it is near synonym interference, TD-IDF matrix
In information, be not sufficient to judge that " engine " and " engine " are synonyms, and " loud and clear " and " loud " is synonym.Therefore
Need it is a kind of dimensionality reduction but also can identify the method for synonym to convert to matrix, here it is LSA models.
Below by taking LSA model as an example.Model training, the training of TF-IDF matrix are carried out to TF-IDF matrix based on LSA model
The basic principle of LSA model is the singular value decomposition (SVD) in linear algebra, i.e. a matrix can be decomposed into three matrixes
Product: A=U Σ VT, wherein A is original matrix, and U is left singular matrix, and Σ is diagonal matrix, and V is right singular matrix, and U's is each
Row represents the relevant a kind of word of meaning, and each column of V represent semantic relevant a kind of article, and the singular value for including in Σ is from upper
Arrive down descending arrangement.Σ is cut, it is assumed that originally the square matrix of n rank is punctured into k rank, according to matrix multiple original
Then, U and V will also be cut accordingly, these cutting after matrix multiple as a result, number of files and word can't be reduced
Number is equivalent to only to semantically being merged and having been disassembled and has done matrix dimensionality reduction with SVD and remain the important of original matrix again
Information.Semantic merging can be showed by a formulation, as follows:
0.73* engine+0.54* engine+0.3* automobile, the near synonym combination of such a Weighted Coefficients, is " to draw
Hold up " and the semantic of " engine " merge;
0.72* tire+0.7* automobile, the near synonym combination of such a Weighted Coefficients are that the semantic of " tire " merges;
Semantic dismantling can be understood as the automobile in above-mentioned statement, and wherein the component of automobile 0.3 is disassembled " engine "
In this related semanteme, in addition the component of automobile 0.7 is disassembled in " tire " this related semanteme, because of " automobile " this word
Language can be semantic comprising multilayer.
LSA model has carried out dismantling semantically to original word, this is to complete clustering documents and dimensionality reduction most critical
A bit.More accurate saying is: carry out semantic dismantling just complete original word space reflection to semantic space this
Process, in semantic space, similar document is closer, therefore completes the cluster of document.It has just been trained to this LSA model
Cheng Liao.
Two, the comparison of Documents Similarity
After LSA model training, to hospital to be mapped, via Chinese word segmentation, vectorization, calculate TF-IDF value it
Afterwards, the mapping of semantic space can be carried out to hospital to be mapped with trained LSA model, then in semantic space and others
Hospital's document vector carries out similarity mode, and the calculation method used in this programme is cosine similarity, calculation formula are as follows:
Cosine similarity will not investigate vector length, only investigate the angle theta of vector, this higher-dimension this for document vector is dilute
It is more particularly suitable for dredging the comparison of vector.The following figure illustrates " coking coal Central Hospital Henan Province Jiaozuo City health road " this hospital's text
The ranking results of the similarity of shelves and hospital existing in database from high to low are as shown in Figure 7.In Fig. 7, first row indicates classification
Index, secondary series indicate that hospital's directory index, third column indicate hospital name, and the 4th column indicate public based on above-mentioned cosine similarity
The similarity that formula is calculated, it can be seen that all similar hospitals all cluster together, and sequence is in the main true.
Specifically, the mapping of hospital data can be divided mainly into following two kinds of scenes:
A, mapping when original state
When initial, as soon as the hospital data of only partner, the hospital data of this partner is used as base value
According at this moment a hospital only has a hospital data.After later other partner's input data, the doctor that will duplicate
Institute's data, it is therefore desirable to carry out similarity-rough set, the comparison of this when is exactly each hospital of this new partner
Data are compared with all reference datas.
When original state, the hospital data that database can be entered using first part as benchmark data, behind have data add come
When, the data come will be just added below and all carry out similarity-rough set with every part of hospital data in database respectively, then the row of taking
Several parts of preceding hospital datas, then by the hospital data being newly added judged by correlation rule its whether can in database
Hospital data mapping.
If original state there are more parts of data, select a copy of it for benchmark data, then carries out the ratio of similarity respectively
Compared with.
B, subsequent more new mappings
For the hospital data for the partner being newly added, can by its respectively with every part of hospital data in database
The comparison for carrying out similarity, then obtains the permutation table of similarity.
Three, the design of correlation rule
It, can also be after similarity-rough set passes through, so that it may judge the number in order to increase System Error-tolerance Property after similarity anticipation
According to whether may map in database under existing classified catalogue, that is, under 10 document enters before selection similarity ranking
The accurate judgement of one step, i.e. correlation rule judgement.Accordingly even when have certain error in previous step, i.e., the hospital that should be mapped
Ranking will not be missed there is no at first, and according to business scenario, the current partner's number of platform of registering and hospital
From the point of view of coincidence number, taking first 10 is than better suited value.
In addition, the design of correlation rule is also required to be adjusted according to business scenario, for example, hospital and department data dimension simultaneously
It is different, therefore correlation rule can not be the same.
The correlation rule for platform institute of traditional Chinese medicine data of registering has three ranks: the first rank be hospital name it is identical=> hospital can map;
Second-order be hospital's alias it is identical with hospital name=> hospital can map;Third rank is city code, regional code and phone number
Identical=> the hospital of code can map.The operational mode of three rank correlation rules is: if the first rank derives successfully, no longer investigating the
Second order and third rank rule;First rank is unsuccessful to investigate second-order again, and second-order is if it succeeds, no longer investigate third rank.Such as
Three rank of fruit is all unsuccessful, then it is assumed that the hospital that this hospital, family can not map, for the new hospital that the partner provides, or friendship
By manual examination and verification.
The correlation rule of department's data is also designed as three ranks: the first rank be department's title it is identical=> department can map;The
Second order, which has inclusion relation=> department for department's title, to be mapped;Third rank is the matching that doctor's title under department has 60%
Rate=> department can map.The operational mode of three rank correlation rules is identical with hospital.
The application scenarios of the second-order correlation rule of hospital are as follows:
The Chongqing great Ping hospital VS great Ping hospital, Third Military Medical University [0.634987]
The similarity of this Liang Jia hospital is 0.634987, and being finally mapped successfully is the alias because the former is the latter.
The application scenarios of the third rank correlation rule of hospital are as follows:
440,300 440304 0755-23811 165 of institute of traditional Chinese medicine, Shenzhen gold ground seascape Community Healthcare Service Center
440,300 440304 0755-23811165 [0.775298] of Jin Di seascape society health
The similarity of this Liang Jia hospital is 0.775298, and being finally mapped successfully is city, regional code because of the two
It is consistent with telephone number.
The application scenarios of the second-order correlation rule of department are as follows:
Medical and beauty treatment section (area Zhu Yuan) VS medical and beauty treatment section [0.94992]
Because title has inclusion relation, it is mapped.
The application scenarios of the third rank correlation rule of department are as follows:
1158 reproductive medicine section (Bei Yuan) Wang Junxia Chen Hua Zhou Jianjun Wang Fen
The 1158 25347 old Hua Wangjun rosy clouds Zhou Jianjuns [0.536493] of north institute reproductive center Wang Fen
Physician names registration 60% in the third rank correlation rule of department is an empirical value, can be answered with different
It is adjusted with scene.Why physician names are included in the judgement of correlation rule, are because department's information is come relative to information for hospital
Say considerably less, department does not have very high, the null value in platform of registering at present of introducing null value rate of the information such as address, phone and department
Rate is higher than 50%, therefore physician names are necessary in this as correlation rule.Without physician names are included in LSA mould
In the training data of type, then allowing for physician names may be considered inherent noun, cannot be disassembled or merge.
The problem of theoretical foundation of LSA model is that the SVD of linear algebra is decomposed, this scheme is can not explaining for semanteme
Property.The application also provides a kind of bag of words, i.e. probability latent semantic analysis (PLSA), is based on probability, and semanteme becomes to be implicit
Amount, basic ideas are also space conversion, but theory support is probability theory, has preferable interpretation, be one theoretically more
Excellent model.But in the case where hospital and department's data map this application scenarios, the effect of PLSA is not so good as LSA.In 513 hospitals,
There are 32 can map hospital, the comparing result of LSA and PLSA are as shown in table 1 below:
Hospital | Rule 1 | Rule 2 | Rule 3 |
Based on LSA model | 20 | 1 | 9 |
Based on PLSA model | 12 | 2 | 8 |
Table 1
Wherein, Rule1, Rule2 and Rule3 indicate that correlation rule, Rule1 indicate the first rank correlation rule, Rule2 table
Show that second-order correlation rule, Rule3 indicate third rank correlation rule.
When being associated judgement using Rule1, can be associated with out 20 parts based on LSA model can be mapped as same hospital
Hospital data, and 12 parts of hospital datas that can be mapped as same hospital can be associated with out based on PLSA model.
When due to that may be associated using Rule1, some hospitals for being mapped as same hospital may be missed
Data continue to be associated judgement, association results using Rule2 so can also continue to carry out second-order association judgement are as follows:
It can be associated with out 1 part of hospital data that can be mapped as same hospital based on LSA model, and 2 can be associated with out based on PLSA model
Part can be mapped as the hospital data of same hospital.
When similarly, due to that may be associated using Rule2, it may miss and some be mapped as same hospital
Hospital data continues to be associated judgement using Rule3, association knot so can also continue to carry out the association judgement of third rank
Fruit are as follows: 9 parts of hospital datas that can be mapped as same hospital can be associated with out based on LSA model, and can be closed based on PLSA model
Join 8 parts of hospital datas that can be mapped as same hospital out.
Finally, identifying the 30 parts of hospital datas that can be mapped as same hospital in total based on LSA model, it is based on PLSA
Model identifies the 22 parts of hospital datas that can be mapped as same hospital altogether.
In some embodiments, LSA model can be individually disposed, PLSA model can also be individually disposed, it can be with portion
Operation efficiency can effectively be improved, and be timely pushed to carry out parallel computation in this way by affixing one's name to LSA model and PLSA model
Client, for the user for using client, the variation unaware of back-end data.
In some embodiments, the application is also based on the extension of word2vec, and word2vec is a kind of term vector
Form of presentation can be generalized to the statement of document vector, therefore the similarity comparison of document vector can be in this special text
It is carried out between shelves vector.This model can investigate the sequencing of word, the i.e. context of co-text of word, than bag of words (example
Such as LSA model and PLSA model) it can more be fitted natural language.
The method of data processing a kind of in the application is illustrated above, below to the method for executing above-mentioned data processing
Device be described, which can be server or terminal device, be also possible to be installed on server or this terminal device
On interactive application, the application is mainly using the device as server, and with the device is the interaction being installed on server
For formula application.
One, referring to Fig. 8, the device 80 for data processing is illustrated, the device 80 for data processing can include:
Module 801 is obtained, for obtaining training data set to be processed, the training data set includes at least two parts
Training data after semantic analysis;
Processing module 802, for it is described acquisition module 801 obtain the training data set in training data into
Row clustering, obtains target data set, and the target data set includes at least two that similarity is higher than default similarity
Part training data;
Mapping block 803, each training data in target training set for obtaining the processing module 802
It is mapped to the same category list, the category list is used to provide the entrance for obtaining the training data under the category list.
In the embodiment of the present application, the training data set that transceiver module 801 obtains includes at least two parts and passes through semantic analysis
Training data afterwards, it is seen that by the pretreatment mode of semantic analysis, just can slightly judge to meet to be mapped to same category catalogue
Training data, reduce mapping range.Then the training data in the training data set is gathered by processing module 802
Alanysis obtains including that similarity is higher than the target data set for presetting at least two parts training datas of similarity, due to passing through
Clustering identifies the higher training data of similarity, really is able to be mapped to same category of instruction so can further determine that
Practice data.Each training data in target training set is finally mapped to the same category list by mapping block 803.
As it can be seen that the application can be improved the accuracy of multi-data source mapping, difference in form can be recognized accurately but semantically
The same or similar training data improves the reliability and fault-tolerance of mapping.
Optionally, in some inventive embodiments, the processing module 802 is specifically used for:
Respectively by each training data in the training data set from element group space reflection to semantic space;
The similarity being mapped between each training data of semantic space is calculated, it is true according to the similarity between training data
The fixed target data set.
Optionally, in some inventive embodiments, the processing module 802 is specifically used for:
Element group division processing is carried out to each training data in the training data set respectively, obtains at least two yuan
Element group set, the element group set includes at least one element group, the corresponding a training data of each element group set, element
Group indicates the set of at least one indivisible element;
Vectorization processing is carried out at least two elements group set respectively, obtains the first matrix, first matrix
The frequency occurred for indicating at least one element group in each element group set;
According to the weight of element group, the frequency of element group and first matrix, it is calculated the second matrix, described second
Matrix is used to indicate the frequency weighted value of element group;
Bag of words training is carried out to second matrix.
Optionally, in some inventive embodiments, the processing module 802 is specifically used for:
Based on the bag of words to second matrix carry out singular value decomposition, obtain left singular matrix, diagonal matrix and
Right singular matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.
Optionally, in some embodiments, the weight of the first element group according to the sum of element group set and including
The sum of the element group set of the first element group obtains, and the first element group refers to the element group in element group set.
Optionally, in some inventive embodiments, the processing module 802 is specifically used for:
According to the frequency that element group occurs in each element group set, respectively at least two elements group set into
Row vectorization processing, obtains at least two training vectors;
First matrix is formed according to obtained at least two training vector.
Optionally, in some inventive embodiments, the processing module 802 is to the training in the training data set
Data carry out clustering after, by the target training set in each training data be associated with the same category list it
Before, it is also used to:
Judge whether the training data in the target data set meets mapping ruler, however, it is determined that meet the mapping rule
Then, then each training data in target training set is mapped to the same category list.
Optionally, in some inventive embodiments, the mapping ruler meets:
Similarity between training data is higher than the default similarity, and according to the grade descending of element group, judgement
Whether the element group in one element group set is identical in semantic space as the element group of ad eundem in another element group set
Or it is similar, if so, determination meets mapping ruler, if it is not, then carrying out the judgement of lower level.
The application also provides a kind of computer storage medium, which has program, which includes above-mentioned when executing
Device for data processing executes some or all of step in the method for above-mentioned data processing.
The application also provides a kind of computer program product comprising instruction, when run on a computer, makes to succeed in one's scheme
Calculation machine executes some or all of step in the method as performed by the device for data processing.
The device for data processing in the embodiment of the present invention is carried out from the angle of modular functionality entity above
Description, below from the angle of hardware handles respectively in the embodiment of the present invention network authentication server and terminal device retouch
It states.It should be noted that the corresponding entity device of acquisition module in the embodiment shown in fig. 8 can be input/output list
Member, the corresponding entity device of processing module can be processor.Device shown in Fig. 8 can have structure as shown in Figure 9, when
When device shown in Fig. 8 has structure as shown in Figure 9, it is aforementioned right that processor and I/O unit in Fig. 9 can be realized
Should device the processing module that provides of Installation practice and obtain the same or similar function of module, the memory in Fig. 9 is deposited
The program code that storage processor needs to call when executing the method for above-mentioned data processing.
Figure 10 is a kind of server architecture schematic diagram provided in an embodiment of the present invention, which can be because of configuration or property
Energy is different and generates bigger difference, may include one or more central processing unit (full name in English: central
Processing units, English abbreviation: CPU) 1022 (for example, one or more processors) and memory 1032, one
The storage medium 1030 of a or more than one storage application program 1042 or data 1044 (such as deposit by one or more magnanimity
Store up equipment).Wherein, memory 1032 and storage medium 1030 can be of short duration storage or persistent storage.It is stored in storage medium
1030 program may include one or more modules (diagram does not mark), and each module may include in server
Series of instructions operation.Further, central processing unit 1022 can be set to communicate with storage medium 1030, in server
The series of instructions operation in storage medium 1030 is executed on 1000.
Server 1000 can also include one or more power supplys 1026, one or more wired or wireless nets
Network interface 1050, one or more input/output interfaces 1058, and/or, one or more operating systems 1041, example
Such as Windows Server, Mac OS XT, Unix, Linux, FreeBSD etc..
The step as performed by server can be based on the server architecture shown in Fig. 10 in above-described embodiment.
The embodiment of the invention also provides another terminal devices, as shown in figure 11, for ease of description, illustrate only with
The relevant part of the embodiment of the present invention, it is disclosed by specific technical details, please refer to present invention method part.The terminal
Equipment can be include mobile phone, tablet computer, personal digital assistant (full name in English: Personal Digital Assistant,
English abbreviation: PDA), point-of-sale terminal (full name in English: Point of Sales, English abbreviation: POS), vehicle-mounted computer etc. it is any eventually
End equipment, by taking terminal device is mobile phone as an example:
Figure 11 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided in an embodiment of the present invention.Ginseng
Figure 11 is examined, mobile phone includes: radio frequency (full name in English: Radio Frequency, English abbreviation: RF) circuit 1111, memory
1120, input unit 1130, display unit 1140, sensor 1150, voicefrequency circuit 1160, Wireless Fidelity (full name in English:
Wireless fidelity, English abbreviation: WiFi) components such as module 1170, processor 1180 and power supply 1190.This field
Technical staff is appreciated that handset structure shown in Figure 11 does not constitute the restriction to mobile phone, may include more than illustrating
Or less component, perhaps combine certain components or different component layouts.
It is specifically introduced below with reference to each component parts of the Figure 11 to mobile phone:
RF circuit 1111 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station
After downlink information receives, handled to processor 1180;In addition, the data for designing uplink are sent to base station.In general, RF circuit
1111 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (full name in English: Low
Noise Amplifier, English abbreviation: LNA), duplexer etc..In addition, RF circuit 1111 can also by wireless communication with net
Network and other equipment communication.Any communication standard or agreement can be used in above-mentioned wireless communication, and including but not limited to the whole world is mobile
Communication system (full name in English: Global System of Mobile communication, English abbreviation: GSM), general point
Group wireless service (full name in English: General Packet Radio Service, English abbreviation: GPRS), CDMA (English
Full name: Code Division Multiple Access, English abbreviation: CDMA), wideband code division multiple access (full name in English:
Wideband Code Division Multiple Access, English abbreviation: WCDMA), long term evolution (full name in English: Long
Term Evolution, English abbreviation: LTE), Email, short message service (full name in English: Short Messaging
Service, English abbreviation: SMS) etc..
Memory 1120 can be used for storing software program and module, and processor 1180 is stored in memory by operation
1120 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1120 can be led
It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function
Application program (such as sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses institute according to mobile phone
Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1120 may include high random access storage
Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid
State memory device.
Input unit 1130 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with
And the related key signals input of function control.Specifically, input unit 1130 may include touch panel 1131 and other inputs
Equipment 1132.Touch panel 1131, also referred to as touch screen collect touch operation (such as the user of user on it or nearby
Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1131 or near touch panel 1131
Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1131 may include touch detection
Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band
The signal come, transmits a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and by it
It is converted into contact coordinate, then gives processor 1180, and order that processor 1180 is sent can be received and executed.In addition,
Touch panel 1131 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface
Plate 1131, input unit 1130 can also include other input equipments 1132.Specifically, other input equipments 1132 may include
But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc.
It is one or more.
Display unit 1140 can be used for showing information input by user or be supplied to user information and mobile phone it is each
Kind menu.Display unit 1140 may include display panel 1141, optionally, can using liquid crystal display (full name in English:
Liquid Crystal Display, English abbreviation: LCD), Organic Light Emitting Diode (full name in English: Organic Light-
Emitting Diode, English abbreviation: OLED) etc. forms configure display panel 1141.Further, touch panel 1131 can
Covering display panel 1141 sends processor to after touch panel 1131 detects touch operation on it or nearby
1180, to determine the type of touch event, are followed by subsequent processing device 1180 and are provided on display panel 1141 according to the type of touch event
Corresponding visual output.Although touch panel 1131 and display panel 1141 are come as two independent components in Figure 11
Realize the input and input function of mobile phone, but in some embodiments it is possible to by touch panel 1131 and display panel 1141
It is integrated and that realizes mobile phone output and input function.
Mobile phone may also include at least one sensor 1150, such as optical sensor, motion sensor and other sensors.
Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light
Light and shade adjust the brightness of display panel 1141, proximity sensor can close display panel when mobile phone is moved in one's ear
1141 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add
The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture
Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.;Also as mobile phone
The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.
Voicefrequency circuit 1160, loudspeaker 1161, microphone 1162 can provide the audio interface between user and mobile phone.Audio
Electric signal after the audio data received conversion can be transferred to loudspeaker 1161, be converted by loudspeaker 1161 by circuit 1160
For voice signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 1162, by voicefrequency circuit 1160
Audio data is converted to after reception, then by after the processing of audio data output processor 1180, through RF circuit 1111 to be sent to ratio
Such as another mobile phone, or audio data is exported to memory 1120 to be further processed.
WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1170
Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 11 is shown
WiFi module 1170, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely
Become in the range of the essence of invention and omits.
Processor 1180 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone,
By running or execute the software program and/or module that are stored in memory 1120, and calls and be stored in memory 1120
Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor
1180 may include one or more processing units;Preferably, processor 1180 can integrate application processor and modulation /demodulation processing
Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located
Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1180.
Mobile phone further includes the power supply 1190 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply
Management system and processor 1180 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system
The functions such as reason.
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.
In embodiments of the present invention, processor 1180 included by the mobile phone also there is control to execute above by terminal device
The method flow of execution.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module
The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple
On network module.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
It, can also be in addition, can integrate in a processing module in each functional module in each embodiment of the application
It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit realizes and that when sold or used as an independent product can store can in a computer in the form of software function module
It reads in storage medium.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.
The computer program product includes one or more computer instructions.Load and execute on computers the meter
When calculation machine program instruction, entirely or partly generate according to process or function described in the embodiment of the present invention.The computer can
To be general purpose computer, special purpose computer, computer network or other programmable devices.The computer instruction can be deposited
Storage in a computer-readable storage medium, or from a computer readable storage medium to another computer readable storage medium
Transmission, for example, the computer instruction can pass through wired (example from a web-site, computer, server or data center
Such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave) mode to another website
Website, computer, server or data center are transmitted.The computer readable storage medium can be computer and can deposit
Any usable medium of storage either includes that the data storages such as one or more usable mediums integrated server, data center are set
It is standby.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or partly lead
Body medium (such as solid state hard disk Solid State Disk (SSD)) etc..
Claims (15)
1. a kind of method of data processing, which is characterized in that the described method includes:
Training data set to be processed is obtained, the training data set includes at least two parts of training after semantic analysis
Data;
Clustering is carried out to the training data in the training data set, obtains target data set, the target data
Set includes at least two parts of training datas that similarity is higher than default similarity;
Each training data in target training set is mapped to the same category list, the category list is for providing
Obtain the entrance of the training data under the category list.
2. the method according to claim 1, wherein the training data in the training data set into
Row clustering obtains target data set, comprising:
Respectively by each training data in the training data set from element group space reflection to semantic space;
The similarity being mapped between each training data of semantic space is calculated, institute is determined according to the similarity between training data
State target data set.
3. according to the method described in claim 2, it is characterized in that, described respectively by each training in the training data set
Data are from element group space reflection to semantic space, comprising:
Element group division processing is carried out to each training data in the training data set respectively, obtains at least two element groups
Set, the element group set includes at least one element group, the corresponding a training data of each element group set, element group table
Show the set of at least one indivisible element;
Vectorization processing is carried out at least two elements group set respectively, obtains the first matrix, first matrix is used for
Indicate the frequency that at least one element group occurs in each element group set;
According to the weight of element group, the frequency of element group and first matrix, the second matrix, second matrix is calculated
For indicating the frequency weighted value of element group;
Bag of words training is carried out to second matrix.
4. according to the method described in claim 3, it is characterized in that, it is described to second matrix carry out bag of words training,
Include:
Singular value decomposition is carried out to second matrix based on the bag of words, obtains left singular matrix, diagonal matrix and right surprise
Different matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.
5. according to the method described in claim 3, it is characterized in that, the weight of the first element group is according to the sum of element group set
And the sum of the element group set including the first element group obtains, the first element group refers in element group set
Element group.
6. according to the method described in claim 5, it is characterized in that, described respectively carry out at least two elements group set
Vectorization processing, obtains the first matrix, comprising:
According to the frequency that element group occurs in each element group set, respectively at least two elements group set carry out to
Quantification treatment obtains at least two training vectors;
First matrix is formed according to obtained at least two training vector.
7. according to any method of claim 3-6, which is characterized in that the training in the training data set
After data carry out clustering, each training data by target training set is associated with the same category list
Before, the method also includes:
Judge whether the training data in the target data set meets mapping ruler, however, it is determined that meet the mapping ruler,
Each training data in target training set is then mapped to the same category list.
8. the method according to the description of claim 7 is characterized in that the mapping ruler meets:
Similarity between training data is higher than the default similarity, and according to the grade descending of element group, judges one
The element group of element group in element group set and ad eundem in another element group set is whether semantic space identical or phase
Seemingly, if so, determination meets mapping ruler, if it is not, then carrying out the judgement of lower level.
9. a kind of device for data processing, which is characterized in that described device includes:
Module is obtained, for obtaining training data set to be processed, the training data set includes at least two parts and passes through language
Training data after justice analysis;
Processing module, the training data in the training data set for obtaining to the acquisition module carry out cluster point
Analysis, obtains target data set, and the target data set includes at least two parts trained numbers that similarity is higher than default similarity
According to;
Mapping block, each training data in target training set for obtaining the processing module are mapped to same
A category list, the category list are used to provide the entrance for obtaining the training data under the category list.
10. device according to claim 9, which is characterized in that the processing module is specifically used for:
Respectively by each training data in the training data set from element group space reflection to semantic space;
The similarity being mapped between each training data of semantic space is calculated, institute is determined according to the similarity between training data
State target data set.
11. device according to claim 10, which is characterized in that the processing module is specifically used for:
Element group division processing is carried out to each training data in the training data set respectively, obtains at least two element groups
Set, the element group set includes at least one element group, the corresponding a training data of each element group set, element group table
Show the set of at least one indivisible element;
Vectorization processing is carried out at least two elements group set respectively, obtains the first matrix, first matrix is used for
Indicate the frequency that at least one element group occurs in each element group set;
According to the weight of element group, the frequency of element group and first matrix, the second matrix, second matrix is calculated
For indicating the frequency weighted value of element group;
Bag of words training is carried out to second matrix.
12. device according to claim 11, which is characterized in that the processing module is specifically used for:
Singular value decomposition is carried out to second matrix based on the bag of words, obtains left singular matrix, diagonal matrix and right surprise
Different matrix, to carry out dimension-reduction treatment to second matrix, to remove the noise data in second matrix.
13. device according to claim 12, which is characterized in that the processing module is specifically used for:
According to the frequency that element group occurs in each element group set, respectively at least two elements group set carry out to
Quantification treatment obtains at least two training vectors;
First matrix is formed according to obtained at least two training vector.
14. a kind of computer storage medium, which is characterized in that it includes instruction, when run on a computer, so that calculating
Machine executes method a method as claimed in any one of claims 1-8.
15. a kind of computer program product comprising instruction, which is characterized in that when run on a computer, so that calculating
Machine executes any method of the claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710619053.5A CN109947858B (en) | 2017-07-26 | 2017-07-26 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710619053.5A CN109947858B (en) | 2017-07-26 | 2017-07-26 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109947858A true CN109947858A (en) | 2019-06-28 |
CN109947858B CN109947858B (en) | 2022-10-21 |
Family
ID=67003894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710619053.5A Active CN109947858B (en) | 2017-07-26 | 2017-07-26 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109947858B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674293A (en) * | 2019-08-27 | 2020-01-10 | 电子科技大学 | Text classification method based on semantic migration |
CN111930463A (en) * | 2020-09-23 | 2020-11-13 | 杭州橙鹰数据技术有限公司 | Display method and device |
CN112650836A (en) * | 2020-12-28 | 2021-04-13 | 成都网安科技发展有限公司 | Text analysis method and device based on syntax structure element semantics and computing terminal |
CN113191147A (en) * | 2021-05-27 | 2021-07-30 | 中国人民解放军军事科学院评估论证研究中心 | Unsupervised automatic term extraction method, apparatus, device and medium |
CN113420328A (en) * | 2021-06-23 | 2021-09-21 | 鹤壁国立光电科技股份有限公司 | Big data batch sharing exchange system |
CN114696946A (en) * | 2020-12-28 | 2022-07-01 | 郑州大学 | Data encoding method, data decoding method, data encoding device, data decoding device, electronic equipment and storage medium |
CN114743681A (en) * | 2021-12-20 | 2022-07-12 | 健康数据(北京)科技有限公司 | Case grouping screening method and system based on natural language processing |
CN114732634A (en) * | 2022-05-19 | 2022-07-12 | 佳木斯大学 | Clinical medicine is with preventing neonate and infecting probability analytic system and isolating device thereof |
CN118335348A (en) * | 2024-06-12 | 2024-07-12 | 临沂亿通软件有限公司 | Medical big data processing method and system based on cloud computing |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040193414A1 (en) * | 2000-01-27 | 2004-09-30 | Manning & Napier Information Services, Llc | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors |
US20070150424A1 (en) * | 2005-12-22 | 2007-06-28 | Pegasus Technologies, Inc. | Neural network model with clustering ensemble approach |
CN101059806A (en) * | 2007-06-06 | 2007-10-24 | 华东师范大学 | Word sense based local file searching method |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
CN105284089A (en) * | 2013-06-27 | 2016-01-27 | 华为技术有限公司 | Data transmission method and apparatus |
CN106021578A (en) * | 2016-06-01 | 2016-10-12 | 南京邮电大学 | Improved text classification algorithm based on integration of cluster and membership degree |
CN106557485A (en) * | 2015-09-25 | 2017-04-05 | 北京国双科技有限公司 | A kind of method and device for choosing text classification training set |
CN106776713A (en) * | 2016-11-03 | 2017-05-31 | 中山大学 | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis |
-
2017
- 2017-07-26 CN CN201710619053.5A patent/CN109947858B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040193414A1 (en) * | 2000-01-27 | 2004-09-30 | Manning & Napier Information Services, Llc | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors |
US20070150424A1 (en) * | 2005-12-22 | 2007-06-28 | Pegasus Technologies, Inc. | Neural network model with clustering ensemble approach |
CN101059806A (en) * | 2007-06-06 | 2007-10-24 | 华东师范大学 | Word sense based local file searching method |
CN105284089A (en) * | 2013-06-27 | 2016-01-27 | 华为技术有限公司 | Data transmission method and apparatus |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
CN106557485A (en) * | 2015-09-25 | 2017-04-05 | 北京国双科技有限公司 | A kind of method and device for choosing text classification training set |
CN106021578A (en) * | 2016-06-01 | 2016-10-12 | 南京邮电大学 | Improved text classification algorithm based on integration of cluster and membership degree |
CN106776713A (en) * | 2016-11-03 | 2017-05-31 | 中山大学 | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis |
Non-Patent Citations (2)
Title |
---|
ZHUGE H: "Peer-to-Peer in Metric Space and Semantic", 《IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING》 * |
戴新宇等: "一种基于潜在语义分析和直推式谱图算法的文本分类方法LSASGT", 《电子学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674293A (en) * | 2019-08-27 | 2020-01-10 | 电子科技大学 | Text classification method based on semantic migration |
CN111930463A (en) * | 2020-09-23 | 2020-11-13 | 杭州橙鹰数据技术有限公司 | Display method and device |
CN114696946A (en) * | 2020-12-28 | 2022-07-01 | 郑州大学 | Data encoding method, data decoding method, data encoding device, data decoding device, electronic equipment and storage medium |
CN112650836A (en) * | 2020-12-28 | 2021-04-13 | 成都网安科技发展有限公司 | Text analysis method and device based on syntax structure element semantics and computing terminal |
CN114696946B (en) * | 2020-12-28 | 2023-07-14 | 郑州大学 | Data encoding and decoding method and device, electronic equipment and storage medium |
CN113191147A (en) * | 2021-05-27 | 2021-07-30 | 中国人民解放军军事科学院评估论证研究中心 | Unsupervised automatic term extraction method, apparatus, device and medium |
CN113420328B (en) * | 2021-06-23 | 2023-04-28 | 鹤壁国立光电科技股份有限公司 | Big data batch sharing exchange system |
CN113420328A (en) * | 2021-06-23 | 2021-09-21 | 鹤壁国立光电科技股份有限公司 | Big data batch sharing exchange system |
CN114743681A (en) * | 2021-12-20 | 2022-07-12 | 健康数据(北京)科技有限公司 | Case grouping screening method and system based on natural language processing |
CN114743681B (en) * | 2021-12-20 | 2024-01-30 | 健康数据(北京)科技有限公司 | Case grouping screening method and system based on natural language processing |
CN114732634A (en) * | 2022-05-19 | 2022-07-12 | 佳木斯大学 | Clinical medicine is with preventing neonate and infecting probability analytic system and isolating device thereof |
CN118335348A (en) * | 2024-06-12 | 2024-07-12 | 临沂亿通软件有限公司 | Medical big data processing method and system based on cloud computing |
CN118335348B (en) * | 2024-06-12 | 2024-09-17 | 互网嘉(上海)信息技术有限公司 | Medical big data processing method and system based on cloud computing |
Also Published As
Publication number | Publication date |
---|---|
CN109947858B (en) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109947858A (en) | A kind of method and device of data processing | |
CN111339774B (en) | Text entity relation extraction method and model training method | |
CN107943860B (en) | Model training method, text intention recognition method and text intention recognition device | |
CN104239535B (en) | A kind of method, server, terminal and system for word figure | |
CN110334344B (en) | Semantic intention recognition method, device, equipment and storage medium | |
CN111931501B (en) | Text mining method based on artificial intelligence, related device and equipment | |
CN110276075A (en) | Model training method, name entity recognition method, device, equipment and medium | |
CN110704661B (en) | Image classification method and device | |
US20080065623A1 (en) | Person disambiguation using name entity extraction-based clustering | |
CN111177180A (en) | Data query method and device and electronic equipment | |
CN107330022A (en) | A kind of method and device for obtaining much-talked-about topic | |
CN107426177A (en) | A kind of user behavior clustering method and terminal, computer-readable recording medium | |
WO2021147421A1 (en) | Automatic question answering method and apparatus for man-machine interaction, and intelligent device | |
CN107273416A (en) | The dark chain detection method of webpage, device and computer-readable recording medium | |
CN110276010A (en) | A kind of weight model training method and relevant apparatus | |
CN111651604A (en) | Emotion classification method based on artificial intelligence and related device | |
CN115022098B (en) | Artificial intelligence safety target range content recommendation method, device and storage medium | |
CN112749252A (en) | Text matching method based on artificial intelligence and related device | |
CN110597957B (en) | Text information retrieval method and related device | |
CN108897846A (en) | Information search method, equipment and computer readable storage medium | |
CN116975295B (en) | Text classification method and device and related products | |
CN110781274A (en) | Question-answer pair generation method and device | |
CN114328908A (en) | Question and answer sentence quality inspection method and device and related products | |
CN113822038A (en) | Abstract generation method and related device | |
CN113704008A (en) | Anomaly detection method, problem diagnosis method and related products |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TG01 | Patent term adjustment |