CN106980639A

CN106980639A - Short text data paradigmatic system and method

Info

Publication number: CN106980639A
Application number: CN201611242641.3A
Authority: CN
Inventors: 郑建宾; 华锦芝; 周钰
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-07-25
Anticipated expiration: 2036-12-29
Also published as: CN106980639B

Abstract

The present invention relates to a kind of short text data paradigmatic system, including：Data acquisition module, for the first set for obtaining short text data, short text data to be polymerized；Data extraction module, second set and the second field attribute data for forming the first field attribute data；And data aggregate module, including candidate data query unit, similarity calculated and short text data polymerized unit；Wherein, candidate data query unit inquires about several first field attribute data related to the second field attribute data from second set, similarity calculated calculates the similarity between every one first field attribute data and the second field attribute data, and the short text data corresponding to similarity highest the first field attribute data and text data to be polymerized are carried out data aggregate by short text data polymerized unit.Data aggregate process compatible accuracy rate based on the system is high, system execution efficiency is high.The system logic is simple, configuration is convenient.

Description

Short text data paradigmatic system and method

Technical field

The present invention relates to Data fusion technique field, more specifically to a kind of short text data paradigmatic system and side Method.

Background technology

At present, social media, mobile Internet, big data analysis, cloud computing, Internet of Things these industries are not isolated development , but mutually merging, and then realize high speed cooperative development.As the support means of intelligent decision, big data is in financial machine Application in structure, enterprise, cause, government, social management and development field is the direction that research staff makes great efforts.

Traditional statistical analysis is often to data mapping (marketing data, administrative form, survey, census Deng) carrying out deep tracking and analysis, analysis personnel have certain control and the understanding of deep layer to the source of data and structure.And In the big data epoch, data source is that various, self-assembling formation, magnanimity data are often half structure or structureless.This is just It is required that data science man and analyst control various, multi-source data, excavated and analyzed after they are combed.

The different data that will originate are sorted out, analyzed, and are directed to two technical bottlenecks.First, Various types of data is originated not Same, structure is different, it is necessary to extract wherein common field before data clusters fusion；2nd, the standard of data clusters integration technology True rate governs the application breadth and depth of the technology.In the prior art, for the Cluster-Fusion between short text data, deposit In many application technologies, but wherein, often using in short text keyword occur word frequency as data aggregate primary foundation, this The one-sidedness of decision-making is easily caused, and then has a strong impact on the accuracy rate of data aggregate.In addition, needing the field of processing mass data Close, the execution efficiency of data aggregate is the technical problem that those skilled in the art especially pay attention to.

The content of the invention

It polymerize the short text data paradigmatic system that accuracy rate is high, execution efficiency is high it is an object of the invention to provide a kind of.

To achieve the above object, a kind of technical scheme of present invention offer is as follows：

A kind of short text data paradigmatic system, including：Data acquisition module, it includes internal data loading unit and external data Acquiring unit, internal data loading unit obtains the first set of short text data, external number from the data memory module of system According to acquiring unit short text data to be polymerized is obtained from the outside of system；Data extraction module, is coupled with data acquisition module, its Including field extracting unit, field extracting unit extracts the word of the participation polymerization of each short text data respectively from first set Section, to form the second set of the first field attribute data, and extraction participates in the field of polymerization from short text data to be polymerized, To form the second field attribute data；And data aggregate module, coupled with data extraction module, it includes candidate data inquiry Unit, similarity calculated and short text data polymerized unit；Wherein, candidate data query unit is looked into from second set Several first field attribute data related to the second field attribute data are ask, to form the 3rd of the first field attribute data the Set, similarity calculated is calculated between every one first field attribute data and the second field attribute data in the 3rd set Similarity, short text data polymerized unit by the 3rd gather in, with second field attribute data similarity the first word of highest Short text data corresponding to section attribute data carries out data aggregate with text data to be polymerized.

Preferably, data aggregate module also includes inverted list structural unit, and inverted list structural unit is to the second field attribute Data configuration inverted list, candidate data query unit inquires about several first words of correlation according to inverted list from second set Section attribute data.

Preferably, data extraction module also include data filtering units, data filtering units filtered out from second set with The first field attribute data that second field attribute data can not be matched.

Preferably, candidate data query unit calculates each first field attribute data and the second field attribute in second set The degree of correlation between data, and the first field attribute data for being more than relevance threshold with the degree of correlation form the 3rd set.

Preferably, the degree of correlation is with the segmentation sequence of the first field attribute data and the segmentation sequence of the second field attribute data Between identical participle word number for calculate the factor.

Preferably, the system also includes serialization unit, unserializing unit, and serialization unit is used for internal storage data sequence Rowization are for being stored on disk, and unserializing unit is used to disk file being converted to internal storage data.

Invention additionally discloses a kind of short text data polymerization, it comprises the following steps：A), obtained from data memory module The first set of short text data is taken, short text data to be polymerized is obtained from outside；B), extracted respectively from first set each short The field of the participation polymerization of text data, to form the second set of the first field attribute data, and from short text number to be polymerized According to the middle field for extracting and participating in polymerizeing, to form the second field attribute data；C), inquiry belongs to the second field from second set The degree of correlation between property data meets several the first field attribute data of relevance threshold, to form the first field attribute number According to the 3rd set；D), between the every one first field attribute data and the second field attribute data in the set of calculating the 3rd Similarity；E) during, the 3rd is gathered and corresponding to second field attribute data similarity highest the first field attribute data Short text data carries out data aggregate with text data to be polymerized.

Short text data paradigmatic system and method that the present invention is provided, realize a kind of matching accuracy rate height, system and perform The data aggregate process of efficiency high.By multiple batches of filtering or matching, when handling magnanimity external data, it takes Significantly shorten.The system logic is simple, configuration is convenient, implementation cost is low, is easy to popularization and application in industry.

Brief description of the drawings

Fig. 1 shows the modular structure schematic diagram of the short text data paradigmatic system of one embodiment of the invention.

Embodiment

As shown in figure 1, one embodiment of the invention provides a kind of short text data paradigmatic system, it includes data acquisition module 10th, data extraction module 20, data aggregate module 30 and data memory module 40.

Wherein, data acquisition module 10 includes internal data loading unit 101 and external data acquiring unit 102, internal Data loading unit 101 from data memory module 40 obtain short text data first set, external data acquiring unit 102 from The outside of system obtains input, i.e. the set of short text data to be polymerized or short text data to be polymerized.

May be quite big in view of the short text data stored in data memory module 40, the system may also include serializing Unit, unserializing unit (accompanying drawing is not shown), serialization unit are used for internal storage data serializing for being stored in disk, And unserializing unit is then used to disk file being converted to internal storage data.

Data extraction module 20 is coupled with data acquisition module 10, and the field that data extraction module 20 at least includes extracts single Member 201, field extracting unit 201 extracts the field of the participation polymerization of each short text data respectively from first set, to be formed The second set of first field attribute data；And the field for participating in polymerization is extracted from short text data to be polymerized, to form the Two field attribute data.

Wherein, field extracting unit 201 may include a field configuration table, and the field for participating in polymerization is configured for user Or definition.After the completion of user configuring, field extracting unit 201 is loaded directly into the field configuration table, and is taken out according to its progress field Take action.

Further, data extraction module 20 can also include data filtering units (accompanying drawing is not shown), data filtering list Member filters out the first field attribute data that can not be substantially matched with the second field attribute data from second set.As an example, If there is a data element (the first field attribute data) in second set, each word of its each field and short text data to be polymerized Section is occured simultaneously without any, then the data element can be filtered out from second set.

Data aggregate module 30 couples 20 with data extraction module and is coupled, and data aggregate module 30 is looked into including candidate data Unit 301, similarity calculated 302 and short text data polymerized unit 303 are ask, wherein, candidate data query unit 301 Coupled to similarity calculated 302, similarity calculated 302 is coupled to short text data polymerized unit 303.

Specifically, if candidate data query unit 301 inquires about related to the second field attribute data from second set A dry first field attribute data, to form the 3rd set of the first field attribute data, similarity calculated 302 calculates the The similarity between every one first field attribute data and the second field attribute data in three set, short text data polymerization is single Short text data and textual data to be polymerized during member 303 is gathered the 3rd corresponding to similarity highest the first field attribute data According to progress data aggregate, and with the output of the result formation system of polymerization.

Wherein, candidate data query unit 301 calculates each first field attribute data in second set and belonged to the second field Property data between the degree of correlation, and with the degree of correlation be more than relevance threshold the first field attribute data formation the 3rd gather.

Wherein, similarity calculated 302 can calculate similarity using following algorithm one of which or multinomial combination： Jaro-Winkler similarity algorithms；Levenshetin similarity algorithms；Longest Common Substring algorithm；Phrase similarity algorithm； And cosine similarity algorithm.

Preferably, data aggregate module 30 also includes inverted list structural unit (accompanying drawing is not shown), arranges Table structural unit is to the second field attribute data configuration inverted list, and candidate data query unit 301 will be according to inverted list come from Several the first field attribute data of correlation are inquired about in two set.

Specifically, to external data, i.e. short text data to be polymerized carries out the row's of falling training, in the base of the inverted list of generation On plinth, (that is, the second sets of the first field attribute data) inquiry and the second field in the range of the internal data that system is stored The first related field attribute data of attribute data, to produce the 3rd set of the first field attribute data.In 3rd set Data element experienced one by one mapping of the internal data with external data, and this is compared to directly by each data in second set Element carries out Similarity Measure with the second field attribute data, and the 3rd is integrated into scale and to be far smaller than second set, utilizes Some correlations between the data of inside and outside, can avoid calculating those completely uncorrelated data, so as to greatly reduce fortune Calculation amount, improves computational efficiency.

The relatedness computation carried out on candidate data query unit 301, as an example, illustrating a kind of degree of correlation meter below Calculation method：To each to data, i.e. any short text data in short text data and first set to be polymerized, respectively through word Section extracting unit 201 is extracted after the field for participating in polymerization, the number formed in the second field attribute data and second set According to element (the first field attribute data), to the second field attribute data configuration inverted list, then count in the inverted list with being somebody's turn to do The number count of data element identical participle word, the degree of correlation is calculated according to equation below：

Wherein, len (termsA) represents the segmentation sequence A of the first field attribute data length, and len (termsB) represents second The segmentation sequence B of field attribute data length.

Then, to the degree of correlation according to descending sort from big to small, then choose, for example, topN (degree of correlation highest is N number of) Data element (the first field attribute data) formation the 3rd in two set is gathered, after being carried out for similarity calculated 302 Continuous processing choosing.Choose topN rather than handle whole second set, what this was mainly balanced from actual execution efficiency and the degree of accuracy What angle was accounted for.From the point of view of the definition of the degree of correlation, which ensure that internal data and external data similitude (identical participle Word) it is more, then the degree of correlation is also just corresponding higher, moreover, by second set narrow down to the 3rd set but it is correct (be best suitable for Short text data to be polymerized carries out data aggregate) the possibility that excludes of data element be very low.

Above-described embodiment provide short text data paradigmatic system, by carry out data pick-up, filtering, relatedness computation with And Similarity Measure, whole data aggregate process compatible accuracy rate height, system execution efficiency height.The system logic is simple, configuration It is convenient.Under preferable case, the system can be disposed according to cloud computing system, is easy to the upgrading of system, maintenance, pushing away in industry Wide application.

Further embodiment of this invention provides a kind of short text data polymerization, and it comprises the following steps：Step S10, from Data memory module obtains the first set of short text data, and short text data to be polymerized is obtained from outside.

Step S20, extract respectively from first set each short text data participation polymerization field, to form the first word The second set of section attribute data, and the field for participating in polymerization is extracted from short text data to be polymerized, to form the second field Attribute data.

Step S30, the degree of correlation from second set between inquiry and the second field attribute data meet relevance threshold Several the first field attribute data, to form the 3rd set of the first field attribute data.

Specifically, relevance threshold can also can dynamically be set with static state setting according to the result of calculation of the degree of correlation.Phase Guan Du calculation formula is：

Wherein, len (termsA) represents the participle sequence of the first field attribute data A length is arranged, len (termsB) represents the segmentation sequence B of the second field attribute data length, and count belongs to for the first field The number of identical participle word between property the segmentation sequence A of data and the segmentation sequence B of the second field attribute data.

Step S40, calculate the 3rd set in every one first field attribute data and the second field attribute data between Similarity.

Specifically, similarity can be calculated using following algorithm one of which or multinomial combination：Jaro-Winkler phases Like degree algorithm；Levenshetin similarity algorithms；Longest Common Substring algorithm；Phrase similarity algorithm；And cosine similarity Algorithm.

Step S50, by the 3rd gather in, with second field attribute data similarity highest the first field attribute data institute Corresponding short text data carries out data aggregate with text data to be polymerized.

It is used as a kind of concrete application of the above embodiment of the present invention, the polymerization example given below for merchant data.

Outside merchant data comes from each external the Internet platform, such as the net such as popular comment net, ctrip.com, skill dragon net Stand.On the one hand these third-party common data platforms can include the public information of upper many trade companies of society, possess various Data source；On the other hand, many third party's common data platforms can be interacted, and user can be according to the hobby of oneself to each Individual trade company carries out evaluation marking, material is thus formed potential, the socialization evaluation to trade company's credit grade, contributes to trade company Real value make appropriate assessment.

In the first stage of the concrete application, using web crawlers from above three website fetching portion merchant information, with Exemplified by masses' comment net, field is crawled as shown in the table:

Field information	Sample
		Trade company ID	2209663
City	Shanghai
		Administrative area	Pudong New District
Sell title in shop	Wang Pintai moulds beefsteak
		Shop alias	NA
Branch information	China Resources Shi Di shops
		Branch number	5
Affiliated classification	{ western-style food-beefsteak }
		Affiliated commerce and trade	Yaohan
Address	7 buildings, No. 500 China Resources of Pudong New District Zhang Yanglu Times Square (nearly Pudong South Road)
		Business hours	{ 11.5-14,17.5-21 }
Pre-capita consumption	323 yuan
		top SCORES	4.5
Important label	{ lovers date：1418, it can swipe the card：543, friend has a dinner party：534, commercial affairs are entertained：484}
		Score details	{ 531,699,187,22,6 }
Acquiescence comment	2815
		Register short commentary	698
All comments	3224
		Purchase by group comment	4
Taste scoring (subdivision A)	8.3
		Environment scoring (subdivision B)	8.8
Service scoring (subdivision C)	9.1
		Collect number	1895
Browse number	643919
		Browse within nearest one week	2328
It is also browsed	BaiWanZhuangYuan (Guangan shops), ten thousand Lou Fu local flavors restaurants, Chan Ren chop house, north Xinjiang restaurant ...
		Geography information	116.37707,39.89292
Timestamp	2014-3-14 15：43
		Transport information	riek_mam：Parking is in lane, relatively harder (13-08-14), the fragrance four seasons：Free parking

The merchant data obtained from outside contains substantial amounts of field, and these fields are discrete message a bit, to the feature of trade company Word description has been carried out, and some fields are then continuous informations, the value to trade company has carried out numerical value description.Obviously, these words Section is not all to be required for being applied in polymerization process, and not only some fields do not play any work to data polymerization process With, but also the data throughout and treating capacity of polymerization process can be increased, and then cause the execution efficiency of system to decline.

As the opposing party of polymerization, the acquisition of internal merchant data is relatively easy.But internal data is due to being related to business The specifying information of family individual, directly operation are easily caused wrong generation.Therefore reprocessed for internal merchant data using export Way, so can both isolate former data, again can be by form carry out group of the internal merchant data according to outside merchant data Knit.

In the second stage of the concrete application, participate in the field of polymerization mainly for example including：

When field extracting unit 201 extracts outside merchant data, you can each field in upper table is extracted, and forms second Field attribute data.When field extracting unit 201 extracts internal merchant data, the data field of extraction at least includes following three Individual field：

Field	Explanation
		Trade company ID	Row is recalled when the ID. of trade company's individual facilitates follow-up inside unique mark：
Name of firm	The core field of polymerization：
		Trade company MCC	The type of trade company：

2 tables of the above only show some fields, it will be appreciated that as needed, for internal merchant data and outside trade company Data, can configure according to field configuration table to extract any quantity, the field of any classification.

(correspond to internal trade company in the second set for extracting inside and outside merchant data the first field attribute data of formation respectively Data) and the second field attribute data (correspond to outside merchant data) after, these data if appropriate for polymerization also need into A step of advancing is demonstrate,proved.It is third party's common data platform because masses comment on net, take journey and skill dragon, its merchant data paid close attention to is In the presence of certain tendentious, for example taking journey, concern is primarily with hotel information.So internal merchant data is not necessarily all Polymerization can be realized, only by increasing different data sources, just can guarantee that internal merchant data as much as possible with outside business User data is polymerize.

Because a large amount of internal merchant datas need to participate in the degree of correlation, Similarity Measure, second set is carried out as much as possible Filtering is beneficial to improve system execution efficiency.The target of main filtration for example including：

Overanxious target	Filtering rule
		ATM	" ATM " is included in internal title, during " cash dispenser " this kind of character string, this kind of trade company need to be removed；
POS	When including " POS " this kind of character string in internal title, this kind of trade company need to be removed；
		Self-employed worker	" self-employed worker " is included in internal title, during " individual " this kind of character string, this kind of trade company need to be removed：
Special defects (MCC)	If the MCC of internal trade company is also required to filter when being " special defects "；
		Name Length	If internal name of firm is too short, information content deficiency is also that can not participate in polymerization

From the point of view of the target of above-mentioned filtering, the content of filtering contains both of which：

First, category patterns.Found after being checked by the MCC to internal merchant data, " special defects " is that a comparison is special MCC, the inside contains many clearance contents, rather than really merchant information, is not suitable for being added in polymerization process, it is necessary to reject Fall this kind of trade company；

2nd, comprising pattern.Some trade companies are weeded out by some keywords included in name of firm or by title Length filters out the trade company of information content very little.

Therefore the used configuration file of this kind of filtering is needed comprising two parts：Correlation can be increased in comprising pattern Keyword, as long as these keywords are so contained in name of firm to be filtered；MCC is then specified in category patterns, As long as the trade company of the MCC can all be filtered out.

Extracted by field and abnormal data screening after, we can be obtained by the inside that meets polymerizing condition and outer Portion's merchant data.Now these data need to be stored on HDFS platforms, are used as the practical operation data of polymerization；And Although the data filtered out are removed from source data, it should not abandon but these data are carried out to appropriate storage, For follow-up analysis and assessment.

In the phase III of the concrete application, every a pair of the merchant datas of 301 pairs of candidate data query unit (specifically, are Any first field attribute data in one specific second field attribute data and second set), phase as described above Pass degree calculation formula, to calculate the degree of correlation.Then, arranged by degree of correlation descending, candidate data query unit 301 is from second set Middle selection and segmentation sequence degree of correlation highest, the field attribute data of top1000 bars (individual) first of the second field attribute data, The 3rd set is formed, for carrying out the Similarity Measure algorithm of next step.

Similarity calculated 302 is for each data element (the first field attribute data) in the 3rd set, difference Its similarity between the specific second field attribute data is calculated, calculates similar using Jaro-Winkler during similarity Spend algorithm.It is appreciated that Similarity Measure can also be used based on editing distance (Levenshtein) Similarity Measure, most long Public substring (LCS) algorithm, phrase similarity computational methods or cosine similarity computational methods etc..

Each similarity that short text data polymerized unit 303 is calculated to abovementioned steps carries out descending sort, by similarity Corresponding to inside merchant data and the specific second field attribute data corresponding to highest the first field attribute data Outside merchant data carries out data aggregate, forms polymerization merchant data and exports.

Tested using merchant data as aggregate objects, by Beijing and District of Shanghai Unionpay (inside) merchant data Show with the test checking sampling results of popular comment net (outside) merchant data, the ensemble average matching of inside and outside merchant data Rate is 27.5%；(matching result concentrates the bar number correctly matched to be concentrated in matching result to the Optimum Matching accuracy rate of polymerization model Accounting) 75% or so can be reached, and (there is occurrence in recall rate in the bar number divided by test set that are correctly matched in result set Bar number) 85% or so can be reached.

Described above is not lain in and limited the scope of the invention only in the preferred embodiments of the present invention.Ability Field technique personnel can make various modifications design, without departing from the thought and subsidiary claim of the present invention.

Claims

1. a kind of short text data paradigmatic system, including：

Data acquisition module, it includes internal data loading unit and external data acquiring unit, and the internal data loading is single Member obtains the first set of short text data from the data memory module of the system, and the external data acquiring unit is from described The outside of system obtains short text data to be polymerized；

Data extraction module, is coupled with the data acquisition module, and it includes field extracting unit, the field extracting unit from The field of the participation polymerization of each short text data is extracted in the first set respectively, to form the first field attribute data Second set, and from the short text data to be polymerized extract participate in polymerization field, to form the second field attribute number According to；And

Data aggregate module, is coupled with the data extraction module, and it includes candidate data query unit, similarity calculated And short text data polymerized unit；

Wherein, the candidate data query unit is inquired about related to the second field attribute data from the second set Several described first field attribute data, to form the 3rd set of the first field attribute data, the similarity meter Calculate between each first field attribute data and the second field attribute data that unit is calculated in the 3rd set Similarity, the short text data polymerized unit by the described 3rd gather in, with the second field attribute data similarity The short text data described in highest corresponding to the first field attribute data carries out data with the text data to be polymerized Polymerization.

2. system according to claim 1, it is characterised in that the data aggregate module also includes inverted list and constructs list Member, the inverted list structural unit is to the second field attribute data configuration inverted list, the candidate data cargo tracer primitive root Described several related described first field attribute data are inquired about from the second set according to the inverted list.

3. system according to claim 1, it is characterised in that the data extraction module also includes data filtering units, The data filtering units filter out can not be matched with the second field attribute data described first from the second set Field attribute data.

4. system according to claim 1, it is characterised in that the candidate data query unit calculates the second set In the degree of correlation between each first field attribute data and the second field attribute data, and be more than with the degree of correlation The first field attribute data of relevance threshold form the 3rd set.

5. system according to claim 4, it is characterised in that the degree of correlation is with point of the first field attribute data The number of identical participle word is the calculating factor between word sequence and the segmentation sequence of the second field attribute data.

6. system according to claim 1, it is characterised in that the similarity calculated uses following algorithm wherein one Or multinomial combination calculate the similarity：

Jaro-Winkler similarity algorithms；

Levenshetin similarity algorithms；

Longest Common Substring algorithm；

Phrase similarity algorithm；And

Cosine similarity algorithm.

7. system according to claim 1, it is characterised in that the field extracting unit includes field configuration table, for Family is configured to the field for participating in polymerization.

8. system according to any one of claim 1 to 7, it is characterised in that the system also include serialization unit, Unserializing unit, the serialization unit is used to serialize to be stored in disk, the unserializing by internal storage data Unit is used to disk file being converted to internal storage data.

9. system according to claim 8, it is characterised in that the system is disposed according to cloud computing system.

10. a kind of short text data polymerization, comprises the following steps：

A) first set of short text data, is obtained from data memory module, short text data to be polymerized is obtained from outside；

B) field of the participation polymerization of each short text data, is extracted respectively from the first set, to form the first word The second set of section attribute data, and the field for participating in polymerization is extracted from the short text data to be polymerized, to form second Field attribute data；

C), the degree of correlation from the second set between inquiry and the second field attribute data meets relevance threshold Several described first field attribute data, to form the 3rd set of the first field attribute data；

D), between each first field attribute data and the second field attribute data in calculating the 3rd set Similarity；

E), by the described 3rd gather in, with the first field attribute number described in the second field attribute data similarity highest According to the corresponding short text data data aggregate is carried out with the text data to be polymerized.

11. method as claimed in claim 10, it is characterised in that in the step c), the calculation formula of the degree of correlation For：Wherein, len (termsA) represents the segmentation sequence of the first field attribute data A length, len (termsB) represents the segmentation sequence B of the second field attribute data length, and count is first field The number of identical participle word between the segmentation sequence A of attribute data and the segmentation sequence B of the second field attribute data.

12. method according to claim 10, it is characterised in that in the step d), using following algorithm wherein one Or multinomial combination calculate the similarity：

Jaro-Winkler similarity algorithms；Levenshetin similarity algorithms；Longest Common Substring algorithm；Phrase similarity Algorithm；And, cosine similarity algorithm.