CN106980639A - Short text data paradigmatic system and method - Google Patents
Short text data paradigmatic system and method Download PDFInfo
- Publication number
- CN106980639A CN106980639A CN201611242641.3A CN201611242641A CN106980639A CN 106980639 A CN106980639 A CN 106980639A CN 201611242641 A CN201611242641 A CN 201611242641A CN 106980639 A CN106980639 A CN 106980639A
- Authority
- CN
- China
- Prior art keywords
- data
- field attribute
- field
- attribute data
- short text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of short text data paradigmatic system, including:Data acquisition module, for the first set for obtaining short text data, short text data to be polymerized;Data extraction module, second set and the second field attribute data for forming the first field attribute data;And data aggregate module, including candidate data query unit, similarity calculated and short text data polymerized unit;Wherein, candidate data query unit inquires about several first field attribute data related to the second field attribute data from second set, similarity calculated calculates the similarity between every one first field attribute data and the second field attribute data, and the short text data corresponding to similarity highest the first field attribute data and text data to be polymerized are carried out data aggregate by short text data polymerized unit.Data aggregate process compatible accuracy rate based on the system is high, system execution efficiency is high.The system logic is simple, configuration is convenient.
Description
Technical field
The present invention relates to Data fusion technique field, more specifically to a kind of short text data paradigmatic system and side
Method.
Background technology
At present, social media, mobile Internet, big data analysis, cloud computing, Internet of Things these industries are not isolated development
, but mutually merging, and then realize high speed cooperative development.As the support means of intelligent decision, big data is in financial machine
Application in structure, enterprise, cause, government, social management and development field is the direction that research staff makes great efforts.
Traditional statistical analysis is often to data mapping (marketing data, administrative form, survey, census
Deng) carrying out deep tracking and analysis, analysis personnel have certain control and the understanding of deep layer to the source of data and structure.And
In the big data epoch, data source is that various, self-assembling formation, magnanimity data are often half structure or structureless.This is just
It is required that data science man and analyst control various, multi-source data, excavated and analyzed after they are combed.
The different data that will originate are sorted out, analyzed, and are directed to two technical bottlenecks.First, Various types of data is originated not
Same, structure is different, it is necessary to extract wherein common field before data clusters fusion;2nd, the standard of data clusters integration technology
True rate governs the application breadth and depth of the technology.In the prior art, for the Cluster-Fusion between short text data, deposit
In many application technologies, but wherein, often using in short text keyword occur word frequency as data aggregate primary foundation, this
The one-sidedness of decision-making is easily caused, and then has a strong impact on the accuracy rate of data aggregate.In addition, needing the field of processing mass data
Close, the execution efficiency of data aggregate is the technical problem that those skilled in the art especially pay attention to.
The content of the invention
It polymerize the short text data paradigmatic system that accuracy rate is high, execution efficiency is high it is an object of the invention to provide a kind of.
To achieve the above object, a kind of technical scheme of present invention offer is as follows:
A kind of short text data paradigmatic system, including:Data acquisition module, it includes internal data loading unit and external data
Acquiring unit, internal data loading unit obtains the first set of short text data, external number from the data memory module of system
According to acquiring unit short text data to be polymerized is obtained from the outside of system;Data extraction module, is coupled with data acquisition module, its
Including field extracting unit, field extracting unit extracts the word of the participation polymerization of each short text data respectively from first set
Section, to form the second set of the first field attribute data, and extraction participates in the field of polymerization from short text data to be polymerized,
To form the second field attribute data;And data aggregate module, coupled with data extraction module, it includes candidate data inquiry
Unit, similarity calculated and short text data polymerized unit;Wherein, candidate data query unit is looked into from second set
Several first field attribute data related to the second field attribute data are ask, to form the 3rd of the first field attribute data the
Set, similarity calculated is calculated between every one first field attribute data and the second field attribute data in the 3rd set
Similarity, short text data polymerized unit by the 3rd gather in, with second field attribute data similarity the first word of highest
Short text data corresponding to section attribute data carries out data aggregate with text data to be polymerized.
Preferably, data aggregate module also includes inverted list structural unit, and inverted list structural unit is to the second field attribute
Data configuration inverted list, candidate data query unit inquires about several first words of correlation according to inverted list from second set
Section attribute data.
Preferably, data extraction module also include data filtering units, data filtering units filtered out from second set with
The first field attribute data that second field attribute data can not be matched.
Preferably, candidate data query unit calculates each first field attribute data and the second field attribute in second set
The degree of correlation between data, and the first field attribute data for being more than relevance threshold with the degree of correlation form the 3rd set.
Preferably, the degree of correlation is with the segmentation sequence of the first field attribute data and the segmentation sequence of the second field attribute data
Between identical participle word number for calculate the factor.
Preferably, the system also includes serialization unit, unserializing unit, and serialization unit is used for internal storage data sequence
Rowization are for being stored on disk, and unserializing unit is used to disk file being converted to internal storage data.
Invention additionally discloses a kind of short text data polymerization, it comprises the following steps:A), obtained from data memory module
The first set of short text data is taken, short text data to be polymerized is obtained from outside;B), extracted respectively from first set each short
The field of the participation polymerization of text data, to form the second set of the first field attribute data, and from short text number to be polymerized
According to the middle field for extracting and participating in polymerizeing, to form the second field attribute data;C), inquiry belongs to the second field from second set
The degree of correlation between property data meets several the first field attribute data of relevance threshold, to form the first field attribute number
According to the 3rd set;D), between the every one first field attribute data and the second field attribute data in the set of calculating the 3rd
Similarity;E) during, the 3rd is gathered and corresponding to second field attribute data similarity highest the first field attribute data
Short text data carries out data aggregate with text data to be polymerized.
Short text data paradigmatic system and method that the present invention is provided, realize a kind of matching accuracy rate height, system and perform
The data aggregate process of efficiency high.By multiple batches of filtering or matching, when handling magnanimity external data, it takes
Significantly shorten.The system logic is simple, configuration is convenient, implementation cost is low, is easy to popularization and application in industry.
Brief description of the drawings
Fig. 1 shows the modular structure schematic diagram of the short text data paradigmatic system of one embodiment of the invention.
Embodiment
As shown in figure 1, one embodiment of the invention provides a kind of short text data paradigmatic system, it includes data acquisition module
10th, data extraction module 20, data aggregate module 30 and data memory module 40.
Wherein, data acquisition module 10 includes internal data loading unit 101 and external data acquiring unit 102, internal
Data loading unit 101 from data memory module 40 obtain short text data first set, external data acquiring unit 102 from
The outside of system obtains input, i.e. the set of short text data to be polymerized or short text data to be polymerized.
May be quite big in view of the short text data stored in data memory module 40, the system may also include serializing
Unit, unserializing unit (accompanying drawing is not shown), serialization unit are used for internal storage data serializing for being stored in disk,
And unserializing unit is then used to disk file being converted to internal storage data.
Data extraction module 20 is coupled with data acquisition module 10, and the field that data extraction module 20 at least includes extracts single
Member 201, field extracting unit 201 extracts the field of the participation polymerization of each short text data respectively from first set, to be formed
The second set of first field attribute data;And the field for participating in polymerization is extracted from short text data to be polymerized, to form the
Two field attribute data.
Wherein, field extracting unit 201 may include a field configuration table, and the field for participating in polymerization is configured for user
Or definition.After the completion of user configuring, field extracting unit 201 is loaded directly into the field configuration table, and is taken out according to its progress field
Take action.
Further, data extraction module 20 can also include data filtering units (accompanying drawing is not shown), data filtering list
Member filters out the first field attribute data that can not be substantially matched with the second field attribute data from second set.As an example,
If there is a data element (the first field attribute data) in second set, each word of its each field and short text data to be polymerized
Section is occured simultaneously without any, then the data element can be filtered out from second set.
Data aggregate module 30 couples 20 with data extraction module and is coupled, and data aggregate module 30 is looked into including candidate data
Unit 301, similarity calculated 302 and short text data polymerized unit 303 are ask, wherein, candidate data query unit 301
Coupled to similarity calculated 302, similarity calculated 302 is coupled to short text data polymerized unit 303.
Specifically, if candidate data query unit 301 inquires about related to the second field attribute data from second set
A dry first field attribute data, to form the 3rd set of the first field attribute data, similarity calculated 302 calculates the
The similarity between every one first field attribute data and the second field attribute data in three set, short text data polymerization is single
Short text data and textual data to be polymerized during member 303 is gathered the 3rd corresponding to similarity highest the first field attribute data
According to progress data aggregate, and with the output of the result formation system of polymerization.
Wherein, candidate data query unit 301 calculates each first field attribute data in second set and belonged to the second field
Property data between the degree of correlation, and with the degree of correlation be more than relevance threshold the first field attribute data formation the 3rd gather.
Wherein, similarity calculated 302 can calculate similarity using following algorithm one of which or multinomial combination:
Jaro-Winkler similarity algorithms;Levenshetin similarity algorithms;Longest Common Substring algorithm;Phrase similarity algorithm;
And cosine similarity algorithm.
Preferably, data aggregate module 30 also includes inverted list structural unit (accompanying drawing is not shown), arranges
Table structural unit is to the second field attribute data configuration inverted list, and candidate data query unit 301 will be according to inverted list come from
Several the first field attribute data of correlation are inquired about in two set.
Specifically, to external data, i.e. short text data to be polymerized carries out the row's of falling training, in the base of the inverted list of generation
On plinth, (that is, the second sets of the first field attribute data) inquiry and the second field in the range of the internal data that system is stored
The first related field attribute data of attribute data, to produce the 3rd set of the first field attribute data.In 3rd set
Data element experienced one by one mapping of the internal data with external data, and this is compared to directly by each data in second set
Element carries out Similarity Measure with the second field attribute data, and the 3rd is integrated into scale and to be far smaller than second set, utilizes
Some correlations between the data of inside and outside, can avoid calculating those completely uncorrelated data, so as to greatly reduce fortune
Calculation amount, improves computational efficiency.
The relatedness computation carried out on candidate data query unit 301, as an example, illustrating a kind of degree of correlation meter below
Calculation method:To each to data, i.e. any short text data in short text data and first set to be polymerized, respectively through word
Section extracting unit 201 is extracted after the field for participating in polymerization, the number formed in the second field attribute data and second set
According to element (the first field attribute data), to the second field attribute data configuration inverted list, then count in the inverted list with being somebody's turn to do
The number count of data element identical participle word, the degree of correlation is calculated according to equation below:
Wherein, len (termsA) represents the segmentation sequence A of the first field attribute data length, and len (termsB) represents second
The segmentation sequence B of field attribute data length.
Then, to the degree of correlation according to descending sort from big to small, then choose, for example, topN (degree of correlation highest is N number of)
Data element (the first field attribute data) formation the 3rd in two set is gathered, after being carried out for similarity calculated 302
Continuous processing choosing.Choose topN rather than handle whole second set, what this was mainly balanced from actual execution efficiency and the degree of accuracy
What angle was accounted for.From the point of view of the definition of the degree of correlation, which ensure that internal data and external data similitude (identical participle
Word) it is more, then the degree of correlation is also just corresponding higher, moreover, by second set narrow down to the 3rd set but it is correct (be best suitable for
Short text data to be polymerized carries out data aggregate) the possibility that excludes of data element be very low.
Above-described embodiment provide short text data paradigmatic system, by carry out data pick-up, filtering, relatedness computation with
And Similarity Measure, whole data aggregate process compatible accuracy rate height, system execution efficiency height.The system logic is simple, configuration
It is convenient.Under preferable case, the system can be disposed according to cloud computing system, is easy to the upgrading of system, maintenance, pushing away in industry
Wide application.
Further embodiment of this invention provides a kind of short text data polymerization, and it comprises the following steps:Step S10, from
Data memory module obtains the first set of short text data, and short text data to be polymerized is obtained from outside.
Step S20, extract respectively from first set each short text data participation polymerization field, to form the first word
The second set of section attribute data, and the field for participating in polymerization is extracted from short text data to be polymerized, to form the second field
Attribute data.
Step S30, the degree of correlation from second set between inquiry and the second field attribute data meet relevance threshold
Several the first field attribute data, to form the 3rd set of the first field attribute data.
Specifically, relevance threshold can also can dynamically be set with static state setting according to the result of calculation of the degree of correlation.Phase
Guan Du calculation formula is:
Wherein, len (termsA) represents the participle sequence of the first field attribute data
A length is arranged, len (termsB) represents the segmentation sequence B of the second field attribute data length, and count belongs to for the first field
The number of identical participle word between property the segmentation sequence A of data and the segmentation sequence B of the second field attribute data.
Step S40, calculate the 3rd set in every one first field attribute data and the second field attribute data between
Similarity.
Specifically, similarity can be calculated using following algorithm one of which or multinomial combination:Jaro-Winkler phases
Like degree algorithm;Levenshetin similarity algorithms;Longest Common Substring algorithm;Phrase similarity algorithm;And cosine similarity
Algorithm.
Step S50, by the 3rd gather in, with second field attribute data similarity highest the first field attribute data institute
Corresponding short text data carries out data aggregate with text data to be polymerized.
It is used as a kind of concrete application of the above embodiment of the present invention, the polymerization example given below for merchant data.
Outside merchant data comes from each external the Internet platform, such as the net such as popular comment net, ctrip.com, skill dragon net
Stand.On the one hand these third-party common data platforms can include the public information of upper many trade companies of society, possess various
Data source;On the other hand, many third party's common data platforms can be interacted, and user can be according to the hobby of oneself to each
Individual trade company carries out evaluation marking, material is thus formed potential, the socialization evaluation to trade company's credit grade, contributes to trade company
Real value make appropriate assessment.
In the first stage of the concrete application, using web crawlers from above three website fetching portion merchant information, with
Exemplified by masses' comment net, field is crawled as shown in the table:
Field information | Sample |
Trade company ID | 2209663 |
City | Shanghai |
Administrative area | Pudong New District |
Sell title in shop | Wang Pintai moulds beefsteak |
Shop alias | NA |
Branch information | China Resources Shi Di shops |
Branch number | 5 |
Affiliated classification | { western-style food-beefsteak } |
Affiliated commerce and trade | Yaohan |
Address | 7 buildings, No. 500 China Resources of Pudong New District Zhang Yanglu Times Square (nearly Pudong South Road) |
Business hours | { 11.5-14,17.5-21 } |
Pre-capita consumption | 323 yuan |
top SCORES | 4.5 |
Important label | { lovers date:1418, it can swipe the card:543, friend has a dinner party:534, commercial affairs are entertained:484} |
Score details | { 531,699,187,22,6 } |
Acquiescence comment | 2815 |
Register short commentary | 698 |
All comments | 3224 |
Purchase by group comment | 4 |
Taste scoring (subdivision A) | 8.3 |
Environment scoring (subdivision B) | 8.8 |
Service scoring (subdivision C) | 9.1 |
Collect number | 1895 |
Browse number | 643919 |
Browse within nearest one week | 2328 |
It is also browsed | BaiWanZhuangYuan (Guangan shops), ten thousand Lou Fu local flavors restaurants, Chan Ren chop house, north Xinjiang restaurant ... |
Geography information | 116.37707,39.89292 |
Timestamp | 2014-3-14 15:43 |
Transport information | riek_mam:Parking is in lane, relatively harder (13-08-14), the fragrance four seasons:Free parking |
The merchant data obtained from outside contains substantial amounts of field, and these fields are discrete message a bit, to the feature of trade company
Word description has been carried out, and some fields are then continuous informations, the value to trade company has carried out numerical value description.Obviously, these words
Section is not all to be required for being applied in polymerization process, and not only some fields do not play any work to data polymerization process
With, but also the data throughout and treating capacity of polymerization process can be increased, and then cause the execution efficiency of system to decline.
As the opposing party of polymerization, the acquisition of internal merchant data is relatively easy.But internal data is due to being related to business
The specifying information of family individual, directly operation are easily caused wrong generation.Therefore reprocessed for internal merchant data using export
Way, so can both isolate former data, again can be by form carry out group of the internal merchant data according to outside merchant data
Knit.
In the second stage of the concrete application, participate in the field of polymerization mainly for example including:
When field extracting unit 201 extracts outside merchant data, you can each field in upper table is extracted, and forms second
Field attribute data.When field extracting unit 201 extracts internal merchant data, the data field of extraction at least includes following three
Individual field:
Field | Explanation |
Trade company ID | Row is recalled when the ID. of trade company's individual facilitates follow-up inside unique mark: |
Name of firm | The core field of polymerization: |
Trade company MCC | The type of trade company: |
2 tables of the above only show some fields, it will be appreciated that as needed, for internal merchant data and outside trade company
Data, can configure according to field configuration table to extract any quantity, the field of any classification.
(correspond to internal trade company in the second set for extracting inside and outside merchant data the first field attribute data of formation respectively
Data) and the second field attribute data (correspond to outside merchant data) after, these data if appropriate for polymerization also need into
A step of advancing is demonstrate,proved.It is third party's common data platform because masses comment on net, take journey and skill dragon, its merchant data paid close attention to is
In the presence of certain tendentious, for example taking journey, concern is primarily with hotel information.So internal merchant data is not necessarily all
Polymerization can be realized, only by increasing different data sources, just can guarantee that internal merchant data as much as possible with outside business
User data is polymerize.
Because a large amount of internal merchant datas need to participate in the degree of correlation, Similarity Measure, second set is carried out as much as possible
Filtering is beneficial to improve system execution efficiency.The target of main filtration for example including:
Overanxious target | Filtering rule |
ATM | " ATM " is included in internal title, during " cash dispenser " this kind of character string, this kind of trade company need to be removed; |
POS | When including " POS " this kind of character string in internal title, this kind of trade company need to be removed; |
Self-employed worker | " self-employed worker " is included in internal title, during " individual " this kind of character string, this kind of trade company need to be removed: |
Special defects (MCC) | If the MCC of internal trade company is also required to filter when being " special defects "; |
Name Length | If internal name of firm is too short, information content deficiency is also that can not participate in polymerization |
From the point of view of the target of above-mentioned filtering, the content of filtering contains both of which:
First, category patterns.Found after being checked by the MCC to internal merchant data, " special defects " is that a comparison is special
MCC, the inside contains many clearance contents, rather than really merchant information, is not suitable for being added in polymerization process, it is necessary to reject
Fall this kind of trade company;
2nd, comprising pattern.Some trade companies are weeded out by some keywords included in name of firm or by title
Length filters out the trade company of information content very little.
Therefore the used configuration file of this kind of filtering is needed comprising two parts:Correlation can be increased in comprising pattern
Keyword, as long as these keywords are so contained in name of firm to be filtered;MCC is then specified in category patterns,
As long as the trade company of the MCC can all be filtered out.
Extracted by field and abnormal data screening after, we can be obtained by the inside that meets polymerizing condition and outer
Portion's merchant data.Now these data need to be stored on HDFS platforms, are used as the practical operation data of polymerization;And
Although the data filtered out are removed from source data, it should not abandon but these data are carried out to appropriate storage,
For follow-up analysis and assessment.
In the phase III of the concrete application, every a pair of the merchant datas of 301 pairs of candidate data query unit (specifically, are
Any first field attribute data in one specific second field attribute data and second set), phase as described above
Pass degree calculation formula, to calculate the degree of correlation.Then, arranged by degree of correlation descending, candidate data query unit 301 is from second set
Middle selection and segmentation sequence degree of correlation highest, the field attribute data of top1000 bars (individual) first of the second field attribute data,
The 3rd set is formed, for carrying out the Similarity Measure algorithm of next step.
Similarity calculated 302 is for each data element (the first field attribute data) in the 3rd set, difference
Its similarity between the specific second field attribute data is calculated, calculates similar using Jaro-Winkler during similarity
Spend algorithm.It is appreciated that Similarity Measure can also be used based on editing distance (Levenshtein) Similarity Measure, most long
Public substring (LCS) algorithm, phrase similarity computational methods or cosine similarity computational methods etc..
Each similarity that short text data polymerized unit 303 is calculated to abovementioned steps carries out descending sort, by similarity
Corresponding to inside merchant data and the specific second field attribute data corresponding to highest the first field attribute data
Outside merchant data carries out data aggregate, forms polymerization merchant data and exports.
Tested using merchant data as aggregate objects, by Beijing and District of Shanghai Unionpay (inside) merchant data
Show with the test checking sampling results of popular comment net (outside) merchant data, the ensemble average matching of inside and outside merchant data
Rate is 27.5%;(matching result concentrates the bar number correctly matched to be concentrated in matching result to the Optimum Matching accuracy rate of polymerization model
Accounting) 75% or so can be reached, and (there is occurrence in recall rate in the bar number divided by test set that are correctly matched in result set
Bar number) 85% or so can be reached.
Described above is not lain in and limited the scope of the invention only in the preferred embodiments of the present invention.Ability
Field technique personnel can make various modifications design, without departing from the thought and subsidiary claim of the present invention.
Claims (12)
1. a kind of short text data paradigmatic system, including:
Data acquisition module, it includes internal data loading unit and external data acquiring unit, and the internal data loading is single
Member obtains the first set of short text data from the data memory module of the system, and the external data acquiring unit is from described
The outside of system obtains short text data to be polymerized;
Data extraction module, is coupled with the data acquisition module, and it includes field extracting unit, the field extracting unit from
The field of the participation polymerization of each short text data is extracted in the first set respectively, to form the first field attribute data
Second set, and from the short text data to be polymerized extract participate in polymerization field, to form the second field attribute number
According to;And
Data aggregate module, is coupled with the data extraction module, and it includes candidate data query unit, similarity calculated
And short text data polymerized unit;
Wherein, the candidate data query unit is inquired about related to the second field attribute data from the second set
Several described first field attribute data, to form the 3rd set of the first field attribute data, the similarity meter
Calculate between each first field attribute data and the second field attribute data that unit is calculated in the 3rd set
Similarity, the short text data polymerized unit by the described 3rd gather in, with the second field attribute data similarity
The short text data described in highest corresponding to the first field attribute data carries out data with the text data to be polymerized
Polymerization.
2. system according to claim 1, it is characterised in that the data aggregate module also includes inverted list and constructs list
Member, the inverted list structural unit is to the second field attribute data configuration inverted list, the candidate data cargo tracer primitive root
Described several related described first field attribute data are inquired about from the second set according to the inverted list.
3. system according to claim 1, it is characterised in that the data extraction module also includes data filtering units,
The data filtering units filter out can not be matched with the second field attribute data described first from the second set
Field attribute data.
4. system according to claim 1, it is characterised in that the candidate data query unit calculates the second set
In the degree of correlation between each first field attribute data and the second field attribute data, and be more than with the degree of correlation
The first field attribute data of relevance threshold form the 3rd set.
5. system according to claim 4, it is characterised in that the degree of correlation is with point of the first field attribute data
The number of identical participle word is the calculating factor between word sequence and the segmentation sequence of the second field attribute data.
6. system according to claim 1, it is characterised in that the similarity calculated uses following algorithm wherein one
Or multinomial combination calculate the similarity:
Jaro-Winkler similarity algorithms;
Levenshetin similarity algorithms;
Longest Common Substring algorithm;
Phrase similarity algorithm;And
Cosine similarity algorithm.
7. system according to claim 1, it is characterised in that the field extracting unit includes field configuration table, for
Family is configured to the field for participating in polymerization.
8. system according to any one of claim 1 to 7, it is characterised in that the system also include serialization unit,
Unserializing unit, the serialization unit is used to serialize to be stored in disk, the unserializing by internal storage data
Unit is used to disk file being converted to internal storage data.
9. system according to claim 8, it is characterised in that the system is disposed according to cloud computing system.
10. a kind of short text data polymerization, comprises the following steps:
A) first set of short text data, is obtained from data memory module, short text data to be polymerized is obtained from outside;
B) field of the participation polymerization of each short text data, is extracted respectively from the first set, to form the first word
The second set of section attribute data, and the field for participating in polymerization is extracted from the short text data to be polymerized, to form second
Field attribute data;
C), the degree of correlation from the second set between inquiry and the second field attribute data meets relevance threshold
Several described first field attribute data, to form the 3rd set of the first field attribute data;
D), between each first field attribute data and the second field attribute data in calculating the 3rd set
Similarity;
E), by the described 3rd gather in, with the first field attribute number described in the second field attribute data similarity highest
According to the corresponding short text data data aggregate is carried out with the text data to be polymerized.
11. method as claimed in claim 10, it is characterised in that in the step c), the calculation formula of the degree of correlation
For:Wherein, len (termsA) represents the segmentation sequence of the first field attribute data
A length, len (termsB) represents the segmentation sequence B of the second field attribute data length, and count is first field
The number of identical participle word between the segmentation sequence A of attribute data and the segmentation sequence B of the second field attribute data.
12. method according to claim 10, it is characterised in that in the step d), using following algorithm wherein one
Or multinomial combination calculate the similarity:
Jaro-Winkler similarity algorithms;Levenshetin similarity algorithms;Longest Common Substring algorithm;Phrase similarity
Algorithm;And, cosine similarity algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611242641.3A CN106980639B (en) | 2016-12-29 | 2016-12-29 | Short text data aggregation system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611242641.3A CN106980639B (en) | 2016-12-29 | 2016-12-29 | Short text data aggregation system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980639A true CN106980639A (en) | 2017-07-25 |
CN106980639B CN106980639B (en) | 2020-07-28 |
Family
ID=59340500
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611242641.3A Active CN106980639B (en) | 2016-12-29 | 2016-12-29 | Short text data aggregation system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106980639B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647213A (en) * | 2018-05-21 | 2018-10-12 | 辽宁工程技术大学 | A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis |
CN109190117A (en) * | 2018-08-10 | 2019-01-11 | 中国船舶重工集团公司第七〇九研究所 | A kind of short text semantic similarity calculation method based on term vector |
WO2019128409A1 (en) * | 2017-12-28 | 2019-07-04 | 中国银联股份有限公司 | Data compression and storage method and data compression and storage device |
CN110750588A (en) * | 2019-10-29 | 2020-02-04 | 珠海格力电器股份有限公司 | Multi-source heterogeneous data fusion method, system, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060224584A1 (en) * | 2005-03-31 | 2006-10-05 | Content Analyst Company, Llc | Automatic linear text segmentation |
CN104809117A (en) * | 2014-01-24 | 2015-07-29 | 深圳市云帆世纪科技有限公司 | Video data aggregation processing method, aggregation system and video searching platform |
CN104866631A (en) * | 2015-06-18 | 2015-08-26 | 北京京东尚科信息技术有限公司 | Method and device for aggregating counseling problems |
-
2016
- 2016-12-29 CN CN201611242641.3A patent/CN106980639B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060224584A1 (en) * | 2005-03-31 | 2006-10-05 | Content Analyst Company, Llc | Automatic linear text segmentation |
CN104809117A (en) * | 2014-01-24 | 2015-07-29 | 深圳市云帆世纪科技有限公司 | Video data aggregation processing method, aggregation system and video searching platform |
CN104866631A (en) * | 2015-06-18 | 2015-08-26 | 北京京东尚科信息技术有限公司 | Method and device for aggregating counseling problems |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019128409A1 (en) * | 2017-12-28 | 2019-07-04 | 中国银联股份有限公司 | Data compression and storage method and data compression and storage device |
CN108647213A (en) * | 2018-05-21 | 2018-10-12 | 辽宁工程技术大学 | A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis |
CN109190117A (en) * | 2018-08-10 | 2019-01-11 | 中国船舶重工集团公司第七〇九研究所 | A kind of short text semantic similarity calculation method based on term vector |
CN110750588A (en) * | 2019-10-29 | 2020-02-04 | 珠海格力电器股份有限公司 | Multi-source heterogeneous data fusion method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106980639B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102890803B (en) | The defining method of the abnormal process of exchange of electronic goods and device thereof | |
CN104035917B (en) | A kind of knowledge mapping management method and system based on semantic space mapping | |
CN110413707A (en) | The excavation of clique's relationship is cheated in internet and checks method and its system | |
CN106529968A (en) | Customer classification method and system thereof based on transaction data | |
Yao et al. | An ensemble model for fake online review detection based on data resampling, feature pruning, and parameter optimization | |
CN106980639A (en) | Short text data paradigmatic system and method | |
CN106708966A (en) | Similarity calculation-based junk comment detection method | |
CN109416761A (en) | Use the machine learning and prediction of figure community | |
CN104462592B (en) | Based on uncertain semantic social network user behavior relation deduction system and method | |
CN105931068A (en) | Cardholder consumption figure generation method and device | |
Gustafsson et al. | Comparison and validation of community structures in complex networks | |
CN106886518A (en) | A kind of method of microblog account classification | |
CN109299258A (en) | A kind of public sentiment event detecting method, device and equipment | |
Shirole et al. | Customer segmentation using rfm model and k-means clustering | |
CN107133289A (en) | A kind of method and apparatus of determination commercial circle | |
CN114611959A (en) | O2O big data technology-based product selection strategy system | |
CN108763496A (en) | A kind of sound state data fusion client segmentation algorithm based on grid and density | |
CN109472626A (en) | A kind of intelligent finance risk control method and system towards mobile phone charter business | |
CN107908733A (en) | A kind of querying method of global trade data, apparatus and system | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
Watts et al. | Exchange network topologies and agent-based modeling: economies of the Sedentary-Period Hohokam | |
CN112819544A (en) | Advertisement putting method, device, equipment and storage medium based on big data | |
CN109145187A (en) | Cross-platform electric business fraud detection method and system based on comment data | |
CN105931055A (en) | Service provider feature modeling method for crowdsourcing platform | |
KR20210058525A (en) | Method and device for classifying unstructured item data automatically for goods or services |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |