CN107291748A - A kind of feature extracting method and device - Google Patents

A kind of feature extracting method and device Download PDF

Info

Publication number
CN107291748A
CN107291748A CN201610202581.6A CN201610202581A CN107291748A CN 107291748 A CN107291748 A CN 107291748A CN 201610202581 A CN201610202581 A CN 201610202581A CN 107291748 A CN107291748 A CN 107291748A
Authority
CN
China
Prior art keywords
word
string
feature
address text
segmentation processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610202581.6A
Other languages
Chinese (zh)
Other versions
CN107291748B (en
Inventor
王国印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cainiao Smart Logistics Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610202581.6A priority Critical patent/CN107291748B/en
Publication of CN107291748A publication Critical patent/CN107291748A/en
Application granted granted Critical
Publication of CN107291748B publication Critical patent/CN107291748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application is related to data mining technology field, more particularly to a kind of feature extracting method and device, and the feature extracting method that the application is provided includes:It is determined that carrying out the address text after word segmentation processing;Word number and jump word number are taken according to what is pre-set, word is taken from the address text after the progress word segmentation processing, constitutes the feature word string for carrying out the address text after word segmentation processing;Wherein, the number of the word taken included in each feature word string takes word number described in being equal to, and is each equal to the jump word number in the presence of the word quantity that two adjacent words are separated by the address text in feature word string.Application scheme can carry out jumping word processing to address text, so as to have an opportunity to obtain the stronger feature word string of distinguishability, lift the mining effect to address text.

Description

A kind of feature extracting method and device
Technical field
The application is related to data mining technology field, more particularly to a kind of feature extracting method and device.
Background technology
With being skyrocketed through for data warehouse Chinese version information, text mining turns into the research heat of message area Point.Address information is to be stored in a text form in data warehouse, because address information is in big data point Occupy very important status in analysis, address feature mining is as one kind of text mining, and its importance also gets over Come more obvious.
To Chinese address text carry out word segmentation processing be carry out text mining basis, this be by Chinese the characteristics of Determine.Such as to Chinese address text " Hangzhou, Zhejiang province city Yuhang District 5 constant virtues street Jing Feng communities Wen Yixi Road " is carried out after word segmentation processing, can obtain including Zhejiang Province, Hangzhou, Yuhang District, 5 constant virtues street, chaste tree Each word in the text of address after Feng Shequ, these words of a literary West Road address text, word segmentation processing Have its corresponding address implication (such as individually see Zhejiang, river, save these three words, not possessing any address implication, But just there is corresponding address implication in the word Zhejiang Province after being combined).Under many circumstances, for one Chinese address text, if only extracting part word therein, the word of extraction is under many circumstances still with stronger Distinguishability.
As shown in figure 1, the process to carry out feature extraction to Chinese address text in text classification.From figure As can be seen that in text mining, carrying out word segmentation processing to Chinese address text first, then carrying out in 1 Feature extraction, namely progress takes word from Chinese address text, is next namely based on and takes word result to be divided The process of class, therefore, after word segmentation processing is carried out to Chinese address text, influences Chinese address text mining The primary factor of effect is exactly to carry out feature extraction.
At present, the method for carrying out feature extraction is mainly based upon n meta-models (n-gram) come what is realized, n-gram Definition be:If address text constitutes (w by m word1w2w3…wm), wherein wiFor in the text of address I-th of word, then n-gram be defined as:{wiwi+1…wi+n-1|1≤i≤m-n+1}。
Such as, current address text is made up of 5 words, is w1w2w3w4w5, then:
As n=1, the 1-gram of generation has w1、w2、w3、w4、w5
As n=2, the 2-gram of generation has w1w2、w2w3、w3w4、w4w5
As n=3, the 3-gram of generation has w1w2w3、, w2w3w4、w3w4w5,
It is the union for taking all gram to mix n meta-models, such as mixes the gram of ternary model and have:w1、 w2、w3、w4、w5、w1w2、w2w3、w3w4、、w4w5、w1w2w3、w2w3w4、w3w4w5
It is therefore seen that, it is exactly the continuous extraction n in the text of address to enter row address feature extraction based on n-gram Individual word, obtains including the feature word string of n word.But in some cases, the word in the text of address is present Long-distance dependence, or people can neglect some unessential vocabulary when describing same address, to mark Quasi- address text " Hangzhou, Zhejiang province city Yuhang District 5 constant virtues street Jing Feng communities one West Road of text 969 Ah In Ba Baxi small streams garden " exemplified by, people are possible to that short committal can be used in input address:" more than Hangzhou The Hangzhoupro area text one West Road 969 Brazilian small stream garden of Arriba ".Obviously, feature extraction mode is can not extract The address of this short committal, because " Yuhang District one West Road of text " that is included in the address of the short committal It is in the text of normal address and discontinuous, and " Yuhang District one West Road of text " exactly with it is very strong can Distinctiveness.
To sum up, when entering row address feature extraction to address text at present, included in the feature word string extracted Word be all continuous in the text of address, wherein the stronger feature word string of distinctiveness may do not included so that Cause the mining effect to address text poor.
The content of the invention
The embodiment of the present application provides a kind of feature extracting method and device, to improve the excavation to address text Effect.
The embodiment of the present application provides a kind of feature extracting method, including:
It is determined that carrying out the address text after word segmentation processing;Included in address text after the carry out word segmentation processing N number of word, the N is the integer more than 1;
Word number and jump word number are taken according to what is pre-set, is taken from the address text after the progress word segmentation processing Word, constitutes the feature word string for carrying out the address text after word segmentation processing;Wherein, in each feature word string Comprising the number of the word taken be equal to described take in word number, and each feature word string and there are two adjacent words The word quantity being separated by the address text is equal to the jump word number.
Alternatively, word number and jump word number are taken according to what is pre-set, the address after the progress word segmentation processing Word is taken in text, the feature word string for carrying out the address text after word segmentation processing is constituted, specifically includes:
Pre-set and take word number to be n, and it is the integer from 1 to k to pre-set jump word number, the n is Integer more than 1 and less than N, the k is the integer more than 1 and less than N-1;
According to when in front jumping word number s, the address text after the progress word segmentation processing, from current location Word starts to choose n word, obtains the feature word string;S is the integer more than 0 and less than or equal to k.
Alternatively, according to as front jumping word number s, in the address text after the progress word segmentation processing, from working as The word of front position starts to choose n word, obtains the feature word string, including:
In address text after the progress word segmentation processing, since the word of the current location, continuous choosing N word is taken, the first word string is obtained;
In address text after the progress word segmentation processing, it is determined that continuous since the word of the current location Choose the remaining word after n word;When the quantity of the remaining word is more than or equal to s, from the residue First word in word starts, and continuously chooses s word, obtains the second word string;
In other words in first word string in addition to first word, first object word, Yi Ji are determined Determined and the second target word of the first object word number identical in second word string;
By the way that the first object word in first word string is replaced with into second target word, institute is determined State feature word string.
Alternatively, in other words in first word string in addition to first word, first object word is determined, And determined and the second target word of the first object word number identical in second word string, including:
Word is jumped using second word in first word string to last word as starting respectively, is performed following Operation:
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is more than or equal to s, the continuous s word jumping word since the starting is defined as described the One target word, and the word in second word string is defined as second target word;Q be more than 1, And the integer less than n;
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is less than s, q word jumping word since the starting to n-th of word is defined as the first mesh Word is marked, and since second word string last word, the side of first word towards in the second word string To continuously q word of selection is used as the second target word.
Alternatively, by the way that the first object word in first word string is replaced with into second target Word, determines the Feature Words word string, including:
The first object word in first word string is replaced with into second target word, the 3rd word is obtained String;
According to the sequencing of N number of word arrangement in the address text after the progress word segmentation processing, to described the Word in three word strings is resequenced, and obtains the feature word string.
The embodiment of the present application provides a kind of feature deriving means, including:
Determining module, for determining to carry out the address text after word segmentation processing;After the carry out word segmentation processing N number of word is included in the text of address, the N is the integer more than 1;
Word module is taken, for taking word number and jump word number according to what is pre-set, after the progress word segmentation processing Address text in take word, constitute the feature word string for carrying out the address text after word segmentation processing;Wherein, The number of the word taken included in each feature word string is equal to described take in word number, and each feature word string and deposited The word quantity being separated by two adjacent words in the address text is equal to the jump word number.
Application scheme can carry out jumping word processing to address text, so as to have an opportunity to obtain distinguishability stronger Feature word string, lifted to the mining effect of address text.
Other features and advantage will illustrate in the following description, also, partly from explanation Become apparent, or understood by implementing the application in book.The purpose of the application and other advantages can Realize and obtain by specifically noted structure in the specification, claims and accompanying drawing write .
Brief description of the drawings
Accompanying drawing is used for providing further understanding of the present application, and constitutes a part for specification, with this Shen Please embodiment together be used for explain the application, do not constitute the limitation to the application.In the accompanying drawings:
Fig. 1 is the text classification flow chart in data mining in the prior art;
Fig. 2 is a kind of feature extracting method flow chart for providing in the embodiment of the present application;
Fig. 3 A generate the schematic diagram of feature word string to be provided in the embodiment of the present application in the case of s < n;
Fig. 3 B is provide in the case of s < n in the embodiment of the present application, generation the another of feature word string shows It is intended to;
Fig. 4 generates the schematic diagram of feature word string to be provided in the embodiment of the present application in the case of s >=n;
Fig. 5 is a kind of structural representation of the feature deriving means provided in the embodiment of the present application.
Embodiment
The preferred embodiment of the application is illustrated below in conjunction with Figure of description, it will be appreciated that this place The preferred embodiment of description is merely to illustrate and explained the application, is not used to limit the application.And not In the case of conflict, the feature in embodiment and embodiment in the application can be mutually combined.
The embodiment of the present application provides a kind of feature extracting method, as shown in Fig. 2 including:
Step 21, it is determined that carrying out the address text after the address text after word segmentation processing, the carry out word segmentation processing In include N number of word.N is the integer more than 1.
Step 22, word number and jump word number are taken according to what is pre-set, from the address text carried out after word segmentation processing In take word, constitute the feature word string for carrying out the address text after word segmentation processing.
In step 22, word number and jump word number are taken according to what is pre-set, from the ground carried out after word segmentation processing Word is taken in the text of location, obtains including the Feature Words set of strings of at least one feature word string, wherein each Feature Words The number of the word included in string is equal to described take in word number, and each feature word string and existed in the presence of two adjacent words The word quantity be separated by address text after word segmentation processing is equal to the jump word number.
In specific implementation, the mode continuously extracted is can be combined with from the address text carried out after word segmentation processing In take word, the feature word string continuously extracted both is included in the Feature Words set of strings so obtained, also including using The feature word string that the mode for the discontinuous extraction that application scheme is provided is extracted.
In specific implementation, step 22 can be, but not limited to realize as follows:
Step A1:Pre-set and take word number to be n, and pre-set jump word number be from 1 to k (if with reference to The mode continuously extracted takes word from the text of address, then it is from 0 to integer k), n that can set jump word number For the integer more than 1 and less than N, k is the integer more than 1 and less than N-1.
Such as, for carrying out the address text w after word segmentation processing1w2w3w4w5, the address text include 5 Individual word w1、w2、w3、w4、w5, i.e. N=5 can set and take word number n=3, set jump word number be 1, 2 or 0,1,2 (i.e. k=2).
In specific implementation, if in addition to the mode of discontinuous extraction, from address also by the way of continuously extracting Word is taken in text, then jump word number value can be set from 0 to k.In introduced below also with jump word number value from It is introduced exemplified by 0 to k, it is necessary to illustrate, it only can be this Shen by 0 such case of value to jump word number Please embodiment a kind of embodiment, jumped in actually implementing word number can not also include value for 0 it is this Situation.
Step A2:According to as front jumping word number s, in the address text after carrying out word segmentation processing, from present bit The word put starts to choose n word, obtains the feature word string.
In specific implementation, the embodiment of the present application can realize discontinuous spy based on k-skip-n-gram mode Extraction is levied, discontinuous feature extraction can also be realized based on conti-k-skip-n-gram mode.Here, K-skip-n-gram refers to that the adjacent word of any two is all separated by k in the text of address in the feature word string extracted Individual word, and conti-k-skip-n-gram refers to only exist two adjacent words on ground in the feature word string extracted It is separated by k word in the text of location, because k-skip-n-gram is in k>2 and n>It is relatively difficult to achieve in program when 3, The mode that the embodiment of the present application is preferably based on conti-k-skip-n-gram realizes feature extraction.
Step A2 is the process that a circulation is performed.This process can according to when the circulation of front jumping word number, The mode of the circulation of nested current location is performed, can also be according to the circulation of current location, and nesting works as front jumping The mode of the circulation of word number is performed.It is described separately below.
Endless form one:With when the circulation of front jumping word number, the circulation of nested current location.
In this manner, for as front jumping word number s, will carry out respectively in the address text after word segmentation processing First word to N-n-s+1 (in order to obtain meeting the feature word string for taking word number and jumping word said conditions, when Front position is not more than N-n-s+1) individual word, as the word of current location, perform operation:According to when front jumping word Number s, in the address text after carrying out word segmentation processing, n word is chosen since the word of current location, is obtained To feature word string.
Such as, for the address text w after word segmentation processing1w2w3w4w5, the address text includes 5 words w1、w2、w3、w4、w5, i.e. N=5, setting takes word number n=3, set jump word number s be respectively 0,1, 2 (i.e. k=2), then the operating process in endless form once is:
1) when front jumping word number s is 0.Using the word of current location as w1, from w1Start continuously to choose 3 words, Obtain feature word string w1w2w3;Next using the word of current location as w2, from w2Start continuous selection 3 Word, obtains feature word string w2w3w4;Next using the word of current location as w3, from w3Start continuous choose 3 words, obtain feature word string w3w4w5;Due to N-n-s+1=5-3-0+1=3, therefore, when front jumping word number When s is 0, with current location to w3Untill.
2) when front jumping word number s is 1.Using the word of current location as w1, from w1Start to choose 3 words, choosing It is equal to 1 in the presence of the word quantity that two adjacent words are separated by the text of address in the feature word string taken, then obtains Feature word string w1w3w4And w1w2w4;Next using the word of current location as w2, from w2Start selection 3 It is equal to 1 in the presence of the word quantity that two adjacent words are separated by the text of address in individual word, the feature word string of selection, Obtain feature word string w2w3w5And w2w4w5;Due to N-n-s+1=5-3-1+1=2, therefore, when front jumping word When number s is 1, with current location to w2Untill.
2) when front jumping word number is 2.Using the word of current location as w1, from w1Start to choose 3 words, choose Feature word string in there is the word quantity that is separated by the text of address of two adjacent words and be equal to 2, obtain feature Word string w1w4w5And w1w2w5..Due to N-n-s+1=5-3-2+1=1, therefore, when front jumping word number s is 2 When, with current location to w1Untill.
So far, the feature word string obtained using above-mentioned endless form one includes w1w2w3、w2w3w4、w3w4w5、 w1w3w4、w1w2w4、w2w3w5、w2w4w5、w1w4w5、w1w2w5
Endless form two:With the circulation of current location, nesting works as the circulation of front jumping word number.
In this manner, (it is used as present bit to N-n+1 word using the 1st successively for the word of current location The word put), it will perform operation from 0 to k as when front jumping word number respectively:According to as front jumping word number s, In address text after the carry out word segmentation processing, n word is chosen since the word of current location, spy is obtained Levy word string.
Still with the address text w after word segmentation processing1w2w3w4w5Exemplified by, the address text includes 5 word w1、 w2、w3、w4、w5, i.e. N=5, setting takes word number n=3, and it is respectively 0,1,2 (i.e. to set and jump word number s K=2), then the operating process under endless form two is:
1) word of current location is w1.Using when front jumping word number s is 0, from w1Start continuously to choose 3 words, Obtain feature word string w1w2w3;Using when front jumping word number s is 1, from w1Start to choose 3 words, selection It is equal to 1 in the presence of the word quantity that two adjacent words are separated by the text of address in feature word string, obtains Feature Words String w1w3w4And w1w2w4;Using when front jumping word number s is 2, from w1Start to choose 3 words, the spy of selection Levy in word string and there is the word quantity that is separated by the text of address of two adjacent words and be equal to 2, obtain feature word string w1w4w5And w1w2w5
2) word of current location is w2.Using when front jumping word number s is 0, from w2Start continuously to choose 3 words, Obtain feature word string w2w3w4;Using when front jumping word number s is 1, from w2Start to choose 3 words, selection It is equal to 1 in the presence of the word quantity that two adjacent words are separated by the text of address in feature word string, obtains Feature Words String w2w3w5And w2w4w5;In order to obtain meeting the feature word string for taking word number and jumping word said conditions, present bit No more than N-n-s+1 is put, during due to N-n-s+1=2, s=1 (namely s is 1 to the maximum), therefore, present bit The word put is w2When, using when front jumping word number s is untill 1.
3) word of current location is w3.Using when front jumping word number s is 0, from w3Start continuously to choose 3 words, Then obtain feature word string w3w4w5;During due to N-n-s+1=3, s=0, therefore, the word of current location is w3 When, using when front jumping word number s is untill 0.
So far, the feature word string obtained using above-mentioned endless form includes w1w2w3、w1w3w4、w1w2w4、 w1w4w5、w1w2w5.、w2w3w4、w2w3w5、w2w4w5、w3w4w5
Except above two circulate perform mode in addition to, can not also according to will work as front jumping word number and current location according to The secondary sequential loop for Jia 1 is performed, as long as finally by all possible when front jumping word number and current location are all traveled through Arrive.
It was found from above-mentioned result of implementation, the embodiment of the present application can obtain 9 kinds of feature word strings, and according to tradition The mode of continuous extraction can only obtain w1w2w3、w2w3w4、w3w4w5These three feature word strings, therefore, Using application scheme, discrete feature word string, and the number of obtained feature word string can be not only obtained Increase is measured, therefore the mining effect to address text can be lifted.
No matter which kind of endless form used, for cyclic process each time (correspondence one as front jumping word number s and One current location), above-mentioned steps A2 can be, but not limited to realize according to following steps:
Step B1, in the address text after carrying out word segmentation processing, since the word of current location, continuously N word is chosen, the first word string is obtained.
Here continuous selection, refers to that the direction of last word towards in the address text is continuously chosen (following not specify what is continuously chosen towards the direction of first word, the direction towards last word referred both to It is continuous to choose).
Step B1 is a continuous process for selecting word, for the address text w after word segmentation processing1w2w3w4w5, If the word of current location is w1, then n=3 word is continuously chosen, the first word string is obtained for w1w2w3
Step B2, in the address text after carrying out word segmentation processing, it is determined that connecting since the word of current location The continuous remaining word chosen after n word (the first word string);It is more than or equal to s in the quantity of the remaining word When, since first word in the remaining word, s word is continuously chosen, the second word string is obtained.
Here, if remaining word quantity is less than s, being now can not be since the word of current location, according to current Jump word number s and select n word because can more than address text border.In this case the circulation Process terminates.Such as, for the address text w after word segmentation processing1w2w3w4w5, it is assumed that n=3, s=2, The word of current location is w2, now from w2Remaining word quantity after the 3rd word started is 1, now It can not perform from w5Start continuously to choose 2 words.Therefore, in the embodiment of the present application, only in remaining word When quantity is not less than s, the process for obtaining the second word string is just performed.
Such as, for the address text w after word segmentation processing1w2w3w4w5, the word w from current location1Start Continuously the remaining word after 3 words of selection is w4、w5, the quantity of remaining word is 2, (works as front jumping equal to s Word number s=2), then from remaining word w4、w5Middle selection the 1st, 2 words, obtain the second word string for w4w5
In specific implementation, known in the case of front jumping word number s, current location can be set to be to the maximum N-n-s+1 (referring to above-mentioned endless form one), so ensures that the quantity of remaining word is not less than s.
In step B3, other words in first word string in addition to first word, first object is determined Word, and determined and the second target word of the first object word number identical in second word string.
Such as, in the first word string w1w2w3In except first word w1Outside other word w2、w3In, really First object word is determined for w3, in the second word string w4w5In, it is w to determine the second target word5
Step B4, by the way that the first object word in first word string is replaced with into second target Word, determines the Feature Words word string.
Here, due to the first object word in the first word string is replaced with after the second target word, the sequence of word is not Meet the sequence of word in the text of address, now step B4, which can be performed, is:
Step B4*:First object word in first word string is replaced with into the second target word, the 3rd word string is obtained; Then according to the sequencing of N number of word arrangement in the address text carried out after word segmentation processing, in the 3rd word string Word resequenced, obtain the feature word string.
Such as, by the first word string w1w2w3In first object word w3Replace with the second target word w5, obtain w1w5w2, according still further to address text w1w2w3w4w5Middle word puts in order, to w1w5w2Rearrangement, Obtain Feature Words string w1w2w5
In specific implementation, the first object word and the second target word that above-mentioned steps B3 is determined, which are removed, meets number Outside identical condition, in addition it is also necessary to meet the jump word number in the Feature Words word string obtained after performing step B4 Equal to when front jumping word number.In order to meet this condition, all possible can be first chosen in the first word string One target word, chooses all possible second target word in the second word string, it is then determined that all possible The combination of one target word and the second target word, then satisfaction is therefrom chosen when the combination of front jumping word number, But this mode workload is larger, consuming system resource is larger, and based on this, the embodiment of the present application proposes base In step C1~C2 preferred embodiment, explanation as described below.
Above-mentioned steps B3 specific implementation process can be:
Second word in first word string to last word is jumped into word as starting respectively (such as to distinguish By the first word string w1w2w3In w2And w3Word is jumped as starting), perform following steps:
Step C1:When in the first word string, since word is jumped in starting to the first word string n-th of word word When quantity q is more than or equal to s, the continuous s word jumping word since the starting is defined as described first Target word, and the word in second word string is defined as second target word;Q be more than 1 and Integer less than n.
Such as, the first word string w1w2w3In, jump word w from starting2Start the 3rd word to the first word string w3Word quantity 2 be equal to and work as front jumping word number 2, then will jump word w from starting2Start continuous 2 words, it is determined that For the first object word, i.e., by w2、w3It is defined as the first object word.By the second word string w4w5 In word w4、w5It is defined as second target word.So, by the first word string w1w2w3In the first mesh Mark word w2、w3Replace with the second target word w4、w5It is w afterwards1w4w5
Step C2:When in the first word string, since word is jumped in starting to the first word string n-th of word word When quantity q is less than s, q word jumping word since the starting to n-th of word is defined as first object Word, and since second word string last word, the direction of first word towards in the second word string, Continuously choose q word and be used as the second target word.
Such as, the first word string w1w2w3In, jump word w from starting3Start to the 3rd word w3Word quantity 1 Less than when front jumping word number 2, then word w is jumped into starting3It is defined as the first object word, by the second word string w4w5 In last word w5It is defined as second target word.So, by the first word string w1w2w3In It is w that one target word, which is replaced with after the second target word,1w2w5
In order to be better understood from the embodiment of the present application, below in conjunction with tool of the specific example to the embodiment of the present application Body implementation process is illustrated.
Example one
As shown in Figure 3 A and Figure 3 B, it is respectively the generating process signal of the feature word string in the case of s < n Figure.
In this example, n=6, s=3 are set, the word of current location is the since address text originates word I word (i.e. current location is i), then gram generating process is as follows:
1st, it is continuous since the word of current location to choose n (n=6) individual word, it is put into buff, now buff In word be stitched together as the first word string (correspondence above-mentioned steps B1);
2nd, s word is continuously chosen since the 1st word (i.e. correspondence position i+n) after the first word string (right Answer above-mentioned steps B2), it is used as the second word string;
3rd, the difference that lexeme is put is jumped in the starting in the first word string, and two kinds of situations are segmented into again:First In word string since starting jump word to the first word string n-th of word word quantity q be not less than s situations such as (feelings Condition one) and situations (situation two) of the q less than s.
For situation one, such as the 2nd word during word is the first word string is jumped when the starting in the first word string, i.e., During i+1 word, by since starting jump word continuous s (s=3) individual word (correspondence position i+1, i+2, I+3) it is defined as first object word, and by word (the second word string on position i+n, i+n+1, and i+n+2 In word) be defined as the second target word (correspondence above-mentioned steps C1).Shown in detailed process as Fig. 3 A.
For situation two, the 5th word (the correspondence position during word is the first word string is jumped when the starting in the first word string Put i+4) when, then will since starting jump word in the first word string n-th (n=6) individual word (i.e. position i+4, Word on i+5) be defined as first object word, will from the second word string last word (correspondence position i+n+2) Start, the direction of first word towards in the second word string, it is continuous to choose and first object word quantity identical word It is defined as the second target word, i.e., it is (right to be that selected ci poem on i+n+2 and i+n+1 is taken as the second target word by position Answer above-mentioned steps C2).Shown in detailed process as Fig. 3 B.
4th, the first object word in the first word string is replaced with into the second target word, and according to N in the text of address The sequencing of individual word arrangement, resequences to the word string obtained in the 4th step, after being resequenced Feature word string (correspondence above-mentioned steps B4*.
Example two
As shown in figure 4, showing for the gram another generating process in the case of s >=n that example two is provided It is intended to.
In this example, n=4, s=5 are set, the word of current location is the since address text originates word I word (i.e. current location is i), then gram generating process includes as follows:
1st, it is continuous since the word of current location to choose n (n=4) individual word, it is put into buff, now buff In word be stitched together as the first word string (correspondence above-mentioned steps B1);
2nd, s word is continuously chosen since the 1st word (i.e. correspondence position i+n) after the first word string (right Answer above-mentioned steps B2), it is used as the second word string;
3rd, in the case of s >=n, no matter the starting in the first word string jumps word in what position, the first word string Since starting jump word to the first word string n-th of word word quantity q respectively less than s.In such as the first word string When front jumping word original position be the first word string in the 2nd word (correspondence position i+1) when, will include work as The word that remaining lexeme including front jumping word original position is put on (i+1, i+2, i+3) is defined as first object word, And since the second word string last lexeme puts the word on (i+n+4), towards in the second word string The direction of one word, continuous choose is defined as the second target word with first object word quantity identical word, that is, selects The word (correspondence above-mentioned steps C2) on i+n+4, i+n+3 and i+n+2 is put in fetch bit.
4th, first object word is replaced with into the second target word, obtains feature word string (the above-mentioned B4 of correspondence).
Here, by the above-mentioned feature word string extracted based on Conti-k-skip-n-gram, with being carried based on n-gram The feature word string quantity got is contrasted.
Table 1, which is listed, is based respectively on 2-gram and Conti-k-skip-2-gram (k=1,2,3 and 4) is carried The correction data of feature word string (gram) quantity taken, table 2 list be based respectively on 3-gram and The correction data for the gram quantity that Conti-k-skip-3-gram (k=1,2,3 and 4) is extracted.
Table 1
N 2-gram Conti-1-skip Conti-2-skip Conti-3-skip Conti-4-skip
5 4 7 9 10 10
10 9 17 24 30 35
15 14 27 39 50 60
20 19 35 51 66 80
Table 2
From table 1 and table 2, the quantity based on the Conti-k-skip-n-gram gram extracted substantially compares Quantity based on the n-gram gram extracted is more, that is to say, that Conti-k-skip-n-gram can be produced The gram (there is adjacent word in gram non-conterminous in the text of address) that n-gram can not be produced.
In addition, corresponding to traditional mixing n meta-models, Conti-k-skip can be used in the embodiment of the present application Mix n meta-models (k=2, n=3).By taking the better address of the Brazilian small stream garden of Arriba as an example:
" Hangzhou, Zhejiang province city Yuhang District 5 constant virtues street Jing Feng communities text one West Road 969 Arriba Brazilian No. 5 building in small stream garden ", the gram of generation quantity statistics is shown in Table 3.
Table 3
Conti-2-skip-1-gram 10+0
Conti-2-skip-2-gram 9+15
Conti-2-skip-3-gram 8+26
It is total 68
The gram numbers that Conti-2-skip mixes the generation of 3 meta-models as can be seen from Table 3 are 68, and Traditional n-gram mixes 3 meta-models and produces 27 gram, produces 41 gram equivalent to more, has more 41 gram include:
1. 15 2-gram:
1) Zhejiang Province Yuhang District
2) Zhejiang Province 5 constant virtues street (※)
3) Hangzhou 5 constant virtues street (※)
4) Hangzhou Jing Feng communities (※)
5) Yuhang District Jing Feng communities (※)
6) the literary West Road (※) in Yuhang District
7) the literary West Road (※) in 5 constant virtues street
8) 5 constant virtues street 969
9) Jing Feng communities 969
10) Jing Feng societies area code
11) a literary West Road number
12) the Brazilian small stream garden (※) of literary West Road Arriba
13) the Brazilian small stream garden (※) of 969 Arribas
14) 969 No. 5 buildings
15) Brazilian small stream garden the 5th building (※) of Arriba
2. 26 3-gram:
1) Zhejiang Province Yuhang District 5 constant virtues street (※)
2) Hangzhou, Zhejiang province city 5 constant virtues street (※)
3) Zhejiang Province 5 constant virtues street Jing Feng communities (※)
4) Jing Feng communities of Hangzhou, Zhejiang province city (※)
5) Hangzhou 5 constant virtues street Jing Feng communities (※)
6) Hangzhou Yuhang District Jing Feng communities (※)
7) the literary West Road (※) in Hangzhou Jing Feng communities
8) the literary West Road (※) in Hangzhou Yuhang District
9) the literary West Road (※) in Yuhang District Jing Feng communities
10) the literary West Road (※) in Yuhang District 5 constant virtues street
11) the literary West Road 969 (※) in Yuhang District
12) Yuhang District 5 constant virtues street 969
13) the literary West Road 969 (※) in 5 constant virtues street
14) 5 constant virtues street Jing Feng communities 969
15) 5 constant virtues street 969
16) 5 constant virtues street Jing Feng societies area code
17) Jing Feng communities 969
18) the literary West Road number in Jing Feng communities
19) Jing Feng societies area code Alibaba Xi Xi gardens
20) the Brazilian small stream garden (※) of the West Road Arriba of Jing Feng communities text one
21) the Brazilian small stream garden (※) of literary West Road Arriba
22) the Brazilian small stream garden (※) of the literary Arriba of a West Road 969
23) Brazilian small stream garden the 5th building (※) of literary West Road Arriba
24) literary 969 No. 5 buildings in a West Road
25) Brazilian small stream garden the 5th building (※) of 969 Arribas
26) No. 969 No. 5 buildings
In above-mentioned gram, there are many distinguishabilities very strong in 41 gram being had more than n-gram Gram, the testing material based on address applications, by feature selecting, discovery has 27 features (see mark (※) Part) distinguishability is very strong.It can be seen that, the feature extraction side realized based on Conti-k-skip-n-gram Method can be obviously improved the mining effect of address text.
In addition, the gram that the embodiment of the present application will be extracted based on n-gram and Conti-k-skip-n-gram Applied to text classification is carried out to address text and non-address text, table 4 below is the system to the degree of accuracy of classifying Meter.
Table 4
As known from Table 4, under conditions of a small amount of experimental data, the quantity of feature word string is extracted by being lifted, Compared to non-address text, for the text of address, the text point realized based on Conti-k-skip-n-gram The accuracy of class is more preferable.
Based on same inventive concept, the embodiment of the present application additionally provides a kind of spy corresponding with feature extracting method Extraction element is levied, the feature extracting method provided due to the principle that the device solves problem with the embodiment of the present application It is similar, therefore repeated no more in place of repetition.
As shown in figure 5, the feature deriving means provided for the embodiment of the present application, including:
Determining module 51, for determining to carry out the address text after word segmentation processing;After the carry out word segmentation processing Address text in include N number of word, the N is integer more than 1;
Word module 52 is taken, for taking word number and jump word number according to what is pre-set, from the carry out word segmentation processing Word is taken in address text afterwards, the feature word string for carrying out the address text after word segmentation processing is constituted;Wherein, The number of the word taken included in each feature word string is equal to described take in word number, and each feature word string and deposited The word quantity being separated by two adjacent words in the address text is equal to the jump word number.
Alternatively, it is described take word module 52 specifically for:
Pre-set and take word number to be n, and it is the integer from 1 to k to pre-set jump word number, the n is Integer more than 1 and less than N, the k is the integer more than 1 and less than N-1;According to working as front jumping In word number s, the address text after the progress word segmentation processing, n are chosen since the word of current location Word, obtains the feature word string;S is the integer more than 0 and less than or equal to k.
Alternatively, it is described take word module 52 specifically for:
In address text after the progress word segmentation processing, since the word of the current location, continuous choosing N word is taken, the first word string is obtained;In address text after the progress word segmentation processing, it is determined that from described The word of current location starts continuously to choose the remaining word after n word;Be more than in the quantity of the remaining word or During equal to s, since first word in the remaining word, s word is continuously chosen, the second word string is obtained; In other words in first word string in addition to first word, first object word is determined, and described Determined and the second target word of the first object word number identical in second word string;By by first word The first object word in string replaces with second target word, determines the feature word string.
Alternatively, it is described take word module 52 specifically for:
Word is jumped using second word in first word string to last word as starting respectively, is performed following Operation:
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is more than or equal to s, the continuous s word jumping word since the starting is defined as described the One target word, and the word in second word string is defined as second target word;Q be more than 1, And the integer less than n;
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is less than s, q word jumping word since the starting to n-th of word is defined as the first mesh Word is marked, and since second word string last word, the side of first word towards in the second word string To continuously q word of selection is used as the second target word.
Alternatively, it is described take word module 52 specifically for:
The first object word in first word string is replaced with into second target word, the 3rd word is obtained String;According to the sequencing of N number of word arrangement in the address text after the progress word segmentation processing, to described the Word in three word strings is resequenced, and obtains the feature word string.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the application can be used Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.
The application is produced with reference to according to the method, equipment (system) and computer program of the embodiment of the present application The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and / or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
Although having been described for the preferred embodiment of the application, those skilled in the art once know base This creative concept, then can make other change and modification to these embodiments.So, appended right will Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the application scope.
Obviously, those skilled in the art can carry out various changes and modification without departing from this Shen to the application Spirit and scope please.So, if these modifications and variations of the application belong to the application claim and Within the scope of its equivalent technologies, then the application is also intended to comprising including these changes and modification.

Claims (12)

1. a kind of feature extracting method, it is characterised in that including:
It is determined that carrying out the address text after word segmentation processing;Included in address text after the carry out word segmentation processing N number of word, the N is the integer more than 1;
Word number and jump word number are taken according to what is pre-set, is taken from the address text after the progress word segmentation processing Word, constitutes the feature word string for carrying out the address text after word segmentation processing;Wherein, in each feature word string Comprising the number of the word taken be equal to described take in word number, and each feature word string and there are two adjacent words The word quantity being separated by the address text is equal to the jump word number.
2. the method as described in claim 1, it is characterised in that take word number and jump according to what is pre-set Word number, word is taken from the address text after the progress word segmentation processing, is constituted after the progress word segmentation processing The feature word string of address text, is specifically included:
Pre-set and take word number to be n, and it is the integer from 1 to k to pre-set jump word number, the n is Integer more than 1 and less than N, the k is the integer more than 1 and less than N-1;
According to when in front jumping word number s, the address text after the progress word segmentation processing, from current location Word starts to choose n word, obtains the feature word string;S is the integer more than 0 and less than or equal to k.
3. method as claimed in claim 2, it is characterised in that according to as front jumping word number s, it is described enter In address text after row word segmentation processing, n word is chosen since the word of current location, the feature is obtained Word string, including:
In address text after the progress word segmentation processing, since the word of the current location, continuous choosing N word is taken, the first word string is obtained;
In address text after the progress word segmentation processing, it is determined that continuous since the word of the current location Choose the remaining word after n word;When the quantity of the remaining word is more than or equal to s, from the residue First word in word starts, and continuously chooses s word, obtains the second word string;
In other words in first word string in addition to first word, first object word, Yi Ji are determined Determined and the second target word of the first object word number identical in second word string;
By the way that the first object word in first word string is replaced with into second target word, institute is determined State feature word string.
4. method as claimed in claim 3, it is characterised in that first is removed in first word string In other words outside word, first object word is determined, and determined and described first in second word string Target word number the second target word of identical, including:
Word is jumped using second word in first word string to last word as starting respectively, is performed following Operation:
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is more than or equal to s, the continuous s word jumping word since the starting is defined as described the One target word, and the word in second word string is defined as second target word;Q be more than 1, And the integer less than n;
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is less than s, q word jumping word since the starting to n-th of word is defined as the first mesh Word is marked, and since second word string last word, the side of first word towards in the second word string To continuously q word of selection is used as the second target word.
5. method as claimed in claim 4, it is characterised in that by by the institute in first word string State first object word and replace with second target word, determine the Feature Words word string, including:
The first object word in first word string is replaced with into second target word, the 3rd word is obtained String;
According to the sequencing of N number of word arrangement in the address text after the progress word segmentation processing, to described the Word in three word strings is resequenced, and obtains the feature word string.
6. the method as described in claim 1, it is characterised in that have two in each feature word string The word quantity that individual adjacent word is separated by the address text is equal to the jump word number, including:
The word quantity that the adjacent word of any two is separated by the address text in each feature word string Equal to the jump word number.
7. a kind of feature deriving means, it is characterised in that including:
Determining module, for determining to carry out the address text after word segmentation processing;After the carry out word segmentation processing N number of word is included in the text of address, the N is the integer more than 1;
Word module is taken, for taking word number and jump word number according to what is pre-set, after the progress word segmentation processing Address text in take word, constitute the feature word string for carrying out the address text after word segmentation processing;Wherein, The number of the word taken included in each feature word string is equal to described take in word number, and each feature word string and deposited The word quantity being separated by two adjacent words in the address text is equal to the jump word number.
8. device as claimed in claim 7, it is characterised in that it is described take word module specifically for:
Pre-set and take word number to be n, and it is the integer from 1 to k to pre-set jump word number, the n is Integer more than 1 and less than N, the k is the integer more than 1 and less than N-1;According to working as front jumping In word number s, the address text after the progress word segmentation processing, n are chosen since the word of current location Word, obtains the feature word string;S is the integer more than 0 and less than or equal to k.
9. device as claimed in claim 8, it is characterised in that it is described take word module specifically for:
In address text after the progress word segmentation processing, since the word of the current location, continuous choosing N word is taken, the first word string is obtained;In address text after the progress word segmentation processing, it is determined that from described The word of current location starts continuously to choose the remaining word after n word;Be more than in the quantity of the remaining word or During equal to s, since first word in the remaining word, s word is continuously chosen, the second word string is obtained; In other words in first word string in addition to first word, first object word is determined, and described Determined and the second target word of the first object word number identical in second word string;By by first word The first object word in string replaces with second target word, determines the feature word string.
10. device as claimed in claim 9, it is characterised in that it is described take word module specifically for:
Word is jumped using second word in first word string to last word as starting respectively, is performed following Operation:
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is more than or equal to s, the continuous s word jumping word since the starting is defined as described the One target word, and the word in second word string is defined as second target word;Q be more than 1, And the integer less than n;
When in first word string, to n-th of word of first word string since word is jumped in the starting When word quantity q is less than s, q word jumping word since the starting to n-th of word is defined as the first mesh Word is marked, and since second word string last word, the side of first word towards in the second word string To continuously q word of selection is used as the second target word.
11. device as claimed in claim 10, it is characterised in that it is described take word module specifically for:
The first object word in first word string is replaced with into second target word, the 3rd word is obtained String;According to the sequencing of N number of word arrangement in the address text after the progress word segmentation processing, to described the Word in three word strings is resequenced, and obtains the feature word string.
12. device as claimed in claim 7, it is characterised in that any two in each feature word string The word quantity that individual adjacent word is separated by the address text is equal to the jump word number.
CN201610202581.6A 2016-03-31 2016-03-31 Feature extraction method and device Active CN107291748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610202581.6A CN107291748B (en) 2016-03-31 2016-03-31 Feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610202581.6A CN107291748B (en) 2016-03-31 2016-03-31 Feature extraction method and device

Publications (2)

Publication Number Publication Date
CN107291748A true CN107291748A (en) 2017-10-24
CN107291748B CN107291748B (en) 2021-01-15

Family

ID=60087452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610202581.6A Active CN107291748B (en) 2016-03-31 2016-03-31 Feature extraction method and device

Country Status (1)

Country Link
CN (1) CN107291748B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
US20130311452A1 (en) * 2012-05-16 2013-11-21 Daniel Jacoby Media and location based social network
CN103714092A (en) * 2012-09-29 2014-04-09 北京百度网讯科技有限公司 Geographic position searching method and geographic position searching device
CN104142915A (en) * 2013-05-24 2014-11-12 腾讯科技(深圳)有限公司 Punctuation adding method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
US20130311452A1 (en) * 2012-05-16 2013-11-21 Daniel Jacoby Media and location based social network
CN103714092A (en) * 2012-09-29 2014-04-09 北京百度网讯科技有限公司 Geographic position searching method and geographic position searching device
CN104142915A (en) * 2013-05-24 2014-11-12 腾讯科技(深圳)有限公司 Punctuation adding method and system

Also Published As

Publication number Publication date
CN107291748B (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN109934241B (en) Image multi-scale information extraction method capable of being integrated into neural network architecture
CN102184169B (en) Method, device and equipment used for determining similarity information among character string information
KR101655835B1 (en) A multi-layer system for symbol-space based compression of patterns
CN104737155B (en) The sequence of the conclusion synthesis converted for going here and there
CN101446962B (en) Data conversion method, device thereof and data processing system
CN106156082B (en) A body alignment method and device
CN103631385B (en) Method and device for screening candidate items in character input
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
JP2012155714A (en) Ordering document content
CN109885826A (en) Text word vector acquisition method, device, computer equipment and storage medium
CN106371624A (en) Method and device for providing input candidate item
DE102013221125A1 (en) System, method and computer program product for performing a string search
KR101379128B1 (en) Dictionary generation device, dictionary generation method, and computer readable recording medium storing the dictionary generation program
CN107908714A (en) A kind of aggregation of data sort method and device
CN109918658A (en) A kind of method and system obtaining target vocabulary from text
CN103309857B (en) A kind of taxonomy determines method and apparatus
CN105868372A (en) Label distribution method and device
Gu et al. Learning joint multimodal representation based on multi-fusion deep neural networks
CN106202224B (en) Search processing method and device
CN103605521A (en) Method and device for realizing interface apposition
Elfeky et al. Analyzing the simple ranking and selection process for constrained evolutionary optimization
CN107145244A (en) A kind of special-shaped characters input method, device and electronic equipment
JP2009093556A (en) Index construction method, document retrieval apparatus, and index construction program
CN107291748A (en) A kind of feature extracting method and device
Goldman et al. The social lives of land

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20180418

Address after: Four story 847 mailbox of the capital mansion of Cayman Islands, Cayman Islands, Cayman

Applicant after: CAINIAO SMART LOGISTICS HOLDING Ltd.

Address before: Cayman Islands Grand Cayman capital building, a four storey No. 847 mailbox

Applicant before: ALIBABA GROUP HOLDING Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant