CN106294736A - Text feature based on key word frequency - Google Patents
Text feature based on key word frequency Download PDFInfo
- Publication number
- CN106294736A CN106294736A CN201610649942.1A CN201610649942A CN106294736A CN 106294736 A CN106294736 A CN 106294736A CN 201610649942 A CN201610649942 A CN 201610649942A CN 106294736 A CN106294736 A CN 106294736A
- Authority
- CN
- China
- Prior art keywords
- text
- entry
- word frequency
- class bunch
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/835—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of text feature based on key word frequency, circulation is read user and is searched for the entry in text, close with predefined class gathering, text in each class bunch and each entry as initial condition, carry out participle and index to search text in the word frequency of class bunch;Then, in training set in each class bunch text, statistical nature word word frequency is higher than the quantity of threshold value;In each class bunch, calculate entry eigenvalue, be stored in web page characteristics set;Web page characteristics value is sorted as key word and sets up index;It is indexed in existing web page library with the whole sentence eigenvalue of webpage to be analyzed, retrieves candidate web pages;Candidate web pages is performed Similarity Measure with webpage to be analyzed, decides whether to be analysed to webpage recommending to user according to result of calculation.The present invention proposes a kind of text feature based on key word frequency, quickly excavates valuable information, promotes the Consumer's Experience of search engine.
Description
Technical field
The present invention relates to natural language processing, particularly to a kind of text feature based on key word frequency.
Background technology
Along with Internet technology and the fast development of related industry, data just increase with unprecedented scale, greatly rapidly
Data, while bringing motive force, also bring challenge.How in magnanimity internet data, to seek valuable resource, root
Recommend Similar content according to the search of user, be the vital task of big data text process.It is directed to the approx imately-detecting of webpage, it is desirable to
The space complexity of algorithm and time complexity will reduce as much as possible, to meet the demand of user.Existing based on text
The recommendation method of similarity has the following disadvantages, and when data scale is the hugest, the generation of web page characteristics value and calculating will consumptions
Take long time;To professional field, too much rely on basis corpus and calculate word weights;Short text similarity discrimination
Low.
Summary of the invention
For solving the problem existing for above-mentioned prior art, it is special that the present invention proposes a kind of text based on key word frequency
Levy extracting method, including:
Circulation is read user and is searched for the entry in text, closes with predefined class gathering, text in each class bunch and each
Entry is initial condition in the word frequency of class bunch, and search text is carried out participle and index;Then each class bunch literary composition in training set
In Ben, statistical nature word word frequency is higher than the quantity of threshold value;In each class bunch, calculate entry eigenvalue, be stored in web page characteristics collection
In conjunction;
Web page characteristics value is sorted as key word and sets up index;With the whole sentence eigenvalue of webpage to be analyzed existing
Web page library is indexed, retrieves candidate web pages;Candidate web pages is performed Similarity Measure with webpage to be analyzed, according to calculating
Result decides whether to be analysed to webpage recommending to user.
Preferably, described calculating entry eigenvalue in each class bunch, farther include:
Predefined class gathering closes { c1,c2,…,cm, each class bunch cjInclude text (dj1,dj2,…djn), each text
djIncluding entry (t1,t2,…tk), entry tkAt class bunch cjThreshold value word frequency MM of middle appearance;Number NM that Feature Words is chosen;
(1) participle and to text collection set up index, initialization feature set S be sky;
(2) entry during index file is read in circulation;
(3) entry t is calculatedkThe word frequency text number DF (t no less than MM time in the text of each class bunch of training setk,
ci);
(4) t is calculatedkCharacteristic frequency FF and average word frequency AN relative to each class bunch:
Wherein tfikIt is characterized t at text dikThe word frequency of middle appearance;
(5) t is calculatedkFeature weight MI (t in each class bunchk,ci):
MI(tk,ci)=FF × AN × log (Pm(tk,ci)/P(ci)Pm(tk))
Wherein Pm(tk,ci)=DF (tk,ci)/DF(tk)
P(ci)=n/N
Pm(tk)=DF (tk)/N
Wherein DF (tk) represent feature t in whole training textkWord frequency minimize the text number of value, N is whole instruction
Practice the text sum of collection;
(6) select the document feature sets that MI value is maximum, be incorporated in set S, as first Feature Words, and with in set S
Between entry, the minimum principle of interdependence selects next document feature sets;
(7) step 6 is repeated, until Feature Words number reaches threshold value NM.
The present invention compared to existing technology, has the advantage that
The present invention proposes a kind of text feature based on key word frequency, for large-scale dataset, soon
Speed, check set of metadata of similar data efficiently, quickly excavate valuable information, promote the Consumer's Experience of search engine.
Accompanying drawing explanation
Fig. 1 is the flow chart of text feature based on key word frequency according to embodiments of the present invention.
Detailed description of the invention
The detailed description of embodiment one or more to the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.
Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only wanted by right
Ask book to limit, and the present invention contains many replacements, amendment and equivalent.Illustrate many details in the following description so that
Thorough understanding of the present invention is provided.These details are provided for exemplary purposes, and without in these details
A little or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of text feature based on key word frequency.Fig. 1 is according to this
The text feature flow chart based on key word frequency of inventive embodiments.
The present invention is by webpage approx imately-detecting, and circulation is read user and searched for the entry in text, close with predefined class gathering,
Each class bunch Chinese version and each entry are initial condition in the word frequency of class bunch, and search text is carried out participle and index;Then
In training set in each class bunch text, statistical nature word word frequency is higher than the quantity of threshold value;Entry is calculated special in each class bunch
Value indicative, is stored in web page characteristics set, completes the extraction to text feature.After the eigenvalue obtaining webpage, by this feature
Value sorts as key word and sets up index;It is indexed in existing web page library with the whole sentence eigenvalue of webpage to be analyzed,
Retrieve candidate web pages;Finally, with webpage to be analyzed, candidate web pages being performed Similarity Measure, according to result of calculation, decision is
The no webpage recommending that is analysed to is to user.
The present invention is primarily based on the net page data source crawled, and defined feature extracts strategy, including page structure, position letter
Breath, extraction flow process, rules back, output result etc.;Then, carry out page pretreatment, determine obtain webpage content, abandon with
The entry attribute that extraction information is unrelated;According to extracting strategy, it is thus achieved that required data item, and it is saved in XML document;By XML
Document obtains characteristic vector by feature extraction and clusters.By the document after cluster, by class bunch storage to correspondence database.
Wherein, characteristic extraction procedure farther includes:
Predefined class gathering closes { c1,c2,…,cm, each class bunch cjInclude text (dj1,dj2,…djn), each text
djIncluding entry (t1,t2,…tk), entry tkAt class bunch cjThreshold value word frequency MM of middle appearance;Number NM that Feature Words is chosen.
(1) participle and to text collection set up index, initialization feature set S be sky;
(2) entry during index file is read in circulation;
(3) entry t is calculatedkThe word frequency text number DF (t no less than MM time in the text of each class bunch of training setk,
ci);
(4) t is calculatedkCharacteristic frequency FF and average word frequency AN relative to each class bunch:
Wherein tfikIt is characterized t at text dikThe word frequency of middle appearance;
(5) t is calculatedkFeature weight MI (t in each class bunchk,ci):
MI(tk,ci)=FF × AN × log (Pm(tk,ci)/P(ci)Pm(tk))
Wherein Pm(tk,ci)=DF (tk,ci)/DF(tk)
P(ci)=n/N
Pm(tk)=DF (tk)/N
Wherein DF (tk) represent feature t in whole training textkWord frequency minimize the text number of value, N is whole instruction
Practice the text sum of collection.
(6) select the document feature sets that MI value is maximum, be incorporated in set S, as first Feature Words, and with in set S
Between entry, the minimum principle of interdependence selects next document feature sets;
(7) step 6 is repeated, until Feature Words number reaches threshold value NM.
Alternatively, for the webpage with summary, its feature extraction uses the higher method of following accuracy rate, specifically walks
Suddenly:
(1), filter out the information that web page text head and the tail are unrelated with feature extraction, obtain the web page text after denoising;
(2), summary and the Chinese word segmentation result of textual content are respectively obtained;
(3), the Chinese word segmentation result of summary and textual content is carried out parts of speech classification, after completing classification, to textual content
Carry out predicate with the parts of speech classification result of summary to extract and notional word identification;
(4), according to presetting the parts of speech classification result of web page text after described predicate is extracted by merger rule set and described
The notional word recognition result of web page text carries out merger, obtains the merger result of original text;The word of the summary after described predicate is extracted
Property classification results and the notional word recognition result of described summary carry out merger, obtain the merger result of summary;
(5), the merger result of web page text and the merger result of summary are carried out unit merger, obtain the letter of web page text
The unit merger result of interest statement unit merger result and summary;
(6), the unit merger result of web page text is clustered, webpage literary composition after being clustered according to characterization rules collection
This feature extraction result;Described characterization rules collection is cut by the statement of weights allocation strategy, the unit merger result of web page text
Divider then, atomic sentence segmentation rules, voice decimation rule, tone recognition rule constitute.
Described cluster process farther includes:
(6.1), the webpage text content inputted is carried out dimension-reduction treatment, it is thus achieved that each Feature Words in web page text and
The group of word frequency is right, is designated as < word, value >;
(6.2), described group is ranked up according to lexicographic order, and sets up index according to described sequence;
(6.3), described index and described Feature Words are set up corresponding relation, will the group of each Feature Words and its frequency right
< word, value > is converted to the corresponding relation of each index and its word frequency, is designated as vector < index, value >;
(6.4) definition cycle-index t, maximum cycle tmax;And initialize t=0;Take turns from index vector collection < at t
Index, value > obtains n index vector, is designated as N(t)={ N1 (t),N2 (t),…,Nn (t), Ni (t)Represent the i-th of t wheel
Index vector < indexi (t),value i (t)>;Calculate the i-th index vector N of t wheeli (t)With jth index vector Nj (t)'s
Regularization similarity Nsim (i, j)=Nj (t)·Ni (t);
(6.5), n the index vector N that described t is taken turns(t)Weights be designated as WEN(t)={ WEN1 (t),WEN2 (t),…,WENn (t), WENi (t)Represent the i-th index vector N of t wheeli (t)Weights;Initialize WENi (t)=1;Calculate the i-th index of t wheel
Vector Ni (t)With jth index vector Nj (t)Similarity distance matrix S(t)(i, j):
S(t)(i, j)=(1+WENi (t)/WENj (t))/Nsim(i,j)
(6.6), the S that t is taken turns(t)(i, j) is assigned to Affinity Propagation algorithm, n the rope taking turns described t
The amount of guiding into N(t)Cluster, it is thus achieved that the m of t wheeltIndividual preliminary clusters center, is designated as C(t)={ C1 (t),C2 (t),…,Cmt (t)};By t
Increase 1;And judge t=tmaxWhether setting up, if setting up, then performing step 2.11;Otherwise from described index vector collection < index,
Value > obtains n index vector N of t wheel(t)={ N1 (t),N2 (t),…,Nn (t)}
(6.7), the m that described t-1 is taken turnst-1Individual cluster centre C(t-1)It is appended to n index vector N of described t wheel(t)In,
Thus obtain n+mt-1Individual index vector, the n+m that will updatet-1Individual index vector N(t)' it is assigned to the index vector N that described t takes turns(t), and return step 6.5 order execution;Thus obtain the m of t wheeltIndividual final cluster centre C(t);
(6.8);Obtain each cluster centre taken turns, complete described cluster.
After obtaining eigenvalue, on the one hand the Similarity Measure of the present invention uses whole sentence is that unit obtains and calculate whole
Sentence eigenvalue, then uses editing distance to calculate similarity.It is mapped to a dimensionality reduction vector empty for a multidimensional characteristic vectors
Between, and producing an x dimensional feature value (x > 1) according to the vector after this dimensionality reduction, the most one-dimensional value is 1 or-1, by each characteristic item
It is weighted processing in x gt, finally the most one-dimensional weights in this x dimensional vector is mapped as 0 according to pre-defined rule
Or 1, then these binary digits are coupled together, obtain the x position hashed value of webpage vector.And carry out similarity detection process:
Step 1;The vector v of one x dimension is initialized as 0, and the binary number fbin of x position is initialized as 0.
Step 2: to statement s in whole sentence set SPi, use SHA1 hashing algorithm to obtain the hashed value of an x position.
Step 3: defined function g (hj(si)):
Wherein hj(si) represent siThe binary numeral that jth position is corresponding;Definition vjRepresent the jth dimension of vector v, to 1 to x, meter
Calculate vjWeights
vj=vj+W(si)×g(hj(si))
Wherein, W (si) represent statement siWeights.
Step 4, if there is the most untreated statement in set SP, then jumps to step 2 and is iterated calculating;Otherwise turn step
Rapid 5.
Step 5, defines fbinjRepresent the jth bit value in fbin, to 1 to x, if vj> 0, then fbinj=1;If vj≤ 0,
Then fbinj=0.
Step 6, using the binary sequence fbin that obtains as the eigenvalue of current whole sentence;Then for given webpage X
With webpage Y, respectively the characteristic value combinations of each whole sentence is formed whole sentence characteristic value collection SXAnd SY, use | SX| and | SY| table respectively
Show the element number in each set, | SX∩SY| represent the number approximating sentence in two set, the similarity of calculating webpage X and Y:
Sim (X, Y)=| SX∩SY|/(|SX|+|SY|-|SX∩SY|)
The judgment criterion wherein approximating sentence is, if two respective eigenvalues of whole sentence a, b meetIt is higher than
Predefined threshold value η, then be judged as that two whole sentences are for approximation sentence.
Step 7, if sim (X, Y) > λ (presetting similarity threshold), it is determined that webpage X with Y is similar, otherwise dissimilar.
And in search-engine web page recommendation process, the webpage that number of visits is different is used different methods to enter by the present invention
Row is recommended.
For the number of visits webpage more than predetermined threshold α, making to complete user using the following method and recommend, concrete step is such as
Under:
1.1 search user gathers similar users u of each user u in U ', by the user of browsed same web page be wherein
Similar users.To each similar users u ' the entry t that browsed, give weights according to the sequence number of entry;For each word
Bar, calculating total weight value:
Wgh(ti)=θ × Fr (ti)+ζ×Se(ti);
Wherein Fr (ti) represent that all users use entry to browse the number of times of webpage, Se (ti) represent entry browse order,
θ, ζ are regulation coefficient, and meet θ+ζ=1;
1.2 press entry total weight value descending, merge synonym entry;Finally, multiple by the maximum weight of predetermined number
Webpage recommending corresponding to entry is to user u.
For number of visits less than the webpage of predetermined threshold α, search and number of visits the highest with current web page similarity
Many webpages, recommend user by entry bigger for total weight value in calculated webpage.Concrete step is as follows:
2.1 make to evaluate using the following method the difference degree of entry w,
All collections of web pages that wherein P crawls in being gatherer process, T is the set of all entries, and (p w) represents entry to FP
The word frequency that w occurs in webpage p.
The 2.2 webpage weights height with more high difference degree entry, calculate webpage weights as follows:
Further according to aforementioned webpage similarity sim (X, Y), calculate Candidate Recommendation similarity sim (X, Y) with webpage weights
× IM (X) × IM (Y), preserves final similarity and carries out more than the web results of threshold alpha more than predetermined threshold value Φ and number of visits
Recommend.
Still optionally further, for above-mentioned webpage weights, it is possible to use entry semantic similarity quaternary tree, then with former phase
The calculating seemingly spending sim (X, Y) is weighted summation.Entry semantic similarity quaternary tree comprises leaf node and nonleaf node, leaf segment
In point, all similarities exceed the entry of threshold value Phi and arrange the most in descending order, and are sequentially saved in leaf node.And entry number information
It is saved in nonleaf node.During the semantic similarity calculated between document feature sets vector, if Feature Words vector viAnd vj
Feature w of certain dimensionikAnd wjlMeet following condition 1 or 2, then to document feature sets vector viAnd vjSimilarity result carry out
Weighting processes.
Condition 1: if wjlBelong to the entry descending queue of some leaf node in quaternary tree, and wikIt is not belonging to above-mentioned fall
Sequence queue, then according to wikWith the similarity of other entry in the entry descending queue of place, containing wjlEntry descending queue in
Determine wikOrdinal position in entry descending queue.
Condition 2: if wikAnd wjlAll it is not belonging to the entry descending queue of some leaf node in quaternary tree, wikAnd wjlWith
The document feature sets with maximum similarity in the entry descending queue of certain leaf node and have minimum similar in quaternary tree
When the Similarity value of the document feature sets of degree is both less than a certain threshold value Phi, then set up a branch, and by wikAnd wjlIt is inserted into this
In the document feature sets queue of individual branch leaf node.
After entry semantic similarity quaternary tree has built, from viIn each entry start, find vjIn with wjl
Most like entry, the similarity between record entry.By viIn other entries repeat above-mentioned searching process, until viIn all
Entry is all at vjIn have found the most most like entry.Similarity between the entry that will obtain adds up, divided by viIn all words
Bar number, as viAnd vjSimilarity sim (vi, vj).Then sim (v is calculatedi, vj) and sim (vj, vi) meansigma methods, as
Vector viAnd vjSemantic similarity.To vector viAnd vjSemantic similarity be weighted processing, finally give the semantic phase of weighting
Like degree.
In sum, the present invention proposes a kind of text feature based on key word frequency, on a large scale
Data set, checks set of metadata of similar data fast and efficiently, quickly excavates valuable information, promotes the user's body of search engine
Test.
Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general
Calculating system realize, they can concentrate in single calculating system, or is distributed in what multiple calculating system was formed
On network, alternatively, they can realize with the executable program code of calculating system, it is thus possible to be stored in
Storage system is performed by calculating system.So, the present invention is not restricted to the combination of any specific hardware and software.
It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's
Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any
Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention
Whole within containing the equivalents falling into scope and border or this scope and border change and modifications
Example.
Claims (2)
1. a text feature based on key word frequency, it is characterised in that including:
Circulation is read user and is searched for the entry in text, with the text in the conjunction of predefined class gathering, each class bunch and each entry
Word frequency in class bunch is initial condition, and search text is carried out participle and index;Then in training set in each class bunch text,
Statistical nature word word frequency is higher than the quantity of threshold value;In each class bunch, calculate entry eigenvalue, be stored in web page characteristics set;
Web page characteristics value is sorted as key word and sets up index;With the whole sentence eigenvalue of webpage to be analyzed at existing webpage
Storehouse is indexed, retrieves candidate web pages;Candidate web pages is performed Similarity Measure with webpage to be analyzed, according to result of calculation
Decide whether to be analysed to webpage recommending to user.
Method the most according to claim 1, it is characterised in that described calculating entry eigenvalue in each class bunch, enters one
Step includes:
Predefined class gathering closes { c1,c2,…,cm, each class bunch cjInclude text (dj1,dj2,…djn), each text djBag
Include entry (t1,t2,…tk), entry tkAt class bunch cjThreshold value word frequency MM of middle appearance;Number NM that Feature Words is chosen;
(1) participle and to text collection set up index, initialization feature set S be sky;
(2) entry during index file is read in circulation;
(3) entry t is calculatedkThe word frequency text number DF (t no less than MM time in the text of each class bunch of training setk,ci);
(4) t is calculatedkCharacteristic frequency FF and average word frequency AN relative to each class bunch:
Wherein tfikIt is characterized t at text dikThe word frequency of middle appearance;
(5) t is calculatedkFeature weight MI (t in each class bunchk,ci):
MI(tk,ci)=FF × AN × log (Pm(tk,ci)/P(ci)Pm(tk))
Wherein Pm(tk,ci)=DF (tk,ci)/DF(tk)
P(ci)=n/N
Pm(tk)=DF (tk)/N
Wherein DF (tk) represent feature t in whole training textkWord frequency minimize the text number of value, N is whole training set
Text sum;
(6) select the document feature sets that MI value is maximum, be incorporated in set S, as first Feature Words, and with entry in set S
Between the minimum principle of interdependence select next document feature sets;
(7) step 6 is repeated, until Feature Words number reaches threshold value NM.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610649942.1A CN106294736A (en) | 2016-08-10 | 2016-08-10 | Text feature based on key word frequency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610649942.1A CN106294736A (en) | 2016-08-10 | 2016-08-10 | Text feature based on key word frequency |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294736A true CN106294736A (en) | 2017-01-04 |
Family
ID=57667587
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610649942.1A Pending CN106294736A (en) | 2016-08-10 | 2016-08-10 | Text feature based on key word frequency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294736A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107608965A (en) * | 2017-09-14 | 2018-01-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN107992477A (en) * | 2017-11-30 | 2018-05-04 | 北京神州泰岳软件股份有限公司 | Text subject determines method, apparatus and electronic equipment |
CN109918624A (en) * | 2019-03-18 | 2019-06-21 | 北京搜狗科技发展有限公司 | A kind of calculation method and device of web page text similarity |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN113239687A (en) * | 2021-05-08 | 2021-08-10 | 北京天空卫士网络安全技术有限公司 | Data processing method and device |
CN117648409A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (en) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | Text message indexing unit and text message indexing method |
CN101414300A (en) * | 2008-11-28 | 2009-04-22 | 电子科技大学 | Method for sorting and processing internet public feelings information |
CN101441663A (en) * | 2008-12-02 | 2009-05-27 | 西安交通大学 | Chinese text classification characteristic dictionary generating method based on LZW compression algorithm |
CN104598532A (en) * | 2014-12-29 | 2015-05-06 | 中国联合网络通信有限公司广东省分公司 | Information processing method and device |
-
2016
- 2016-08-10 CN CN201610649942.1A patent/CN106294736A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (en) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | Text message indexing unit and text message indexing method |
CN101414300A (en) * | 2008-11-28 | 2009-04-22 | 电子科技大学 | Method for sorting and processing internet public feelings information |
CN101441663A (en) * | 2008-12-02 | 2009-05-27 | 西安交通大学 | Chinese text classification characteristic dictionary generating method based on LZW compression algorithm |
CN104598532A (en) * | 2014-12-29 | 2015-05-06 | 中国联合网络通信有限公司广东省分公司 | Information processing method and device |
Non-Patent Citations (5)
Title |
---|
丁益斌: "相似网页去重算法的并行化研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
冷强奎,等: "基于句子相似度的论文抄袭检测模型研究", 《计算机工程与应用》 * |
王文 主编: "《现代图书馆建设》", 31 October 2012 * |
赵晓永 著: "《面向云计算的数据存储关键技术研究》", 31 December 2014 * |
邓彩凤: "中文文本分类中互信息特征选择方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107608965A (en) * | 2017-09-14 | 2018-01-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN107608965B (en) * | 2017-09-14 | 2018-10-19 | 掌阅科技股份有限公司 | Extracting method, electronic equipment and the storage medium of books the names of protagonists |
CN107992477A (en) * | 2017-11-30 | 2018-05-04 | 北京神州泰岳软件股份有限公司 | Text subject determines method, apparatus and electronic equipment |
CN107992477B (en) * | 2017-11-30 | 2019-03-29 | 北京神州泰岳软件股份有限公司 | Text subject determines method and device |
CN109918624A (en) * | 2019-03-18 | 2019-06-21 | 北京搜狗科技发展有限公司 | A kind of calculation method and device of web page text similarity |
CN109918624B (en) * | 2019-03-18 | 2022-10-04 | 北京搜狗科技发展有限公司 | Method and device for calculating similarity of webpage texts |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN113239687A (en) * | 2021-05-08 | 2021-08-10 | 北京天空卫士网络安全技术有限公司 | Data processing method and device |
CN113239687B (en) * | 2021-05-08 | 2024-03-22 | 北京天空卫士网络安全技术有限公司 | Data processing method and device |
CN117648409A (en) * | 2024-01-30 | 2024-03-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
CN117648409B (en) * | 2024-01-30 | 2024-04-05 | 北京点聚信息技术有限公司 | OCR-based format file anti-counterfeiting recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294733B (en) | Page detection method based on text analyzing | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
CN106294736A (en) | Text feature based on key word frequency | |
US20230195773A1 (en) | Text classification method, apparatus and computer-readable storage medium | |
CN110750640B (en) | Text data classification method and device based on neural network model and storage medium | |
CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
CN106156272A (en) | A kind of information retrieval method based on multi-source semantic analysis | |
Wang et al. | Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications | |
JP2002014999A (en) | Similar document retrieval device and relative keyword extract device | |
JP2012524314A (en) | Method and apparatus for data retrieval and indexing | |
CN106844632A (en) | Based on the product review sensibility classification method and device that improve SVMs | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
US20220180317A1 (en) | Linguistic analysis of seed documents and peer groups | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
US9652997B2 (en) | Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme | |
CN112527958A (en) | User behavior tendency identification method, device, equipment and storage medium | |
CN101187919A (en) | Method and system for abstracting batch single document for document set | |
CN111241410A (en) | Industry news recommendation method and terminal | |
Bhutada et al. | Semantic latent dirichlet allocation for automatic topic extraction | |
CN110990003B (en) | API recommendation method based on word embedding technology | |
CN106294295B (en) | Article similarity recognition method based on word frequency | |
CN114138979B (en) | Cultural relic safety knowledge map creation method based on word expansion unsupervised text classification | |
CN111563361B (en) | Text label extraction method and device and storage medium | |
Amini | Interactive learning for text summarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |