CN104778158B - A kind of document representation method and device - Google Patents

A kind of document representation method and device Download PDF

Info

Publication number
CN104778158B
CN104778158B CN201510096570.XA CN201510096570A CN104778158B CN 104778158 B CN104778158 B CN 104778158B CN 201510096570 A CN201510096570 A CN 201510096570A CN 104778158 B CN104778158 B CN 104778158B
Authority
CN
China
Prior art keywords
term vector
word
text
vector
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510096570.XA
Other languages
Chinese (zh)
Other versions
CN104778158A (en
Inventor
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN201510096570.XA priority Critical patent/CN104778158B/en
Publication of CN104778158A publication Critical patent/CN104778158A/en
Application granted granted Critical
Publication of CN104778158B publication Critical patent/CN104778158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of document representation method and devices, to improve the accuracy of text representation, to improve the accuracy of text-processing.The method includes:Determine each word for constituting current text, determine the term vector of each word, each term vector is clustered, the Feature Words of current text and the weight of this feature word is determined in each word according to cluster result, the text vector of current text is determined according to the term vector of each Feature Words and weight.In this way, determine that the process of Feature Words already has accounted for the correlation between semanteme and sentence of the word in sentence by cluster, the term vector for the Feature Words determined can accurately express the intension of text, to which the accuracy of text representation can be improved, and then the accuracy of text-processing can be improved.

Description

A kind of document representation method and device
Technical field
The present invention relates to the information processing technology more particularly to a kind of document representation methods and device.
Background technology
In technical field of information processing, often it is related to text-processing.Text-processing refers to after text representation Content of text, carry out text retrieval, text classification, the processing such as text analyzing, wherein text representation refers to by original text Content becomes computer-internal and indicates structure, which is the analyzable structure of computer program, for example, can use Word, phrase in content of text etc. form the analyzable vector structure of computer.
The accuracy of text representation is higher, more can accurately express the intension of current text, the effect of text-processing is better, Efficiency is higher, conversely, the accuracy of text representation is lower, the intension of the text given expression to more deviates in the reality of text Contain, the effect of text-processing is poorer, efficiency is also lower.
In the prior art, document representation method is based primarily upon vector space model.Vector space model indicates text Method is:For some text, the text is segmented first, obtains multiple words, is then existed further according to these words The frequency occurred in text, selecting frequency is more than Feature Words of the word of preset value as the expression text, and calculates each These Feature Words and corresponding weight are finally constituted text vector by the weight of Feature Words, and text vector is exactly the text Representation.For example, for some text, ith feature word is fi, and the weight of this feature word is wi, then text representation shape Formula is:{<f1:w1>、<f2:w2>、……、<fi:wi>..., wherein i=1,2,3 ....
In the document representation method that the above-mentioned prior art provides, in selected characteristic word, there is no consider Feature Words in sentence Semanteme in son does not account for the correlation between sentence yet, and only extraction frequency is more than preset value in mechanical slave text Word is as Feature Words, further, since the Feature Words in text vector are the word in text, since independent word may be deposited In multilayer meaning, the intension of text can not be accurately expressed, therefore, text vector expresses the accuracy of text with regard to relatively low, correspondingly, The accuracy of text-processing is also just relatively low.
Invention content
A kind of document representation method of offer of the embodiment of the present invention and device, to improve the accuracy of text representation, to The accuracy of text-processing also can be improved.
A kind of document representation method provided in an embodiment of the present invention, including:
Determine each word for constituting current text;
Determine the term vector of each word;
Each term vector is clustered;
According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word;
The text vector of current text is determined according to the term vector of each Feature Words and weight.
A kind of text representation device provided in an embodiment of the present invention, including:
First determining module, for determining each word for constituting current text;
Second determining module, the term vector for determining each word;
Cluster module, for being clustered to each term vector;
Third determining module, for according to cluster result, the Feature Words of current text being determined in each word and are somebody's turn to do The weight of Feature Words;
4th determining module, the text vector for determining current text according to the term vector and weight of each Feature Words.
A kind of document representation method and device provided in an embodiment of the present invention, this method determine each word for constituting current text Language determines the term vector of each word, is clustered to each term vector, according to cluster result determine current text Feature Words and The weight of this feature word determines the text vector of current text according to the corresponding term vector of the Feature Words of each word and weight. As it can be seen that the word in the present invention is indicated by term vector, term vector compares word can be from multiple dimensions to the word It is described, can more accurately indicate the semantic information of word, in addition, the process of cluster already has accounted for Feature Words in sentence In semanteme and sentence between correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can be effective The accuracy for the Feature Words for determining current text is improved, and then the accuracy of text-processing can be effectively improved.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and constitutes the part of the present invention, this hair Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of document representation method flow diagram provided in an embodiment of the present invention;
Fig. 2 is a kind of method flow schematic diagram in default term vector library provided in an embodiment of the present invention;
Fig. 3 is a kind of text representation apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below with reference to the specific embodiment of the invention and Technical solution of the present invention is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the present invention one Section Example, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, shall fall within the protection scope of the present invention.
It is a kind of document representation method flow diagram provided in an embodiment of the present invention referring to Fig. 1, including:
S101:Determine each word for constituting current text.
In embodiments of the present invention, it is described it is current be the text for needing to carry out text representation that server obtains herein, The text can be sentence, paragraph or chapter of Chinese form etc., and the text can be the text of the formats such as txt, doc, pdf, wps This.
In embodiments of the present invention, server can be, but not limited to obtain text from default storage region (such as corpus) This, or the online text for obtaining user and currently uploading, and using the text of acquisition as current text.
The embodiment of the present invention can segment the current text after obtaining current text, obtain constituting current Each word of text.During participle, the segmenting method of use may include but be not limited to:By word traversal, mechanical Chinese word segmentation method Deng.For example, it is assumed that server obtain an article, using this article as current text after, this article content is pre-processed, Then pretreated article content is segmented, each word obtained after participle includes:Display, tablet, liquid crystal, illumination, This five words can be determined as constituting each word of current text by this five words of device.
The calculation amount of server and the interference of some words is avoided when in order to reduce participle, the embodiment of the present invention is in participle Before, current text can be pre-processed, for example, the hypertext markup language (Hypertext in removal current text Markup Language, HTML), convert the complex form of Chinese characters in current text to simplified Chinese character, convert double byte character to half widths Symbol etc..
In view of in each word in practical application scene, being obtained after participle in addition to comprising the word with practical significance Except, it is also possible to include some words without practical significance, and Feature Words are generally the word with practical significance, therefore, The embodiment of the present invention, specifically, being segmented to current text, obtains multiple words when determining each word for constituting current text After language, the word of specified type can also be determined in each word, it, can also be into order to avoid filtering out identical word One step carries out duplicate removal processing to the word of specified type, using each word after duplicate removal processing as each word for constituting current text Language;Wherein, the word of the specified type can be specifically the word with practical significance, and the word with practical significance can Including but not limited to:Noun, verb, adjective etc., then the word for not having practical significance is usually auxiliary word, adverbial word, function word etc..
S102:Determine the term vector of each word.
In embodiments of the present invention, in order to express the meaning (i.e. semantic information) of a word in more detail, packet can be used The N-dimensional vector of N number of element is included to indicate that a word, the N-dimensional vector are the term vector of the word.N number of element of the term vector In, each element is the weighted value of the corresponding text categories of the word, and wherein text categories may include:Computer, traffic, religion Educate, economy, military affairs, sport, medicine, art, politics, environment etc..
For example, it is assumed that the text categories of term vector are represented by N-dimensional vector { computer, traffic, education, economy }4, In, N=4.Assuming that display, tablet, liquid crystal, illumination, device this five words are each word for constituting current text, then " liquid crystal " The term vector of this word can be expressed as:{0.175、0.095、0.185、0.041}4, wherein the meaning that the term vector indicates For:" liquid crystal " correspond to computer, traffic, education, this economic four text categories weighted value be respectively 0.175,0.095, 0.185、0.041。
In embodiments of the present invention, server can determine tool when determining the term vector of each word according to term vector Direct-on-line determines term vector.Optionally, word2vec calculating instruments can be used in server, to determine the term vector of each word.
In order to improve the efficiency for the term vector for determining each word, it is preferred that in embodiments of the present invention, can also in advance really The term vector of fixed each word, it needs to be determined that each word term vector when, in preset term vector library, determine and (e.g., search) Term vector corresponding with each word.As it can be seen that term vector corresponding with each word is determined in preset term vector library, it is convenient fast Victory can effectively improve the treatment effeciency of server.
In embodiments of the present invention, when predefining the term vector of each word, word2vec calculating instruments can also be used, To determine the term vector of each word.
S103:Each term vector is clustered.
In embodiments of the present invention, after the term vector that each word is determined by step S102, can to each term vector into Row cluster.
The basic principle of cluster is that have larger similitude between of a sort term vector, between inhomogeneous term vector It differs greatly, therefore, by carrying out similarity measurement between term vector, so that it may each vector be clustered with realizing.Specifically , it can determine the similarity between term vector, cosine by calculating the COS distance (cosine) between two term vectors Value is bigger, and similarity is bigger between term vector, conversely, cosine values are smaller, the similarity between term vector is with regard to smaller.
In embodiments of the present invention, adoptable clustering algorithm includes but not limited to:Chinese RestaurantProcess (CRP) algorithm, K- means clustering algorithms, K- central points clustering algorithm, CLARANS algorithms, BIRCH Algorithm, CLIQUE algorithms, DBSCAN algorithms etc..
In embodiments of the present invention, by being clustered to obtain multiclass term vector set, the multiclass word to each term vector Vector set is combined into the cluster result clustered to each term vector;Wherein, per a kind of term vector set in comprising several words to Amount.
It uses the example above, it is assumed that display, tablet, liquid crystal, illumination, the corresponding term vector of this five words of device are carried out Cluster, obtains three classes term vector set.First kind term vector set includes liquid crystal, display, the corresponding word of these three words of device Vector, the second class term vector set only include the corresponding term vector of tablet, and third class term vector set only includes that illumination is corresponding Term vector, this illustrates the mutual similarity maximum of liquid crystal, display, the corresponding term vector of these three words of device, correlation Highest.Correlation is relatively low between tablet and the corresponding term vector of illumination, tablet and illuminate respectively with liquid crystal, display, device The correlation of these words is relatively low.That is, in this three classes term vector, the corresponding word of first kind term vector, which best embodies, works as The feature of preceding text, the second class and third class are taken second place.
S104:According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word.
In embodiments of the present invention, according to cluster result, the Feature Words of current text are determined in each word, specifically may be used In all kinds of term vector set, determine that the quantity for the term vector for including is more than the term vector set of predetermined threshold value.
It uses the example above, it is assumed that predetermined threshold value 2, according to the first kind, the second class and third class term vector set, (cluster is tied Fruit), the Feature Words of current text are determined in display, tablet, liquid crystal, illumination, device these words, it specifically can be first In class, the second class and third class term vector set, determine that the quantity for the term vector for including is more than the term vector collection of predetermined threshold value 2 It closes, the quantity for the term vector for including due to first kind term vector set is 3, the word that the second class and third class term vector set include The quantity of vector is respectively 1, it is determined that the quantity for the term vector for including is more than that the term vector collection of predetermined threshold value 2 is combined into first kind word Vector set, using the corresponding word of each term vector in the first kind term vector set determined as Feature Words, that is, by liquid The Feature Words of brilliant, display, device these three words as current text.
In embodiments of the present invention, according to cluster result, the Feature Words of current text are determined in each word, it is specific to go back All kinds of term vector set can be sorted according to the descending sequence of the quantity comprising term vector, m term vector collection before determining It closes, wherein m is default value;Using the corresponding word of each term vector in the term vector set determined as Feature Words.
It uses the example above, it is assumed that default value m=1, by the first kind, the second class and third class term vector set according to including word The descending sequence sequence of the quantity of vector, the quantity for the term vector for including due to first kind term vector set is 3, the second class The quantity for the term vector for including with third class term vector set is respectively 1, then all kinds of term vector set sequence sequences are followed successively by:The A kind of, the second class and third class term vector set, determine in a term vector set of the 1st (m=1) (i.e. first kind term vector set) The corresponding word of each term vector (liquid crystal, display, device) be used as Feature Words.
In embodiments of the present invention, according to cluster result, the weight of the Feature Words of current text is determined in each word wi, can specifically be determined by formula (1-1).
wi=log (1+ni/nm) (1-1)
Wherein, wiFor the weight of the ith feature word in current text, niIt is that ith feature word occurs in current text Number (hereinafter referred to as word frequency), nmIt is the maximum word frequency of numerical value in the corresponding word frequency of each Feature Words.
For example, each Feature Words:Liquid crystal, display, device word frequency be respectively 10,30,20, then show the word frequency of this word Maximum, i.e. nm=30.The then weight w of this word of liquid crystal1=log (1+10/30);Show the weight w of this word2=log (1 +30/30);The weight w of this word of device3=log (1+20/30).
S105:The text vector of current text is determined according to the term vector of each Feature Words and weight.
Specifically, according to the term vector and weight of each Feature Words, the multi-C vector being made of multiple elements is determined, this is more Text vector of the dimensional vector as current text;Wherein, an element in the multi-C vector from the word of a Feature Words to Amount and the weight of this feature word are constituted.
For example, the text vector of current text can be expressed as:{<F1:w1>、<F2:w2>、……、<Fi:wi>..., Wherein, i=1,2,3 ....Fi is the corresponding term vector of ith feature word.
In above-mentioned method shown in FIG. 1, this method determines each word for constituting current text, determine the word of each word to Amount, clusters each term vector, the weight of the Feature Words and this feature word of current text is determined according to cluster result, according to every The corresponding term vector of Feature Words and weight of a word determine the text vector of current text.As it can be seen that the word in the present invention is It is indicated by term vector, the term vector word that compares can be described the word from multiple dimensions, can more accurately The semantic information of word is indicated, in addition, the process of cluster already has accounted between semanteme and sentence of the Feature Words in sentence Correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can effectively improve the spy of determining current text The accuracy of word is levied, and then the accuracy of text-processing can be improved.
When above-mentioned determining in preset term vector library (e.g., searching) term vector corresponding with each word, default word is needed Vectorial library.
Referring to Fig. 2, in embodiments of the present invention, the method for presetting term vector library specifically may include following steps:
S201:Obtain multiple history texts.
When obtaining multiple history texts, multiple texts can be obtained from corpus, as history text, obtain text Quantity can be hundreds of, thousands of etc., be not particularly limited here.
S202:Determine the multiple words for constituting each history text.
When determining each word for constituting each history text, the method class with the above-mentioned determining each word for constituting current text Seemingly, for example, can be segmented to each history text by segmenting method, each word is obtained.
Optionally, in order to reduce the calculation amount of server and avoid the interference of some words, to each history text Before being segmented, which can be pre-processed.Pretreatment may include but be not limited to:History text is gone HTML, the complex form of Chinese characters are converted into simplified Chinese character, double byte character is converted into half-angle character, carry out duplicate removal processing to each history text.
When carrying out duplicate removal processing to each history text, each history text can be calculated by message digest algorithm Informative abstract, for example, can be right by the message digest algorithm (Message-Digest Algorithm 5, MD5) of the 5th version Each history text obtained is calculated, and after obtaining the corresponding MD5 values of each history text, identical MD5 values are corresponding to be gone through History text only retains a (realizing duplicate removal processing).
In view of in practical application scene, for indicating that the Feature Words of text are typically the word of practical significance, because This, optionally, after being segmented to each history text, it may be determined that constitute multiple specified types of each history text Word;Wherein the word of the specified type can be specifically the word with practical significance.In this way, server can be further decreased Calculation amount.
S203:Each word in history text is expressed as a multi-C vector, using the multi-C vector as the word Initial term vector.In embodiments of the present invention, word2vec calculating instruments equally can be used determine the word of each word to Amount, which is not described herein again.
S204:Each initial term vector is subjected to digital finger-print processing respectively, obtains digital finger-print treated term vector.
Digital finger-print processing is carried out to initial term vector, namely processing is digitized to initial term vector, for example, by first Beginning term vector is converted to " 0 " of certain length (such as 64 bit), " 1 " numerical string.The embodiment of the present invention can be breathed out by local sensitivity Term vector is converted to " 0 ", " 1 " numerical string by uncommon (LSH) algorithm.
For example, the term vector of " liquid crystal " this word is expressed as:{0.175、0.095、0.185、0.041}4, then to the word Vector carries out digital finger-print processing, obtains digital finger-print treated that term vector can be<000000000010>;
If the term vector of " display " this word is expressed as:{0.123、0.195、0.085、0.441}4, then to the word to Amount carries out digital finger-print processing, obtains digital finger-print treated that term vector can be<100101010010>.
S205:It is constituted using digital finger-print treated term vector and presets term vector library.
In embodiments of the present invention, it is constituted using digital finger-print treated term vector and presets term vector library, worked as in determination After the corresponding each word of preceding text, from the term vector corresponding with each word found in default term vector library be digital finger-print from Term vector after reason.When being clustered to each term vector, exactly the term vector after digital finger prints processing is clustered, then is clustered When calculating the similitude between term vector in the process, the Hamming distance between two term vectors can be calculated, Hamming distance is passed through The similarity between term vector can be determined, between bigger the two term vectors of explanation of the Hamming distance between two term vectors Correlation is smaller, conversely, the Hamming distance between two term vectors is smaller, the similitude between two term vectors is bigger.In logarithm When term vector after word is clustered, the calculation amount of server can be greatly reduced, the processing effect of server can be effectively improved Rate.
In embodiments of the present invention, in order to further avoid in initial term vector, there is also the word of no practical significance correspondences Term vector, then initial term vector can be screened, specifically, according to part of speech, word frequency, deactivate the attributes such as vocabulary, from each The word without practical significance is removed in initial term vector, only retains the word with practical significance, so as to effectively reduce no reality The interference of the word of border meaning, and then can effectively reduce the calculation amount of server.
In embodiments of the present invention, according to the term vector of each Feature Words and weight determine current text text vector it Afterwards, so that it may text-processing is carried out to be based on text vector, for example, carrying out text retrieval, text classification, text analyzing, text The processing such as similarity calculation.
During reducing text-processing, the calculation amount of server, to effectively improve the treatment effeciency of server, then In embodiments of the present invention, the method further includes:The text of current text is determined according to the term vector of each Feature Words and weight After vector, digital finger-print processing is carried out to the text vector of current text.
The digital finger-print handles namely is digitized processing, and optionally, the present invention may be used in LSH algorithms One of which algorithm simhash to carry out digital finger-print processing to text vector.
For example, setting Feature Words:Term vector after liquid crystal, display and the corresponding digitlization of device is<010>、<001> With<110>, the weight of liquid crystal, display and device is respectively 0.1,0.2,0.4, then text vector is expressed as:{<Liquid crystal term vector: 0.1>、<Show term vector:0.2>、<Device term vector:0.4>}.
Then to text vector<Liquid crystal term vector:0.1>、<Show term vector:0.2>、<Device term vector:0.4>Carry out Digitized processing is specially:
" 0 " in each term vector is replaced with into " -1 ", " 1 " replaces with " 1 ", each term vector is multiplied by weight, Obtain new term vector;First numerical value in each term vector, which adds up, obtains first value, by second number in each term vector Value is cumulative to obtain second value, and the third numerical value in each term vector is added up and obtains third value.
In first value~third value, negative value will be replaced with 0, then obtain being made of 0 and 1 on the occasion of replacing with 1 Vector be digitized processing after vector.
For example,<010>、<001>With<110>In " 0 " replace with " -1 ", " 1 " replaces with " 1 ", and is multiplied by each term vector Corresponding weight, obtained vector difference are as follows:
Term vector<010>Corresponding to vector 1<-0.1、0.1、-0.1>;
Term vector<001>Corresponding to vector 2<-0.2、-0.2、0.2>;
Term vector<110>Corresponding to vector 3<0.4、0.4、-0.4>;
First element -0.1, -0.2 of vector 1~vector 3 is added with 0.4, first obtained value is 0.1, the value For just;
Second element 0.1, -0.2 of vector 1~vector 3 is added with 0.4, second obtained value is 0.3, the value For just;
The third element -0.1,0.2 of vector 1~vector 3 is added with -0.4, obtained third value is -0.3, should Value is negative;
Then in first value~third value, negative value will be replaced with 0 on the occasion of replacing with 1, then obtain being made of 0 and 1 Vector<110>For the vector after digitized processing.
It is document representation method provided in an embodiment of the present invention above, is based on same thinking, the embodiment of the present invention also carries A kind of text representation device is supplied, as shown in figure 3, including:
First determining module 31, for determining each word for constituting current text;
Second determining module 32, the term vector for determining each word;
Cluster module 33, for being clustered to each term vector;
Third determining module 34, for according to cluster result, the Feature Words of current text being determined in each word and are somebody's turn to do The weight of Feature Words;
4th determining module 35, the text vector for determining current text according to the term vector and weight of each Feature Words.
Optionally, described device further includes:
Processing module 36 carries out digital finger-print processing for the text vector to the current text.
Optionally, the second determining module 32 is specifically used for,
In preset term vector library, term vector corresponding with each word is determined.
Optionally, described device further includes:
Default term vector library module 37, for presetting term vector library;
The default term vector library module 37 is specifically used for, and obtains multiple history texts, determines and constitutes each history text Each word in the history text is expressed as a multi-C vector, using the multi-C vector as institute's predicate by multiple words Each initial term vector is carried out digital finger-print processing by the initial term vector of language respectively, obtains digital finger-print treated term vector, It is constituted using the digital finger-print treated term vector and presets term vector library.
Optionally, the default term vector library module 37 is specifically used for, and determines the multiple specified classes for constituting each history text The word of type.
Optionally, first determining module 31 is specifically used for, and is segmented to the current text, obtains multiple words Language determines the word of specified type in each word, duplicate removal processing is carried out to the word of the specified type, at duplicate removal Each word after reason is as each word for constituting current text.
Optionally, the cluster result includes multiclass term vector set, includes several words in every one kind term vector set Vector;
The third determining module 34 is specifically used for, and in all kinds of term vector set, determines the number of term vector for including Amount is more than the term vector set of predetermined threshold value, alternatively, all kinds of term vector set are descending according to the quantity comprising term vector Sequence sequence, determine before m term vector set, wherein m is default value;It will be each in the term vector set determined The corresponding word of term vector is as Feature Words.
Optionally, the 4th determining module 35 is specifically used for, and according to the term vector and weight of each Feature Words, determines by more The multi-C vector that a element is constituted, using the multi-C vector as the text vector of current text;Wherein, in the multi-C vector One element is made of the weight of the term vector of Feature Words and this feature word.
A kind of document representation method and device provided in an embodiment of the present invention, this method determine each word for constituting current text Language determines the term vector of each word, is clustered to each term vector, according to cluster result determine current text Feature Words and The weight of this feature word determines the text vector of current text according to the corresponding term vector of the Feature Words of each word and weight. As it can be seen that the word in the present invention is indicated by term vector, term vector compares word can be from multiple dimensions to the word It is described, can more accurately indicate the semantic information of word, in addition, the process of cluster already has accounted for Feature Words in sentence In semanteme and sentence between correlation, therefore, the present invention clusters determining Feature Words by being carried out to term vector, can be effective The accuracy for the Feature Words for determining current text is improved, and then the accuracy of text-processing can be improved.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Including so that process, method, commodity or equipment including a series of elements include not only those elements, but also wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that the embodiment of the present invention can be provided as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the present invention Form.It is deposited moreover, the present invention can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
It these are only the embodiment of the present invention, be not intended to restrict the invention.To those skilled in the art, The invention may be variously modified and varied.It is all within spirit and principles of the present invention made by any modification, equivalent replacement, Improve etc., it should be included within scope of the presently claimed invention.

Claims (10)

1. a kind of document representation method, which is characterized in that including:
Determine each word for constituting current text;
Determine the term vector of each word;
Each term vector is clustered to obtain multiclass term vector set;
According to cluster result, the Feature Words of current text and the weight of this feature word are determined in each word, wherein described The weight of Feature Words is the sum of maximum frequency in the frequency and the text that this feature word occurs in current text, with Feature Words The logarithm of the ratio of the frequency occurred in current text;
The text vector of current text is determined according to the term vector of each Feature Words and weight.
2. the method as described in claim 1, which is characterized in that the method further includes:
Digital finger-print processing is carried out to the text vector of the current text.
3. the method as described in claim 1, which is characterized in that the term vector of each word of determination specifically includes:Default Term vector library in, determine corresponding with each word term vector;
Wherein, the method for presetting term vector library, specifically includes:
Obtain multiple history texts;
Determine the multiple words for constituting each history text;
Each word in the history text is expressed as a multi-C vector, using the multi-C vector as the first of the word Beginning term vector;
Each initial term vector is subjected to digital finger-print processing respectively, obtains digital finger-print treated term vector;
It is constituted using the digital finger-print treated term vector and presets term vector library.
4. the method as described in claim 1, which is characterized in that the cluster result includes multiclass term vector set, per a kind of Include several term vectors in term vector set;
It is described according to cluster result, the Feature Words of current text are determined in each word, are specifically included:
In all kinds of term vector set, determine that the quantity for the term vector for including is more than the term vector set of predetermined threshold value, alternatively, All kinds of term vector set are sorted according to the descending sequence of the quantity comprising term vector, m term vector set before determining, Wherein, m is default value;
Using the corresponding word of each term vector in the term vector set determined as Feature Words.
5. the method as described in claim 1, which is characterized in that the determining each word for constituting current text specifically includes: The current text is segmented, multiple words are obtained;In each word, the word of specified type is determined;To the finger The word for determining type carries out duplicate removal processing, using each word after duplicate removal processing as each word for constituting current text;
And/or
The text vector that current text is determined according to the term vector and weight of each Feature Words, specifically includes:According to each feature The term vector and weight of word determine the multi-C vector being made of multiple elements, using the multi-C vector as the text of current text Vector;Wherein, an element in the multi-C vector is made of the weight of the term vector of Feature Words and this feature word.
6. a kind of text representation device, which is characterized in that including:
First determining module, for determining each word for constituting current text;
Second determining module, the term vector for determining each word;
Cluster module obtains multiclass term vector set for being clustered to each term vector;
Third determining module, for according to cluster result, the Feature Words and this feature of current text to be determined in each word The weight of word, wherein the weight of the Feature Words is maximum in the frequency and the text that this feature word occurs in current text The sum of frequency, the logarithm of the ratio of the frequency occurred in current text with Feature Words;
4th determining module, the text vector for determining current text according to the term vector and weight of each Feature Words.
7. device as claimed in claim 6, which is characterized in that described device further includes:
Processing module carries out digital finger-print processing for the text vector to the current text.
8. device as claimed in claim 6, which is characterized in that second determining module is specifically used for, preset word to It measures in library, determines term vector corresponding with each word;
Described device further includes:Default term vector library module, for presetting term vector library;
The default term vector library module is specifically used for, and obtains multiple history texts, determines the multiple words for constituting each history text Each word in the history text is expressed as a multi-C vector by language, using the multi-C vector as the first of the word Each initial term vector is carried out digital finger-print processing by beginning term vector respectively, digital finger-print is obtained treated term vector, using institute It states digital finger-print treated term vector and constitute and preset term vector library.
9. device as claimed in claim 6, which is characterized in that the cluster result includes multiclass term vector set, per a kind of Include several term vectors in term vector set;
The third determining module is specifically used for, and in all kinds of term vector set, determines that the quantity for the term vector for including is more than The term vector set of predetermined threshold value, alternatively, by all kinds of term vector set according to the descending sequence of the quantity comprising term vector Sequence, m term vector set before determining, wherein m is default value;By each term vector in the term vector set determined Corresponding word is as Feature Words.
10. device as claimed in claim 6, which is characterized in that first determining module is specifically used for, to it is described ought be above This is segmented, and multiple words are obtained, and in each word, the word of specified type is determined, to the word of the specified type Duplicate removal processing is carried out, using each word after duplicate removal processing as each word for constituting current text;And/or
4th determining module is specifically used for, and according to the term vector and weight of each Feature Words, determines and to be made of multiple elements Multi-C vector, using the multi-C vector as the text vector of current text;Wherein, an element in the multi-C vector is by one The term vector of a Feature Words and the weight of this feature word are constituted.
CN201510096570.XA 2015-03-04 2015-03-04 A kind of document representation method and device Active CN104778158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510096570.XA CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510096570.XA CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Publications (2)

Publication Number Publication Date
CN104778158A CN104778158A (en) 2015-07-15
CN104778158B true CN104778158B (en) 2018-07-17

Family

ID=53619632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510096570.XA Active CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Country Status (1)

Country Link
CN (1) CN104778158B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095444A (en) * 2015-07-24 2015-11-25 百度在线网络技术(北京)有限公司 Information acquisition method and device
CN106484682B (en) 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics
CN106484681B (en) * 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
CN105426354B (en) * 2015-10-29 2019-03-22 杭州九言科技股份有限公司 The fusion method and device of a kind of vector
CN105426356B (en) * 2015-10-29 2019-05-21 杭州九言科技股份有限公司 A kind of target information recognition methods and device
CN106446264B (en) * 2016-10-18 2019-08-27 哈尔滨工业大学深圳研究生院 Document representation method and system
CN106503184B (en) * 2016-10-24 2019-09-20 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN107357895B (en) * 2017-01-05 2020-05-19 大连理工大学 Text representation processing method based on bag-of-words model
CN107247704B (en) * 2017-06-09 2020-09-08 阿里巴巴集团控股有限公司 Word vector processing method and device and electronic equipment
CN109408797A (en) * 2017-08-18 2019-03-01 普天信息技术有限公司 A kind of text sentence vector expression method and system
US11823013B2 (en) * 2017-08-29 2023-11-21 International Business Machines Corporation Text data representation learning using random document embedding
CN107862620A (en) * 2017-12-11 2018-03-30 四川新网银行股份有限公司 A kind of similar users method for digging based on social data
CN108304480B (en) * 2017-12-29 2020-08-04 东软集团股份有限公司 Text similarity determination method, device and equipment
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
CN109033307B (en) * 2018-07-17 2021-08-31 华北水利水电大学 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method
CN109101620B (en) * 2018-08-08 2022-07-05 阿里巴巴(中国)有限公司 Similarity calculation method, clustering method, device, storage medium and electronic equipment
CN110874528B (en) * 2018-08-10 2020-11-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN109710845A (en) * 2018-12-25 2019-05-03 百度在线网络技术(北京)有限公司 Information recommended method, device, computer equipment and readable storage medium storing program for executing
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN110147449A (en) * 2019-05-27 2019-08-20 中国联合网络通信集团有限公司 File classification method and device
CN110309515B (en) * 2019-07-10 2023-08-11 北京奇艺世纪科技有限公司 Entity identification method and device
CN111428180B (en) * 2020-03-20 2022-02-08 创优数字科技(广东)有限公司 Webpage duplicate removal method, device and equipment
CN111913912A (en) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 File processing method, file matching device, electronic equipment and medium
CN112527971A (en) * 2020-12-25 2021-03-19 华戎信息产业有限公司 Method and system for searching similar articles
CN113536763A (en) * 2021-07-20 2021-10-22 北京中科闻歌科技股份有限公司 Information processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN101853486A (en) * 2010-06-08 2010-10-06 华中科技大学 Image copying detection method based on local digital fingerprint
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN101853486A (en) * 2010-06-08 2010-10-06 华中科技大学 Image copying detection method based on local digital fingerprint
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device
CN108345605B (en) * 2017-01-24 2022-04-05 苏宁易购集团股份有限公司 Text search method and device

Also Published As

Publication number Publication date
CN104778158A (en) 2015-07-15

Similar Documents

Publication Publication Date Title
CN104778158B (en) A kind of document representation method and device
US11599714B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
US11243993B2 (en) Document relationship analysis system
US9542477B2 (en) Method of automated discovery of topics relatedness
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US8457950B1 (en) System and method for coreference resolution
KR20180011254A (en) Web page training methods and devices, and search intent identification methods and devices
US20140207782A1 (en) System and method for computerized semantic processing of electronic documents including themes
US20170344822A1 (en) Semantic representation of the content of an image
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
CN107357895B (en) Text representation processing method based on bag-of-words model
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
US20220309276A1 (en) Automatically classifying heterogenous documents using machine learning techniques
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
US20180260476A1 (en) Expert stance classification using computerized text analytics
CN114461809A (en) Method and equipment for automatically generating semantic knowledge graph of Chinese abstract
CN106484724A (en) Information processor and information processing method
US20240168999A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Nagrale et al. Document theme extraction using named-entity recognition
Khatai et al. An implementation of text mining decision feedback model using Hadoop MapReduce

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230315

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.

TR01 Transfer of patent right