CN104778158A - Method and device for representing text - Google Patents

Method and device for representing text Download PDF

Info

Publication number
CN104778158A
CN104778158A CN201510096570.XA CN201510096570A CN104778158A CN 104778158 A CN104778158 A CN 104778158A CN 201510096570 A CN201510096570 A CN 201510096570A CN 104778158 A CN104778158 A CN 104778158A
Authority
CN
China
Prior art keywords
term vector
word
vector
text
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510096570.XA
Other languages
Chinese (zh)
Other versions
CN104778158B (en
Inventor
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN201510096570.XA priority Critical patent/CN104778158B/en
Publication of CN104778158A publication Critical patent/CN104778158A/en
Application granted granted Critical
Publication of CN104778158B publication Critical patent/CN104778158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method and a device for representing a text, which are used for increasing the accuracy of text representation so as to increase the accuracy of text processing. The method comprises the steps of determining all words of a current text, determining word vectors of all the words, clustering all the word vectors, determining feature words and the weight of the feature words in the current text from all the words according to a clustering result, and determining the text vector of the current text according to the word vectors and the weight of the feature words. In such a way, the semanteme of the words in sentences and a correlation between the sentences are considered in the process of determining the feature words through clustering, the connotation of the text can be accurately represented by the determined word vectors of the feature words, thus the accuracy of the text representation is increased, and the accuracy of the text processing is further increased.

Description

A kind of document representation method and device
Technical field
The present invention relates to the information processing technology, particularly relate to a kind of document representation method and device.
Background technology
In technical field of information processing, often relate to text-processing.Text-processing refers to the content of text after text representation, carry out the process such as text retrieval, text classification, text analyzing, wherein, text representation refers to that original content of text is become computer-internal represents structure, this internal representation structure is the analyzable structure of computer program, such as, the analyzable vector structure of computing machine can be formed with the word, phrase etc. in content of text.
The accuracy of text representation is higher, more accurately can express the intension of current text, the effect of text-processing is better, efficiency is higher, otherwise, the accuracy of text representation is lower, the intension of the text given expression to more departs from the actual intension of text, and the effect of text-processing is poorer, efficiency is also lower.
In the prior art, document representation method is mainly based on vector space model.Vector space model represents that the method for text is: for some texts, first participle is carried out to the text, obtain multiple word, and then according to the frequency that these words occur in the text, selecting frequency is greater than the word of preset value as the Feature Words of expressing the text, and calculate the weight of each Feature Words, finally these Feature Words and corresponding weight are formed text vector, text vector is exactly the representation of the text.Such as, for some texts, i-th Feature Words is fi, the weight of this Feature Words is wi, then text representation form is: <f1:w1>, <f2:w2> ..., <fi:wi> ..., wherein, i=1,2,3 ...
In the document representation method that above-mentioned prior art provides, when selected characteristic word, do not consider the semanteme of Feature Words in sentence, the correlativity between sentence is not considered yet, just the frequency of extracting from text of machinery is greater than the word of preset value as Feature Words, in addition, due to the word that the Feature Words in text vector is in text, because independently word may exist multilayer implication, accurately cannot express the intension of text, therefore, the accuracy that text vector expresses text is just lower, accordingly, the accuracy of text-processing is also just lower.
Summary of the invention
The embodiment of the present invention provides a kind of document representation method and device, in order to improve the accuracy of text representation, thus also can improve the accuracy of text-processing.
A kind of document representation method that the embodiment of the present invention provides, comprising:
Determine each word forming current text;
Determine the term vector of each word;
Cluster is carried out to each term vector;
According to cluster result, in each word, determine the Feature Words of current text and the weight of this Feature Words;
According to the term vector of each Feature Words and the text vector of weight determination current text.
A kind of text representation device that the embodiment of the present invention provides, comprising:
First determination module, for determining each word forming current text;
Second determination module, for determining the term vector of each word;
Cluster module, for carrying out cluster to each term vector;
3rd determination module, for according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word;
4th determination module, for according to the term vector of each Feature Words and the text vector of weight determination current text.
A kind of document representation method that the embodiment of the present invention provides and device, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then effectively can improve the accuracy of text-processing.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
A kind of document representation method schematic flow sheet that Fig. 1 provides for the embodiment of the present invention;
The method flow schematic diagram in a kind of default term vector storehouse that Fig. 2 provides for the embodiment of the present invention;
A kind of text representation apparatus structure schematic diagram that Fig. 3 provides for the embodiment of the present invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below with reference to the specific embodiment of the invention and corresponding accompanying drawing, technical solution of the present invention is clearly and completely described.Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
See Fig. 1, be a kind of document representation method schematic flow sheet that the embodiment of the present invention provides, comprise:
S101: determine each word forming current text.
In embodiments of the present invention, the described current text needing to carry out text representation being server herein and obtaining, the text can be the sentence of Chinese form, paragraph or chapter etc., and the text can be the text of the forms such as txt, doc, pdf, wps.
In embodiments of the present invention, server can be, but not limited to obtain text from default storage area (as corpus), or the current text uploaded of online acquisition user, and using the text of acquisition as current text.
The embodiment of the present invention, after acquisition current text, can carry out participle to this current text, obtains each word forming current text.In participle process, the segmenting method of employing can include but not limited to: by word traversal, mechanical Chinese word segmentation method etc.Such as, suppose that server obtains one section of article, using this article after current text, pre-service is carried out to this article content, then participle is carried out to pretreated article content, the each word obtained after participle comprises: display, flat board, liquid crystal, illumination, these five words of device, these five words can be defined as each word forming current text.
In order to server when reducing participle calculated amount and avoid the interference of some words, the embodiment of the present invention is before participle, pre-service can be carried out to current text, such as, remove the HTML (Hypertext Markup Language) (Hypertext Markup Language, HTML) in current text, the complex form of Chinese characters in current text be converted into simplified Chinese character, double byte character is converted into half-angle character etc.
Consider in practical application scene, in each word obtained after participle except comprise there is practical significance word except, also may comprise some words without practical significance, and Feature Words is generally the word with practical significance, therefore, the embodiment of the present invention is when determining each word forming current text, concrete, participle is carried out to current text, after obtaining multiple word, the word of specified type can also be determined in each word, in order to avoid filtering out identical word, duplicate removal process can also be carried out further to the word of specified type, using each word after duplicate removal process as each word forming current text, wherein, the word of described specified type can be specifically the word with practical significance, and the word with practical significance can include but not limited to: noun, verb, adjective etc., then the word without practical significance is generally auxiliary word, adverbial word, function word etc.
S102: the term vector determining each word.
In embodiments of the present invention, in order to express the implication (i.e. semantic information) of a word in more detail, the N dimensional vector comprising N number of element can be adopted to represent a word, and this N dimensional vector is the term vector of this word.In N number of element of this term vector, each element is the weighted value of the text categories that this word is corresponding, and wherein text categories can comprise: computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, politics, environment etc.
Such as, the text categories of suppositive vector can be expressed as N dimensional vector { computing machine, traffic, education, economy } 4, wherein, N=4.Suppose that display, flat board, liquid crystal, illumination, these five words of device are each word forming current text, then the term vector of " liquid crystal " this word can be expressed as: 0.175,0.095,0.185,0.041} 4, wherein, the implication that this term vector represents is: " liquid crystal " is respectively 0.175,0.095,0.185,0.041 corresponding to the weighted value of computing machine, traffic, education, these four text categories economic.
In embodiments of the present invention, server, can according to term vector determination instrument direct-on-line determination term vector when determining the term vector of each word.Optionally, server can adopt word2vec computational tool, determines the term vector of each word.
In order to improve the efficiency of the term vector determining each word, preferably, in embodiments of the present invention, the term vector of each word can also be pre-determined, when needing the term vector determining each word, in the term vector storehouse of presetting, determine the term vector that (e.g., searching) is corresponding with each word.Visible, in the term vector storehouse of presetting, determine the term vector corresponding with each word, convenient and swift, effectively can improve the treatment effeciency of server.
In embodiments of the present invention, when pre-determining the term vector of each word, also can adopt word2vec computational tool, determine the term vector of each word.
S103: cluster is carried out to each term vector.
In embodiments of the present invention, determined the term vector of each word by step S102 after, cluster can be carried out to each term vector.
The cardinal rule of cluster has larger similarity between of a sort term vector, differs greatly between inhomogeneous term vector, therefore, by carrying out similarity measurement between term vector, just can realize carrying out cluster to each vector.Concrete, determine the similarity between term vector by the COS distance (cosine) calculated between two term vectors, cosine value is larger, between term vector, similarity is larger, otherwise cosine value is less, the similarity between term vector is less.
In embodiments of the present invention, adoptable clustering algorithm includes but not limited to: Chinese RestaurantProcess (CRP) algorithm, K-means clustering algorithm, K-central point clustering algorithm, CLARANS algorithm, BIRCH algorithm, CLIQUE algorithm, DBSCAN algorithm etc.
In embodiments of the present invention, obtain the set of multiclass term vector by carrying out cluster to each term vector, the set of described multiclass term vector is the cluster result each term vector being carried out to cluster; Wherein, several term vectors are comprised in each class term vector set.
Along using example, supposing that the term vector corresponding respectively to display, flat board, liquid crystal, illumination, these five words of device carries out cluster, obtaining three class term vector set.The set of first kind term vector comprises liquid crystal, display, term vector that this three words of device are corresponding, the set of Equations of The Second Kind term vector only comprises dull and stereotyped corresponding term vector, 3rd class term vector set only comprises term vector corresponding to illumination, and this illustrates liquid crystal, display, term vector similarity each other that these three words of device are corresponding is maximum, correlativity is the highest.Between the term vector of flat board and illumination difference correspondence, correlativity is lower, dull and stereotyped lower with the correlativity of liquid crystal, display, these words of device respectively with illumination.That is, in this three classes term vector, the word that first kind term vector is corresponding best embodies the feature of current text, and Equations of The Second Kind and the 3rd class are taken second place.
S104: according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word.
In embodiments of the present invention, according to cluster result, in each word, determine the Feature Words of current text, specifically in all kinds of term vector set, can determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value.
Along using example, suppose that predetermined threshold value is 2, according to the first kind, Equations of The Second Kind and the 3rd class term vector set (cluster result), in display, dull and stereotyped, liquid crystal, illumination, the Feature Words of current text is determined in these words of device, specifically can in the first kind, in Equations of The Second Kind and the 3rd class term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value 2, the quantity of the term vector comprised due to the set of first kind term vector is 3, the quantity of the term vector that Equations of The Second Kind and the 3rd class term vector set comprise is respectively 1, then determine that the term vector set that the quantity of the term vector comprised exceedes predetermined threshold value 2 is the set of first kind term vector, using word corresponding for each term vector in the first kind term vector set determined as Feature Words, also be, by liquid crystal, display, these three words of device are as the Feature Words of current text.
In embodiments of the present invention, according to cluster result, in each word, determine the Feature Words of current text, specifically also all kinds of term vector set can be sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.
Along using example, suppose default value m=1, by the first kind, Equations of The Second Kind and the 3rd class term vector set are sorted according to the order that the quantity comprising term vector is descending, the quantity of the term vector comprised due to the set of first kind term vector is 3, the quantity of the term vector that Equations of The Second Kind and the 3rd class term vector set comprise is respectively 1, then the sequence of all kinds of term vector set order is followed successively by: the first kind, Equations of The Second Kind and the 3rd class term vector set, determine the word (liquid crystal that each term vector in the 1st (m=1) individual term vector set (i.e. the set of first kind term vector) is corresponding, display, device) as Feature Words.
In embodiments of the present invention, according to cluster result, in each word, determine the weight w of the Feature Words of current text i, specifically determine by formula (1-1).
w i=log(1+n i/n m) (1-1)
Wherein, w ifor the weight of the Feature Words of i-th in current text, n ithe number of times (hereinafter referred to as word frequency) that i-th Feature Words occurs in current text, n min the word frequency that each Feature Words is corresponding, the word frequency that numerical value is maximum.
Such as, each Feature Words: the word frequency of liquid crystal, display, device is respectively 10,30,20, then the word frequency showing this word is maximum, i.e. n m=30.The then weight w of this word of liquid crystal 1=log (1+10/30); Show the weight w of this word 2=log (1+30/30); The weight w of this word of device 3=log (1+20/30).
S105: according to the term vector of each Feature Words and the text vector of weight determination current text.
Concrete, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.
Such as, the text vector of current text can be expressed as: { <F1:w 1>, <F2:w 2> ..., <Fi:w i> ..., wherein, i=1,2,3 ...Fi is the term vector that i-th Feature Words is corresponding.
In method shown in above-mentioned Fig. 1, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then can improve the accuracy of text-processing.
When determining (e.g., searching) term vector corresponding with each word in the above-mentioned term vector storehouse presetting, need default term vector storehouse.
See Fig. 2, in embodiments of the present invention, preset the method in term vector storehouse, specifically can comprise the following steps:
S201: obtain multiple history text.
When obtaining multiple history text, multiple text can be obtained from corpus, as history text, obtain the quantity of text can be hundreds of, several thousand etc., do not do concrete restriction here.
S202: determine the multiple words forming each history text.
When determining each word forming each history text, determining that the method for each word forming current text is similar with above-mentioned, such as, participle can be carried out by segmenting method to each history text, obtain each word.
Optionally, in order to reduce the calculated amount of server and avoid the interference of some words, before participle is carried out to each history text, pre-service can be carried out to this history text.Pre-service can include but not limited to: remove HTML to history text, the complex form of Chinese characters is converted into simplified Chinese character, double byte character is converted into half-angle character, carry out duplicate removal process to each history text.
When carrying out duplicate removal process to each history text, the informative abstract of each history text can be calculated by message digest algorithm, such as, by the message digest algorithm (Message-DigestAlgorithm 5 of the 5th version, MD5) each history text obtained is calculated, after obtaining MD5 value corresponding to each history text, the history text that identical MD5 value is corresponding, only retains portion (namely realizing duplicate removal process).
Considering in practical application scene, for representing that the Feature Words of text is generally the word with practical significance, therefore, optionally, after participle is carried out to each history text, the word of the multiple specified type forming each history text can be determined; Wherein the word of this specified type can be specifically the word with practical significance.Like this, the calculated amount of server can be reduced further.
S203: each word lists in history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as this word.In embodiments of the present invention, word2vec computational tool can be adopted equally to determine the term vector of each word, repeat no more here.
S204: each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process.
Digital finger-print process is carried out to initial word vector, also namely digitized processing is carried out to initial word vector, such as, initial word vector is changed into " 0 ", " 1 " numerical string of certain length (as 64 bit).Term vector is changed into " 0 ", " 1 " numerical string by local sensitivity Hash (LSH) algorithm by the embodiment of the present invention.
Such as, the term vector of " liquid crystal " this word is expressed as: 0.175,0.095,0.185,0.041} 4, then carry out digital finger-print process to this term vector, obtaining the term vector after digital finger-print process can be <000000000010>;
If the term vector of " display " this word is expressed as: 0.123,0.195,0.085,0.441} 4, then carry out digital finger-print process to this term vector, obtaining the term vector after digital finger-print process can be <100101010010>.
S205: adopt the term vector after digital finger-print process to form and preset term vector storehouse.
In embodiments of the present invention, adopt the term vector after digital finger-print process to form and preset term vector storehouse, after determining each word that current text is corresponding, the term vector corresponding with each word found from default term vector storehouse is the term vector after digital finger-print process.When cluster is carried out to each term vector, exactly cluster is carried out to the term vector after digital finger prints processing, when then calculating the similarity between term vector in cluster process, the Hamming distance between two term vectors can be calculated, can determine the similarity between term vector by Hamming distance, the correlativity between these two term vectors of the larger explanation of Hamming distance between two term vectors is less, otherwise, Hamming distance between two term vectors is less, and the similarity between two term vectors is larger.When carrying out cluster to the term vector after digitizing, the calculated amount of server can be greatly reduced, effectively can improve the treatment effeciency of server.
In embodiments of the present invention, in order to avoid also existing without term vector corresponding to the word of practical significance in initial word vector further, then can screen initial word vector, concrete, according to part of speech, word frequency, the attributes such as vocabulary of stopping using, from each initial word vector, remove the word without practical significance, only retain the word with practical significance, thus effectively can reduce the interference of the word without practical significance, and then effectively can reduce the calculated amount of server.
In embodiments of the present invention, after the term vector of each Feature Words and the text vector of weight determination current text, just can carry out text-processing based on text vector, such as, carry out the process such as text retrieval, text classification, text analyzing, Text similarity computing.
In order to reduce in text-processing process, the calculated amount of server, thus effectively improve the treatment effeciency of server, then in embodiments of the present invention, described method also comprises: after the term vector of each Feature Words and the text vector of weight determination current text, carry out digital finger-print process to the text vector of current text.
Namely described digital finger-print process also carries out digitized processing, and optionally, the present invention can adopt the wherein a kind of algorithm simhash in LSH algorithm to carry out digital finger-print process to text vector.
Such as, if Feature Words: liquid crystal, term vector after the digitizing that display and device are corresponding is respectively <010>, <001> and <110>, liquid crystal, the weight of display and device is respectively 0.1, 0.2, 0.4, then text vector is expressed as: { < liquid crystal term vector: 0.1>, < shows term vector: 0.2>, < device term vector: 0.4>}.
Then to text vector < liquid crystal term vector: 0.1>, < show term vector: 0.2>, < device term vector: 0.4>} carries out digitized processing and is specially:
" 0 " in each term vector replaced with "-1 ", " 1 " replaces with " 1 ", and each term vector is multiplied by weight, obtains new term vector; First numerical value in each term vector is cumulative obtains first value, obtains second value by cumulative for the numerical value of second in each term vector, obtains the 3rd value by cumulative for the numerical value of the 3rd in each term vector.
In described first value ~ the three value, on the occasion of replacing with 1, negative value will be replaced with 0, then the vector obtaining being made up of 0 and 1 is the vector after digitized processing.
Such as, " 0 " in <010>, <001> and <110> replaces with "-1 ", " 1 " replaces with " 1 ", and be multiplied by weight corresponding to each term vector, the vector obtained is as follows respectively:
Term vector <010> correspond to vectorial 1<-0.1,0.1 ,-0.1>;
Term vector <001> correspond to vectorial 2<-0.2 ,-0.2,0.2>;
Term vector <110> correspond to vectorial 3<0.4,0.4 ,-0.4>;
Be added by first element-0.1 ,-0.2 and 0.4 of vector 1 ~ vector 3, first value obtained is 0.1, and this value is just;
Be added by second element 0.1 ,-0.2 and 0.4 of vector 1 ~ vector 3, second value obtained is 0.3, and this value is just;
Be added by 3rd element-0.1 of vector 1 ~ vector 3,0.2 and-0.4, the 3rd value obtained is-0.3, and this value is negative;
Then in first value ~ the three value, on the occasion of replacing with 1, negative value will be replaced with 0, then the vectorial <110> obtaining being made up of 0 and 1 is the vector after digitized processing.
Be the document representation method that the embodiment of the present invention provides above, based on same thinking, the embodiment of the present invention additionally provides a kind of text representation device, as shown in Figure 3, comprising:
First determination module 31, for determining each word forming current text;
Second determination module 32, for determining the term vector of each word;
Cluster module 33, for carrying out cluster to each term vector;
3rd determination module 34, for according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word;
4th determination module 35, for according to the term vector of each Feature Words and the text vector of weight determination current text.
Optionally, described device also comprises:
Processing module 36, for carrying out digital finger-print process to the text vector of described current text.
Optionally, the second determination module 32 specifically for,
In the term vector storehouse of presetting, determine the term vector corresponding with each word.
Optionally, described device also comprises:
Preset term vector library module 37, for default term vector storehouse;
Described default term vector library module 37 specifically for, obtain multiple history text, determine the multiple words forming each history text, each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word, each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process, adopt the term vector after described digital finger-print process to form and preset term vector storehouse.
Optionally, described default term vector library module 37 specifically for, determine the word of the multiple specified type forming each history text.
Optionally, described first determination module 31 specifically for, participle is carried out to described current text, obtain multiple word, in each word, determine the word of specified type, duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text.
Optionally, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;
Described 3rd determination module 34 specifically for, in all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.
Optionally, described 4th determination module 35 specifically for, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.
A kind of document representation method that the embodiment of the present invention provides and device, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then can improve the accuracy of text-processing.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
In one typically configuration, computing equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.
Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.
Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise temporary computer readable media (transitory media), as data-signal and the carrier wave of modulation.
Also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, commodity or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, commodity or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment comprising key element and also there is other identical element.
It will be understood by those skilled in the art that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
These are only embodiments of the invention, be not limited to the present invention.To those skilled in the art, the present invention can have various modifications and variations.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within right of the present invention.

Claims (10)

1. a document representation method, is characterized in that, comprising:
Determine each word forming current text;
Determine the term vector of each word;
Cluster is carried out to each term vector;
According to cluster result, in each word, determine the Feature Words of current text and the weight of this Feature Words;
According to the term vector of each Feature Words and the text vector of weight determination current text.
2. the method for claim 1, is characterized in that, described method also comprises:
Digital finger-print process is carried out to the text vector of described current text.
3. the method for claim 1, is characterized in that, the described term vector determining each word, specifically comprises: in the term vector storehouse of presetting, determine the term vector corresponding with each word;
Wherein, preset the method in term vector storehouse, specifically comprise:
Obtain multiple history text;
Determine the multiple words forming each history text;
Each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word;
Each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process;
Adopt the term vector after described digital finger-print process to form and preset term vector storehouse.
4. the method for claim 1, is characterized in that, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;
Described according to cluster result, in each word, determine the Feature Words of current text, specifically comprise:
In all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value;
Using word corresponding for each term vector in the term vector set determined as Feature Words.
5. the method for claim 1, is characterized in that, the described each word determining formation current text, specifically comprises: carry out participle to described current text, obtain multiple word; In each word, determine the word of specified type; Duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text;
And/or,
The text vector of the described term vector according to each Feature Words and weight determination current text, specifically comprises: according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.
6. a text representation device, is characterized in that, comprising:
First determination module, for determining each word forming current text;
Second determination module, for determining the term vector of each word;
Cluster module, for carrying out cluster to each term vector;
3rd determination module, for according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word;
4th determination module, for according to the term vector of each Feature Words and the text vector of weight determination current text.
7. device as claimed in claim 6, it is characterized in that, described device also comprises:
Processing module, for carrying out digital finger-print process to the text vector of described current text.
8. device as claimed in claim 6, is characterized in that, described second determination module specifically for, in the term vector storehouse of presetting, determine the term vector corresponding with each word;
Described device also comprises: preset term vector library module, for default term vector storehouse;
Described default term vector library module specifically for, obtain multiple history text, determine the multiple words forming each history text, each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word, each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process, adopt the term vector after described digital finger-print process to form and preset term vector storehouse.
9. device as claimed in claim 6, it is characterized in that, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;
Described 3rd determination module specifically for, in all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.
10. device as claimed in claim 6, it is characterized in that, described first determination module specifically for, participle is carried out to described current text, obtain multiple word, in each word, determine the word of specified type, duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text; And/or,
Described 4th determination module specifically for, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.
CN201510096570.XA 2015-03-04 2015-03-04 A kind of document representation method and device Active CN104778158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510096570.XA CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510096570.XA CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Publications (2)

Publication Number Publication Date
CN104778158A true CN104778158A (en) 2015-07-15
CN104778158B CN104778158B (en) 2018-07-17

Family

ID=53619632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510096570.XA Active CN104778158B (en) 2015-03-04 2015-03-04 A kind of document representation method and device

Country Status (1)

Country Link
CN (1) CN104778158B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095444A (en) * 2015-07-24 2015-11-25 百度在线网络技术(北京)有限公司 Information acquisition method and device
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105426354A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Sentence vector fusion method and apparatus
CN106446264A (en) * 2016-10-18 2017-02-22 哈尔滨工业大学深圳研究生院 Text representation method and system
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN106503184A (en) * 2016-10-24 2017-03-15 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN107247704A (en) * 2017-06-09 2017-10-13 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words
CN107862620A (en) * 2017-12-11 2018-03-30 四川新网银行股份有限公司 A kind of similar users method for digging based on social data
CN108304480A (en) * 2017-12-29 2018-07-20 东软集团股份有限公司 A kind of text similarity determines method, apparatus and equipment
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109101620A (en) * 2018-08-08 2018-12-28 广州神马移动信息科技有限公司 Similarity calculating method, clustering method, device, storage medium and electronic equipment
CN109408797A (en) * 2017-08-18 2019-03-01 普天信息技术有限公司 A kind of text sentence vector expression method and system
WO2019043540A1 (en) * 2017-08-29 2019-03-07 International Business Machines Corporation Text data representation learning using random document embedding
CN109710845A (en) * 2018-12-25 2019-05-03 百度在线网络技术(北京)有限公司 Information recommended method, device, computer equipment and readable storage medium storing program for executing
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN110147449A (en) * 2019-05-27 2019-08-20 中国联合网络通信集团有限公司 File classification method and device
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
CN110874528A (en) * 2018-08-10 2020-03-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
US10810379B2 (en) 2015-08-25 2020-10-20 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
CN111913912A (en) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 File processing method, file matching device, electronic equipment and medium
CN112527971A (en) * 2020-12-25 2021-03-19 华戎信息产业有限公司 Method and system for searching similar articles
CN113536763A (en) * 2021-07-20 2021-10-22 北京中科闻歌科技股份有限公司 Information processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN101853486A (en) * 2010-06-08 2010-10-06 华中科技大学 Image copying detection method based on local digital fingerprint
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN101853486A (en) * 2010-06-08 2010-10-06 华中科技大学 Image copying detection method based on local digital fingerprint
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104008090A (en) * 2014-04-29 2014-08-27 河海大学 Multi-subject extraction method based on concept vector model
CN104182388A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic analysis based text clustering system and method

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095444A (en) * 2015-07-24 2015-11-25 百度在线网络技术(北京)有限公司 Information acquisition method and device
US10860808B2 (en) 2015-08-25 2020-12-08 Alibaba Group Holding Limited Method and system for generation of candidate translations
US10810379B2 (en) 2015-08-25 2020-10-20 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN106484681B (en) * 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
US10255275B2 (en) 2015-08-25 2019-04-09 Alibaba Group Holding Limited Method and system for generation of candidate translations
CN105426354B (en) * 2015-10-29 2019-03-22 杭州九言科技股份有限公司 The fusion method and device of a kind of vector
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105426354A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Sentence vector fusion method and apparatus
CN105426356B (en) * 2015-10-29 2019-05-21 杭州九言科技股份有限公司 A kind of target information recognition methods and device
CN106446264B (en) * 2016-10-18 2019-08-27 哈尔滨工业大学深圳研究生院 Document representation method and system
CN106446264A (en) * 2016-10-18 2017-02-22 哈尔滨工业大学深圳研究生院 Text representation method and system
CN106503184A (en) * 2016-10-24 2017-03-15 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN106503184B (en) * 2016-10-24 2019-09-20 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN107357895B (en) * 2017-01-05 2020-05-19 大连理工大学 Text representation processing method based on bag-of-words model
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words
CN108345605B (en) * 2017-01-24 2022-04-05 苏宁易购集团股份有限公司 Text search method and device
CN108345605A (en) * 2017-01-24 2018-07-31 苏宁云商集团股份有限公司 A kind of text search method and device
CN107247704B (en) * 2017-06-09 2020-09-08 阿里巴巴集团控股有限公司 Word vector processing method and device and electronic equipment
CN107247704A (en) * 2017-06-09 2017-10-13 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN109408797A (en) * 2017-08-18 2019-03-01 普天信息技术有限公司 A kind of text sentence vector expression method and system
GB2578711A (en) * 2017-08-29 2020-05-20 Ibm Text data representation learning using random document embedding
WO2019043540A1 (en) * 2017-08-29 2019-03-07 International Business Machines Corporation Text data representation learning using random document embedding
US11823013B2 (en) 2017-08-29 2023-11-21 International Business Machines Corporation Text data representation learning using random document embedding
CN107862620A (en) * 2017-12-11 2018-03-30 四川新网银行股份有限公司 A kind of similar users method for digging based on social data
CN108304480A (en) * 2017-12-29 2018-07-20 东软集团股份有限公司 A kind of text similarity determines method, apparatus and equipment
CN110362815A (en) * 2018-04-11 2019-10-22 北京京东尚科信息技术有限公司 Text vector generation method and device
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109101620A (en) * 2018-08-08 2018-12-28 广州神马移动信息科技有限公司 Similarity calculating method, clustering method, device, storage medium and electronic equipment
CN109101620B (en) * 2018-08-08 2022-07-05 阿里巴巴(中国)有限公司 Similarity calculation method, clustering method, device, storage medium and electronic equipment
CN110874528A (en) * 2018-08-10 2020-03-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN110874528B (en) * 2018-08-10 2020-11-10 珠海格力电器股份有限公司 Text similarity obtaining method and device
CN109710845A (en) * 2018-12-25 2019-05-03 百度在线网络技术(北京)有限公司 Information recommended method, device, computer equipment and readable storage medium storing program for executing
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN110147449A (en) * 2019-05-27 2019-08-20 中国联合网络通信集团有限公司 File classification method and device
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
CN110309515B (en) * 2019-07-10 2023-08-11 北京奇艺世纪科技有限公司 Entity identification method and device
CN111428180B (en) * 2020-03-20 2022-02-08 创优数字科技(广东)有限公司 Webpage duplicate removal method, device and equipment
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
CN111913912A (en) * 2020-07-16 2020-11-10 北京字节跳动网络技术有限公司 File processing method, file matching device, electronic equipment and medium
CN112527971A (en) * 2020-12-25 2021-03-19 华戎信息产业有限公司 Method and system for searching similar articles
CN113536763A (en) * 2021-07-20 2021-10-22 北京中科闻歌科技股份有限公司 Information processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN104778158B (en) 2018-07-17

Similar Documents

Publication Publication Date Title
CN104778158A (en) Method and device for representing text
US10360307B2 (en) Automated ontology building
US8457950B1 (en) System and method for coreference resolution
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
US9542477B2 (en) Method of automated discovery of topics relatedness
US10417335B2 (en) Automated quantitative assessment of text complexity
CN111460083A (en) Document title tree construction method and device, electronic equipment and storage medium
US8996994B2 (en) Multi-lingual word hyphenation using inductive machine learning on training data
CN111124487B (en) Code clone detection method and device and electronic equipment
Zu et al. Resume information extraction with a novel text block segmentation algorithm
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN110704608A (en) Text theme generation method and device and computer equipment
Soori et al. Text similarity based on data compression in Arabic
Nodarakis et al. Using hadoop for large scale analysis on twitter: A technical report
US20200387815A1 (en) Building training data and similarity relations for semantic space
KR20230115964A (en) Method and apparatus for generating knowledge graph
CN103927176A (en) Method for generating program feature tree on basis of hierarchical topic model
Dlugolinsky et al. Character gazetteer for named entity recognition with linear matching complexity
CN110968691B (en) Judicial hotspot determination method and device
Kovacevic et al. Application of a Structural Support Vector Machine method to N-gram based text classification in Serbian
CN115905297B (en) Method, apparatus and medium for retrieving data
CN115269851B (en) Article classification method, apparatus, electronic device, storage medium and program product
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
CN116306616A (en) Method and device for determining keywords of text
CN117725555A (en) Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230315

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.