CN104778158A

CN104778158A - Method and device for representing text

Info

Publication number: CN104778158A
Application number: CN201510096570.XA
Authority: CN
Inventors: 刘洋
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2015-07-15
Anticipated expiration: 2035-03-04
Also published as: CN104778158B

Abstract

The invention discloses a method and a device for representing a text, which are used for increasing the accuracy of text representation so as to increase the accuracy of text processing. The method comprises the steps of determining all words of a current text, determining word vectors of all the words, clustering all the word vectors, determining feature words and the weight of the feature words in the current text from all the words according to a clustering result, and determining the text vector of the current text according to the word vectors and the weight of the feature words. In such a way, the semanteme of the words in sentences and a correlation between the sentences are considered in the process of determining the feature words through clustering, the connotation of the text can be accurately represented by the determined word vectors of the feature words, thus the accuracy of the text representation is increased, and the accuracy of the text processing is further increased.

Description

A kind of document representation method and device

Technical field

The present invention relates to the information processing technology, particularly relate to a kind of document representation method and device.

Background technology

In technical field of information processing, often relate to text-processing.Text-processing refers to the content of text after text representation, carry out the process such as text retrieval, text classification, text analyzing, wherein, text representation refers to that original content of text is become computer-internal represents structure, this internal representation structure is the analyzable structure of computer program, such as, the analyzable vector structure of computing machine can be formed with the word, phrase etc. in content of text.

The accuracy of text representation is higher, more accurately can express the intension of current text, the effect of text-processing is better, efficiency is higher, otherwise, the accuracy of text representation is lower, the intension of the text given expression to more departs from the actual intension of text, and the effect of text-processing is poorer, efficiency is also lower.

In the prior art, document representation method is mainly based on vector space model.Vector space model represents that the method for text is: for some texts, first participle is carried out to the text, obtain multiple word, and then according to the frequency that these words occur in the text, selecting frequency is greater than the word of preset value as the Feature Words of expressing the text, and calculate the weight of each Feature Words, finally these Feature Words and corresponding weight are formed text vector, text vector is exactly the representation of the text.Such as, for some texts, i-th Feature Words is fi, the weight of this Feature Words is wi, then text representation form is: <f1:w1>, <f2:w2> ..., <fi:wi> ..., wherein, i=1,2,3 ...

In the document representation method that above-mentioned prior art provides, when selected characteristic word, do not consider the semanteme of Feature Words in sentence, the correlativity between sentence is not considered yet, just the frequency of extracting from text of machinery is greater than the word of preset value as Feature Words, in addition, due to the word that the Feature Words in text vector is in text, because independently word may exist multilayer implication, accurately cannot express the intension of text, therefore, the accuracy that text vector expresses text is just lower, accordingly, the accuracy of text-processing is also just lower.

Summary of the invention

The embodiment of the present invention provides a kind of document representation method and device, in order to improve the accuracy of text representation, thus also can improve the accuracy of text-processing.

A kind of document representation method that the embodiment of the present invention provides, comprising:

Determine each word forming current text;

Determine the term vector of each word;

Cluster is carried out to each term vector;

According to cluster result, in each word, determine the Feature Words of current text and the weight of this Feature Words;

According to the term vector of each Feature Words and the text vector of weight determination current text.

A kind of text representation device that the embodiment of the present invention provides, comprising:

First determination module, for determining each word forming current text;

Second determination module, for determining the term vector of each word;

Cluster module, for carrying out cluster to each term vector;

3rd determination module, for according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word;

4th determination module, for according to the term vector of each Feature Words and the text vector of weight determination current text.

A kind of document representation method that the embodiment of the present invention provides and device, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then effectively can improve the accuracy of text-processing.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

A kind of document representation method schematic flow sheet that Fig. 1 provides for the embodiment of the present invention;

The method flow schematic diagram in a kind of default term vector storehouse that Fig. 2 provides for the embodiment of the present invention;

A kind of text representation apparatus structure schematic diagram that Fig. 3 provides for the embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below with reference to the specific embodiment of the invention and corresponding accompanying drawing, technical solution of the present invention is clearly and completely described.Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

See Fig. 1, be a kind of document representation method schematic flow sheet that the embodiment of the present invention provides, comprise:

S101: determine each word forming current text.

In embodiments of the present invention, the described current text needing to carry out text representation being server herein and obtaining, the text can be the sentence of Chinese form, paragraph or chapter etc., and the text can be the text of the forms such as txt, doc, pdf, wps.

In embodiments of the present invention, server can be, but not limited to obtain text from default storage area (as corpus), or the current text uploaded of online acquisition user, and using the text of acquisition as current text.

The embodiment of the present invention, after acquisition current text, can carry out participle to this current text, obtains each word forming current text.In participle process, the segmenting method of employing can include but not limited to: by word traversal, mechanical Chinese word segmentation method etc.Such as, suppose that server obtains one section of article, using this article after current text, pre-service is carried out to this article content, then participle is carried out to pretreated article content, the each word obtained after participle comprises: display, flat board, liquid crystal, illumination, these five words of device, these five words can be defined as each word forming current text.

In order to server when reducing participle calculated amount and avoid the interference of some words, the embodiment of the present invention is before participle, pre-service can be carried out to current text, such as, remove the HTML (Hypertext Markup Language) (Hypertext Markup Language, HTML) in current text, the complex form of Chinese characters in current text be converted into simplified Chinese character, double byte character is converted into half-angle character etc.

Consider in practical application scene, in each word obtained after participle except comprise there is practical significance word except, also may comprise some words without practical significance, and Feature Words is generally the word with practical significance, therefore, the embodiment of the present invention is when determining each word forming current text, concrete, participle is carried out to current text, after obtaining multiple word, the word of specified type can also be determined in each word, in order to avoid filtering out identical word, duplicate removal process can also be carried out further to the word of specified type, using each word after duplicate removal process as each word forming current text, wherein, the word of described specified type can be specifically the word with practical significance, and the word with practical significance can include but not limited to: noun, verb, adjective etc., then the word without practical significance is generally auxiliary word, adverbial word, function word etc.

S102: the term vector determining each word.

In embodiments of the present invention, in order to express the implication (i.e. semantic information) of a word in more detail, the N dimensional vector comprising N number of element can be adopted to represent a word, and this N dimensional vector is the term vector of this word.In N number of element of this term vector, each element is the weighted value of the text categories that this word is corresponding, and wherein text categories can comprise: computing machine, traffic, education, economy, military affairs, physical culture, medicine, art, politics, environment etc.

Such as, the text categories of suppositive vector can be expressed as N dimensional vector { computing machine, traffic, education, economy } ₄, wherein, N=4.Suppose that display, flat board, liquid crystal, illumination, these five words of device are each word forming current text, then the term vector of " liquid crystal " this word can be expressed as: 0.175,0.095,0.185,0.041} ₄, wherein, the implication that this term vector represents is: " liquid crystal " is respectively 0.175,0.095,0.185,0.041 corresponding to the weighted value of computing machine, traffic, education, these four text categories economic.

In embodiments of the present invention, server, can according to term vector determination instrument direct-on-line determination term vector when determining the term vector of each word.Optionally, server can adopt word2vec computational tool, determines the term vector of each word.

In order to improve the efficiency of the term vector determining each word, preferably, in embodiments of the present invention, the term vector of each word can also be pre-determined, when needing the term vector determining each word, in the term vector storehouse of presetting, determine the term vector that (e.g., searching) is corresponding with each word.Visible, in the term vector storehouse of presetting, determine the term vector corresponding with each word, convenient and swift, effectively can improve the treatment effeciency of server.

In embodiments of the present invention, when pre-determining the term vector of each word, also can adopt word2vec computational tool, determine the term vector of each word.

S103: cluster is carried out to each term vector.

In embodiments of the present invention, determined the term vector of each word by step S102 after, cluster can be carried out to each term vector.

The cardinal rule of cluster has larger similarity between of a sort term vector, differs greatly between inhomogeneous term vector, therefore, by carrying out similarity measurement between term vector, just can realize carrying out cluster to each vector.Concrete, determine the similarity between term vector by the COS distance (cosine) calculated between two term vectors, cosine value is larger, between term vector, similarity is larger, otherwise cosine value is less, the similarity between term vector is less.

In embodiments of the present invention, adoptable clustering algorithm includes but not limited to: Chinese RestaurantProcess (CRP) algorithm, K-means clustering algorithm, K-central point clustering algorithm, CLARANS algorithm, BIRCH algorithm, CLIQUE algorithm, DBSCAN algorithm etc.

In embodiments of the present invention, obtain the set of multiclass term vector by carrying out cluster to each term vector, the set of described multiclass term vector is the cluster result each term vector being carried out to cluster; Wherein, several term vectors are comprised in each class term vector set.

Along using example, supposing that the term vector corresponding respectively to display, flat board, liquid crystal, illumination, these five words of device carries out cluster, obtaining three class term vector set.The set of first kind term vector comprises liquid crystal, display, term vector that this three words of device are corresponding, the set of Equations of The Second Kind term vector only comprises dull and stereotyped corresponding term vector, 3rd class term vector set only comprises term vector corresponding to illumination, and this illustrates liquid crystal, display, term vector similarity each other that these three words of device are corresponding is maximum, correlativity is the highest.Between the term vector of flat board and illumination difference correspondence, correlativity is lower, dull and stereotyped lower with the correlativity of liquid crystal, display, these words of device respectively with illumination.That is, in this three classes term vector, the word that first kind term vector is corresponding best embodies the feature of current text, and Equations of The Second Kind and the 3rd class are taken second place.

S104: according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word.

In embodiments of the present invention, according to cluster result, in each word, determine the Feature Words of current text, specifically in all kinds of term vector set, can determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value.

Along using example, suppose that predetermined threshold value is 2, according to the first kind, Equations of The Second Kind and the 3rd class term vector set (cluster result), in display, dull and stereotyped, liquid crystal, illumination, the Feature Words of current text is determined in these words of device, specifically can in the first kind, in Equations of The Second Kind and the 3rd class term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value 2, the quantity of the term vector comprised due to the set of first kind term vector is 3, the quantity of the term vector that Equations of The Second Kind and the 3rd class term vector set comprise is respectively 1, then determine that the term vector set that the quantity of the term vector comprised exceedes predetermined threshold value 2 is the set of first kind term vector, using word corresponding for each term vector in the first kind term vector set determined as Feature Words, also be, by liquid crystal, display, these three words of device are as the Feature Words of current text.

In embodiments of the present invention, according to cluster result, in each word, determine the Feature Words of current text, specifically also all kinds of term vector set can be sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.

Along using example, suppose default value m=1, by the first kind, Equations of The Second Kind and the 3rd class term vector set are sorted according to the order that the quantity comprising term vector is descending, the quantity of the term vector comprised due to the set of first kind term vector is 3, the quantity of the term vector that Equations of The Second Kind and the 3rd class term vector set comprise is respectively 1, then the sequence of all kinds of term vector set order is followed successively by: the first kind, Equations of The Second Kind and the 3rd class term vector set, determine the word (liquid crystal that each term vector in the 1st (m=1) individual term vector set (i.e. the set of first kind term vector) is corresponding, display, device) as Feature Words.

In embodiments of the present invention, according to cluster result, in each word, determine the weight w of the Feature Words of current text _i, specifically determine by formula (1-1).

w _i＝log(1+n _i/n _m) (1-1)

Wherein, w _ifor the weight of the Feature Words of i-th in current text, n _ithe number of times (hereinafter referred to as word frequency) that i-th Feature Words occurs in current text, n _min the word frequency that each Feature Words is corresponding, the word frequency that numerical value is maximum.

Such as, each Feature Words: the word frequency of liquid crystal, display, device is respectively 10,30,20, then the word frequency showing this word is maximum, i.e. n _m=30.The then weight w of this word of liquid crystal ₁=log (1+10/30); Show the weight w of this word ₂=log (1+30/30); The weight w of this word of device ₃=log (1+20/30).

S105: according to the term vector of each Feature Words and the text vector of weight determination current text.

Concrete, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.

Such as, the text vector of current text can be expressed as: { <F1:w ₁>, <F2:w ₂> ..., <Fi:w _i> ..., wherein, i=1,2,3 ...Fi is the term vector that i-th Feature Words is corresponding.

In method shown in above-mentioned Fig. 1, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then can improve the accuracy of text-processing.

When determining (e.g., searching) term vector corresponding with each word in the above-mentioned term vector storehouse presetting, need default term vector storehouse.

See Fig. 2, in embodiments of the present invention, preset the method in term vector storehouse, specifically can comprise the following steps:

S201: obtain multiple history text.

When obtaining multiple history text, multiple text can be obtained from corpus, as history text, obtain the quantity of text can be hundreds of, several thousand etc., do not do concrete restriction here.

S202: determine the multiple words forming each history text.

When determining each word forming each history text, determining that the method for each word forming current text is similar with above-mentioned, such as, participle can be carried out by segmenting method to each history text, obtain each word.

Optionally, in order to reduce the calculated amount of server and avoid the interference of some words, before participle is carried out to each history text, pre-service can be carried out to this history text.Pre-service can include but not limited to: remove HTML to history text, the complex form of Chinese characters is converted into simplified Chinese character, double byte character is converted into half-angle character, carry out duplicate removal process to each history text.

When carrying out duplicate removal process to each history text, the informative abstract of each history text can be calculated by message digest algorithm, such as, by the message digest algorithm (Message-DigestAlgorithm 5 of the 5th version, MD5) each history text obtained is calculated, after obtaining MD5 value corresponding to each history text, the history text that identical MD5 value is corresponding, only retains portion (namely realizing duplicate removal process).

Considering in practical application scene, for representing that the Feature Words of text is generally the word with practical significance, therefore, optionally, after participle is carried out to each history text, the word of the multiple specified type forming each history text can be determined; Wherein the word of this specified type can be specifically the word with practical significance.Like this, the calculated amount of server can be reduced further.

S203: each word lists in history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as this word.In embodiments of the present invention, word2vec computational tool can be adopted equally to determine the term vector of each word, repeat no more here.

S204: each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process.

Digital finger-print process is carried out to initial word vector, also namely digitized processing is carried out to initial word vector, such as, initial word vector is changed into " 0 ", " 1 " numerical string of certain length (as 64 bit).Term vector is changed into " 0 ", " 1 " numerical string by local sensitivity Hash (LSH) algorithm by the embodiment of the present invention.

Such as, the term vector of " liquid crystal " this word is expressed as: 0.175,0.095,0.185,0.041} ₄, then carry out digital finger-print process to this term vector, obtaining the term vector after digital finger-print process can be <000000000010>;

If the term vector of " display " this word is expressed as: 0.123,0.195,0.085,0.441} ₄, then carry out digital finger-print process to this term vector, obtaining the term vector after digital finger-print process can be <100101010010>.

S205: adopt the term vector after digital finger-print process to form and preset term vector storehouse.

In embodiments of the present invention, adopt the term vector after digital finger-print process to form and preset term vector storehouse, after determining each word that current text is corresponding, the term vector corresponding with each word found from default term vector storehouse is the term vector after digital finger-print process.When cluster is carried out to each term vector, exactly cluster is carried out to the term vector after digital finger prints processing, when then calculating the similarity between term vector in cluster process, the Hamming distance between two term vectors can be calculated, can determine the similarity between term vector by Hamming distance, the correlativity between these two term vectors of the larger explanation of Hamming distance between two term vectors is less, otherwise, Hamming distance between two term vectors is less, and the similarity between two term vectors is larger.When carrying out cluster to the term vector after digitizing, the calculated amount of server can be greatly reduced, effectively can improve the treatment effeciency of server.

In embodiments of the present invention, in order to avoid also existing without term vector corresponding to the word of practical significance in initial word vector further, then can screen initial word vector, concrete, according to part of speech, word frequency, the attributes such as vocabulary of stopping using, from each initial word vector, remove the word without practical significance, only retain the word with practical significance, thus effectively can reduce the interference of the word without practical significance, and then effectively can reduce the calculated amount of server.

In embodiments of the present invention, after the term vector of each Feature Words and the text vector of weight determination current text, just can carry out text-processing based on text vector, such as, carry out the process such as text retrieval, text classification, text analyzing, Text similarity computing.

In order to reduce in text-processing process, the calculated amount of server, thus effectively improve the treatment effeciency of server, then in embodiments of the present invention, described method also comprises: after the term vector of each Feature Words and the text vector of weight determination current text, carry out digital finger-print process to the text vector of current text.

Namely described digital finger-print process also carries out digitized processing, and optionally, the present invention can adopt the wherein a kind of algorithm simhash in LSH algorithm to carry out digital finger-print process to text vector.

Such as, if Feature Words: liquid crystal, term vector after the digitizing that display and device are corresponding is respectively <010>, <001> and <110>, liquid crystal, the weight of display and device is respectively 0.1, 0.2, 0.4, then text vector is expressed as: { < liquid crystal term vector: 0.1>, < shows term vector: 0.2>, < device term vector: 0.4>}.

Then to text vector < liquid crystal term vector: 0.1>, < show term vector: 0.2>, < device term vector: 0.4>} carries out digitized processing and is specially:

" 0 " in each term vector replaced with "-1 ", " 1 " replaces with " 1 ", and each term vector is multiplied by weight, obtains new term vector; First numerical value in each term vector is cumulative obtains first value, obtains second value by cumulative for the numerical value of second in each term vector, obtains the 3rd value by cumulative for the numerical value of the 3rd in each term vector.

In described first value ~ the three value, on the occasion of replacing with 1, negative value will be replaced with 0, then the vector obtaining being made up of 0 and 1 is the vector after digitized processing.

Such as, " 0 " in <010>, <001> and <110> replaces with "-1 ", " 1 " replaces with " 1 ", and be multiplied by weight corresponding to each term vector, the vector obtained is as follows respectively:

Term vector <010> correspond to vectorial 1<-0.1,0.1 ,-0.1>;

Term vector <001> correspond to vectorial 2<-0.2 ,-0.2,0.2>;

Term vector <110> correspond to vectorial 3<0.4,0.4 ,-0.4>;

Be added by first element-0.1 ,-0.2 and 0.4 of vector 1 ~ vector 3, first value obtained is 0.1, and this value is just;

Be added by second element 0.1 ,-0.2 and 0.4 of vector 1 ~ vector 3, second value obtained is 0.3, and this value is just;

Be added by 3rd element-0.1 of vector 1 ~ vector 3,0.2 and-0.4, the 3rd value obtained is-0.3, and this value is negative;

Then in first value ~ the three value, on the occasion of replacing with 1, negative value will be replaced with 0, then the vectorial <110> obtaining being made up of 0 and 1 is the vector after digitized processing.

Be the document representation method that the embodiment of the present invention provides above, based on same thinking, the embodiment of the present invention additionally provides a kind of text representation device, as shown in Figure 3, comprising:

First determination module 31, for determining each word forming current text;

Second determination module 32, for determining the term vector of each word;

Cluster module 33, for carrying out cluster to each term vector;

3rd determination module 34, for according to cluster result, determines the Feature Words of current text and the weight of this Feature Words in each word;

4th determination module 35, for according to the term vector of each Feature Words and the text vector of weight determination current text.

Optionally, described device also comprises:

Processing module 36, for carrying out digital finger-print process to the text vector of described current text.

Optionally, the second determination module 32 specifically for,

In the term vector storehouse of presetting, determine the term vector corresponding with each word.

Optionally, described device also comprises:

Preset term vector library module 37, for default term vector storehouse;

Described default term vector library module 37 specifically for, obtain multiple history text, determine the multiple words forming each history text, each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word, each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process, adopt the term vector after described digital finger-print process to form and preset term vector storehouse.

Optionally, described default term vector library module 37 specifically for, determine the word of the multiple specified type forming each history text.

Optionally, described first determination module 31 specifically for, participle is carried out to described current text, obtain multiple word, in each word, determine the word of specified type, duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text.

Optionally, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;

Described 3rd determination module 34 specifically for, in all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.

Optionally, described 4th determination module 35 specifically for, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.

A kind of document representation method that the embodiment of the present invention provides and device, the method determines each word forming current text, determine the term vector of each word, cluster is carried out to each term vector, according to the Feature Words of cluster result determination current text and the weight of this Feature Words, the text vector of the term vector corresponding according to the Feature Words of each word and weight determination current text.Visible, word in the present invention is represented by term vector, term vector word of comparing can be described from multiple dimension this word, can represent the semantic information of word more accurately, in addition, the process of cluster has considered the correlativity between the semanteme of Feature Words in sentence and sentence, therefore, the present invention, by carrying out cluster determination Feature Words to term vector, effectively can improve the accuracy of the Feature Words determining current text, and then can improve the accuracy of text-processing.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

In one typically configuration, computing equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.

Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flash RAM).Internal memory is the example of computer-readable medium.

Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise temporary computer readable media (transitory media), as data-signal and the carrier wave of modulation.

Also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, commodity or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, commodity or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment comprising key element and also there is other identical element.

It will be understood by those skilled in the art that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

These are only embodiments of the invention, be not limited to the present invention.To those skilled in the art, the present invention can have various modifications and variations.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within right of the present invention.

Claims

1. a document representation method, is characterized in that, comprising:

Determine each word forming current text;

Determine the term vector of each word;

Cluster is carried out to each term vector;

2. the method for claim 1, is characterized in that, described method also comprises:

Digital finger-print process is carried out to the text vector of described current text.

3. the method for claim 1, is characterized in that, the described term vector determining each word, specifically comprises: in the term vector storehouse of presetting, determine the term vector corresponding with each word;

Wherein, preset the method in term vector storehouse, specifically comprise:

Obtain multiple history text;

Determine the multiple words forming each history text;

Each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word;

Each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process;

Adopt the term vector after described digital finger-print process to form and preset term vector storehouse.

4. the method for claim 1, is characterized in that, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;

Described according to cluster result, in each word, determine the Feature Words of current text, specifically comprise:

In all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value;

Using word corresponding for each term vector in the term vector set determined as Feature Words.

5. the method for claim 1, is characterized in that, the described each word determining formation current text, specifically comprises: carry out participle to described current text, obtain multiple word; In each word, determine the word of specified type; Duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text;

And/or,

The text vector of the described term vector according to each Feature Words and weight determination current text, specifically comprises: according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.

6. a text representation device, is characterized in that, comprising:

First determination module, for determining each word forming current text;

Second determination module, for determining the term vector of each word;

Cluster module, for carrying out cluster to each term vector;

7. device as claimed in claim 6, it is characterized in that, described device also comprises:

Processing module, for carrying out digital finger-print process to the text vector of described current text.

8. device as claimed in claim 6, is characterized in that, described second determination module specifically for, in the term vector storehouse of presetting, determine the term vector corresponding with each word;

Described device also comprises: preset term vector library module, for default term vector storehouse;

Described default term vector library module specifically for, obtain multiple history text, determine the multiple words forming each history text, each word lists in described history text is shown as a multi-C vector, using the initial word vector of this multi-C vector as described word, each initial word vector is carried out digital finger-print process respectively, obtains the term vector after digital finger-print process, adopt the term vector after described digital finger-print process to form and preset term vector storehouse.

9. device as claimed in claim 6, it is characterized in that, described cluster result comprises the set of multiclass term vector, comprises several term vectors in each class term vector set;

Described 3rd determination module specifically for, in all kinds of term vector set, determine that the quantity of the term vector comprised exceedes the term vector set of predetermined threshold value, or, all kinds of term vector set is sorted according to the order that the quantity comprising term vector is descending, determine front m term vector set, wherein, m is default value; Using word corresponding for each term vector in the term vector set determined as Feature Words.

10. device as claimed in claim 6, it is characterized in that, described first determination module specifically for, participle is carried out to described current text, obtain multiple word, in each word, determine the word of specified type, duplicate removal process is carried out to the word of described specified type, using each word after duplicate removal process as each word forming current text; And/or,

Described 4th determination module specifically for, according to term vector and the weight of each Feature Words, determine the multi-C vector be made up of multiple element, using the text vector of this multi-C vector as current text; Wherein, an element in described multi-C vector is made up of the term vector of a Feature Words and the weight of this Feature Words.