CN105630928B

CN105630928B - The identification method and device of text

Info

Publication number: CN105630928B
Application number: CN201510974385.6A
Authority: CN
Inventors: 张伸正; 魏少俊; 陈培军
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2019-06-21
Anticipated expiration: 2035-12-22
Also published as: CN105630928A

Abstract

This application provides a kind of identification method of text and identity devices.This method comprises: choosing the first text to be identified；According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively；According to the multiple characteristic fingerprint, first text is identified.In conclusion the identification method and identity device of text according to an embodiment of the present invention, to increase the identifiability of the text, have greatly reduced the space size of text by identifying according to multiple mark post texts are the multiple characteristic fingerprints of text production to be identified.

Description

The identification method and device of text

Technical field

The present invention relates to technical field of network information, the identity device of identification method and text more particularly to text.

Background technique

With the development of network technology, people arrive a large amount of information by Internet communication platform is available.Many information with The form of text is supplied to people.

In order to store and identify mass text, the identification means of many texts have been developed.For example, widely known passes through TFIDF algorithm obtains the feature vector of text, is then compressed, is obtained to vector information by min-hash (hash) algorithm again The characteristic fingerprint for obtaining text, thus can greatly save greatly the space of text.

But if two texts are similar, need to sample enough elements in feature vector, it just can ensure that two texts Characteristic fingerprint it is different, but the space that will result in mark text in this way is larger.

Summary of the invention

In view of the above problems, the identification method and identity device of a kind of text are proposed, multiple characteristic fingerprints can be passed through To identify text.

According to an aspect of the invention, there is provided a kind of identification method of text, comprising:

Choose the first text to be identified；

According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively；

According to the multiple characteristic fingerprint, first text is identified.

Optionally, characteristic fingerprint is obtained in the following manner:

Obtain the first eigenvector of first text；

According to the mark post text, the weight of each element in the first eigenvector of first text is determined；

According to the weight, the characteristic fingerprint of first text is obtained.

Optionally, the characteristic fingerprint that the first text is obtained according to weight, comprising:

According to the weight, on the basis of the first eigenvector, establish the second feature of first text to Amount；

According to the second feature vector, the characteristic fingerprint of first text is generated.

Optionally, the characteristic fingerprint that the first text is generated according to second feature vector, comprising:

According to the second feature vector, it is based on the distance between first text and the mark post text, generates institute State the characteristic fingerprint of the first text.

Optionally, by min-hash operation, the distance between first text and the mark post text are determined.

Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.

Optionally, the first eigenvector for obtaining the first text, comprising:

Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back First eigenvector of the words of amount as first text.

Optionally, first text is subjected to word segmentation processing, then carried out before forming sequence after garbage is handled Words sequence.

Optionally, described eigenvector is extracted from one or more below: text header, text snippet, text is just Text.

According to another aspect of the present invention, a kind of identity device of text is provided, comprising:

Module is chosen, for choosing the first text to be identified；

Determining module, for determining multiple characteristic fingerprints of the first text respectively according to multiple mark post texts；

Mark module, for identifying first text according to the multiple characteristic fingerprint.

Optionally, the determining module obtains characteristic fingerprint in the following manner:

Obtain the first eigenvector of first text；

Further, the determining module obtains the characteristic fingerprint of the first text in the following manner:

Optionally, the determining module generates the characteristic fingerprint of the first text in the following manner:

Optionally, the module that obtains is used to arrange word according to the sequence of the words frequency of occurrences in words sequence from high to low Word, and first eigenvector of the words of preset quantity as first text is taken out from front to back.

Optionally, the acquisition module is used to first text carrying out word segmentation processing, then carries out at garbage The words sequence before sequence is formed after reason.

In conclusion the identification method and identity device of text according to an embodiment of the present invention pass through according to multiple mark post texts This is that text to be identified produces multiple characteristic fingerprints to identify, to increase the identifiability of the text, is greatly reduced The space size of text.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 is the step flow chart of the identification method of text according to an embodiment of the present invention；

Fig. 2 is the step flow chart according to an embodiment of the present invention for obtaining characteristic fingerprint；

Fig. 3 is the structural schematic diagram of the identity device of text according to an embodiment of the present invention；

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Referring to Fig.1, the flow chart of the identification method of text according to one embodiment of the present invention is shown.As schemed Show, this method comprises the following steps:

11, choose the first text to be identified.

After the first text to be identified has been determined, so that it may obtain its first eigenvector.

In general, carrying out word segmentation processing first to the first text, multiple words are obtained.Word after word segmentation processing Word, it is also possible to include garbage.In general, according to the frequency that these words occur in the text with sequence from high to low into Row arrangement, then will come front preset quantity words as the first eigenvector of first text.

It is possible to further the garbage occurred in text is removed, such as " ", " ground ", " obtaining " etc..Garbage It can be divided into punctuation mark, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.Current words goes out in the text Existing frequency is high, but often without practical significance, therefore needs to ignore these words when production feature vector.That is, will be described First text carries out word segmentation processing, then forms the words sequence before sequence after carrying out garbage processing.

It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes Representative words constitutes the feature vector of news in the words obtained after garbage processing.

For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage Arrange S=(s₁,s₂,s₃......,s_N), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.

It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back Accord with the feature vector as the newsletter archive.

It is appreciated that the source of element can be extracted from one or more below in feature vector: text header, text This abstract, text body.

12, according to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively.

First text is directed to each mark post text respectively and obtains a characteristic fingerprint, shares several mark post files, energy Obtain several characteristic fingerprints.

Wherein, the step of the first text obtains characteristic fingerprint according to mark post file is as follows:

S121 obtains the first eigenvector of first text；

S122 determines the weight of each element in the first eigenvector of first text according to the mark post text；

In an embodiment of the present invention, weight can using following methods determine:

Word frequency TF indicates the frequency that a certain words Ti occurs in a certain document D j, and the frequency that Ti occurs is higher, TFi It is higher, illustrate that this words is more important for entire document, for example, the document D j of from-primary-to-junior-middle-school is talked about for one, it is " small in document Rise at the beginning of " occur frequency TFi it is relatively high.

That is, determining each element in feature vector according to the word frequency of each words in feature vector Weight.

In another embodiment of the present invention, weight can be determined using following methods:

Document frequency DF indicates to contain the number of the document of a certain words Ti, includes the words for a certain words Ti The document of Ti is more, i.e., DFi is bigger, and the effect that Ti is used to distinguish different documents is smaller, belongs to non-focus word.

Inverse document frequency IDF is in inverse relation with document frequency DF.It such as, but not limited to, can be with for a certain words It sets IDFi=log (N/DFi), wherein N is total number of documents.If a certain words only occurs in a document, i.e. DFi is 1, Then IDFi is logN, and the words acts on the differentiation between document maximum at this time.

That is, determining each member in feature vector according to the inverse document frequency of each words in feature vector The weight of element.

In another embodiment of the invention, weight can using following methods determine:

According to the word frequency and inverse document frequency of each words in feature vector, each in feature vector is determined The weighted value of element.Such as, but not limited to, each member in feature vector can be determined using the product of IF and IDF as parameter The weight of element.

In a specific embodiment of the invention, weight can be determined using following methods:

Words appears in title, and text snippet, the different position such as text body, significance level is different, to text Role of delegate is also different.Therefore, member can be determined according to each element in feature vector the location of in the text The weight of element, the position can include but is not limited to text header, text snippet, text body.

It is determined according to words position in the text and word frequency and/or inverse document frequency each in feature vector The weight of a element.

S123 obtains the characteristic fingerprint of first text according to the weight.

Specifically, according to the weight, on the basis of the first eigenvector, the second of first text is established Feature vector；According to the second feature vector, the characteristic fingerprint of first text is generated.

That is, in newly-generated second feature vector the quantity of each element embody the element identify this Weight when one text.For example a words is bigger to the effect of mark text, corresponding weight is also bigger.

For example, the first eigenvector of the first text is (examination of from-primary-to-junior-middle-school section primary school is put into several classes), mark post text is then related to (examination of the section You Sheng little primary school is put into several classes), it can be seen that, the weight of " from-primary-to-junior-middle-school " should just increase accordingly in the first text.To be " small At the beginning of rising " weight be set to 0.4, the weight of section is 0.2, other are 0.1, then the second feature vector generated is that (from-primary-to-junior-middle-school is small It rises from-primary-to-junior-middle-school section section primary school examination at the beginning of lower primary school rises to put into several classes).

Further, it is also possible to according to the second feature vector, based between first text and the mark post text Distance generates the characteristic fingerprint of first text.For example, by minimum hash operation, determine first text with it is described The distance between mark post text.In fact, there are also other algorithms to obtain the distance other than minimum hash operation.

13, according to the multiple characteristic fingerprint, identify first text.

Through the above steps, the first text can be obtained respectively to refer to relative to multiple features of multiple and different mark post texts Line identifies the first text with multiple characteristic fingerprint, increases the identifiability of the text, greatly reduced the space of text Size.

Fig. 3 shows the structural schematic diagram of the identity device of text according to an embodiment of the present invention.

In Fig. 3, the identity device 30 of text includes choosing module 31, determining module 32 and mark module 33.Wherein, it selects Modulus block 31 is for choosing the first text to be identified；Determining module 32 determines the first text according to multiple mark post texts respectively Multiple characteristic fingerprints；Mark module 33 identifies first text according to the multiple characteristic fingerprint.

Determining module 32 is specifically used for:

Obtain the first eigenvector of first text；According to the mark post text, the of first text is determined The weight of each element in one feature vector；According to the weight, the characteristic fingerprint of first text is obtained.

For example, on the basis of the first eigenvector, establish first text second is special according to the weight Levy vector；According to the second feature vector, the characteristic fingerprint of first text is generated.

Specifically, according to the second feature vector, it is based on the distance between first text and the mark post text, Generate the characteristic fingerprint of first text.

For example, can determine the distance between first text and the mark post text by min-hash operation.

Further, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.

The acquisition module 31 is specifically used for:

Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back First eigenvector of the words of amount as first text.Further, first text is subjected to word segmentation processing, then The words sequence before sequence is formed after carrying out garbage processing.

Feature vector described here can be extracted from one or more below: text header, text snippet, text Text.

More than, in an embodiment of the present invention, weight can be determined using following methods:

In conclusion the identity device of text according to an embodiment of the present invention is by being to be identified according to multiple mark post texts Text produce multiple characteristic fingerprints to identify, to increase the identifiability of the text, greatly reduced the space of text Size.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) are according to an embodiment of the present invention based on the determining news advowson of comment to realize The some or all functions of some or all components in the device of weight.The present invention is also implemented as executing here Some or all device or device programs of described method are (for example, computer program and computer program produce Product).It is such to realize that program of the invention can store on a computer-readable medium, or can have one or more The form of signal.Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or to appoint What other forms provides.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this specification.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Furthermore, it should also be noted that language used in this specification primarily to readable and introduction purpose and select Select, rather than in order to explain or defining the subject matter of the present invention and select.Therefore, without departing from the appended claims In the case where scope and spirit, many modifications and changes are all apparent for those skilled in the art 's.For the scope of the present invention, the disclosure that the present invention is done be it is illustrative and not restrictive, the scope of the present invention by The appended claims limit.

Claims

1. a kind of identification method of text, comprising:

Choose the first text to be identified；

According to multiple mark post texts, the first text is directed to each mark post text respectively and obtains a characteristic fingerprint, is determined respectively Multiple characteristic fingerprints of first text；According to the multiple characteristic fingerprint, first text is identified.

2. the identification method of text according to claim 1, wherein obtaining characteristic fingerprint in the following manner:

Obtain the first eigenvector of first text；

3. the identification method of text according to claim 2, wherein the feature for obtaining the first text according to weight refers to Line, comprising:

According to the weight, on the basis of the first eigenvector, the second feature vector of first text is established；

4. the identification method of text according to claim 3, wherein described generate the first text according to second feature vector Characteristic fingerprint, comprising:

According to the second feature vector, it is based on the distance between first text and the mark post text, generation described the The characteristic fingerprint of one text.

5. the identification method of text according to claim 4, wherein determining first text by min-hash operation The distance between described mark post text.

6. the identification method of text according to claim 5, wherein the quantity of each element in the second feature vector Meet the multiple proportion between the weight of each element.

7. the identification method of the text according to any one of claim 2 to 6, wherein described obtain the first of the first text Feature vector, comprising:

Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out preset quantity from front to back First eigenvector of the words as first text.

8. the identification method of text according to claim 7 wherein first text is carried out word segmentation processing, then carries out The words sequence before sequence is formed after going garbage to handle.

9. the identification method of the text according to any one of claim 2-6 or 8, wherein described eigenvector is from below It is extracted in one or more: text header, text snippet, text body.

10. a kind of identity device of text, comprising:

Module is chosen, for choosing the first text to be identified；

Determining module, for the first text being directed to each mark post text respectively and obtains a feature according to multiple mark post texts Fingerprint determines multiple characteristic fingerprints of the first text respectively；