CN103885989B

CN103885989B - Estimate the method and device of neologisms document frequency

Info

Publication number: CN103885989B
Application number: CN201210566103.5A
Authority: CN
Inventors: 蔡兵
Original assignee: Tencent Technology Wuhan Co Ltd
Current assignee: Tencent Technology Wuhan Co Ltd
Priority date: 2012-12-24
Filing date: 2012-12-24
Publication date: 2017-12-01
Anticipated expiration: 2032-12-24
Also published as: CN103885989A

Abstract

The present invention discloses a kind of method and device for estimating neologisms document frequency, and its method includes：Obtain the first document sets and the second document sets；The document data generation time that first document sets are included is earlier than second document sets；Document frequency of each default everyday words in the first document sets and the second document sets is counted respectively；Count document frequency of each default neologisms in the second document sets；Obtain the corresponding fit correlation of default document frequency of the everyday words in the first document sets and the second document sets；According to the document frequency of corresponding fit correlation and default neologisms in the second document sets, default document frequency of the neologisms in the first document sets is obtained.The present invention improve neologisms document frequency statistics accuracy rate, compensate for traditional statistical method for the document frequency statistical result error of neologisms it is larger the defects of；And the present invention is significant in the application of the technical fields such as feature selecting, keyword abstraction, vector space model expression for neologisms.

Description

Estimate the method and device of neologisms document frequency

Technical field

The present invention relates to Internet technical field, more particularly to a kind of method and device for estimating neologisms document frequency.

Background technology

With the development of Internet technology, neologisms are increasing, and it is more and more common that it has been increasingly becoming internet arena One phenomenon.Neologisms are called unregistered word, never occur before referring to, and significant word popular recently.Neologisms one As with focus incident, focus personage and produce, be the skills such as text classification, keyword abstraction often with great information content The indispensable characteristic item of art.And document frequency (DF, Document Frequency) as a kind of classical measure information because Son, also it is widely used in these correlative technology fields, such as vector space model, feature selecting, feature weight etc..

Generally, document frequency refers to the document number that a word occurs in magnanimity collection of document.Traditional document frequency Computational methods are generally based on the statistics of magnanimity collection of document.Substantially method is that first random screening goes out one from full dose document for it Then every document sets are segmented by the document sets of larger amt (such as 1,000,000), and count each word in how many documents Middle appearance, the document number thus counted the just document frequency as the word.

This method based on magnanimity collection of document statistics is more stable, more accurate for the document frequency of everyday words, But because neologisms are only present in the high document of few timeliness n, document frequency of traditional this statistical method for neologisms Rate statistical result error is larger, can typically be significantly less than its actual value.

Therefore, traditional document frequency computational methods based on magnanimity document sets statistics are less applicable neologisms, find more preferable Neologisms document frequency computational methods be particularly important.

The content of the invention

It is a primary object of the present invention to provide a kind of method and device for estimating neologisms document frequency, it is intended to improve neologisms The accuracy rate of document frequency statistics.

In order to achieve the above object, the present invention proposes a kind of method for estimating neologisms document frequency, including：

Obtain the first document sets and the second document sets；The document data generation time that first document sets are included earlier than Second document sets；

Document frequency of each default everyday words in first document sets and the second document sets is counted respectively；Statistics is every One default document frequency of the neologisms in second document sets；

Obtain the corresponding fitting of document frequency of the default everyday words in first document sets and the second document sets Relation；

According to the document frequency of the corresponding fit correlation and default neologisms in second document sets, described in acquisition Default document frequency of the neologisms in first document sets.

The present invention also proposes a kind of device for estimating neologisms document frequency, including：

Document sets acquisition module, for obtaining the first document sets and the second document sets；What first document sets were included Document data generation time is earlier than second document sets；

Statistical module, for counting text of each default everyday words in first document sets and the second document sets respectively Shelves frequency；Count document frequency of each default neologisms in second document sets；

Fit correlation acquisition module, for obtaining the default everyday words in first document sets and the second document sets Document frequency corresponding fit correlation；

Neologisms document frequency acquisition module, for literary described second according to the corresponding fit correlation and default neologisms The document frequency that shelves are concentrated, obtain document frequency of the default neologisms in first document sets.

A kind of method and device for estimating neologisms document frequency proposed by the present invention, by determining magnanimity document sets (first Document sets) and new document sets (the second document sets), and document frequency of the everyday words in magnanimity document sets and new document sets is counted, The relation between the two document frequencies is found again, finally estimates it in sea using document frequency of the neologisms in new document sets The document frequency in document sets is measured, the accuracy rate of neologisms document frequency statistics is which thereby enhanced, so as to compensate for traditional statistics Method for the document frequency statistical result error of neologisms it is larger the defects of；And the present invention for neologisms feature selecting, close The application for the technical fields such as keyword extracts, vector space model represents is significant.

Brief description of the drawings

Fig. 1 is the schematic flow sheet for the method preferred embodiment that the present invention estimates neologisms document frequency；

Fig. 2 is a kind of document frequency scatterplot of example in the method preferred embodiment of the invention for estimating neologisms document frequency Figure；

Fig. 3 is the structural representation for the device preferred embodiment that the present invention estimates neologisms document frequency；

Fig. 4 is that the structure of fit correlation acquisition module in the device preferred embodiment of the invention for estimating neologisms document frequency is shown It is intended to.

In order that technical scheme is clearer, clear, it is described in further detail below in conjunction with accompanying drawing.

Embodiment

The solution of the embodiment of the present invention is mainly：By determining magnanimity document sets (the first document sets) and new document sets (the second document sets), and document frequency of the everyday words in magnanimity document sets and new document sets is counted, then find the two documents Relation between frequency, finally estimate its document in magnanimity document sets using document frequency of the neologisms in new document sets Frequency, to improve the accuracy rate of neologisms document frequency statistics, the document frequency for making up traditional statistical method for neologisms counts The defects of resultant error is larger.

As shown in figure 1, present pre-ferred embodiments propose a kind of method for estimating neologisms document frequency, including：

Step S101, obtain the first document sets and the second document sets；The document data production that first document sets are included The raw time is earlier than second document sets；

Because neologisms are often only present in the high page of timeliness n, and traditional document based on magnanimity document sets statistics There is larger error in frequency calculation method, the present embodiment introduces new document sets concept, and is based on magnanimity document sets and new document sets To estimate document frequency of the neologisms in magnanimity document sets.

Specifically, first, magnanimity document sets A (i.e. the first document sets alleged by the present embodiment) and new document sets B are determined (i.e. originally Second document sets alleged by embodiment) two collection of document, wherein：

Preferably, magnanimity document sets A includes about 1,000,000 documents altogether, is selected at random from full dose document；Magnanimity Document in document sets A is essentially the data before 2 years.

New document sets B includes about 50,000 documents altogether, can be captured from major portal website's homepage；In new document sets B Document is essentially the data within nearest one month.

It should be noted that before the generation time of the document data in above-mentioned magnanimity document sets A can also be not limited to 2 years, For example it can also wait the year before；The generation time of document data in above-mentioned new document sets B can also be not limited to nearest one Within month, for example can also be within first quarter moon, etc..

Step S102, document frequency of each default everyday words in first document sets and the second document sets is counted respectively Rate；Count document frequency of each default neologisms in second document sets；

Wherein, default everyday words refers to the word often occurred, and the everyday words defined at present there are about 70,000；Default neologisms are Refer to and developed based on Internet technology and appear in the word in the high document of timeliness n, neologisms are typically accompanied by focus incident, focus people Thing and produce, its existence time is shorter.

Everyday words is set as w, neologisms t, it is determined that after two document sets A and B, count respectively each everyday words w in A and Document frequency in B, is expressed as DF_A_w and DF_B_w, and wherein DF_A_w is everyday words w in the true of magnanimity document sets A Document frequency, DF_B_w are used to continue to make comparisons with neologisms in new document sets B.

In addition, document frequency DF_B_ts of each neologisms t in new document sets B is also counted, subsequently to be commonly used After the corresponding fit correlation of document frequency of the word in magnanimity document sets A and new document sets B, according to neologisms t in new document sets B Document frequency DF_B_t obtain document frequency DF_A_t of the neologisms in magnanimity document sets A.

Document frequencies of the above-mentioned statistics everyday words w in A and B, and document frequencies of the statistics neologisms t in B, can be adopted Use following scheme：

First every document in document sets (A or B) is segmented, each word is then counted and occurs in how many documents Cross, document frequency of the document number for thus counting to obtain i.e. as the word.

Step S103, obtain document frequency of the default everyday words in first document sets and the second document sets Corresponding fit correlation；

Step S104, according to the document frequency of the corresponding fit correlation and default neologisms in second document sets Rate, obtain document frequency of the default neologisms in first document sets.

In above-mentioned steps 103 and step S104, document frequency DF_s of each everyday words w in magnanimity document sets A is being got After document frequency DF_B_w in A_w and new document sets B, document of the analysis everyday words in magnanimity document sets A and new document sets B Frequency relation.

First, by document frequency of all everyday words in magnanimity document sets A from being as low as ranked up greatly, the sequence that sorts is obtained Row；Then the collating sequence is segmented in units of group；Here be section gap with 100, i.e. 0-100 is one group, 101-200 is one group, and the rest may be inferred.

Afterwards in units of group, the average DF_B_w of all everyday words in each group is calculated；Then, it is averaged with each group DF_B_w draws, drafting obtains document frequency matched curve as abscissa by ordinate of the ranking value at this group of center.Its In, the document frequency scatter diagram that the data based on preceding 50 groups obtain is as shown in Figure 2.

From the scatterplot it can be seen from the figure that shown in Fig. 2：Document frequency of the everyday words in magnanimity document sets A and new document sets B Both rates are present close to linear fit correlation, exist between this document frequency of explanation everyday words in two document sets A and B Linear relationship.

It finally can also become everyday words in view of neologisms and settle out, therefore the document with neologisms in new document sets B Frequency DF_B_t is abscissa, and the ordinate value obtained using the scatter diagram shown in Fig. 2 is neologisms in magnanimity document sets A Document frequency DF_A_t.

It is big that error caused by the statistics of magnanimity collection of document is only based on compared to traditional document frequency computational methods The defects of, the present embodiment improves the accuracy rate that neologisms document frequency counts, so as to compensate for traditional system by such scheme The defects of meter method；And the present embodiment for neologisms feature selecting, keyword abstraction, vector space model represent etc. technology The application in field is significant.

As shown in figure 3, present pre-ferred embodiments propose a kind of device for estimating neologisms document frequency, including：Document sets Acquisition module 201, statistical module 202, fit correlation acquisition module 203 and neologisms document frequency acquisition module 204, wherein：

Document sets acquisition module 201, for obtaining the first document sets and the second document sets；First document sets are included Document data generation time earlier than second document sets；

Statistical module 202, for counting each default everyday words respectively in first document sets and the second document sets Document frequency；Count document frequency of each default neologisms in second document sets；

Fit correlation acquisition module 203, for obtaining the default everyday words in first document sets and the second document The corresponding fit correlation of the document frequency of concentration；

Neologisms document frequency acquisition module 204, for according to the corresponding fit correlation and default neologisms described the Document frequency in two document sets, obtain document frequency of the default neologisms in first document sets.

Then, document frequency of each default everyday words in first document sets and the second document sets is counted respectively； Count document frequency of each default neologisms in second document sets.

Getting the frequency of the document in document frequency DF_A_w and new document sets B of each everyday words w in magnanimity document sets A After rate DF_B_w, document frequency relation of the analysis everyday words in magnanimity document sets A and new document sets B.

In specific implementation process, as shown in figure 4, above-mentioned fit correlation acquisition module 203 can include：Sequencing unit 2031st, segmenting unit 2032, computing unit 2033 and drawing unit 2034, wherein：

Sequencing unit 2031, for by document frequency of all default everyday words in first document sets from as low as big It is ranked up, obtains collating sequence；

Segmenting unit 2032, for being segmented to the collating sequence in units of group；

Computing unit 2033, for calculating average text of all default everyday words in second document sets in each group Shelves frequency；

Drawing unit 2034, for using each group of the average document frequency as abscissa, with the row at this group of center Sequence value is ordinate, and drafting obtains document frequency matched curve.

The embodiment of the present invention estimates the method and device of neologisms document frequency, by determining magnanimity document sets (the first document Collection) and new document sets (the second document sets), and document frequency of the everyday words in magnanimity document sets and new document sets is counted, then seek The relation looked between the two document frequencies, finally the document frequency using neologisms in new document sets is literary in magnanimity to estimate it The document frequency that shelves are concentrated, the accuracy rate of neologisms document frequency statistics is which thereby enhanced, so as to compensate for traditional statistical method For neologisms document frequency statistical result error it is larger the defects of；And the present invention for neologisms in feature selecting, keyword The application of the technical fields such as extraction, vector space model expression is significant.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the scope of the invention, every utilization Equivalent structure or the flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims

A kind of 1. method for estimating neologisms document frequency, it is characterised in that including：

Obtain the first document sets and the second document sets；The document data generation time that first document sets are included is earlier than described Second document sets；

Document frequency of each default everyday words in first document sets and the second document sets is counted respectively；Count each pre- If document frequency of the neologisms in second document sets；

Obtain the corresponding fit correlation of document frequency of the default everyday words in first document sets and the second document sets；

According to the document frequency of the corresponding fit correlation and default neologisms in second document sets, obtain described default Document frequency of the neologisms in first document sets.
2. according to the method for claim 1, it is characterised in that it is described obtain default everyday words in first document sets and The step of corresponding fit correlation of document frequency in second document sets, includes：

By document frequency of all default everyday words in first document sets from being as low as ranked up greatly, the sequence that sorts is obtained Row；

The collating sequence is segmented in units of group；

Calculate average document frequency of all default everyday words in second document sets in each group；

Using each group of the average document frequency as abscissa, using the ranking value at this group of center as ordinate, drafting obtains Document frequency matched curve.
3. according to the method for claim 2, it is characterised in that it is described according to corresponding fit correlation and default neologisms in institute The step of stating the document frequency in the second document sets, obtaining document frequency of the default neologisms in first document sets is wrapped Include：

Using document frequency of the default neologisms in second document sets as abscissa, from the document frequency matched curve Ordinate corresponding to middle lookup, it is the default document frequency of the neologisms in first document sets.
4. according to the method described in claim 1,2 or 3, it is characterised in that the first document sets of the acquisition and the second document sets The step of include：

The magnanimity document of the first predetermined quantity is selected at random from given full dose document, as first document sets；From pre- The new document of the second predetermined quantity is captured in fixed portal website's homepage, as second document sets；First predetermined number Amount is more than second predetermined quantity.
5. according to the method for claim 4, it is characterised in that the document data generation time in first document sets is extremely It is more than 2 years less；Document data generation time in second document sets is within January.
A kind of 6. device for estimating neologisms document frequency, it is characterised in that including：

Document sets acquisition module, for obtaining the first document sets and the second document sets；The document that first document sets are included Data generation time is earlier than second document sets；

Statistical module, for counting document frequency of each default everyday words in first document sets and the second document sets respectively Rate；Count document frequency of each default neologisms in second document sets；

Fit correlation acquisition module, for obtaining text of the default everyday words in first document sets and the second document sets The corresponding fit correlation of shelves frequency；

Neologisms document frequency acquisition module, for according to the corresponding fit correlation and default neologisms in second document sets In document frequency, obtain the document frequency of the default neologisms in first document sets.
7. device according to claim 6, it is characterised in that the fit correlation acquisition module includes：

Sequencing unit, for document frequency of all default everyday words in first document sets to be arranged from as low as big Sequence, obtain collating sequence；

Segmenting unit, for being segmented to the collating sequence in units of group；

Computing unit, for calculating average document frequency of all default everyday words in second document sets in each group；

Drawing unit, for being vertical using the ranking value at this group of center using each group of the average document frequency as abscissa Coordinate, drafting obtain document frequency matched curve.
8. device according to claim 7, it is characterised in that the neologisms document frequency acquisition module is additionally operable to described Default document frequency of the neologisms in second document sets is abscissa, is searched from the document frequency matched curve corresponding Ordinate, be the default document frequency of the neologisms in first document sets.
9. according to the device described in claim 6,7 or 8, it is characterised in that the document sets acquisition module is additionally operable to from given Full dose document in select the magnanimity document of the first predetermined quantity at random, as first document sets；From predetermined portal The new document of the second predetermined quantity is captured in homepage of standing, as second document sets；First predetermined quantity is more than described Second predetermined quantity.
10. device according to claim 9, it is characterised in that the document data generation time in first document sets At least more than 2 years；Document data generation time in second document sets is within January.