CN110110328A

CN110110328A - Text handling method and device

Info

Publication number: CN110110328A
Application number: CN201910346113.XA
Authority: CN
Inventors: 靳彦召
Original assignee: Beijing Zero Seconds Technology Co Ltd
Current assignee: Beijing Zero Seconds Technology Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-08-09
Anticipated expiration: 2039-04-26
Also published as: CN110110328B

Abstract

This application discloses a kind of text handling method and devices.This method includes obtaining short text corpus, disposes every short text according to preset format and using all short texts as a destination document；Count the word frequency summation of all words in the word frequency and the destination document that each word occurs in the destination document；According to the word frequency and the word frequency summation, the word weight of institute's predicate is calculated.Present application addresses the bad technical problems of short text treatment effect.The emphasis vocabulary in short text can be preferably identified by the application.In addition, the application is suitable for nature text-processing scene.

Description

Text handling method and device

Technical field

This application involves text-processing fields, in particular to a kind of text handling method and device.

Background technique

The characteristics of short text in natural language processing is that sentence is shorter, vocabulary is fewer.

Inventors have found that bad for short text treatment effect.Further, the heavy duty word in short text can not be identified It converges.

For the bad problem of short text treatment effect in the related technology, currently no effective solution has been proposed.

Summary of the invention

The main purpose of the application is to provide a kind of text handling method and device, to solve short text treatment effect not Good problem.

To achieve the goals above, according to the one aspect of the application, a kind of text handling method is provided.

Text handling method according to the application includes: to obtain short text corpus, disposes every short essay according to preset format Originally and using all short texts as a destination document；Count the word frequency and institute that each word occurs in the destination document State the word frequency summation of all words in destination document；According to the word frequency and the word frequency summation, the word power of institute's predicate is calculated Weight.

Further, the method is used to handle the weight of frequency of occurrences height but meaningless word in short text.

Further, short text corpus is obtained, disposes every short text according to preset format and by all short texts Include: to obtain short text corpus as a destination document, disposes every short essay according to the format that every row disposes a short text Originally and using all short texts as a destination document.

Further, all words in the word frequency and the destination document that each word occurs in the destination document are counted Word frequency summation includes: the word frequency WF that each word occurs in the statistics destination document；Count all words in the destination document Word frequency summation DF；According to the word frequency and the word frequency summation, the word weight that institute's predicate is calculated includes: to calculate word weight WW =ln (DF/WF).

Further, for handling that the frequency of occurrences in short text is high but meaningless word includes following one or more: language Gas word, auxiliary word, pronoun

To achieve the goals above, according to the another aspect of the application, a kind of text processing apparatus is provided.

According to the text processing apparatus of the application, comprising: module is obtained, for obtaining short text corpus, according to default lattice Formula disposes every short text and using all short text as a destination document；Statistical module, for counting the target The word frequency summation of all words in each word occurs in document word frequency and the destination document；Computing module, for according to institute Predicate frequency and the word frequency summation, are calculated the word weight of institute's predicate.

Further, for handling the weight of frequency of occurrences height but meaningless word in short text.

Further, the acquisition module disposes the format of a short text according to every row for obtaining short text corpus Dispose every short text and using all short texts as a destination document.

Further, the statistical module is used for, and counts the word frequency WF that each word occurs in the destination document；Statistics institute State the word frequency summation DF of all words in destination document；According to the word frequency and the word frequency summation, the word of institute's predicate is calculated Weight includes: to calculate word weight WW=ln (DF/WF).

Further, for handling that the frequency of occurrences in short text is high but meaningless word includes following one or more: language Gas word, auxiliary word, pronoun.

Text handling method and device in the embodiment of the present application, using short text corpus is obtained, according to preset format portion Affix one's name to every short text and using all short texts as the mode of a destination document, it is every in the destination document by counting The word frequency summation of all words, has reached according to the word frequency and the word frequency in the word frequency of a word appearance and the destination document Summation, is calculated the purpose of the word weight of institute's predicate, to realize the emphasis vocabulary that can preferably identify in short text Technical effect, and then solve the bad technical problem of short text treatment effect.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings:

Fig. 1 is the text handling method flow diagram according to one embodiment of the application；

Fig. 2 is the text handling method flow diagram according to another embodiment of the application；

Fig. 3 is the text processing apparatus structural schematic diagram according to the embodiment of the present application.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.

It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.

In this application, term " on ", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outside", " in ", "vertical", "horizontal", " transverse direction ", the orientation or positional relationship of the instructions such as " longitudinal direction " be orientation based on the figure or Positional relationship.These terms are not intended to limit indicated dress primarily to better describe the application and embodiment Set, element or component must have particular orientation, or constructed and operated with particular orientation.

Also, above-mentioned part term is other than it can be used to indicate that orientation or positional relationship, it is also possible to for indicating it His meaning, such as term " on " also are likely used for indicating certain relations of dependence or connection relationship in some cases.For ability For the those of ordinary skill of domain, the concrete meaning of these terms in this application can be understood as the case may be.

In addition, term " installation ", " setting ", " being equipped with ", " connection ", " connected ", " socket " shall be understood in a broad sense.For example, It may be a fixed connection, be detachably connected or monolithic construction；It can be mechanical connection, or electrical connection；It can be direct phase It even, or indirectly connected through an intermediary, or is two connections internal between device, element or component. For those of ordinary skills, the concrete meaning of above-mentioned term in this application can be understood as the case may be.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, this method includes the following steps, namely S102 to step S106:

Step S102 obtains short text corpus, disposes every short text according to preset format and by all short texts As a destination document；

Short text corpus is acquired as text input.Short text corpus can be collected in advance.

Refer to according to preset format and disposes the short text corpus to each short text according to the format of setting. Meanwhile using all short texts as a destination document.

It should be noted that using all short texts as can't be by each short text when a destination document It is individually handled, but it is that a text is handled that all short texts, which are treated as,.

Step S104 counts all words in the word frequency and the destination document that each word occurs in the destination document Word frequency summation；

The word frequency that all words in the word frequency and the destination document that each word occurs are counted in the destination document is total With.

It should be noted that the word frequency not occurred to word each in the destination document in embodiments herein Statistical method is specifically limited, as long as being capable of word frequency statistics demand.

It is also to be noted that not to the word frequency summation of all words in the destination document in embodiments herein Statistical method specifically limited, as long as being capable of word frequency statistics summation demand.

The word weight of institute's predicate is calculated according to the word frequency and the word frequency summation in step S106.

According to the word frequency and the word frequency summation, to calculate the word weight of institute's predicate.According to obtained institute's predicate Word weight of the weight as keyword in short text.

It can be seen from the above description that the application realizes following technical effect:

In the embodiment of the present application, using short text corpus is obtained, every short text is disposed according to preset format and by institute Have mode of the short text as a destination document, by count each word occurs in the destination document word frequency and The word frequency summation of all words in the destination document, has reached according to the word frequency and the word frequency summation, is calculated described The purpose of the word weight of word to realize the technical effect that can preferably identify the emphasis vocabulary in short text, and then solves It has determined the bad technical problem of short text treatment effect.

According to the embodiment of the present application, as preferred in the present embodiment, for handling, the frequency of occurrences in short text is high but nothing The weight of meaning word.In embodiments herein, the concept of number of files is not used, by using word frequency summation and word frequency The method that ratio takes natural logrithm again, can effectively solve the problems, such as some high frequencies but meaningless word weight ratio is higher.

According to the embodiment of the present application, as preferred in the present embodiment, short text corpus is obtained, is disposed according to preset format Every short text and using all short texts as a destination document include: obtain short text corpus, according to every row dispose The format of one short text disposes every short text and using all short text as a destination document.Specifically, it will obtain The short text corpus merger taken is a document, and has a short text in every row.It is segmented again later.

According to the embodiment of the present application, as preferred in the present embodiment, count what each word in the destination document occurred The word frequency summation of all words includes: in word frequency and the destination document

Step S202 counts the word frequency WF that each word occurs in the destination document；

Step S204 counts the word frequency summation DF of all words in the destination document；

Step S206, according to the word frequency and the word frequency summation, the word weight that institute's predicate is calculated includes: calculating word Weight WW=ln (DF/WF).

Specifically, pass through the word frequency summation DF, word weight WW=of all words in the word frequency WF and document of each word of statistics ln(DF/WF).The method for taking natural logrithm again using the ratio of word frequency summation and word frequency calculates word weight at this time.

According to the embodiment of the present application, as preferred in the present embodiment, for handling, the frequency of occurrences in short text is high but nothing Meaning word includes following one or more: modal particle, auxiliary word, pronoun.

Specifically, due to the characteristics of short text be sentence is shorter, vocabulary is fewer, a word can in current statement Can only occur once, however be difficult to find which word or which word are emphasis in traditional word statistics based on long text Word.In this application based on the thought of TFIDF, algorithm and thinking are transformed, make the word weight processing suitable for short text Method.By regarding the short text corpus of all collections a piece of document as, having cast aside existing in embodiments herein The concept of number of documents in TFIDF, eliminates the process for calculating TF, and calculation amount is smaller.IDF means inverse text frequency in TFIDF Rate index refers to that total number of files and some word appear in a calculated result in how many documents.Do not have in this method The concept of number of files, the method for taking natural logrithm again using the ratio of word frequency summation and word frequency, for example, when only occurring one in document A word " ", word frequency WF is equivalent to the word frequency summation DF of all words, then WW=ln (DF/WF)=ln1=0.

It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.

According to the embodiment of the present application, additionally provide it is a kind of for implementing the text processing apparatus of the above method, such as Fig. 3 institute Show, which includes: to obtain module 10, for obtaining short text corpus, disposes every short text according to preset format and by institute There is the short text as a destination document；Statistical module 20, for counting the word that each word occurs in the destination document The word frequency summation of all words in frequency and the destination document；Computing module 30, for total according to the word frequency and the word frequency With the word weight of institute's predicate is calculated.

Short text corpus is acquired in the acquisition module 10 of the embodiment of the present application as text input.It can collect in advance Short text corpus.

The word frequency and institute that each word occurs are counted in the statistical module 20 of the embodiment of the present application in the destination document State the word frequency summation of all words in destination document.

According to the word frequency and the word frequency summation in the computing module 30 of the embodiment of the present application, to calculate institute's predicate Word weight.Word weight according to obtained institute's predicate weight as keyword in short text.

According to the embodiment of the present application, as preferred in the present embodiment, the text processing apparatus is for handling short text The weight of middle frequency of occurrences height but meaningless word.In embodiments herein, the concept of number of files is not used, by using The method that the ratio of word frequency summation and word frequency takes natural logrithm again can effectively solve some high frequencies but meaningless word weight Relatively high problem.

According to the embodiment of the present application, as preferred in the present embodiment, the acquisition module 10, for obtaining short text language Material disposes every short text and using all short texts as a target text according to the format that every row disposes a short text Shelves.Specifically, the short text corpus merger that will acquire is a document, and has a short text in every row.It carries out again later Participle.

According to the embodiment of the present application, as preferred in the present embodiment, the statistical module is used for,

Count the word frequency WF that each word occurs in the destination document；

Count the word frequency summation DF of all words in the destination document；

According to the word frequency and the word frequency summation, the word weight that institute's predicate is calculated includes:

It calculates word weight WW=ln (DF/WF).

According to the embodiment of the present application, as preferred in the present embodiment, the text processing apparatus is for handling short text The middle frequency of occurrences is high but meaningless word includes following one or more: modal particle, auxiliary word, pronoun.Specifically, due to short text The characteristics of be sentence is shorter, vocabulary is fewer, a word may only occur in current statement it is primary, however traditional Word statistics based on long text is difficult to find which word or which word are heavy duty words.In this application based on the think of of TFIDF Think, algorithm and thinking are transformed, makes the word weight processing method suitable for short text.Pass through in embodiments herein By the short text corpus of all collections, regard a piece of document as, has cast aside the concept of number of documents in existing TFIDF, eliminated The process of TF is calculated, calculation amount is smaller.IDF means inverse document frequency in TFIDF, refers to total number of files and some Word appears in a calculated result in how many documents.There is no the concept of number of files in this method, using word frequency summation and word The method that the ratio of frequency takes natural logrithm again, for example, only occur in the document word " ", word frequency WF is equivalent to all words Word frequency summation DF, then WW=ln (DF/WF)=ln1=0.

Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific Hardware and software combines.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. a kind of text handling method characterized by comprising

Short text corpus is obtained, disposes every short text according to preset format and using all short texts as a target text Shelves；

Count the word frequency summation of all words in the word frequency and the destination document that each word occurs in the destination document；

According to the word frequency and the word frequency summation, the word weight of institute's predicate is calculated.

2. text handling method according to claim 1, which is characterized in that for handle in short text the frequency of occurrences it is high but The weight of meaningless word.

3. text handling method according to claim 1, which is characterized in that short text corpus is obtained, according to preset format It disposes every short text and includes: using all short texts as a destination document

Short text corpus is obtained, disposes every short text according to the format that every row disposes a short text and by all short essays This is as a destination document.

4. text handling method according to claim 1, which is characterized in that count each word in the destination document and occur Word frequency and the destination document in the word frequency summations of all words include:

It calculates word weight WW=ln (DF/WF).

5. text handling method according to claim 1, which is characterized in that for handle in short text the frequency of occurrences it is high but Meaningless word includes following one or more: modal particle, auxiliary word, pronoun.

6. a kind of text processing apparatus characterized by comprising

Module is obtained, for obtaining short text corpus, disposes every short text according to preset format and by all short texts As a destination document；

Statistical module, for counting all words in the word frequency and the destination document that each word occurs in the destination document Word frequency summation；

Computing module, for the word weight of institute's predicate to be calculated according to the word frequency and the word frequency summation.

7. text processing apparatus according to claim 6, which is characterized in that for handle in short text the frequency of occurrences it is high but The weight of meaningless word.

8. text processing apparatus according to claim 6, which is characterized in that the acquisition module, for obtaining short text Corpus disposes every short text and using all short texts as a target according to the format that every row disposes a short text Document.

9. text processing apparatus according to claim 6, which is characterized in that the statistical module is used for,

It calculates word weight WW=ln (DF/WF).

10. text processing apparatus according to claim 6, which is characterized in that high for handling the frequency of occurrences in short text But meaningless word includes following one or more: modal particle, auxiliary word, pronoun.