CN109388714A

CN109388714A - Text marking method, apparatus, equipment and computer readable storage medium

Info

Publication number: CN109388714A
Application number: CN201811240652.7A
Authority: CN
Inventors: 申勇
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2019-02-26
Anticipated expiration: 2038-10-23
Also published as: CN109388714B

Abstract

The application provides a kind of text marking method, apparatus, equipment and computer readable storage medium, wherein including at least one word to be processed in text to be processed this method comprises: obtain text to be processed；According to preset weight dictionary, the label and score of each of at least one word to be processed word to be processed are determined, wherein include at least one word in weight dictionary, each of at least one word word has label and score；According to the label and score of each of text to be processed word to be processed, the classification of text to be processed is determined.One kind is provided to be labeled automatically for the word in text to be processed, be the method that text to be processed carries out category division, a large amount of human resources are not needed to be labeled for text, human cost can be saved, and then reduces the cost of text marking, improves annotating efficiency.

Description

Text marking method, apparatus, equipment and computer readable storage medium

Technical field

This application involves data processing technique more particularly to a kind of text marking method, apparatus, equipment and computer-readable Storage medium.

Background technique

In machine learning field, it is often necessary to treat text trained or to be identified and carry out data mark, and then to text Word in this is labeled, and is classified to text.

In the prior art, when carrying out data mark to text, the method manually marked is usually used；It is specific next It says, artificial carries out artificial cognition to the word in text, determines the label of each word, and then divided for word Class；And artificial determines the classification of each text ownership, for example, determining that text belongs to sports news or amusement is new It hears.

However in the prior art, need it is artificial each text is labeled, such mode needs a large amount of people Power cost, and then the higher cost of text marking；And annotating efficiency is lower.

Summary of the invention

The application provides a kind of text marking method, apparatus, equipment and computer readable storage medium, to solve text The higher cost of mark, the lower problem of annotating efficiency.

In a first aspect, the application provides a kind of text marking method, comprising:

Text to be processed is obtained, includes at least one word to be processed in the text to be processed；

According to preset weight dictionary, the mark of each of at least one word to be processed word to be processed is determined Label and score, wherein include at least one word, each of at least one described word word tool in the weight dictionary There are label and score；

According to the label and score of each of the text to be processed word to be processed, the text to be processed is determined Classification.

Further, according to the label and score of each of the text to be processed word to be processed, determine described in The classification of text to be processed, comprising:

According to the score of each of the text to be processed word to be processed, determine every in the text to be processed The score summation of word to be processed under one label；

It determines the maximum label of score summation, is the classification of the text to be processed.

Further, if frequency of occurrence of the word to be processed in the text to be processed be greater than 1, according to it is described to The score for handling each of text word to be processed, determines to be processed under each of the text to be processed label The score summation of word, comprising:

According to the score and frequency of occurrence of each of the text to be processed word to be processed, determine described to be processed The score summation of word to be processed under each of text labelWherein, N is each label Under word to be processed total number, the total number of the word to be processed under different labels is identical or different, x_iIt is each mark The score of i-th signed word to be processed, y_iIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], i are positive integers；

Alternatively, determining the text to be processed according to the score of each of the text to be processed word to be processed Each of word to be processed under label score summation

Alternatively, using non-linear summation method, according to the score of each of the text to be processed word to be processed And frequency of occurrence, determine the score summation of the word to be processed under each of the text to be processed labelE is natural constant, and ln is natural logrithm.

Further, the determining maximum label of score summation is the classification of the text to be processed, comprising:

Determine the maximum label of score summation；

If it is determined that the score summation of the maximum label of score summation is within preset threshold range, it is determined that score summation is most Big label is the classification of the text to be processed.

Further, before the acquisition text to be processed, further includes:

Obtain at least one text to be analyzed, wherein each of at least one described text to be analyzed text to be analyzed It include at least one word to be analyzed in this, the word to be analyzed is provided with label；

According to the word to be analyzed in each described text to be analyzed, the key of each text to be analyzed is extracted Word, to obtain keyword set, wherein in the keyword set include at least one keyword subset, it is described at least one It include the keyword for belonging to the same label in each of keyword subset keyword subset；

According to preset degree of association coefficient sets, the score of each keyword is determined, wherein the degree of association system It include the degree of association coefficient of at least one keyword in manifold conjunction, the degree of association coefficient characterizes between keyword and label Degree of association size；

According to the score of the keyword set and each keyword, the weight dictionary is constructed.

Further, described according to preset degree of association coefficient sets, determine each keyword score it Before, further includes:

Count the word frequency information of the keyword in each described keyword subset；

The keyword repeated in each described keyword subset is removed, each is obtained treated crucial lexon Collection；

According to the word frequency information, the keyword in each described treated keyword subset is ranked up, is obtained To sequence treated keyword set.

Further, in the label and score according to each of the text to be processed word to be processed, institute is determined After the classification for stating text to be processed, further includes:

Obtain the word to be processed in the text to be processed that classification has been determined, wherein the word to be processed has label And score；

According to the word to be processed, the weight dictionary is updated.

Further, according to the word to be processed, the weight dictionary is updated, comprising:

Delete word identical with the word to be processed, the weight word after obtaining delete processing in the weight dictionary Allusion quotation；

By the word to be processed, in the weight dictionary after being added to the delete processing, updated weight word is obtained Allusion quotation.

Further, word identical with the word to be processed in the weight dictionary is deleted, after obtaining delete processing Weight dictionary, comprising:

Determine word identical with the word to be processed in the weight dictionary；

Judge the score of the word to be processed, if greater than the score of word identical with the word to be processed；

If so, will word identical with the word to be processed, replace with the word to be processed, obtain updated Weight dictionary.

Second aspect, this application provides a kind of text marking devices, comprising:

First obtains module, includes at least one word to be processed in the text to be processed for obtaining text to be processed Language；

First determining module, for determining every at least one described word to be processed according to preset weight dictionary The label and score of one word to be processed, wherein include at least one word, at least one described word in the weight dictionary Each of language word has label and score；

Second determining module, for the label and score according to each of the text to be processed word to be processed, Determine the classification of the text to be processed.

Further, second determining module, comprising:

First determines submodule, for the score according to each of the text to be processed word to be processed, determines The score summation of word to be processed under each of the text to be processed label；

Second determines submodule, is the classification of the text to be processed for determining the maximum label of score summation.

Further, if frequency of occurrence of the word to be processed in the text to be processed is greater than 1, described first really Stator modules are specifically used for:

Further, it described second determines submodule, is specifically used for:

Determine the maximum label of score summation；

Further, described device, further includes:

Second obtains module, for obtaining at least one and waiting for before described first obtains module acquisition text to be processed Analyze text, wherein include that at least one is to be analyzed in each of at least one described text to be analyzed text to be analyzed Word, the word to be analyzed are provided with label；

Extraction module, for the word to be analyzed in each text to be analyzed according to, each is waited for described in extraction The keyword of text is analyzed, to obtain keyword set, wherein include at least one crucial lexon in the keyword set Collect, includes the keyword for belonging to the same label in each of at least one described keyword subset keyword subset；

Third determining module, for determining the score of each keyword according to preset degree of association coefficient sets, It wherein, include the degree of association coefficient of at least one keyword in the degree of association coefficient sets, the degree of association coefficient characterization Degree of association size between keyword and label；

It constructs module and constructs the weight for the score according to the keyword set and each keyword Dictionary.

Further, described device, further includes:

Statistical module, for, according to preset degree of association coefficient sets, determining each institute in the third determining module Before the score for stating keyword, the word frequency information of the keyword in each described keyword subset is counted；

Module is removed, for removing the keyword repeated in each described keyword subset, is obtained at each Keyword subset after reason；

Sorting module, for according to the word frequency information, to the key in each described treated keyword subset Word is ranked up, and obtains sequence treated keyword set.

Further, described device, further includes:

Third obtains module, for be processed according to each of the text to be processed in second determining module The label and score of word after the classification for determining the text to be processed, obtain in the text to be processed that classification has been determined Word to be processed, wherein the word to be processed has label and score；

Update module, for updating the weight dictionary according to the word to be processed.

Further, the update module, comprising:

Submodule is deleted, for deleting word identical with the word to be processed in the weight dictionary, is deleted Treated weight dictionary；

Submodule is added, in the weight dictionary after being added to the delete processing, obtaining the word to be processed Updated weight dictionary.

Further, the deletion submodule, is specifically used for:

The third aspect, this application provides a kind of text marking equipment, including any for executing the above first aspect The unit or means (means) of each step of method.

Fourth aspect, this application provides a kind of text marking equipment, including processor, memory and computer journey Sequence, wherein the computer program storage in the memory, and is configured as being executed by the processor to realize first Either aspect method.

5th aspect, this application provides a kind of text marking equipment, including any for executing the above first aspect At least one processing element or chip of method.

6th aspect, this application provides a kind of computer program, the calculation procedure is when being executed by processor for holding Either the above first aspect of row method.

7th aspect, this application provides a kind of computer readable storage mediums, are stored thereon with the calculating of the 6th aspect Machine program.

Text marking method, apparatus, equipment and computer readable storage medium provided by the present application, it is to be processed by obtaining Text includes at least one word to be processed in text to be processed；According to preset weight dictionary, determine that at least one is to be processed The label and score of each of word word to be processed, wherein it include at least one word in weight dictionary, at least one Each of word word has label and score；According to the label of each of text to be processed word to be processed and divide Number, determines the classification of text to be processed.One kind is provided to be labeled for the word in text to be processed automatically, be text to be processed The method of this progress category division does not need a large amount of human resources and is labeled for text, can save human cost, in turn The cost of text marking is reduced, annotating efficiency is improved.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application Example, and together with specification it is used to explain the principle of the application.

Fig. 1 is a kind of flow diagram of text marking method provided by the embodiments of the present application；

Fig. 2 is the flow diagram of another text marking method provided by the embodiments of the present application；

Fig. 3 is a kind of structural schematic diagram of text marking device provided by the embodiments of the present application；

Fig. 4 is the structural schematic diagram of another text marking device provided by the embodiments of the present application；

Fig. 5 is a kind of structural schematic diagram of text marking equipment provided by the embodiments of the present application.

Through the above attached drawings, it has been shown that the specific embodiment of the application will be hereinafter described in more detail.These attached drawings It is not intended to limit the range of the application design in any manner with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate the concept of the application.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the application.

Noun involved in the application is explained first:

Text: there is at least one word in text.

Test Rank algorithm: Text Rank algorithm is a kind of algorithm in the prior art；Text Rank algorithm is based on Page Rank, for being text generation keyword and abstract.

The specific application scenarios of the application are as follows: in machine learning field, it is often necessary to treat text trained or to be identified Data mark is carried out, and then the word in text is labeled, and classifies to text.Wherein, data markers are mainly Data in text are classified according to classificating requirement, the operation such as picture frame, annotation, label.In the prior art, text is carried out When data mark, the method manually marked is usually used, can be grasped in some cases by some convenient frames Make.However notation methods in the prior art because in artificial mark each samples of text be by artificial discrimination then It is labeled, so that annotating efficiency is lower, needs to consume a large amount of volume human cost, lead to the higher cost of mark.

Text marking method, apparatus, equipment and computer readable storage medium provided by the present application, it is intended to solve existing skill The technical problem as above of art.

How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.

Fig. 1 is a kind of flow diagram of text marking method provided by the embodiments of the present application.As shown in Figure 1, this method Include:

Step 101 obtains text to be processed, includes at least one word to be processed in text to be processed.

In the present embodiment, specifically, the executing subject of the present embodiment can be control equipment, terminal device, text mark Dispensing device, other can execute the present embodiment method device or equipment, etc..

It first will be in the executing subject of text input to be processed to the present embodiment, wherein text to be processed is pending mark The text of note, in this application, text to be processed are the text for meeting grammar for natural language.It include one in text to be processed Or multiple words to be processed, wherein the languages of word are without limitation.

Step 102, according to preset weight dictionary, determine each of at least one word to be processed word to be processed Label and score, wherein in weight dictionary include at least one word, each of at least one word word have mark Label and score.

It in the present embodiment, include one in weight dictionary specifically, having pre-set a weight dictionary Or multiple words, each word is provided with one or more labels, and each word has under each label One score.Wherein, the field that weight dictionary is belonged to is identical with the field of text to be processed.

It for example, include word 1, word 2, word 3, word 4 in weight dictionary；Word 1 has label A, word 1 The score for belonging to label A is a；Word 2 has label B, and the score that word 2 belongs to label B is b；Word 3 has label C, word 3 scores for belonging to label C are c；Word 4 has label D, and the score that word 4 belongs to label D is d.

It again for example, include word 1, word 2, word 3, word 4 in weight dictionary；Word 1 has label A and mark B is signed, the score that word 1 belongs to label A is a1, and the score that word 1 belongs to label B is a2；Word 2 have label B, label C and Label D, word 2 belong to that the score of label B is b1, word 2 belongs to that the score of label C is b2, word 2 belongs to the score of label D For b3；Word 3 has label C and label D, and the score that word 3 belongs to label C is c1, and the score that word 3 belongs to label D is c2； Word 4 has label D, and the score that word 4 belongs to label D is d1.In this fashion, each word ownership can be determined The largest score of label, the final label as word.For example, word 1 has label A and label B, word 1 belongs to label A Score is a1, and the score that word 1 belongs to label B is a2；A1 is greater than a2, then the final label of word 1 is label A.

Due to being provided with the label and score of each word in weight dictionary, so that it may according to weight dictionary, analysis to Handle label and score corresponding to each of text word to be processed.

It for example, include word 1, word 2, word 3, word 4 in weight dictionary；Word 1 has label A, word 1 The score for belonging to label A is a；Word 2 has label B, and the score that word 2 belongs to label B is b；Word 3 has label C, word 3 scores for belonging to label C are c；Word 4 has label D, and the score that word 4 belongs to label D is d.In text to be processed wait locate Reason word 1 is word 1, then the label of word 1 to be processed is A, and the score that word 1 to be processed belongs to label A is a；Text to be processed Word to be processed 2 in this is word 3, then the label of word 2 to be processed is C, and the score that word 2 to be processed belongs to label C is c。

For example, the classification of label can be with are as follows: sport category label, amusement class label, cultural class label, category of going on a tour Label, educational label etc..

Step 103, according to the label and score of each of text to be processed word to be processed, determine text to be processed Classification.

Optionally, step 103 specifically includes the following steps:

Step 1031, according to the score of each of text to be processed word to be processed, determine in text to be processed The score summation of word to be processed under each label.

Step 1032 determines the maximum label of score summation, is the classification of text to be processed.

Optionally, if frequency of occurrence of the word to be processed in text to be processed is greater than 1, step 1031 is specifically included:

According to the score and frequency of occurrence of each of text to be processed word to be processed, determine in text to be processed The score summation of word to be processed under each labelWherein, N is under each label wait locate The total number of word is managed, the total number of the word to be processed under different labels is identical or different, x_iIt is i-th under each label The score of a word to be processed, y_iIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], i are just Integer；

Alternatively, being determined each in text to be processed according to the score of each of text to be processed word to be processed The score summation of word to be processed under a label

Alternatively, according to the score of each of text to be processed word to be processed and being gone out using non-linear summation method Occurrence number determines the score summation of the word to be processed under each of text to be processed labelE is natural constant, and ln is natural logrithm.

Optionally, step 1032 specifically includes: determining the maximum label of score summation；If it is determined that the maximum mark of score summation The score summation of label is within preset threshold range, it is determined that the maximum label of score summation is the classification of text to be processed.

In the present embodiment, each specifically, word to be processed for each of text to be processed is provided with label A word to be processed is provided with the score that word to be processed belongs to corresponding label；There is at least one mark in text to be processed Label.

Firstly, determining to belong to the word to be processed of the same label；For belonging to the to be processed of the same label Word, according to the respective score for the word to be processed for belonging to the same label, calculate belong to the same label wait locate The score summation of word is managed, to impart a score summation for each of text to be processed label.

Then, according to the score summation of each of text to be processed label, each of text to be processed is marked Label are ranked up, for example, the order descending according to score summation, arranges each of text to be processed label Sequence；Classification by the maximum label of score summation, as text to be processed.

Specifically, the score summation of to be processed word of this step in the case where determining each of text to be processed label When, it is divided into following several implementations.Wherein, the score of the word to be processed under the same label can be identical, can also not Together.

The first implementation are as follows: if each frequency of occurrence of word to be processed in text to be processed is equal to 1, For each label, the sum of the score of the word to be processed under each label can be directly calculated, each label is obtained Under word to be processed score summation.

For example, the label of the word to be processed 1 in text to be processed is A, and word 1 to be processed belongs to point of label A Number is a1；The label of word to be processed 2 in text to be processed is A, and the score that word 2 to be processed belongs to label A is a2；Wait locate The label for managing the word to be processed 3 in text is A, and the score that word 3 to be processed belongs to label A is a3；In text to be processed The label of word 4 to be processed is B, and the score that word 4 to be processed belongs to label B is b1；Word to be processed 5 in text to be processed Label be B, the score of the label B that word 5 to be processed belongs to is b2；The label of word to be processed 6 in text to be processed is C, the score that word 6 to be processed belongs to label C is c1；The label of word to be processed 7 in text to be processed is C, word to be processed The score that language 7 belongs to label C is c2；The label of word to be processed 8 in text to be processed is C, and word 8 to be processed belongs to label The score of C is c3.There are 3 kinds of labels, respectively label A, label B, label C in text to be processed.Each of the above score is Different.

Then according to the first implementation, according to the score a1 of the word to be processed 1 under label A, point of word to be processed 2 The score a3 of number a2 and word to be processed 3 determine that the score summation of the word to be processed under label A is a1+a2+a3；According to mark The score b1 of the word to be processed 4 under B and the score b2 of word to be processed 5 are signed, determines point of the word to be processed under label B Number summation is b1+b2；According to the score c1 of the word to be processed 6 under label C, the score c2 and word to be processed of word to be processed 7 The score c3 of language 8 determines that the score summation of the word to be processed under label C is c1+c2+c3.

Second of implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1, The score that the word to be processed under each of text to be processed label can be calculated by the way of repeatedly linearly summing it up is total With specifically, according to the score x of each word to be processed under each of text to be processed label_iAnd frequency of occurrence y_i, calculate the score summation of the word to be processed under each of text to be processed labelWherein, N It is the total number of the word to be processed under each label, the total number of the word to be processed under different labels can be identical or not Together, x_iIt is the score of i-th of word to be processed under each label, y_iIt is i-th of word to be processed under each label Frequency of occurrence；Then judge the score summation for determining the maximum label of score summation whether within preset threshold range；If It, then can be by the classification by the maximum label of score summation, as text to be processed within preset threshold range；If not default Within threshold range, then the artificial classification for marking out the text to be processed, and then the text to be processed that acquisition manually marks Classification.

For example, the label of the word to be processed 1 in text to be processed is A, and word 1 to be processed belongs to point of label A Number is a1, and the frequency of occurrence of word 1 to be processed is m1；The label of word to be processed 2 in text to be processed is A, word to be processed The score that language 2 belongs to label A is a2, and the frequency of occurrence of word 2 to be processed is m2；Word to be processed 3 in text to be processed Label is A, and the score that word 3 to be processed belongs to label A is a3, and the frequency of occurrence of word 3 to be processed is m3；In text to be processed The label of word to be processed 4 be B, the score that word 4 to be processed belongs to label B is b1, and the frequency of occurrence of word 4 to be processed is m4；The label of word to be processed 5 in text to be processed is B, and the score for the label B that word 5 to be processed belongs to is b2, to be processed The frequency of occurrence of word 5 is m5；The label of word to be processed 6 in text to be processed is C, and word 6 to be processed belongs to label C Score is c1, and the frequency of occurrence of word 6 to be processed is m6；The label of word to be processed 7 in text to be processed is C, to be processed The score that word 7 belongs to label C is c2, and the frequency of occurrence of word 7 to be processed is m7；Word to be processed 8 in text to be processed Label be C, the score that word 8 to be processed belongs to label C is c3, and the frequency of occurrence of word 8 to be processed is m8.Text to be processed In there is 3 kinds of labels, respectively label A, label B, label C.Each of the above score is different.

Then according to second of implementation, determine that the score summation of the word to be processed under label A is a1*m1+a2*m2+ A3*m3 determines that the score summation of the word to be processed under label B is b1*m4+b2*m5, determines the word to be processed under label C Score summation be c1*m6+c2*m7+c3*m8.

The third implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1, The score that the word to be processed under each of text to be processed label can be calculated in such a way that single linearly sums it up is total With specifically, according to the score x of each word to be processed under each of text to be processed label_i, calculate to be processed The score summation of word to be processed under each of text labelKnown at this point, no matter word to be processed Language occurs several times, is all once calculated only with according to word to be processed appearance.

For example, according to above for example, determining that the score summation of the word to be processed under label A is a1+a2+ A3 determines that the score summation of the word to be processed under label B is b1+b2, determines the score summation of the word to be processed under label C For c1+c2+c3.

4th kind of implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1, The score summation of the word to be processed under each of text to be processed label can be calculated by the way of non-linear summation, Specifically, according to the score x of each word to be processed under each of text to be processed label_iWith frequency of occurrence y_i, meter Calculate the score summation of the word to be processed under each of text to be processed label

For example, according to above for example, determining that the score summation of the word to be processed under label A is a1*ln (m1 + e-1)+a2*ln (m2+e-1)+a3*ln (m3+e-1), determine that the score summation of the word to be processed under label B is b1*ln (m4 + e-1)+b2*ln (m5+e-1), determine that the score summation of the word to be processed under label C is c1*ln (m6+e-1)+c2*ln (m7 +e-1)+c3*ln(m8+e-1)。

This step needs to judge to determine that the score of the maximum label of score summation is total when determining the classification of text to be processed Whether within preset threshold range；If the maximum label of score summation can will be made within preset threshold range For the classification of text to be processed；If not within preset threshold range, the artificial classification for marking out the text to be processed, into And obtain the classification of the text to be processed manually marked.Wherein, preset threshold range can be a section or default threshold Value range characterizes score summation and needs to be greater than a numerical value.

For example, according to above for example, if the score summation of label C is maximum, and the score summation of label C exists Within one preset threshold range, so that it may determine that label C is the classification of text to be processed, i.e., the classification of text to be processed is C.

The present embodiment includes at least one word to be processed in text to be processed by obtaining text to be processed；According to pre- If weight dictionary, determine the label and score of each of at least one word to be processed word to be processed, wherein weight It include at least one word in dictionary, each of at least one word word has label and score；According to text to be processed The label and score of each of this word to be processed, determine the classification of text to be processed.It provides a kind of automatically for wait locate The method that word in reason text is labeled, carries out category division for text to be processed, not needing a large amount of human resources is Text is labeled, and can be saved human cost, and then reduce the cost of text marking, be improved annotating efficiency.

Fig. 2 is the flow diagram of another text marking method provided by the embodiments of the present application.As shown in Fig. 2, the party Method includes:

Step 201 obtains at least one text to be analyzed, wherein each of at least one text to be analyzed is wait divide Analysing includes at least one word to be analyzed in text, and word to be analyzed is provided with label.

In the present embodiment, specifically, firstly the need of weight dictionary is constructed.Specifically, obtain it is one or more to Text is analyzed, includes one or more words to be analyzed in each text to be analyzed, is set for each word to be analyzed Label is set.For example, imparting label by way of manually marking for the word to be analyzed in text to be analyzed.

For example, there is word 1 to be analyzed, word to be analyzed 2, word to be analyzed 3, to be analyzed in text 1 to be analyzed Word 4, word to be analyzed 5；The label of word 1 to be analyzed is sport category label, and the label of word 2 to be analyzed is amusement category Label, the label of word 3 to be analyzed are amusement class label, and the label of word 4 to be analyzed is educational label, word 4 to be analyzed Label is class label of going on a tour.

Step 202, according to the word to be analyzed in each text to be analyzed, extract the key of each text to be analyzed Word, to obtain keyword set, wherein it include at least one keyword subset in keyword set, at least one crucial lexon It include the keyword for belonging to the same label in each of collection keyword subset.

In the present embodiment, specifically, being handled using Test Rank algorithm each text to be analyzed, according to Word to be analyzed in each text to be analyzed can extract the keyword in each text to be analyzed, will be needed Keyword in analysis text is put into a keyword set.And each keyword has label.

The keyword of the same label will be belonged in keyword set, be divided into a keyword subset, and then It include one or more keywords in each keyword subset to one or more keyword subsets, and the same pass Keyword in keyword subset is label having the same.

For example, the keyword in text 1 to be analyzed is extracted using Test Rank algorithm, obtains keyword 1, key Word 2, keyword 3, the label of keyword 1 is A, the label of keyword 2 is B, the label of keyword 3 is C；Using Test Rank Algorithm extracts the keyword in text 2 to be analyzed, obtains keyword 4, keyword 5, keyword 6, and the label of keyword 4 is A, closes The label of keyword 5 is B, the label of keyword 6 is C；To include keyword subset 1, keyword subset in keyword set 2, keyword subset 3 includes keyword 1 and keyword 2 in keyword subset 1, includes keyword 3 in keyword subset 2 It include keyword 5 and keyword 6 in keyword subset 3 with keyword 4.

Step 203, the word frequency information for counting keyword in each keyword subset.

In the present embodiment, specifically, after the completion of keyword extraction, the keyword in each keyword subset is carried out Word frequency sequence.Firstly, being directed to each keyword subset, the word frequency information of keyword is counted.Wherein, word frequency information is to close The number that keyword occurs.

Step 204 removes the keyword repeated in each keyword subset, obtains each treated and is crucial Lexon collection.

In the present embodiment, specifically, being directed to each keyword subset, the pass that will repeat in keyword subset Keyword is rejected.And then the keyword repeated is removed.

Step 205, according to word frequency information, the keyword in each treated keyword subset is ranked up, is obtained To sequence treated keyword set.

In the present embodiment, specifically, to each treated keyword subset, according to pass each in keyword subset Keyword is ranked up by the word frequency information of keyword according to the descending order of word frequency information, obtains each sequence processing Keyword subset afterwards；All sequences treated keyword subset constitutes sequence treated keyword set.From And keyword is ranked up by the descending according to word frequency height, W keywords before available ranking, and then preliminary Filter out the keyword more by access times, wherein W is positive integer.

It for example, include keyword subset 1 and keyword subset 2 in keyword set；Keyword subset 1 and label A is corresponding, includes keyword 1, keyword 2, keyword 2, keyword 3, keyword 3, keyword 3 in keyword subset 1；It is crucial Lexon collection 2 is corresponding with label B, includes keyword 4, keyword 4, keyword 5 in keyword subset 2.For keyword subset 1, the word frequency that can count keyword 1 is 1, and the word frequency of keyword 2 is 2, and the word frequency of keyword 3 is 3；For keyword subset 2, the word frequency that can count keyword 4 is 2, and the word frequency of keyword 5 is 1.The pass that can will repeat in keyword subset 1 Keyword 2 and keyword 3 remove, and the keyword 4 repeated in keyword subset 2 is removed.For keyword subset 1, according to The descending order of the word frequency of keyword, the order of keyword is keyword 3, keyword 2, keyword 1 after being sorted；It is right In keyword subset 2, according to the order that the word frequency of keyword is descending, after being sorted the order of keyword be keyword 4, Keyword 5.

Step 206, according to preset degree of association coefficient sets, determine the score of each keyword, wherein degree of association system It include the degree of association coefficient of at least one keyword in manifold conjunction, degree of association coefficient characterizes the pass between keyword and label Connection degree size.

In the present embodiment, it specifically, having pre-established a degree of association coefficient sets, is wrapped in degree of association coefficient sets The degree of association coefficient of each keyword is included, which has indicated the degree of association size between keyword and label. For example, can determine keyword and corresponding mark by the Pair Analysis between artificial observation keyword and corresponding label Degree of association size between label.

The score of keyword can be determined according to the degree of association coefficient of each keyword.For example, the association of keyword Degree coefficient is bigger, then the score of keyword is higher.

The score of keyword can be divided into two grades to three gears.For example, keyword 1 belongs to when being divided into two grades First grade of score, keyword 2 belong to second gear score；For example, keyword 1 belongs to first grade of score when being divided into third gear, Keyword 2 belongs to second gear score, and keyword 3 belongs to third gear score.

For example, for the keyword subset 1 in keyword set, the order of keyword is keyword 3, closes after sequence Keyword 2, keyword 1, the label of keyword subset 1 are A；It can determine keyword 3 and label A degree of association coefficient is 1, determine and close The score of keyword 3 is first grade of score；It determines that keyword 2 and label A degree of association coefficient are 2, determines that the score of keyword 2 is the Two grades of scores；It determines keyword 1 and label A degree of association coefficient is 3, determine that the score of keyword 1 is third gear score.For closing Keyword subset 2 in keyword set, the order of keyword are keyword 4, keyword 5, and the label of keyword subset 2 is B；It can To determine keyword 4 and label B degree of association coefficient as 2, determine that the score of keyword 4 is second gear score；Determine keyword 5 with Label B degree of association coefficient is 1, determines that the score of keyword 5 is first grade of score.

Step 207, according to the score of keyword set and each keyword, construct weight dictionary.

In the present embodiment, specifically, can determine keyword, the label of keyword, keyword by above step Score, thus according to each keyword, the label of each keyword, each keyword score, available weight Dictionary.

Step 208 obtains text to be processed, includes at least one word to be processed in text to be processed.

In the present embodiment, it specifically, this step may refer to the step 101 of Fig. 1, repeats no more.

Step 209, according to preset weight dictionary, determine each of at least one word to be processed word to be processed Label and score, wherein in weight dictionary include at least one word, each of at least one word word have mark Label and score.

In the present embodiment, it specifically, this step may refer to the step 102 of Fig. 1, repeats no more.

Step 210, according to the label and score of each of text to be processed word to be processed, determine text to be processed Classification.

In the present embodiment, it specifically, this step may refer to the step 103 of Fig. 1, repeats no more.

Step 211, acquisition have determined the word to be processed in the text to be processed of classification, wherein word to be processed has Label and score.

In the present embodiment, it specifically, can use the text for being added to label, goes to expand weight dictionary.

Firstly, obtaining the text to be processed for being added to label and classification after executing step 209-210.It repeats After executing 209-210 more times, available multiple texts to be processed for being added to label and classification.

Obtain the word to be processed in each text to be processed, each word to be processed be assigned label and Score.

Step 212, according to word to be processed, update weight dictionary.

Wherein, step 212 includes:

Step 2121 deletes word identical with word to be processed in weight dictionary, the weight word after obtaining delete processing Allusion quotation.

Optionally, step 2121 specifically includes: determining word identical with word to be processed in weight dictionary；Judge wait locate Manage the score of word, if greater than the score of word identical with word to be processed；If so, will be identical with word to be processed Word replaces with word to be processed, obtains updated weight dictionary.

Step 2122, by word to be processed, in the weight dictionary after being added to delete processing, obtain updated weight word Allusion quotation.

In the present embodiment, it specifically, since each word to be processed is assigned label and score, can incite somebody to action Word to be processed is added in weight dictionary.

Firstly, be added to label and classification text to be processed in word to be processed, may in weight dictionary Word it is identical, and then need to delete word identical with word to be processed in weight dictionary.Specifically, added according to each The word to be processed being labeled in the text to be processed with classification finds word identical with word to be processed in weight dictionary Language；Since the word of weight dictionary has score, word to be processed also has score, it can be determined that the score of word to be processed is The no score greater than word；If more than, then can directly by word identical with word to be processed in weight dictionary, replace with to Handle word；If being less than or equal to, do not need to replace.

Then, the word to be processed being added in label and the text to be processed of classification, after being all added to delete processing Weight dictionary in, so that it may completion weight dictionary is had updated.

It is repeated several times after executing step 211-212, so that it may which weight dictionary is repeatedly updated.Carrying out n times After update, if the weight dictionary after n-th update, the weight dictionary after updating with the N-1 times, the two is come compared to relatively It says, the number that word is updated is no more than predetermined number, it is determined that change after the update of the dictionary of full work attendance less, determination does not need Weight dictionary is updated again.

It for example, include word 1, word 2, word 3, word 4 in weight dictionary；Word 1 has label A, word 1 The score for belonging to label A is a1；Word 2 has label B, and the score that word 2 belongs to label B is b1；Word 3 has label C, word The score that language 3 belongs to label C is c；Word 4 has label D, and the score that word 4 belongs to label D is d.Word to be processed is got Language 1, word to be processed 2, word to be processed 5, word to be processed 6；Word 1 to be processed has label A, score a2, word to be processed Language 2 has label B, score b2, and word 5 to be processed has label E, score e, and word 5 to be processed has label F, score f.To Processing word 1 is identical as word 1, and word 2 to be processed is identical as word 2.It can determine that the score a2 of word 1 to be processed is greater than word Word 1 can then be replaced with word 1 to be processed by the score a2 of language 1；It can determine that the score b1 of word 2 to be processed is less than word The score b2 of language 2 can then need to be replaced.It include word 1 to be processed, word in obtained updated weight dictionary 2, word 3, word 4, word to be processed 5, word to be processed 6, wherein word 1 to be processed has label A, score a2, word 2 With label B, score b1, word 3 has label C, score c, and word 4 has label D, score d, and word 5 to be processed has Label E, score e, word 5 to be processed have label F, score f.

The present embodiment includes at least one word to be processed in text to be processed by obtaining text to be processed；According to pre- If weight dictionary, determine the label and score of each of at least one word to be processed word to be processed, wherein weight It include at least one word in dictionary, each of at least one word word has label and score；According to text to be processed The label and score of each of this word to be processed, determine the classification of text to be processed.It provides a kind of automatically for wait locate The method that word in reason text is labeled, carries out category division for text to be processed, not needing a large amount of human resources is Text is labeled, and can be saved human cost, and then reduce the cost of text marking, be improved annotating efficiency.And it is possible to root According to the word to be processed obtained in the text to be processed that classification has been determined, update is iterated to weight dictionary, is guaranteed subsequent Text marking has good precision.

Fig. 3 is a kind of structural schematic diagram of text marking device provided by the embodiments of the present application, as shown in figure 3, this implementation Example device may include:

First obtains module 31, includes at least one word to be processed in text to be processed for obtaining text to be processed；

First determining module 32, for determining each at least one word to be processed according to preset weight dictionary The label and score of a word to be processed, wherein it include at least one word in weight dictionary, it is each at least one word A word has label and score；

Second determining module 33, for the label and score according to each of text to be processed word to be processed, really The classification of fixed text to be processed.

A kind of text marking method provided by the embodiments of the present application can be performed in the text marking device of the present embodiment, realizes Principle and technical effect are similar, and details are not described herein again.

Fig. 4 is the structural schematic diagram of another text marking device provided by the embodiments of the present application, embodiment shown in Fig. 3 On the basis of, as shown in figure 4, in the device of the present embodiment, the second determining module 33, comprising:

First determines submodule 331, for the score according to each of text to be processed word to be processed, determine to Handle the score summation of the word to be processed under each of text label.

Second determines submodule 332, is the classification of text to be processed for determining the maximum label of score summation.

First determines submodule 331, is specifically used for:

According to the score and frequency of occurrence of each of text to be processed word to be processed, determine in text to be processed The score summation of word to be processed under each labelWherein, N is under each label wait locate The total number of word is managed, the total number of the word to be processed under different labels is identical or different, x_iIt is i-th under each label The score of a word to be processed, y_iIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], i are just Integer；Alternatively, determining that each of text to be processed is marked according to the score of each of text to be processed word to be processed The score summation for the word to be processed signedAlternatively, using non-linear summation method, according to text to be processed Each of word to be processed score and frequency of occurrence, determine the word to be processed under each of text to be processed label The score summation of languageE is natural constant, and ln is natural logrithm.

Second determines submodule 332, is specifically used for: determining the maximum label of score summation；If it is determined that score summation is maximum Label score summation within preset threshold range, it is determined that the maximum label of score summation, be text to be processed class Not.

Device provided in this embodiment, further includes:

Second obtains module 41, for obtaining at least one and waiting for before the first acquisition module 31 obtains text to be processed Analyze text, wherein it include at least one word to be analyzed in each of at least one text to be analyzed text to be analyzed, Word to be analyzed is provided with label.

Extraction module 42, for extracting each text to be analyzed according to the word to be analyzed in each text to be analyzed This keyword, to obtain keyword set, wherein it include at least one keyword subset in keyword set, at least one It include the keyword for belonging to the same label in each of keyword subset keyword subset.

Third determining module 43, for determining the score of each keyword according to preset degree of association coefficient sets, In, include the degree of association coefficient of at least one keyword in degree of association coefficient sets, degree of association coefficient characterize keyword with Degree of association size between label.

Module 44 is constructed, for the score according to keyword set and each keyword, constructs weight dictionary.

Device provided in this embodiment, further includes:

Statistical module 45, for, according to preset degree of association coefficient sets, determining each pass in third determining module 43 Before the score of keyword, the word frequency information of the keyword in each keyword subset is counted.

It removes module 46 and obtains each processing for removing the keyword repeated in each keyword subset Keyword subset afterwards.

Sorting module 47 carries out the keyword in each treated keyword subset for according to word frequency information Sequence obtains sequence treated keyword set.

Device provided in this embodiment, further includes:

Third obtains module 48, is used in the second determining module 33 according to each of text to be processed word to be processed Label and score, after the classification for determining text to be processed, obtain the word to be processed in the text to be processed that classification has been determined Language, wherein word to be processed has label and score.

Update module 49, for updating weight dictionary according to word to be processed.

Update module 49, comprising:

Submodule 491 is deleted, for deleting word identical with word to be processed in weight dictionary, after obtaining delete processing Weight dictionary.

Submodule 492 is added, in the weight dictionary after being added to delete processing, being updated word to be processed Weight dictionary afterwards.

Wherein, submodule 491 is deleted, is specifically used for: determining word identical with word to be processed in weight dictionary；Judgement The score of word to be processed, if greater than the score of word identical with word to be processed；If so, will be with word phase to be processed Same word, replaces with word to be processed, obtains updated weight dictionary.

Another text marking method provided by the embodiments of the present application can be performed in the text marking device of the present embodiment, in fact Existing principle and technical effect are similar, and details are not described herein again.

Fig. 5 is a kind of structural schematic diagram of text marking equipment provided by the embodiments of the present application, as shown in figure 5, the application Embodiment provides a kind of text marking equipment, and it is dynamic to can be used for executing text tagging equipment in Fig. 1 or embodiment illustrated in fig. 2 Work or step, specifically include: processor 2701, memory 2702 and communication interface 2703.

Memory 2702, for storing computer program.

Processor 2701, it is real shown in Fig. 1 or Fig. 2 to realize for executing the computer program stored in memory 2702 The movement for applying text tagging equipment in example, repeats no more.

Optionally, text marking equipment can also include bus 2704.Wherein, processor 2701, memory 2702 and Communication interface 2703 can be connected with each other by bus 2704；Bus 2704 can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..Above-mentioned bus 2704 can be divided into address bus, Data/address bus and control bus etc..Only to be indicated with a thick line in Fig. 5, it is not intended that an only bus convenient for indicating Or a type of bus.

In the embodiment of the present application, it can mutually be referred to and learnt between the various embodiments described above, same or similar step And noun no longer repeats one by one.

Alternatively, some or all of above modules can also be embedded in text mark by way of integrated circuit It is realized on some chip of equipment.And they can be implemented separately, and also can integrate together.That is the above module can To be configured to implement one or more integrated circuits of above method, such as: one or more specific integrated circuits (Application Specific Integrated Circuit, abbreviation ASIC), or, one or more microprocessors (Digital Singnal Processor, abbreviation DSP), or, one or more field programmable gate array (Field Programmable Gate Array, abbreviation FPGA) etc..

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 2702 of instruction, above-metioned instruction can be executed above-mentioned to complete by the processor 2701 of above-mentioned text marking equipment Method.For example, non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic Band, floppy disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by text marking equipment When managing device execution, so that text marking equipment is able to carry out above-mentioned text marking method.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.Computer program product Including one or more computer instructions.When loading on computers and executing computer program instructions, all or part of real estate Raw process or function according to the embodiment of the present application.Computer can be general purpose computer, special purpose computer, computer network, Or other programmable devices.Computer instruction may be stored in a computer readable storage medium, or from a computer Readable storage medium storing program for executing to another computer readable storage medium transmit, for example, computer instruction can from a web-site, Computer, text marking equipment or data center are by wired (for example, coaxial cable, optical fiber, Digital Subscriber Line (digital Subscriber line, DSL)) or wireless (for example, infrared, wireless, microwave etc.) mode to another web-site, calculate Machine, text marking equipment or data center are transmitted.Computer readable storage medium can be times that computer can access What usable medium either includes that the data storages such as the integrated text marking equipment of one or more usable mediums, data center are set It is standby.Usable medium can be magnetic medium, and (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor are situated between Matter (for example, solid state hard disk (solid state disk, SSD)) etc..

Those skilled in the art it will be appreciated that in said one or multiple examples, retouched by the embodiment of the present application The function of stating can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by this A little functions storages in computer-readable medium or as on computer-readable medium one or more instructions or code into Row transmission.Computer-readable medium includes computer storage media and communication media, and wherein communication media includes convenient for from one Any medium of the place to another place transmission computer program.Storage medium can be general or specialized computer and can deposit Any usable medium taken.

It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by appended claims System.

Claims

1. a kind of text marking method characterized by comprising

According to preset weight dictionary, determine each of at least one described word to be processed label of word to be processed and Score, wherein include at least one word in the weight dictionary, each of at least one described word word has mark Label and score；

According to the label and score of each of the text to be processed word to be processed, the class of the text to be processed is determined Not.

2. the method according to claim 1, wherein according to each of the text to be processed word to be processed The label and score of language determine the classification of the text to be processed, comprising:

According to the score of each of the text to be processed word to be processed, each of described text to be processed is determined The score summation of word to be processed under label；

3. according to the method described in claim 2, it is characterized in that, if the word to be processed is in the text to be processed Frequency of occurrence is greater than 1, according to the score of each of the text to be processed word to be processed, determines the text to be processed Each of word to be processed under label score summation, comprising:

According to the score and frequency of occurrence of each of the text to be processed word to be processed, the text to be processed is determined Each of word to be processed under label score summationWherein, N is under each label The total number of the total number of word to be processed, the word to be processed under different labels is identical or different, x_iIt is under each label I-th of word to be processed score, y_iIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], I is positive integer；

Alternatively, according to the score of each of the text to be processed word to be processed and being gone out using non-linear summation method Occurrence number determines the score summation of the word to be processed under each of the text to be processed labelE is natural constant, and ln is natural logrithm.

4. according to the method described in claim 2, it is characterized in that, the determining maximum label of score summation, for it is described to Handle the classification of text, comprising:

Determine the maximum label of score summation；

If it is determined that the score summation of the maximum label of score summation is within preset threshold range, it is determined that score summation is maximum Label is the classification of the text to be processed.

5. the method according to claim 1, wherein before the acquisition text to be processed, further includes:

Obtain at least one text to be analyzed, wherein in each of at least one described text to be analyzed text to be analyzed Including at least one word to be analyzed, the word to be analyzed is provided with label；

According to the word to be analyzed in each described text to be analyzed, the keyword of each text to be analyzed is extracted, To obtain keyword set, wherein it include at least one keyword subset in the keyword set, at least one described key It include the keyword for belonging to the same label in each of lexon collection keyword subset；

According to preset degree of association coefficient sets, the score of each keyword is determined, wherein the degree of association coefficient set It include the degree of association coefficient of at least one keyword in conjunction, the degree of association coefficient characterizes the pass between keyword and label Connection degree size；

6. according to the method described in claim 5, it is characterized in that, being determined described according to preset degree of association coefficient sets Before the score of each keyword, further includes:

The keyword repeated in each described keyword subset is removed, each is obtained treated keyword subset；

According to the word frequency information, the keyword in each described treated keyword subset is ranked up, is arranged Sequence treated keyword set.

7. method according to claim 1-6, which is characterized in that according to each in the text to be processed The label and score of a word to be processed, after the classification for determining the text to be processed, further includes:

Obtain the word to be processed in the text to be processed that classification has been determined, wherein the word to be processed has label and divides Number；

According to the word to be processed, the weight dictionary is updated.

8. a kind of text marking device characterized by comprising

First obtains module, includes at least one word to be processed in the text to be processed for obtaining text to be processed；

First determining module, for determining each of at least one described word to be processed according to preset weight dictionary The label and score of word to be processed, wherein include at least one word in the weight dictionary, at least one described word Each word have label and score；

Second determining module is determined for the label and score according to each of the text to be processed word to be processed The classification of the text to be processed.

9. a kind of text marking equipment characterized by comprising processor, memory and computer program；

Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as The described in any item methods of claim 1-7.

10. a kind of computer readable storage medium, which is characterized in that be stored thereon with computer program, the computer program It is executed by processor to realize the method according to claim 1 to 7.