CN109388714A - Text marking method, apparatus, equipment and computer readable storage medium - Google Patents
Text marking method, apparatus, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN109388714A CN109388714A CN201811240652.7A CN201811240652A CN109388714A CN 109388714 A CN109388714 A CN 109388714A CN 201811240652 A CN201811240652 A CN 201811240652A CN 109388714 A CN109388714 A CN 109388714A
- Authority
- CN
- China
- Prior art keywords
- processed
- word
- text
- label
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a kind of text marking method, apparatus, equipment and computer readable storage medium, wherein including at least one word to be processed in text to be processed this method comprises: obtain text to be processed;According to preset weight dictionary, the label and score of each of at least one word to be processed word to be processed are determined, wherein include at least one word in weight dictionary, each of at least one word word has label and score;According to the label and score of each of text to be processed word to be processed, the classification of text to be processed is determined.One kind is provided to be labeled automatically for the word in text to be processed, be the method that text to be processed carries out category division, a large amount of human resources are not needed to be labeled for text, human cost can be saved, and then reduces the cost of text marking, improves annotating efficiency.
Description
Technical field
This application involves data processing technique more particularly to a kind of text marking method, apparatus, equipment and computer-readable
Storage medium.
Background technique
In machine learning field, it is often necessary to treat text trained or to be identified and carry out data mark, and then to text
Word in this is labeled, and is classified to text.
In the prior art, when carrying out data mark to text, the method manually marked is usually used;It is specific next
It says, artificial carries out artificial cognition to the word in text, determines the label of each word, and then divided for word
Class;And artificial determines the classification of each text ownership, for example, determining that text belongs to sports news or amusement is new
It hears.
However in the prior art, need it is artificial each text is labeled, such mode needs a large amount of people
Power cost, and then the higher cost of text marking;And annotating efficiency is lower.
Summary of the invention
The application provides a kind of text marking method, apparatus, equipment and computer readable storage medium, to solve text
The higher cost of mark, the lower problem of annotating efficiency.
In a first aspect, the application provides a kind of text marking method, comprising:
Text to be processed is obtained, includes at least one word to be processed in the text to be processed;
According to preset weight dictionary, the mark of each of at least one word to be processed word to be processed is determined
Label and score, wherein include at least one word, each of at least one described word word tool in the weight dictionary
There are label and score;
According to the label and score of each of the text to be processed word to be processed, the text to be processed is determined
Classification.
Further, according to the label and score of each of the text to be processed word to be processed, determine described in
The classification of text to be processed, comprising:
According to the score of each of the text to be processed word to be processed, determine every in the text to be processed
The score summation of word to be processed under one label;
It determines the maximum label of score summation, is the classification of the text to be processed.
Further, if frequency of occurrence of the word to be processed in the text to be processed be greater than 1, according to it is described to
The score for handling each of text word to be processed, determines to be processed under each of the text to be processed label
The score summation of word, comprising:
According to the score and frequency of occurrence of each of the text to be processed word to be processed, determine described to be processed
The score summation of word to be processed under each of text labelWherein, N is each label
Under word to be processed total number, the total number of the word to be processed under different labels is identical or different, xiIt is each mark
The score of i-th signed word to be processed, yiIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈
[1, N], i are positive integers;
Alternatively, determining the text to be processed according to the score of each of the text to be processed word to be processed
Each of word to be processed under label score summation
Alternatively, using non-linear summation method, according to the score of each of the text to be processed word to be processed
And frequency of occurrence, determine the score summation of the word to be processed under each of the text to be processed labelE is natural constant, and ln is natural logrithm.
Further, the determining maximum label of score summation is the classification of the text to be processed, comprising:
Determine the maximum label of score summation;
If it is determined that the score summation of the maximum label of score summation is within preset threshold range, it is determined that score summation is most
Big label is the classification of the text to be processed.
Further, before the acquisition text to be processed, further includes:
Obtain at least one text to be analyzed, wherein each of at least one described text to be analyzed text to be analyzed
It include at least one word to be analyzed in this, the word to be analyzed is provided with label;
According to the word to be analyzed in each described text to be analyzed, the key of each text to be analyzed is extracted
Word, to obtain keyword set, wherein in the keyword set include at least one keyword subset, it is described at least one
It include the keyword for belonging to the same label in each of keyword subset keyword subset;
According to preset degree of association coefficient sets, the score of each keyword is determined, wherein the degree of association system
It include the degree of association coefficient of at least one keyword in manifold conjunction, the degree of association coefficient characterizes between keyword and label
Degree of association size;
According to the score of the keyword set and each keyword, the weight dictionary is constructed.
Further, described according to preset degree of association coefficient sets, determine each keyword score it
Before, further includes:
Count the word frequency information of the keyword in each described keyword subset;
The keyword repeated in each described keyword subset is removed, each is obtained treated crucial lexon
Collection;
According to the word frequency information, the keyword in each described treated keyword subset is ranked up, is obtained
To sequence treated keyword set.
Further, in the label and score according to each of the text to be processed word to be processed, institute is determined
After the classification for stating text to be processed, further includes:
Obtain the word to be processed in the text to be processed that classification has been determined, wherein the word to be processed has label
And score;
According to the word to be processed, the weight dictionary is updated.
Further, according to the word to be processed, the weight dictionary is updated, comprising:
Delete word identical with the word to be processed, the weight word after obtaining delete processing in the weight dictionary
Allusion quotation;
By the word to be processed, in the weight dictionary after being added to the delete processing, updated weight word is obtained
Allusion quotation.
Further, word identical with the word to be processed in the weight dictionary is deleted, after obtaining delete processing
Weight dictionary, comprising:
Determine word identical with the word to be processed in the weight dictionary;
Judge the score of the word to be processed, if greater than the score of word identical with the word to be processed;
If so, will word identical with the word to be processed, replace with the word to be processed, obtain updated
Weight dictionary.
Second aspect, this application provides a kind of text marking devices, comprising:
First obtains module, includes at least one word to be processed in the text to be processed for obtaining text to be processed
Language;
First determining module, for determining every at least one described word to be processed according to preset weight dictionary
The label and score of one word to be processed, wherein include at least one word, at least one described word in the weight dictionary
Each of language word has label and score;
Second determining module, for the label and score according to each of the text to be processed word to be processed,
Determine the classification of the text to be processed.
Further, second determining module, comprising:
First determines submodule, for the score according to each of the text to be processed word to be processed, determines
The score summation of word to be processed under each of the text to be processed label;
Second determines submodule, is the classification of the text to be processed for determining the maximum label of score summation.
Further, if frequency of occurrence of the word to be processed in the text to be processed is greater than 1, described first really
Stator modules are specifically used for:
According to the score and frequency of occurrence of each of the text to be processed word to be processed, determine described to be processed
The score summation of word to be processed under each of text labelWherein, N is each label
Under word to be processed total number, the total number of the word to be processed under different labels is identical or different, xiIt is each mark
The score of i-th signed word to be processed, yiIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈
[1, N], i are positive integers;
Alternatively, determining the text to be processed according to the score of each of the text to be processed word to be processed
Each of word to be processed under label score summation
Alternatively, using non-linear summation method, according to the score of each of the text to be processed word to be processed
And frequency of occurrence, determine the score summation of the word to be processed under each of the text to be processed labelE is natural constant, and ln is natural logrithm.
Further, it described second determines submodule, is specifically used for:
Determine the maximum label of score summation;
If it is determined that the score summation of the maximum label of score summation is within preset threshold range, it is determined that score summation is most
Big label is the classification of the text to be processed.
Further, described device, further includes:
Second obtains module, for obtaining at least one and waiting for before described first obtains module acquisition text to be processed
Analyze text, wherein include that at least one is to be analyzed in each of at least one described text to be analyzed text to be analyzed
Word, the word to be analyzed are provided with label;
Extraction module, for the word to be analyzed in each text to be analyzed according to, each is waited for described in extraction
The keyword of text is analyzed, to obtain keyword set, wherein include at least one crucial lexon in the keyword set
Collect, includes the keyword for belonging to the same label in each of at least one described keyword subset keyword subset;
Third determining module, for determining the score of each keyword according to preset degree of association coefficient sets,
It wherein, include the degree of association coefficient of at least one keyword in the degree of association coefficient sets, the degree of association coefficient characterization
Degree of association size between keyword and label;
It constructs module and constructs the weight for the score according to the keyword set and each keyword
Dictionary.
Further, described device, further includes:
Statistical module, for, according to preset degree of association coefficient sets, determining each institute in the third determining module
Before the score for stating keyword, the word frequency information of the keyword in each described keyword subset is counted;
Module is removed, for removing the keyword repeated in each described keyword subset, is obtained at each
Keyword subset after reason;
Sorting module, for according to the word frequency information, to the key in each described treated keyword subset
Word is ranked up, and obtains sequence treated keyword set.
Further, described device, further includes:
Third obtains module, for be processed according to each of the text to be processed in second determining module
The label and score of word after the classification for determining the text to be processed, obtain in the text to be processed that classification has been determined
Word to be processed, wherein the word to be processed has label and score;
Update module, for updating the weight dictionary according to the word to be processed.
Further, the update module, comprising:
Submodule is deleted, for deleting word identical with the word to be processed in the weight dictionary, is deleted
Treated weight dictionary;
Submodule is added, in the weight dictionary after being added to the delete processing, obtaining the word to be processed
Updated weight dictionary.
Further, the deletion submodule, is specifically used for:
Determine word identical with the word to be processed in the weight dictionary;
Judge the score of the word to be processed, if greater than the score of word identical with the word to be processed;
If so, will word identical with the word to be processed, replace with the word to be processed, obtain updated
Weight dictionary.
The third aspect, this application provides a kind of text marking equipment, including any for executing the above first aspect
The unit or means (means) of each step of method.
Fourth aspect, this application provides a kind of text marking equipment, including processor, memory and computer journey
Sequence, wherein the computer program storage in the memory, and is configured as being executed by the processor to realize first
Either aspect method.
5th aspect, this application provides a kind of text marking equipment, including any for executing the above first aspect
At least one processing element or chip of method.
6th aspect, this application provides a kind of computer program, the calculation procedure is when being executed by processor for holding
Either the above first aspect of row method.
7th aspect, this application provides a kind of computer readable storage mediums, are stored thereon with the calculating of the 6th aspect
Machine program.
Text marking method, apparatus, equipment and computer readable storage medium provided by the present application, it is to be processed by obtaining
Text includes at least one word to be processed in text to be processed;According to preset weight dictionary, determine that at least one is to be processed
The label and score of each of word word to be processed, wherein it include at least one word in weight dictionary, at least one
Each of word word has label and score;According to the label of each of text to be processed word to be processed and divide
Number, determines the classification of text to be processed.One kind is provided to be labeled for the word in text to be processed automatically, be text to be processed
The method of this progress category division does not need a large amount of human resources and is labeled for text, can save human cost, in turn
The cost of text marking is reduced, annotating efficiency is improved.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application
Example, and together with specification it is used to explain the principle of the application.
Fig. 1 is a kind of flow diagram of text marking method provided by the embodiments of the present application;
Fig. 2 is the flow diagram of another text marking method provided by the embodiments of the present application;
Fig. 3 is a kind of structural schematic diagram of text marking device provided by the embodiments of the present application;
Fig. 4 is the structural schematic diagram of another text marking device provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of text marking equipment provided by the embodiments of the present application.
Through the above attached drawings, it has been shown that the specific embodiment of the application will be hereinafter described in more detail.These attached drawings
It is not intended to limit the range of the application design in any manner with verbal description, but is by referring to specific embodiments
Those skilled in the art illustrate the concept of the application.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the application.
Noun involved in the application is explained first:
Text: there is at least one word in text.
Test Rank algorithm: Text Rank algorithm is a kind of algorithm in the prior art;Text Rank algorithm is based on
Page Rank, for being text generation keyword and abstract.
The specific application scenarios of the application are as follows: in machine learning field, it is often necessary to treat text trained or to be identified
Data mark is carried out, and then the word in text is labeled, and classifies to text.Wherein, data markers are mainly
Data in text are classified according to classificating requirement, the operation such as picture frame, annotation, label.In the prior art, text is carried out
When data mark, the method manually marked is usually used, can be grasped in some cases by some convenient frames
Make.However notation methods in the prior art because in artificial mark each samples of text be by artificial discrimination then
It is labeled, so that annotating efficiency is lower, needs to consume a large amount of volume human cost, lead to the higher cost of mark.
Text marking method, apparatus, equipment and computer readable storage medium provided by the present application, it is intended to solve existing skill
The technical problem as above of art.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Fig. 1 is a kind of flow diagram of text marking method provided by the embodiments of the present application.As shown in Figure 1, this method
Include:
Step 101 obtains text to be processed, includes at least one word to be processed in text to be processed.
In the present embodiment, specifically, the executing subject of the present embodiment can be control equipment, terminal device, text mark
Dispensing device, other can execute the present embodiment method device or equipment, etc..
It first will be in the executing subject of text input to be processed to the present embodiment, wherein text to be processed is pending mark
The text of note, in this application, text to be processed are the text for meeting grammar for natural language.It include one in text to be processed
Or multiple words to be processed, wherein the languages of word are without limitation.
Step 102, according to preset weight dictionary, determine each of at least one word to be processed word to be processed
Label and score, wherein in weight dictionary include at least one word, each of at least one word word have mark
Label and score.
It in the present embodiment, include one in weight dictionary specifically, having pre-set a weight dictionary
Or multiple words, each word is provided with one or more labels, and each word has under each label
One score.Wherein, the field that weight dictionary is belonged to is identical with the field of text to be processed.
It for example, include word 1, word 2, word 3, word 4 in weight dictionary;Word 1 has label A, word 1
The score for belonging to label A is a;Word 2 has label B, and the score that word 2 belongs to label B is b;Word 3 has label C, word
3 scores for belonging to label C are c;Word 4 has label D, and the score that word 4 belongs to label D is d.
It again for example, include word 1, word 2, word 3, word 4 in weight dictionary;Word 1 has label A and mark
B is signed, the score that word 1 belongs to label A is a1, and the score that word 1 belongs to label B is a2;Word 2 have label B, label C and
Label D, word 2 belong to that the score of label B is b1, word 2 belongs to that the score of label C is b2, word 2 belongs to the score of label D
For b3;Word 3 has label C and label D, and the score that word 3 belongs to label C is c1, and the score that word 3 belongs to label D is c2;
Word 4 has label D, and the score that word 4 belongs to label D is d1.In this fashion, each word ownership can be determined
The largest score of label, the final label as word.For example, word 1 has label A and label B, word 1 belongs to label A
Score is a1, and the score that word 1 belongs to label B is a2;A1 is greater than a2, then the final label of word 1 is label A.
Due to being provided with the label and score of each word in weight dictionary, so that it may according to weight dictionary, analysis to
Handle label and score corresponding to each of text word to be processed.
It for example, include word 1, word 2, word 3, word 4 in weight dictionary;Word 1 has label A, word 1
The score for belonging to label A is a;Word 2 has label B, and the score that word 2 belongs to label B is b;Word 3 has label C, word
3 scores for belonging to label C are c;Word 4 has label D, and the score that word 4 belongs to label D is d.In text to be processed wait locate
Reason word 1 is word 1, then the label of word 1 to be processed is A, and the score that word 1 to be processed belongs to label A is a;Text to be processed
Word to be processed 2 in this is word 3, then the label of word 2 to be processed is C, and the score that word 2 to be processed belongs to label C is
c。
For example, the classification of label can be with are as follows: sport category label, amusement class label, cultural class label, category of going on a tour
Label, educational label etc..
Step 103, according to the label and score of each of text to be processed word to be processed, determine text to be processed
Classification.
Optionally, step 103 specifically includes the following steps:
Step 1031, according to the score of each of text to be processed word to be processed, determine in text to be processed
The score summation of word to be processed under each label.
Step 1032 determines the maximum label of score summation, is the classification of text to be processed.
Optionally, if frequency of occurrence of the word to be processed in text to be processed is greater than 1, step 1031 is specifically included:
According to the score and frequency of occurrence of each of text to be processed word to be processed, determine in text to be processed
The score summation of word to be processed under each labelWherein, N is under each label wait locate
The total number of word is managed, the total number of the word to be processed under different labels is identical or different, xiIt is i-th under each label
The score of a word to be processed, yiIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], i are just
Integer;
Alternatively, being determined each in text to be processed according to the score of each of text to be processed word to be processed
The score summation of word to be processed under a label
Alternatively, according to the score of each of text to be processed word to be processed and being gone out using non-linear summation method
Occurrence number determines the score summation of the word to be processed under each of text to be processed labelE is natural constant, and ln is natural logrithm.
Optionally, step 1032 specifically includes: determining the maximum label of score summation;If it is determined that the maximum mark of score summation
The score summation of label is within preset threshold range, it is determined that the maximum label of score summation is the classification of text to be processed.
In the present embodiment, each specifically, word to be processed for each of text to be processed is provided with label
A word to be processed is provided with the score that word to be processed belongs to corresponding label;There is at least one mark in text to be processed
Label.
Firstly, determining to belong to the word to be processed of the same label;For belonging to the to be processed of the same label
Word, according to the respective score for the word to be processed for belonging to the same label, calculate belong to the same label wait locate
The score summation of word is managed, to impart a score summation for each of text to be processed label.
Then, according to the score summation of each of text to be processed label, each of text to be processed is marked
Label are ranked up, for example, the order descending according to score summation, arranges each of text to be processed label
Sequence;Classification by the maximum label of score summation, as text to be processed.
Specifically, the score summation of to be processed word of this step in the case where determining each of text to be processed label
When, it is divided into following several implementations.Wherein, the score of the word to be processed under the same label can be identical, can also not
Together.
The first implementation are as follows: if each frequency of occurrence of word to be processed in text to be processed is equal to 1,
For each label, the sum of the score of the word to be processed under each label can be directly calculated, each label is obtained
Under word to be processed score summation.
For example, the label of the word to be processed 1 in text to be processed is A, and word 1 to be processed belongs to point of label A
Number is a1;The label of word to be processed 2 in text to be processed is A, and the score that word 2 to be processed belongs to label A is a2;Wait locate
The label for managing the word to be processed 3 in text is A, and the score that word 3 to be processed belongs to label A is a3;In text to be processed
The label of word 4 to be processed is B, and the score that word 4 to be processed belongs to label B is b1;Word to be processed 5 in text to be processed
Label be B, the score of the label B that word 5 to be processed belongs to is b2;The label of word to be processed 6 in text to be processed is
C, the score that word 6 to be processed belongs to label C is c1;The label of word to be processed 7 in text to be processed is C, word to be processed
The score that language 7 belongs to label C is c2;The label of word to be processed 8 in text to be processed is C, and word 8 to be processed belongs to label
The score of C is c3.There are 3 kinds of labels, respectively label A, label B, label C in text to be processed.Each of the above score is
Different.
Then according to the first implementation, according to the score a1 of the word to be processed 1 under label A, point of word to be processed 2
The score a3 of number a2 and word to be processed 3 determine that the score summation of the word to be processed under label A is a1+a2+a3;According to mark
The score b1 of the word to be processed 4 under B and the score b2 of word to be processed 5 are signed, determines point of the word to be processed under label B
Number summation is b1+b2;According to the score c1 of the word to be processed 6 under label C, the score c2 and word to be processed of word to be processed 7
The score c3 of language 8 determines that the score summation of the word to be processed under label C is c1+c2+c3.
Second of implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1,
The score that the word to be processed under each of text to be processed label can be calculated by the way of repeatedly linearly summing it up is total
With specifically, according to the score x of each word to be processed under each of text to be processed labeliAnd frequency of occurrence
yi, calculate the score summation of the word to be processed under each of text to be processed labelWherein, N
It is the total number of the word to be processed under each label, the total number of the word to be processed under different labels can be identical or not
Together, xiIt is the score of i-th of word to be processed under each label, yiIt is i-th of word to be processed under each label
Frequency of occurrence;Then judge the score summation for determining the maximum label of score summation whether within preset threshold range;If
It, then can be by the classification by the maximum label of score summation, as text to be processed within preset threshold range;If not default
Within threshold range, then the artificial classification for marking out the text to be processed, and then the text to be processed that acquisition manually marks
Classification.
For example, the label of the word to be processed 1 in text to be processed is A, and word 1 to be processed belongs to point of label A
Number is a1, and the frequency of occurrence of word 1 to be processed is m1;The label of word to be processed 2 in text to be processed is A, word to be processed
The score that language 2 belongs to label A is a2, and the frequency of occurrence of word 2 to be processed is m2;Word to be processed 3 in text to be processed
Label is A, and the score that word 3 to be processed belongs to label A is a3, and the frequency of occurrence of word 3 to be processed is m3;In text to be processed
The label of word to be processed 4 be B, the score that word 4 to be processed belongs to label B is b1, and the frequency of occurrence of word 4 to be processed is
m4;The label of word to be processed 5 in text to be processed is B, and the score for the label B that word 5 to be processed belongs to is b2, to be processed
The frequency of occurrence of word 5 is m5;The label of word to be processed 6 in text to be processed is C, and word 6 to be processed belongs to label C
Score is c1, and the frequency of occurrence of word 6 to be processed is m6;The label of word to be processed 7 in text to be processed is C, to be processed
The score that word 7 belongs to label C is c2, and the frequency of occurrence of word 7 to be processed is m7;Word to be processed 8 in text to be processed
Label be C, the score that word 8 to be processed belongs to label C is c3, and the frequency of occurrence of word 8 to be processed is m8.Text to be processed
In there is 3 kinds of labels, respectively label A, label B, label C.Each of the above score is different.
Then according to second of implementation, determine that the score summation of the word to be processed under label A is a1*m1+a2*m2+
A3*m3 determines that the score summation of the word to be processed under label B is b1*m4+b2*m5, determines the word to be processed under label C
Score summation be c1*m6+c2*m7+c3*m8.
The third implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1,
The score that the word to be processed under each of text to be processed label can be calculated in such a way that single linearly sums it up is total
With specifically, according to the score x of each word to be processed under each of text to be processed labeli, calculate to be processed
The score summation of word to be processed under each of text labelKnown at this point, no matter word to be processed
Language occurs several times, is all once calculated only with according to word to be processed appearance.
For example, according to above for example, determining that the score summation of the word to be processed under label A is a1+a2+
A3 determines that the score summation of the word to be processed under label B is b1+b2, determines the score summation of the word to be processed under label C
For c1+c2+c3.
4th kind of implementation are as follows: as long as there is frequency of occurrence of 1 word to be processed in text to be processed to be greater than 1,
The score summation of the word to be processed under each of text to be processed label can be calculated by the way of non-linear summation,
Specifically, according to the score x of each word to be processed under each of text to be processed labeliWith frequency of occurrence yi, meter
Calculate the score summation of the word to be processed under each of text to be processed label
For example, according to above for example, determining that the score summation of the word to be processed under label A is a1*ln (m1
+ e-1)+a2*ln (m2+e-1)+a3*ln (m3+e-1), determine that the score summation of the word to be processed under label B is b1*ln (m4
+ e-1)+b2*ln (m5+e-1), determine that the score summation of the word to be processed under label C is c1*ln (m6+e-1)+c2*ln (m7
+e-1)+c3*ln(m8+e-1)。
This step needs to judge to determine that the score of the maximum label of score summation is total when determining the classification of text to be processed
Whether within preset threshold range;If the maximum label of score summation can will be made within preset threshold range
For the classification of text to be processed;If not within preset threshold range, the artificial classification for marking out the text to be processed, into
And obtain the classification of the text to be processed manually marked.Wherein, preset threshold range can be a section or default threshold
Value range characterizes score summation and needs to be greater than a numerical value.
For example, according to above for example, if the score summation of label C is maximum, and the score summation of label C exists
Within one preset threshold range, so that it may determine that label C is the classification of text to be processed, i.e., the classification of text to be processed is C.
The present embodiment includes at least one word to be processed in text to be processed by obtaining text to be processed;According to pre-
If weight dictionary, determine the label and score of each of at least one word to be processed word to be processed, wherein weight
It include at least one word in dictionary, each of at least one word word has label and score;According to text to be processed
The label and score of each of this word to be processed, determine the classification of text to be processed.It provides a kind of automatically for wait locate
The method that word in reason text is labeled, carries out category division for text to be processed, not needing a large amount of human resources is
Text is labeled, and can be saved human cost, and then reduce the cost of text marking, be improved annotating efficiency.
Fig. 2 is the flow diagram of another text marking method provided by the embodiments of the present application.As shown in Fig. 2, the party
Method includes:
Step 201 obtains at least one text to be analyzed, wherein each of at least one text to be analyzed is wait divide
Analysing includes at least one word to be analyzed in text, and word to be analyzed is provided with label.
In the present embodiment, specifically, firstly the need of weight dictionary is constructed.Specifically, obtain it is one or more to
Text is analyzed, includes one or more words to be analyzed in each text to be analyzed, is set for each word to be analyzed
Label is set.For example, imparting label by way of manually marking for the word to be analyzed in text to be analyzed.
For example, there is word 1 to be analyzed, word to be analyzed 2, word to be analyzed 3, to be analyzed in text 1 to be analyzed
Word 4, word to be analyzed 5;The label of word 1 to be analyzed is sport category label, and the label of word 2 to be analyzed is amusement category
Label, the label of word 3 to be analyzed are amusement class label, and the label of word 4 to be analyzed is educational label, word 4 to be analyzed
Label is class label of going on a tour.
Step 202, according to the word to be analyzed in each text to be analyzed, extract the key of each text to be analyzed
Word, to obtain keyword set, wherein it include at least one keyword subset in keyword set, at least one crucial lexon
It include the keyword for belonging to the same label in each of collection keyword subset.
In the present embodiment, specifically, being handled using Test Rank algorithm each text to be analyzed, according to
Word to be analyzed in each text to be analyzed can extract the keyword in each text to be analyzed, will be needed
Keyword in analysis text is put into a keyword set.And each keyword has label.
The keyword of the same label will be belonged in keyword set, be divided into a keyword subset, and then
It include one or more keywords in each keyword subset to one or more keyword subsets, and the same pass
Keyword in keyword subset is label having the same.
For example, the keyword in text 1 to be analyzed is extracted using Test Rank algorithm, obtains keyword 1, key
Word 2, keyword 3, the label of keyword 1 is A, the label of keyword 2 is B, the label of keyword 3 is C;Using Test Rank
Algorithm extracts the keyword in text 2 to be analyzed, obtains keyword 4, keyword 5, keyword 6, and the label of keyword 4 is A, closes
The label of keyword 5 is B, the label of keyword 6 is C;To include keyword subset 1, keyword subset in keyword set
2, keyword subset 3 includes keyword 1 and keyword 2 in keyword subset 1, includes keyword 3 in keyword subset 2
It include keyword 5 and keyword 6 in keyword subset 3 with keyword 4.
Step 203, the word frequency information for counting keyword in each keyword subset.
In the present embodiment, specifically, after the completion of keyword extraction, the keyword in each keyword subset is carried out
Word frequency sequence.Firstly, being directed to each keyword subset, the word frequency information of keyword is counted.Wherein, word frequency information is to close
The number that keyword occurs.
Step 204 removes the keyword repeated in each keyword subset, obtains each treated and is crucial
Lexon collection.
In the present embodiment, specifically, being directed to each keyword subset, the pass that will repeat in keyword subset
Keyword is rejected.And then the keyword repeated is removed.
Step 205, according to word frequency information, the keyword in each treated keyword subset is ranked up, is obtained
To sequence treated keyword set.
In the present embodiment, specifically, to each treated keyword subset, according to pass each in keyword subset
Keyword is ranked up by the word frequency information of keyword according to the descending order of word frequency information, obtains each sequence processing
Keyword subset afterwards;All sequences treated keyword subset constitutes sequence treated keyword set.From
And keyword is ranked up by the descending according to word frequency height, W keywords before available ranking, and then preliminary
Filter out the keyword more by access times, wherein W is positive integer.
It for example, include keyword subset 1 and keyword subset 2 in keyword set;Keyword subset 1 and label
A is corresponding, includes keyword 1, keyword 2, keyword 2, keyword 3, keyword 3, keyword 3 in keyword subset 1;It is crucial
Lexon collection 2 is corresponding with label B, includes keyword 4, keyword 4, keyword 5 in keyword subset 2.For keyword subset
1, the word frequency that can count keyword 1 is 1, and the word frequency of keyword 2 is 2, and the word frequency of keyword 3 is 3;For keyword subset
2, the word frequency that can count keyword 4 is 2, and the word frequency of keyword 5 is 1.The pass that can will repeat in keyword subset 1
Keyword 2 and keyword 3 remove, and the keyword 4 repeated in keyword subset 2 is removed.For keyword subset 1, according to
The descending order of the word frequency of keyword, the order of keyword is keyword 3, keyword 2, keyword 1 after being sorted;It is right
In keyword subset 2, according to the order that the word frequency of keyword is descending, after being sorted the order of keyword be keyword 4,
Keyword 5.
Step 206, according to preset degree of association coefficient sets, determine the score of each keyword, wherein degree of association system
It include the degree of association coefficient of at least one keyword in manifold conjunction, degree of association coefficient characterizes the pass between keyword and label
Connection degree size.
In the present embodiment, it specifically, having pre-established a degree of association coefficient sets, is wrapped in degree of association coefficient sets
The degree of association coefficient of each keyword is included, which has indicated the degree of association size between keyword and label.
For example, can determine keyword and corresponding mark by the Pair Analysis between artificial observation keyword and corresponding label
Degree of association size between label.
The score of keyword can be determined according to the degree of association coefficient of each keyword.For example, the association of keyword
Degree coefficient is bigger, then the score of keyword is higher.
The score of keyword can be divided into two grades to three gears.For example, keyword 1 belongs to when being divided into two grades
First grade of score, keyword 2 belong to second gear score;For example, keyword 1 belongs to first grade of score when being divided into third gear,
Keyword 2 belongs to second gear score, and keyword 3 belongs to third gear score.
For example, for the keyword subset 1 in keyword set, the order of keyword is keyword 3, closes after sequence
Keyword 2, keyword 1, the label of keyword subset 1 are A;It can determine keyword 3 and label A degree of association coefficient is 1, determine and close
The score of keyword 3 is first grade of score;It determines that keyword 2 and label A degree of association coefficient are 2, determines that the score of keyword 2 is the
Two grades of scores;It determines keyword 1 and label A degree of association coefficient is 3, determine that the score of keyword 1 is third gear score.For closing
Keyword subset 2 in keyword set, the order of keyword are keyword 4, keyword 5, and the label of keyword subset 2 is B;It can
To determine keyword 4 and label B degree of association coefficient as 2, determine that the score of keyword 4 is second gear score;Determine keyword 5 with
Label B degree of association coefficient is 1, determines that the score of keyword 5 is first grade of score.
Step 207, according to the score of keyword set and each keyword, construct weight dictionary.
In the present embodiment, specifically, can determine keyword, the label of keyword, keyword by above step
Score, thus according to each keyword, the label of each keyword, each keyword score, available weight
Dictionary.
Step 208 obtains text to be processed, includes at least one word to be processed in text to be processed.
In the present embodiment, it specifically, this step may refer to the step 101 of Fig. 1, repeats no more.
Step 209, according to preset weight dictionary, determine each of at least one word to be processed word to be processed
Label and score, wherein in weight dictionary include at least one word, each of at least one word word have mark
Label and score.
In the present embodiment, it specifically, this step may refer to the step 102 of Fig. 1, repeats no more.
Step 210, according to the label and score of each of text to be processed word to be processed, determine text to be processed
Classification.
In the present embodiment, it specifically, this step may refer to the step 103 of Fig. 1, repeats no more.
Step 211, acquisition have determined the word to be processed in the text to be processed of classification, wherein word to be processed has
Label and score.
In the present embodiment, it specifically, can use the text for being added to label, goes to expand weight dictionary.
Firstly, obtaining the text to be processed for being added to label and classification after executing step 209-210.It repeats
After executing 209-210 more times, available multiple texts to be processed for being added to label and classification.
Obtain the word to be processed in each text to be processed, each word to be processed be assigned label and
Score.
Step 212, according to word to be processed, update weight dictionary.
Wherein, step 212 includes:
Step 2121 deletes word identical with word to be processed in weight dictionary, the weight word after obtaining delete processing
Allusion quotation.
Optionally, step 2121 specifically includes: determining word identical with word to be processed in weight dictionary;Judge wait locate
Manage the score of word, if greater than the score of word identical with word to be processed;If so, will be identical with word to be processed
Word replaces with word to be processed, obtains updated weight dictionary.
Step 2122, by word to be processed, in the weight dictionary after being added to delete processing, obtain updated weight word
Allusion quotation.
In the present embodiment, it specifically, since each word to be processed is assigned label and score, can incite somebody to action
Word to be processed is added in weight dictionary.
Firstly, be added to label and classification text to be processed in word to be processed, may in weight dictionary
Word it is identical, and then need to delete word identical with word to be processed in weight dictionary.Specifically, added according to each
The word to be processed being labeled in the text to be processed with classification finds word identical with word to be processed in weight dictionary
Language;Since the word of weight dictionary has score, word to be processed also has score, it can be determined that the score of word to be processed is
The no score greater than word;If more than, then can directly by word identical with word to be processed in weight dictionary, replace with to
Handle word;If being less than or equal to, do not need to replace.
Then, the word to be processed being added in label and the text to be processed of classification, after being all added to delete processing
Weight dictionary in, so that it may completion weight dictionary is had updated.
It is repeated several times after executing step 211-212, so that it may which weight dictionary is repeatedly updated.Carrying out n times
After update, if the weight dictionary after n-th update, the weight dictionary after updating with the N-1 times, the two is come compared to relatively
It says, the number that word is updated is no more than predetermined number, it is determined that change after the update of the dictionary of full work attendance less, determination does not need
Weight dictionary is updated again.
It for example, include word 1, word 2, word 3, word 4 in weight dictionary;Word 1 has label A, word 1
The score for belonging to label A is a1;Word 2 has label B, and the score that word 2 belongs to label B is b1;Word 3 has label C, word
The score that language 3 belongs to label C is c;Word 4 has label D, and the score that word 4 belongs to label D is d.Word to be processed is got
Language 1, word to be processed 2, word to be processed 5, word to be processed 6;Word 1 to be processed has label A, score a2, word to be processed
Language 2 has label B, score b2, and word 5 to be processed has label E, score e, and word 5 to be processed has label F, score f.To
Processing word 1 is identical as word 1, and word 2 to be processed is identical as word 2.It can determine that the score a2 of word 1 to be processed is greater than word
Word 1 can then be replaced with word 1 to be processed by the score a2 of language 1;It can determine that the score b1 of word 2 to be processed is less than word
The score b2 of language 2 can then need to be replaced.It include word 1 to be processed, word in obtained updated weight dictionary
2, word 3, word 4, word to be processed 5, word to be processed 6, wherein word 1 to be processed has label A, score a2, word 2
With label B, score b1, word 3 has label C, score c, and word 4 has label D, score d, and word 5 to be processed has
Label E, score e, word 5 to be processed have label F, score f.
The present embodiment includes at least one word to be processed in text to be processed by obtaining text to be processed;According to pre-
If weight dictionary, determine the label and score of each of at least one word to be processed word to be processed, wherein weight
It include at least one word in dictionary, each of at least one word word has label and score;According to text to be processed
The label and score of each of this word to be processed, determine the classification of text to be processed.It provides a kind of automatically for wait locate
The method that word in reason text is labeled, carries out category division for text to be processed, not needing a large amount of human resources is
Text is labeled, and can be saved human cost, and then reduce the cost of text marking, be improved annotating efficiency.And it is possible to root
According to the word to be processed obtained in the text to be processed that classification has been determined, update is iterated to weight dictionary, is guaranteed subsequent
Text marking has good precision.
Fig. 3 is a kind of structural schematic diagram of text marking device provided by the embodiments of the present application, as shown in figure 3, this implementation
Example device may include:
First obtains module 31, includes at least one word to be processed in text to be processed for obtaining text to be processed;
First determining module 32, for determining each at least one word to be processed according to preset weight dictionary
The label and score of a word to be processed, wherein it include at least one word in weight dictionary, it is each at least one word
A word has label and score;
Second determining module 33, for the label and score according to each of text to be processed word to be processed, really
The classification of fixed text to be processed.
A kind of text marking method provided by the embodiments of the present application can be performed in the text marking device of the present embodiment, realizes
Principle and technical effect are similar, and details are not described herein again.
Fig. 4 is the structural schematic diagram of another text marking device provided by the embodiments of the present application, embodiment shown in Fig. 3
On the basis of, as shown in figure 4, in the device of the present embodiment, the second determining module 33, comprising:
First determines submodule 331, for the score according to each of text to be processed word to be processed, determine to
Handle the score summation of the word to be processed under each of text label.
Second determines submodule 332, is the classification of text to be processed for determining the maximum label of score summation.
First determines submodule 331, is specifically used for:
According to the score and frequency of occurrence of each of text to be processed word to be processed, determine in text to be processed
The score summation of word to be processed under each labelWherein, N is under each label wait locate
The total number of word is managed, the total number of the word to be processed under different labels is identical or different, xiIt is i-th under each label
The score of a word to be processed, yiIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N], i are just
Integer;Alternatively, determining that each of text to be processed is marked according to the score of each of text to be processed word to be processed
The score summation for the word to be processed signedAlternatively, using non-linear summation method, according to text to be processed
Each of word to be processed score and frequency of occurrence, determine the word to be processed under each of text to be processed label
The score summation of languageE is natural constant, and ln is natural logrithm.
Second determines submodule 332, is specifically used for: determining the maximum label of score summation;If it is determined that score summation is maximum
Label score summation within preset threshold range, it is determined that the maximum label of score summation, be text to be processed class
Not.
Device provided in this embodiment, further includes:
Second obtains module 41, for obtaining at least one and waiting for before the first acquisition module 31 obtains text to be processed
Analyze text, wherein it include at least one word to be analyzed in each of at least one text to be analyzed text to be analyzed,
Word to be analyzed is provided with label.
Extraction module 42, for extracting each text to be analyzed according to the word to be analyzed in each text to be analyzed
This keyword, to obtain keyword set, wherein it include at least one keyword subset in keyword set, at least one
It include the keyword for belonging to the same label in each of keyword subset keyword subset.
Third determining module 43, for determining the score of each keyword according to preset degree of association coefficient sets,
In, include the degree of association coefficient of at least one keyword in degree of association coefficient sets, degree of association coefficient characterize keyword with
Degree of association size between label.
Module 44 is constructed, for the score according to keyword set and each keyword, constructs weight dictionary.
Device provided in this embodiment, further includes:
Statistical module 45, for, according to preset degree of association coefficient sets, determining each pass in third determining module 43
Before the score of keyword, the word frequency information of the keyword in each keyword subset is counted.
It removes module 46 and obtains each processing for removing the keyword repeated in each keyword subset
Keyword subset afterwards.
Sorting module 47 carries out the keyword in each treated keyword subset for according to word frequency information
Sequence obtains sequence treated keyword set.
Device provided in this embodiment, further includes:
Third obtains module 48, is used in the second determining module 33 according to each of text to be processed word to be processed
Label and score, after the classification for determining text to be processed, obtain the word to be processed in the text to be processed that classification has been determined
Language, wherein word to be processed has label and score.
Update module 49, for updating weight dictionary according to word to be processed.
Update module 49, comprising:
Submodule 491 is deleted, for deleting word identical with word to be processed in weight dictionary, after obtaining delete processing
Weight dictionary.
Submodule 492 is added, in the weight dictionary after being added to delete processing, being updated word to be processed
Weight dictionary afterwards.
Wherein, submodule 491 is deleted, is specifically used for: determining word identical with word to be processed in weight dictionary;Judgement
The score of word to be processed, if greater than the score of word identical with word to be processed;If so, will be with word phase to be processed
Same word, replaces with word to be processed, obtains updated weight dictionary.
Another text marking method provided by the embodiments of the present application can be performed in the text marking device of the present embodiment, in fact
Existing principle and technical effect are similar, and details are not described herein again.
Fig. 5 is a kind of structural schematic diagram of text marking equipment provided by the embodiments of the present application, as shown in figure 5, the application
Embodiment provides a kind of text marking equipment, and it is dynamic to can be used for executing text tagging equipment in Fig. 1 or embodiment illustrated in fig. 2
Work or step, specifically include: processor 2701, memory 2702 and communication interface 2703.
Memory 2702, for storing computer program.
Processor 2701, it is real shown in Fig. 1 or Fig. 2 to realize for executing the computer program stored in memory 2702
The movement for applying text tagging equipment in example, repeats no more.
Optionally, text marking equipment can also include bus 2704.Wherein, processor 2701, memory 2702 and
Communication interface 2703 can be connected with each other by bus 2704;Bus 2704 can be Peripheral Component Interconnect standard
(Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended
Industry Standard Architecture, abbreviation EISA) bus etc..Above-mentioned bus 2704 can be divided into address bus,
Data/address bus and control bus etc..Only to be indicated with a thick line in Fig. 5, it is not intended that an only bus convenient for indicating
Or a type of bus.
In the embodiment of the present application, it can mutually be referred to and learnt between the various embodiments described above, same or similar step
And noun no longer repeats one by one.
Alternatively, some or all of above modules can also be embedded in text mark by way of integrated circuit
It is realized on some chip of equipment.And they can be implemented separately, and also can integrate together.That is the above module can
To be configured to implement one or more integrated circuits of above method, such as: one or more specific integrated circuits
(Application Specific Integrated Circuit, abbreviation ASIC), or, one or more microprocessors
(Digital Singnal Processor, abbreviation DSP), or, one or more field programmable gate array (Field
Programmable Gate Array, abbreviation FPGA) etc..
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 2702 of instruction, above-metioned instruction can be executed above-mentioned to complete by the processor 2701 of above-mentioned text marking equipment
Method.For example, non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic
Band, floppy disk and optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by text marking equipment
When managing device execution, so that text marking equipment is able to carry out above-mentioned text marking method.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.Computer program product
Including one or more computer instructions.When loading on computers and executing computer program instructions, all or part of real estate
Raw process or function according to the embodiment of the present application.Computer can be general purpose computer, special purpose computer, computer network,
Or other programmable devices.Computer instruction may be stored in a computer readable storage medium, or from a computer
Readable storage medium storing program for executing to another computer readable storage medium transmit, for example, computer instruction can from a web-site,
Computer, text marking equipment or data center are by wired (for example, coaxial cable, optical fiber, Digital Subscriber Line (digital
Subscriber line, DSL)) or wireless (for example, infrared, wireless, microwave etc.) mode to another web-site, calculate
Machine, text marking equipment or data center are transmitted.Computer readable storage medium can be times that computer can access
What usable medium either includes that the data storages such as the integrated text marking equipment of one or more usable mediums, data center are set
It is standby.Usable medium can be magnetic medium, and (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor are situated between
Matter (for example, solid state hard disk (solid state disk, SSD)) etc..
Those skilled in the art it will be appreciated that in said one or multiple examples, retouched by the embodiment of the present application
The function of stating can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by this
A little functions storages in computer-readable medium or as on computer-readable medium one or more instructions or code into
Row transmission.Computer-readable medium includes computer storage media and communication media, and wherein communication media includes convenient for from one
Any medium of the place to another place transmission computer program.Storage medium can be general or specialized computer and can deposit
Any usable medium taken.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by appended claims
System.
Claims (10)
1. a kind of text marking method characterized by comprising
Text to be processed is obtained, includes at least one word to be processed in the text to be processed;
According to preset weight dictionary, determine each of at least one described word to be processed label of word to be processed and
Score, wherein include at least one word in the weight dictionary, each of at least one described word word has mark
Label and score;
According to the label and score of each of the text to be processed word to be processed, the class of the text to be processed is determined
Not.
2. the method according to claim 1, wherein according to each of the text to be processed word to be processed
The label and score of language determine the classification of the text to be processed, comprising:
According to the score of each of the text to be processed word to be processed, each of described text to be processed is determined
The score summation of word to be processed under label;
It determines the maximum label of score summation, is the classification of the text to be processed.
3. according to the method described in claim 2, it is characterized in that, if the word to be processed is in the text to be processed
Frequency of occurrence is greater than 1, according to the score of each of the text to be processed word to be processed, determines the text to be processed
Each of word to be processed under label score summation, comprising:
According to the score and frequency of occurrence of each of the text to be processed word to be processed, the text to be processed is determined
Each of word to be processed under label score summationWherein, N is under each label
The total number of the total number of word to be processed, the word to be processed under different labels is identical or different, xiIt is under each label
I-th of word to be processed score, yiIt is the frequency of occurrence of i-th of word to be processed under each label, i ∈ [1, N],
I is positive integer;
Alternatively, according to the score of each of the text to be processed word to be processed and being gone out using non-linear summation method
Occurrence number determines the score summation of the word to be processed under each of the text to be processed labelE is natural constant, and ln is natural logrithm.
4. according to the method described in claim 2, it is characterized in that, the determining maximum label of score summation, for it is described to
Handle the classification of text, comprising:
Determine the maximum label of score summation;
If it is determined that the score summation of the maximum label of score summation is within preset threshold range, it is determined that score summation is maximum
Label is the classification of the text to be processed.
5. the method according to claim 1, wherein before the acquisition text to be processed, further includes:
Obtain at least one text to be analyzed, wherein in each of at least one described text to be analyzed text to be analyzed
Including at least one word to be analyzed, the word to be analyzed is provided with label;
According to the word to be analyzed in each described text to be analyzed, the keyword of each text to be analyzed is extracted,
To obtain keyword set, wherein it include at least one keyword subset in the keyword set, at least one described key
It include the keyword for belonging to the same label in each of lexon collection keyword subset;
According to preset degree of association coefficient sets, the score of each keyword is determined, wherein the degree of association coefficient set
It include the degree of association coefficient of at least one keyword in conjunction, the degree of association coefficient characterizes the pass between keyword and label
Connection degree size;
According to the score of the keyword set and each keyword, the weight dictionary is constructed.
6. according to the method described in claim 5, it is characterized in that, being determined described according to preset degree of association coefficient sets
Before the score of each keyword, further includes:
Count the word frequency information of the keyword in each described keyword subset;
The keyword repeated in each described keyword subset is removed, each is obtained treated keyword subset;
According to the word frequency information, the keyword in each described treated keyword subset is ranked up, is arranged
Sequence treated keyword set.
7. method according to claim 1-6, which is characterized in that according to each in the text to be processed
The label and score of a word to be processed, after the classification for determining the text to be processed, further includes:
Obtain the word to be processed in the text to be processed that classification has been determined, wherein the word to be processed has label and divides
Number;
According to the word to be processed, the weight dictionary is updated.
8. a kind of text marking device characterized by comprising
First obtains module, includes at least one word to be processed in the text to be processed for obtaining text to be processed;
First determining module, for determining each of at least one described word to be processed according to preset weight dictionary
The label and score of word to be processed, wherein include at least one word in the weight dictionary, at least one described word
Each word have label and score;
Second determining module is determined for the label and score according to each of the text to be processed word to be processed
The classification of the text to be processed.
9. a kind of text marking equipment characterized by comprising processor, memory and computer program;
Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as
The described in any item methods of claim 1-7.
10. a kind of computer readable storage medium, which is characterized in that be stored thereon with computer program, the computer program
It is executed by processor to realize the method according to claim 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811240652.7A CN109388714B (en) | 2018-10-23 | 2018-10-23 | Text labeling method, device, equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811240652.7A CN109388714B (en) | 2018-10-23 | 2018-10-23 | Text labeling method, device, equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109388714A true CN109388714A (en) | 2019-02-26 |
CN109388714B CN109388714B (en) | 2020-11-24 |
Family
ID=65427659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811240652.7A Active CN109388714B (en) | 2018-10-23 | 2018-10-23 | Text labeling method, device, equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109388714B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120303661A1 (en) * | 2011-05-27 | 2012-11-29 | International Business Machines Corporation | Systems and methods for information extraction using contextual pattern discovery |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN107807920A (en) * | 2017-11-17 | 2018-03-16 | 新华网股份有限公司 | Construction method, device and the server of mood dictionary based on big data |
CN108415959A (en) * | 2018-02-06 | 2018-08-17 | 北京捷通华声科技股份有限公司 | A kind of file classification method and device |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
-
2018
- 2018-10-23 CN CN201811240652.7A patent/CN109388714B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120303661A1 (en) * | 2011-05-27 | 2012-11-29 | International Business Machines Corporation | Systems and methods for information extraction using contextual pattern discovery |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
CN107807920A (en) * | 2017-11-17 | 2018-03-16 | 新华网股份有限公司 | Construction method, device and the server of mood dictionary based on big data |
CN108415959A (en) * | 2018-02-06 | 2018-08-17 | 北京捷通华声科技股份有限公司 | A kind of file classification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109388714B (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073568B (en) | Keyword extraction method and device | |
CN107463658B (en) | Text classification method and device | |
CN108363790A (en) | For the method, apparatus, equipment and storage medium to being assessed | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN107506389B (en) | Method and device for extracting job skill requirements | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN110688452B (en) | Text semantic similarity evaluation method, system, medium and device | |
CN110188047A (en) | A kind of repeated defects report detection method based on binary channels convolutional neural networks | |
CN109388634B (en) | Address information processing method, terminal device and computer readable storage medium | |
CN107545038B (en) | Text classification method and equipment | |
CN112650923A (en) | Public opinion processing method and device for news events, storage medium and computer equipment | |
CN104796300B (en) | A kind of packet feature extracting method and device | |
CN110334343B (en) | Method and system for extracting personal privacy information in contract | |
CN107665221A (en) | The sorting technique and device of keyword | |
CN109800309A (en) | Classroom Discourse genre classification methods and device | |
CN103942274B (en) | A kind of labeling system and method for the biologic medical image based on LDA | |
CN110134777A (en) | Problem De-weight method, device, electronic equipment and computer readable storage medium | |
CN111539612B (en) | Training method and system of risk classification model | |
CN112487146A (en) | Legal case dispute focus acquisition method and device and computer equipment | |
CN113743079A (en) | Text similarity calculation method and device based on co-occurrence entity interaction graph | |
CN113590809A (en) | Method and device for automatically generating referee document abstract | |
CN109871889B (en) | Public psychological assessment method under emergency | |
CN111611781A (en) | Data labeling method, question answering method, device and electronic equipment | |
CN114281983B (en) | Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium | |
CN113656575B (en) | Training data generation method and device, electronic equipment and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |