CN109388714B

CN109388714B - Text labeling method, device, equipment and computer readable storage medium

Info

Publication number: CN109388714B
Application number: CN201811240652.7A
Authority: CN
Inventors: 申勇
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2020-11-24
Anticipated expiration: 2038-10-23
Also published as: CN109388714A

Abstract

The application provides a text labeling method, a text labeling device, text labeling equipment and a computer-readable storage medium, wherein the method comprises the following steps: acquiring a text to be processed, wherein the text to be processed comprises at least one word to be processed; determining a label and a score of each word to be processed in at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score; and determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed. The method for automatically labeling the words in the text to be processed and classifying the text to be processed is provided, a large amount of human resources are not needed for labeling the text, the labor cost can be saved, the cost of text labeling is reduced, and the labeling efficiency is improved.

Description

Text labeling method, device, equipment and computer readable storage medium

Technical Field

The present application relates to data processing technologies, and in particular, to a text labeling method, apparatus, device, and computer-readable storage medium.

Background

In the field of machine learning, data labeling is often required for a text to be trained or recognized, and then words in the text are labeled and the text is classified.

In the prior art, when data marking is performed on a text, a manual marking method is generally adopted; specifically, the words in the text are manually distinguished, the label of each word is determined, and then the words are classified; and manually determine the category to which each text belongs, e.g., determine whether the text belongs to sports news or entertainment news.

However, in the prior art, each text needs to be labeled manually, and such a manner needs a large amount of labor cost, so that the cost of text labeling is high; and labeling efficiency is low.

Disclosure of Invention

The application provides a text labeling method, a text labeling device, text labeling equipment and a computer readable storage medium, which are used for solving the problems of high text labeling cost and low labeling efficiency.

In a first aspect, the present application provides a text annotation method, including:

acquiring a text to be processed, wherein the text to be processed comprises at least one word to be processed;

determining a label and a score of each word to be processed in the at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score;

and determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed.

Further, determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed, including:

determining the sum of the scores of the words to be processed under each label in the text to be processed according to the score of each word to be processed in the text to be processed;

and determining the label with the maximum score sum as the category of the text to be processed.

Further, if the number of occurrences of the to-be-processed word in the to-be-processed text is greater than 1, determining the sum of the scores of the to-be-processed words under each label in the to-be-processed text according to the score of each to-be-processed word in the to-be-processed text, including:

determining the total score sum of the words to be processed under each label in the text to be processed according to the score and the occurrence frequency of each word to be processed in the text to be processed

Wherein N is the total number of the words to be processed under each label, the total number of the words to be processed under different labels is the same or different, and x_iIs the score, y, of the ith word to be processed under each label_iIs the number of occurrences of the ith word to be processed under each label, i belongs to [1, N ∈]I is a positive integer;

or determining the to-be-processed words under each label in the to-be-processed text according to the score of each to-be-processed word in the to-be-processed textProcessing fractional sums of words

Or determining the total score sum of the words to be processed under each label in the text to be processed according to the score and the occurrence frequency of each word to be processed in the text to be processed by adopting a nonlinear summation method

e is a natural constant and ln is a natural logarithm.

Further, the determining the label with the largest total score is a category of the text to be processed, and includes:

determining the label with the largest score sum;

and if the total score of the labels with the maximum total score is determined to be within the preset threshold range, determining the labels with the maximum total score as the categories of the texts to be processed.

Further, before the obtaining the text to be processed, the method further includes:

the method comprises the steps of obtaining at least one text to be analyzed, wherein each text to be analyzed in the at least one text to be analyzed comprises at least one word to be analyzed, and the word to be analyzed is provided with a label;

extracting keywords of each text to be analyzed according to the words to be analyzed in each text to be analyzed to obtain a keyword set, wherein the keyword set comprises at least one keyword subset, and each keyword subset of the at least one keyword subset comprises keywords belonging to the same label;

determining the score of each keyword according to a preset association coefficient set, wherein the association coefficient set comprises an association coefficient of at least one keyword, and the association coefficient represents the association degree between the keyword and the label;

and constructing the weight dictionary according to the keyword set and the score of each keyword.

Further, before the determining the score of each keyword according to a preset association coefficient set, the method further includes:

counting word frequency information of the keywords in each keyword subset;

removing the keywords which repeatedly appear in each keyword subset to obtain each processed keyword subset;

and sequencing the keywords in each processed keyword subset according to the word frequency information to obtain a sequenced keyword set.

Further, after determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed, the method further includes:

acquiring words to be processed in the text to be processed with the determined category, wherein the words to be processed have labels and scores;

and updating the weight dictionary according to the words to be processed.

Further, according to the word to be processed, updating the weight dictionary, including:

deleting the words in the weight dictionary which are the same as the words to be processed to obtain a deleted weight dictionary;

and adding the words to be processed into the weight dictionary after deletion processing to obtain an updated weight dictionary.

Further, deleting the words in the weight dictionary which are the same as the words to be processed to obtain a weight dictionary after deletion processing, including:

determining words in the weight dictionary which are the same as the words to be processed;

judging whether the score of the word to be processed is larger than the score of the word same as the word to be processed;

and if so, replacing the words which are the same as the words to be processed with the words to be processed to obtain an updated weight dictionary.

In a second aspect, the present application provides a text labeling apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a text to be processed, and the text to be processed comprises at least one word to be processed;

the first determining module is used for determining a label and a score of each word to be processed in the at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score;

and the second determining module is used for determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed.

Further, the second determining module includes:

the first determining submodule is used for determining the sum of the scores of the words to be processed under each label in the text to be processed according to the score of each word to be processed in the text to be processed;

and the second determining submodule is used for determining the label with the maximum score sum as the category of the text to be processed.

Further, if the number of occurrences of the word to be processed in the text to be processed is greater than 1, the first determining sub-module is specifically configured to:

or, according to each of the texts to be processedThe score of each word to be processed is determined, and the sum of the scores of the words to be processed under each label in the text to be processed is determined

e is a natural constant and ln is a natural logarithm.

Further, the second determining submodule is specifically configured to:

determining the label with the largest score sum;

Further, the apparatus further comprises:

the second acquisition module is used for acquiring at least one text to be analyzed before the first acquisition module acquires the text to be processed, wherein each text to be analyzed in the at least one text to be analyzed comprises at least one word to be analyzed, and the word to be analyzed is provided with a label;

the extraction module is used for extracting keywords of each text to be analyzed according to the words to be analyzed in each text to be analyzed so as to obtain a keyword set, wherein the keyword set comprises at least one keyword subset, and each keyword subset in the at least one keyword subset comprises keywords belonging to the same label;

the third determining module is used for determining the score of each keyword according to a preset association coefficient set, wherein the association coefficient set comprises an association coefficient of at least one keyword, and the association coefficient represents the association degree between the keyword and the label;

and the construction module is used for constructing the weight dictionary according to the keyword set and the score of each keyword.

Further, the apparatus further comprises:

the statistic module is used for counting the word frequency information of the keywords in each keyword subset before the third determining module determines the score of each keyword according to a preset association coefficient set;

the removing module is used for removing the repeated keywords in each keyword subset to obtain each processed keyword subset;

and the sorting module is used for sorting the keywords in each processed keyword subset according to the word frequency information to obtain a sorted keyword set.

Further, the apparatus further comprises:

the third obtaining module is used for obtaining the words to be processed in the texts to be processed with the determined categories after the second determining module determines the categories of the texts to be processed according to the labels and the scores of the words to be processed in the texts to be processed, wherein the words to be processed have the labels and the scores;

and the updating module is used for updating the weight dictionary according to the words to be processed.

Further, the update module includes:

the deleting submodule is used for deleting the words in the weight dictionary, wherein the words are the same as the words to be processed, and the weight dictionary after deletion processing is obtained;

and the adding submodule is used for adding the words to be processed into the weight dictionary after the deletion processing to obtain an updated weight dictionary.

Further, the delete sub-module is specifically configured to:

In a third aspect, the present application provides a text annotation apparatus comprising means or units for performing the steps of any of the methods of the first aspect above.

In a fourth aspect, the present application provides a text annotation apparatus comprising a processor, a memory and a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor to implement any of the methods of the first aspect.

In a fifth aspect, the present application provides a text annotation apparatus comprising at least one processing element or chip for performing any of the methods of the first aspect above.

In a sixth aspect, the present application provides a computer program for performing any of the methods of the first aspect above when executed by a processor.

In a seventh aspect, the present application provides a computer readable storage medium having the computer program of the sixth aspect stored thereon.

According to the text labeling method, the text labeling device, the text labeling equipment and the computer readable storage medium, the text to be processed is obtained, and the text to be processed comprises at least one word to be processed; determining a label and a score of each word to be processed in at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score; and determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed. The method for automatically labeling the words in the text to be processed and classifying the text to be processed is provided, a large amount of human resources are not needed for labeling the text, the labor cost can be saved, the cost of text labeling is reduced, and the labeling efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a text annotation method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another text annotation method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text annotation device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of another text labeling apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a text annotation device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terms referred to in this application are explained first:

text: there is at least one word in the text.

The Test Rank algorithm: the Text Rank algorithm is an algorithm in the prior art; the Text Rank algorithm is based on Page Rank and is used for generating keywords and abstracts for texts.

The application has the specific application scenarios that: in the field of machine learning, data labeling is often required for a text to be trained or recognized, words in the text are labeled, and the text is classified. The data marking is mainly to classify, draw frames, annotate, mark and the like the data in the text according to the classification requirement. In the prior art, when data annotation is performed on a text, a manual annotation method is generally adopted, and in some cases, operation is performed by means of some convenient frames. However, in the labeling mode in the prior art, each text sample in manual labeling is manually identified and then labeled, so that the labeling efficiency is low, a large amount of manpower cost is consumed, and the labeling cost is high.

The text labeling method, device, equipment and computer readable storage medium provided by the application aim to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a text annotation method according to an embodiment of the present application. As shown in fig. 1, the method includes:

step 101, a text to be processed is obtained, wherein the text to be processed comprises at least one word to be processed.

In this embodiment, specifically, the execution main body of this embodiment may be a control device, a terminal device, a text labeling device, other devices or apparatuses that can execute the method of this embodiment, and the like.

Firstly, a text to be processed is input into the execution main body of the embodiment, wherein the text to be processed is a text to be labeled. The text to be processed comprises one or more words to be processed, wherein the language of the words is not limited.

Step 102, determining a label and a score of each word to be processed in at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score.

In this embodiment, specifically, a weight dictionary has been set in advance, one or more words are included in the weight dictionary, each word is set with one or more labels, and each word has a score under each label. The domain to which the weight dictionary belongs is the same as the domain of the text to be processed.

For example, the weight dictionary includes word 1, word 2, word 3, and word 4; the word 1 has a label A, and the score of the word 1 belonging to the label A is a; the word 2 has a label B, and the score of the word 2 belonging to the label B is B; the word 3 has a label C, and the score of the word 3 belonging to the label C is C; word 4 has a label D, and the score for word 4 belonging to label D is D.

For another example, the weight dictionary includes word 1, word 2, word 3, and word 4; word 1 has label a and label B, the score of word 1 belonging to label a is a1, and the score of word 1 belonging to label B is a 2; word 2 has label B, label C and label D, the score for word 2 belonging to label B is B1, the score for word 2 belonging to label C is B2, and the score for word 2 belonging to label D is B3; word 3 has label C and label D, the score for word 3 belonging to label C is C1, and the score for word 3 belonging to label D is C2; word 4 has label D, and the score for word 4 belonging to label D is D1. In this manner, the maximum score of the tags to which each word belongs may be determined as the final tag of the word. For example, word 1 has label a and label B, word 1 belonging to label a has a score of a1, and word 1 belonging to label B has a score of a 2; a1 is greater than a2, then the final label for word 1 is label a.

Because the label and the score of each word are set in the weight dictionary, the label and the score corresponding to each word to be processed in the text to be processed can be analyzed according to the weight dictionary.

For example, the weight dictionary includes word 1, word 2, word 3, and word 4; the word 1 has a label A, and the score of the word 1 belonging to the label A is a; the word 2 has a label B, and the score of the word 2 belonging to the label B is B; the word 3 has a label C, and the score of the word 3 belonging to the label C is C; word 4 has a label D, and the score for word 4 belonging to label D is D. If the word 1 to be processed in the text to be processed is the word 1, the label of the word 1 to be processed is A, and the score of the word 1 to be processed belonging to the label A is a; and if the word 2 to be processed in the text to be processed is the word 3, the label of the word 2 to be processed is C, and the score of the word 2 to be processed belonging to the label C is C.

For example, the classification of the tags may be: sports type tags, entertainment type tags, cultural type tags, outing type tags, educational type tags, and the like.

And 103, determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed.

Optionally, step 103 specifically includes the following steps:

step 1031, determining the total score of the words to be processed under each label in the text to be processed according to the score of each word to be processed in the text to be processed.

And step 1032, determining the label with the maximum score sum as the category of the text to be processed.

Optionally, if the number of occurrences of the word to be processed in the text to be processed is greater than 1, step 1031 specifically includes:

determining the total score of the words to be processed under each label in the text to be processed according to the score and the occurrence frequency of each word to be processed in the text to be processed

Wherein N is the total number of the words to be processed under each label, the total number of the words to be processed under different labels is the same or different, and x_iIs the score, y, of the ith word to be processed under each label_iIs the number of occurrences of the ith word to be processed under each label,i∈[1,N]i is a positive integer;

or determining the total score of the words to be processed under each label in the text to be processed according to the score of each word to be processed in the text to be processed

Or, determining the total score sum of the words to be processed under each label in the text to be processed according to the score and the occurrence frequency of each word to be processed in the text to be processed by adopting a nonlinear summation method

e is a natural constant and ln is a natural logarithm.

Optionally, step 1032 specifically includes: determining the label with the largest score sum; and if the total score of the label with the maximum total score is determined to be within the preset threshold range, determining the label with the maximum total score as the category of the text to be processed.

In this embodiment, specifically, a label is set for each word to be processed in the text to be processed, and a score of the word to be processed belonging to the corresponding label is set for each word to be processed; the text to be processed has at least one tag therein.

Firstly, determining words to be processed belonging to the same label; and for the words to be processed belonging to the same label, calculating the total score of the words to be processed belonging to the same label according to the respective scores of the words to be processed belonging to the same label, thereby endowing each label in the text to be processed with a total score.

Then, sorting each label in the text to be processed according to the score sum of each label in the text to be processed, for example, sorting each label in the text to be processed according to the order of the score sums from large to small; and taking the label with the maximum score sum as the category of the text to be processed.

Specifically, this step is divided into the following several implementations when determining the total score of the words to be processed under each label in the text to be processed. The scores of the words to be processed under the same label can be the same or different.

The first realization mode is as follows: if the occurrence frequency of each word to be processed in the text to be processed is equal to 1, the sum of the scores of the words to be processed under each label can be directly calculated for each label, so that the sum of the scores of the words to be processed under each label is obtained.

For example, the label of the word to be processed 1 in the text to be processed is a, and the score of the word to be processed 1 belonging to the label a is a 1; the label of the word 2 to be processed in the text to be processed is A, and the score of the word 2 to be processed belonging to the label A is a 2; the label of the word 3 to be processed in the text to be processed is A, and the score of the word 3 to be processed belonging to the label A is a 3; the label of the word 4 to be processed in the text to be processed is B, and the score of the word 4 to be processed belonging to the label B is B1; the label of the word 5 to be processed in the text to be processed is B, and the score of the label B to which the word 5 to be processed belongs is B2; the label of the word 6 to be processed in the text to be processed is C, and the score of the word 6 to be processed belonging to the label C is C1; the label of the word 7 to be processed in the text to be processed is C, and the score of the word 7 to be processed belonging to the label C is C2; the label of the word 8 to be processed in the text to be processed is C, and the score of the word 8 to be processed belonging to the label C is C3. The text to be processed has 3 kinds of labels, namely a label A, a label B and a label C. Each of the above scores is different.

According to the first implementation manner, the sum of the scores of the words to be processed under the label a is determined to be a1+ a2+ a3 according to the score a1 of the word to be processed 1, the score a2 of the word to be processed 2 and the score a3 of the word to be processed 3 under the label a; determining the sum of the scores of the words to be processed under the label B as B1+ B2 according to the score B1 of the word to be processed under the label B and the score B2 of the word to be processed 5; and determining the sum of the scores of the words to be processed under the label C to be C1+ C2+ C3 according to the score C1 of the word to be processed 6, the score C2 of the word to be processed 7 and the score C3 of the word to be processed 8 under the label C.

Second implementationThe method comprises the following steps: as long as the number of occurrences of 1 word to be processed in the text to be processed is greater than 1, the sum of the scores of the words to be processed under each label in the text to be processed may be calculated in a multiple linear summation manner, specifically, according to the score x of each word to be processed under each label in the text to be processed_iAnd the number of occurrences y_iCalculating the total score of the words to be processed under each label in the text to be processed

Wherein N is the total number of the words to be processed under each label, the total number of the words to be processed under different labels can be the same or different, and x is_iIs the score, y, of the ith word to be processed under each label_iIs the occurrence frequency of the ith word to be processed under each label; then judging whether the score sum of the label with the maximum score sum is within a preset threshold range; if the score sum is within the preset threshold range, the label with the maximum score sum can be used as the category of the text to be processed; if the current text is not within the preset threshold range, manually marking the type of the text to be processed, and further acquiring the manually marked type of the text to be processed.

For example, the label of the word to be processed 1 in the text to be processed is a, the score of the word to be processed 1 belonging to the label a is a1, and the occurrence frequency of the word to be processed 1 is m 1; the label of the word 2 to be processed in the text to be processed is A, the score of the word 2 to be processed belonging to the label A is a2, and the occurrence frequency of the word 2 to be processed is m 2; the label of the word 3 to be processed in the text to be processed is A, the score of the word 3 to be processed belonging to the label A is a3, and the occurrence frequency of the word 3 to be processed is m 3; the label of the word 4 to be processed in the text to be processed is B, the score of the word 4 to be processed belonging to the label B is B1, and the occurrence frequency of the word 4 to be processed is m 4; the label of the word 5 to be processed in the text to be processed is B, the score of the label B to which the word 5 to be processed belongs is B2, and the occurrence frequency of the word 5 to be processed is m 5; the label of the word 6 to be processed in the text to be processed is C, the score of the word 6 to be processed belonging to the label C is C1, and the occurrence frequency of the word 6 to be processed is m 6; the label of the word 7 to be processed in the text to be processed is C, the score of the word 7 to be processed belonging to the label C is C2, and the occurrence frequency of the word 7 to be processed is m 7; the label of the word 8 to be processed in the text to be processed is C, the score of the word 8 to be processed belonging to the label C is C3, and the occurrence frequency of the word 8 to be processed is m 8. The text to be processed has 3 kinds of labels, namely a label A, a label B and a label C. Each of the above scores is different.

Then, according to a second implementation, the sum of the fractions of the words to be processed under label a is determined to be a1 m1+ a2 m2+ a3 m3, the sum of the fractions of the words to be processed under label B is determined to be B1 m4+ B2 m5, and the sum of the fractions of the words to be processed under label C is determined to be C1 m6+ C2 m7+ C3 m 8.

The third implementation manner is as follows: as long as the number of occurrences of 1 to-be-processed word in the to-be-processed text is greater than 1, the sum of the scores of the to-be-processed words under each label in the to-be-processed text can be calculated in a single linear summation manner, specifically, the score x of each to-be-processed word under each label in the to-be-processed text is calculated_iCalculating the total score of the words to be processed under each label in the text to be processed

It can be known that, at this time, no matter how many times the word to be processed appears, the calculation is performed only once according to the appearance of the word to be processed.

For example, according to the above illustration, the sum of the scores of the words to be processed under tag a is determined to be a1+ a2+ a3, the sum of the scores of the words to be processed under tag B is determined to be B1+ B2, and the sum of the scores of the words to be processed under tag C is determined to be C1+ C2+ C3.

The fourth implementation manner is as follows: as long as the number of occurrences of 1 to-be-processed word in the to-be-processed text is greater than 1, the sum of the scores of the to-be-processed words under each label in the to-be-processed text may be calculated in a non-linear summation manner, specifically, according to the score x of each to-be-processed word under each label in the to-be-processed text_iAnd the number of occurrences y_iTo calculateProcessing a sum of scores for words to be processed under each label in a text

For example, according to the above illustration, the sum of the fractions of the words to be processed under label a is determined to be a1 × ln (m1+ e-1) + a2 × ln (m2+ e-1) + a3 × ln (m3+ e-1), the sum of the fractions of the words to be processed under label B is determined to be B1 × ln (m4+ e-1) + B2 × ln (m5+ e-1), and the sum of the fractions of the words to be processed under label C is determined to be C1 × ln (m6+ e-1) + C2 × ln (m7+ e-1) + C3 ln (m8+ e-1).

In the step, when the category of the text to be processed is determined, whether the score sum of the label with the maximum score sum is within a preset threshold value range needs to be judged; if the score sum is within the preset threshold range, the label with the maximum score sum can be used as the category of the text to be processed; if the current text is not within the preset threshold range, manually marking the type of the text to be processed, and further acquiring the manually marked type of the text to be processed. The preset threshold range may be an interval, or the preset threshold range indicates that the sum of the scores needs to be greater than a numerical value.

For example, according to the above illustration, if the total score of the tags C is the largest and the total score of the tags C is within a preset threshold range, the tags C can be determined to be the category of the text to be processed, that is, the category of the text to be processed is C.

In the embodiment, a text to be processed is obtained, and the text to be processed comprises at least one word to be processed; determining a label and a score of each word to be processed in at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score; and determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed. The method for automatically labeling the words in the text to be processed and classifying the text to be processed is provided, a large amount of human resources are not needed for labeling the text, the labor cost can be saved, the cost of text labeling is reduced, and the labeling efficiency is improved.

Fig. 2 is a schematic flowchart of another text annotation method according to an embodiment of the present application. As shown in fig. 2, the method includes:

step 201, at least one text to be analyzed is obtained, wherein each text to be analyzed in the at least one text to be analyzed includes at least one word to be analyzed, and the word to be analyzed is provided with a label.

In this embodiment, specifically, a weight dictionary needs to be constructed first. Specifically, one or more texts to be analyzed are obtained, each text to be analyzed includes one or more words to be analyzed, and a label is already set for each word to be analyzed. For example, the words to be analyzed in the text to be analyzed are labeled by means of manual labeling.

For example, the text 1 to be analyzed has a word 1 to be analyzed, a word 2 to be analyzed, a word 3 to be analyzed, a word 4 to be analyzed, and a word 5 to be analyzed; the label of the word to be analyzed 1 is a sports label, the label of the word to be analyzed 2 is an entertainment label, the label of the word to be analyzed 3 is an entertainment label, the label of the word to be analyzed 4 is an education label, and the label of the word to be analyzed 4 is a tour label.

Step 202, extracting keywords of each text to be analyzed according to the words to be analyzed in each text to be analyzed to obtain a keyword set, wherein the keyword set comprises at least one keyword subset, and each keyword subset in the at least one keyword subset comprises keywords belonging to the same tag.

In this embodiment, specifically, a Test Rank algorithm is adopted to process each text to be analyzed, and according to words to be analyzed in each text to be analyzed, keywords in each text to be analyzed can be extracted, and the keywords in all the texts to be analyzed are put into one keyword set. And each keyword is tagged.

The method comprises the steps of dividing keywords belonging to the same label in a keyword set into a keyword subset to further obtain one or more keyword subsets, wherein each keyword subset comprises one or more keywords, and the keywords in the same keyword subset have the same label.

For example, a Test Rank algorithm is adopted to extract keywords in a text 1 to be analyzed to obtain a keyword 1, a keyword 2 and a keyword 3, wherein the label of the keyword 1 is A, the label of the keyword 2 is B and the label of the keyword 3 is C; extracting keywords in the text 2 to be analyzed by adopting a Test Rank algorithm to obtain a keyword 4, a keyword 5 and a keyword 6, wherein the label of the keyword 4 is A, the label of the keyword 5 is B and the label of the keyword 6 is C; thus, the keyword set comprises a keyword subset 1, a keyword subset 2 and a keyword subset 3, the keyword subset 1 comprises the keywords 1 and 2, the keyword subset 2 comprises the keywords 3 and 4, and the keyword subset 3 comprises the keywords 5 and 6.

And step 203, counting the word frequency information of the keywords in each keyword subset.

In this embodiment, specifically, after the extraction of the keywords is completed, the keywords in each keyword subset are subjected to word frequency sorting. First, for each keyword subset, word frequency information of the keywords is counted. And the word frequency information is the occurrence frequency of the keywords.

And 204, removing repeated keywords in each keyword subset to obtain each processed keyword subset.

In this embodiment, specifically, for each keyword subset, keywords that repeatedly appear in the keyword subset are removed. And then remove the repeated keywords.

And step 205, sequencing the keywords in each processed keyword subset according to the word frequency information to obtain a sequenced keyword set.

In this embodiment, specifically, for each processed keyword subset, the keywords are ranked according to the word frequency information of each keyword in the keyword subset and the order of the word frequency information from large to small, so as to obtain each ranked keyword subset; all the sorted keyword subsets form a sorted keyword set. Therefore, the keywords are sorted according to the descending order of the word frequency, the keywords with the W positions before the ranking can be obtained, and then the keywords with more used times are preliminarily screened out, wherein W is a positive integer.

For example, the keyword set includes a keyword subset 1 and a keyword subset 2; the keyword subset 1 corresponds to the label A, and the keyword subset 1 comprises a keyword 1, a keyword 2, a keyword 3 and a keyword 3; the keyword subset 2 corresponds to the tag B, and the keyword subset 2 includes a keyword 4, and a keyword 5. For the keyword subset 1, it can be counted that the word frequency of the keyword 1 is 1, the word frequency of the keyword 2 is 2, and the word frequency of the keyword 3 is 3; for the keyword subset 2, it can be counted that the word frequency of the keyword 4 is 2, and the word frequency of the keyword 5 is 1. The keywords 2 and 3 that repeatedly appear in the keyword subset 1 may be removed, and the keywords 4 that repeatedly appear in the keyword subset 2 may be removed. For the keyword subset 1, according to the order of the word frequency of the keywords from large to small, the order of the ordered keywords is the keyword 3, the keyword 2 and the keyword 1; for the keyword subset 2, according to the order of the word frequency of the keywords from large to small, the order of the obtained ordered keywords is the keywords 4 and the keywords 5.

And step 206, determining the score of each keyword according to a preset association coefficient set, wherein the association coefficient set comprises the association coefficient of at least one keyword, and the association coefficient represents the association degree between the keyword and the label.

In this embodiment, specifically, a relevance coefficient set is established in advance, and the relevance coefficient set includes a relevance coefficient of each keyword, where the relevance coefficient indicates a relevance between the keyword and the tag. For example, the association degree between the keyword and the corresponding tag can be determined by manually observing the association degree between the keyword and the corresponding tag.

The scores of the keywords can be determined according to the relevancy coefficient of each keyword. For example, the greater the relevancy coefficient of a keyword, the higher the score of the keyword.

The scores of the keywords may be divided into two to three ranks. For example, when the score is divided into two grades, the keyword 1 belongs to the first grade score, and the keyword 2 belongs to the second grade score; for example, when the score is divided into three grades, keyword 1 belongs to the first grade score, keyword 2 belongs to the second grade score, and keyword 3 belongs to the third grade score.

For example, for the keyword subset 1 in the keyword set, the order of the sorted keywords is keyword 3, keyword 2, and keyword 1, and the label of the keyword subset 1 is a; the association degree coefficient of the keyword 3 and the label A can be determined to be 1, and the score of the keyword 3 is determined to be a first-grade score; determining that the association degree coefficient of the keyword 2 and the label A is 2, and determining that the score of the keyword 2 is a second-grade score; and determining the association degree coefficient of the keyword 1 and the label A as 3, and determining the score of the keyword 1 as a third grade score. For the keyword subset 2 in the keyword set, the sequence of the keywords is keyword 4 and keyword 5, and the label of the keyword subset 2 is B; the association degree coefficient of the keyword 4 and the label B can be determined to be 2, and the score of the keyword 4 is determined to be a second-grade score; and determining the association coefficient of the keyword 5 and the label B as 1, and determining the score of the keyword 5 as a first-grade score.

And step 207, constructing a weight dictionary according to the keyword set and the score of each keyword.

In this embodiment, specifically, through the above steps, the keywords, the tags of the keywords, and the scores of the keywords may be determined, so that the weight dictionary may be obtained according to each keyword, the tag of each keyword, and the score of each keyword.

And 208, acquiring a text to be processed, wherein the text to be processed comprises at least one word to be processed.

In this embodiment, specifically, this step may refer to step 101 in fig. 1, and is not described again.

Step 209, determining a label and a score of each word to be processed in the at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises the at least one word, and each word in the at least one word has the label and the score.

In this embodiment, specifically, this step may refer to step 102 in fig. 1, and is not described again.

And step 210, determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed.

In this embodiment, specifically, this step may refer to step 103 in fig. 1, and is not described again.

And step 211, acquiring the words to be processed in the text to be processed with the determined category, wherein the words to be processed have labels and scores.

In this embodiment, specifically, the text added with the label may be used to expand the weight dictionary.

First, after step 209 and step 210 are executed, a to-be-processed text to which the mark and the category are added is obtained. After the

steps

209 and 210 are repeatedly executed for a plurality of times, a plurality of texts to be processed with the marks and the categories added can be obtained.

And acquiring the words to be processed in each text to be processed, wherein each word to be processed is endowed with a label and a score.

And step 212, updating the weight dictionary according to the words to be processed.

Wherein step 212 comprises:

and step 2121, deleting the words in the weight dictionary which are the same as the words to be processed to obtain a weight dictionary after deletion processing.

Optionally, step 2121 specifically includes: determining words in the weight dictionary which are the same as the words to be processed; judging whether the score of the word to be processed is greater than the score of the word same as the word to be processed; if so, replacing the words which are the same as the words to be processed with the words to be processed to obtain an updated weight dictionary.

And step 2122, adding the words to be processed into the weight dictionary after deletion processing to obtain an updated weight dictionary.

In this embodiment, specifically, since each word to be processed has been given a label and a score, the word to be processed may be added to the weight dictionary.

First, the word to be processed in the text to be processed, to which the tag and the category are added, may be the same as the word in the weight dictionary, and thus the word in the weight dictionary that is the same as the word to be processed needs to be deleted. Specifically, according to the words to be processed in the texts to be processed, which are added with the marks and the categories, the words which are the same as the words to be processed in the weight dictionary are searched; because the words of the weight dictionary have scores and the words to be processed also have scores, whether the scores of the words to be processed are greater than the scores of the words can be judged; if the number of the words in the weight dictionary is larger than the number of the words to be processed, the words which are the same as the words to be processed in the weight dictionary can be directly replaced by the words to be processed; if the value is less than or equal to the preset value, replacement is not needed.

Then, the words to be processed in the text to be processed, to which the labels and the categories are added, are added to the weight dictionary after the deletion processing, and the updating of the weight dictionary can be completed.

After the

steps

211 and 212 are repeated many times, the weight dictionary can be updated many times. After N times of updating, if the updated number of the words does not exceed the preset number compared with the weight dictionary after the N-1 times of updating, the updated change of the dictionary in the whole service is determined to be not large, and the weight dictionary does not need to be updated.

For example, the weight dictionary includes word 1, word 2, word 3, and word 4; word 1 has label a, and the score of word 1 belonging to label a is a 1; word 2 has label B, and the score of word 2 belonging to label B is B1; the word 3 has a label C, and the score of the word 3 belonging to the label C is C; word 4 has a label D, and the score for word 4 belonging to label D is D. Acquiring a word to be processed 1, a word to be processed 2, a word to be processed 5 and a word to be processed 6; the word to be processed 1 has a label A and a score a2, the word to be processed 2 has a label B and a score B2, the word to be processed 5 has a label E and a score E, and the word to be processed 5 has a label F and a score F. The word 1 to be processed is the same as the word 1, and the word 2 to be processed is the same as the word 2. It may be determined that the score a2 of the word 1 to be processed is greater than the score a2 of the word 1, then the word 1 may be replaced with the word 1 to be processed; it may be determined that the score b1 of word 2 to be processed is less than the score b2 of word 2, then a substitution may be required. The obtained updated weight dictionary comprises a word 1 to be processed, a word 2, a word 3, a word 4, a word 5 to be processed and a word 6 to be processed, wherein the word 1 to be processed has a label A and a score a2, the word 2 has a label B and a score B1, the word 3 has a label C and a score C, the word 4 has a label D and a score D, the word 5 to be processed has a label E and a score E, and the word 5 to be processed has a label F and a score F.

In the embodiment, a text to be processed is obtained, and the text to be processed comprises at least one word to be processed; determining a label and a score of each word to be processed in at least one word to be processed according to a preset weight dictionary, wherein the weight dictionary comprises at least one word, and each word in the at least one word has the label and the score; and determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed. The method for automatically labeling the words in the text to be processed and classifying the text to be processed is provided, a large amount of human resources are not needed for labeling the text, the labor cost can be saved, the cost of text labeling is reduced, and the labeling efficiency is improved. Moreover, the weight dictionary can be updated iteratively according to the words to be processed in the text to be processed with the determined category, so that the subsequent text labeling has good precision.

Fig. 3 is a schematic structural diagram of a text annotation device according to an embodiment of the present application, and as shown in fig. 3, the device according to the embodiment may include:

the first obtaining module 31 is configured to obtain a to-be-processed text, where the to-be-processed text includes at least one to-be-processed word;

the first determining module 32 is configured to determine a label and a score of each to-be-processed word in at least one to-be-processed word according to a preset weight dictionary, where the weight dictionary includes at least one word, and each word in the at least one word has a label and a score;

and the second determining module 33 is configured to determine the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed.

The text labeling device of this embodiment can execute the text labeling method provided in this embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 4 is a schematic structural diagram of another text annotation device provided in an embodiment of the present application, and based on the embodiment shown in fig. 3, as shown in fig. 4, in the device of the present embodiment, the second determining module 33 includes:

the first determining submodule 331 is configured to determine, according to the score of each to-be-processed word in the to-be-processed text, a sum of the scores of the to-be-processed words under each tag in the to-be-processed text.

The second determining submodule 332 is configured to determine a label with the largest score sum, which is a category of the text to be processed.

The first determining submodule 331 is specifically configured to:

Wherein N is the total number of the words to be processed under each label, the total number of the words to be processed under different labels is the same or different, and x_iIs the score, y, of the ith word to be processed under each label_iIs the number of occurrences of the ith word to be processed under each label, i belongs to [1, N ∈]I is a positive integer; or determining the total score of the words to be processed under each label in the text to be processed according to the score of each word to be processed in the text to be processed

Or, adopting a nonlinear summation method to sum the scores of each word to be processed in the text to be processedThe number of occurrences, determining the total score of the words to be processed under each label in the text to be processed

e is a natural constant and ln is a natural logarithm.

The second determining submodule 332 is specifically configured to: determining the label with the largest score sum; and if the total score of the label with the maximum total score is determined to be within the preset threshold range, determining the label with the maximum total score as the category of the text to be processed.

The apparatus provided in this embodiment further includes:

the second obtaining module 41 is configured to obtain at least one text to be analyzed before the first obtaining module 31 obtains the text to be processed, where each text to be analyzed in the at least one text to be analyzed includes at least one word to be analyzed, and the word to be analyzed is provided with a tag.

The extracting module 42 is configured to extract keywords of each text to be analyzed according to the words to be analyzed in each text to be analyzed, so as to obtain a keyword set, where the keyword set includes at least one keyword subset, and each keyword subset in the at least one keyword subset includes keywords belonging to the same tag.

The third determining module 43 is configured to determine a score of each keyword according to a preset association coefficient set, where the association coefficient set includes an association coefficient of at least one keyword, and the association coefficient represents an association degree between the keyword and the tag.

And the building module 44 is used for building a weight dictionary according to the keyword set and the score of each keyword.

The apparatus provided in this embodiment further includes:

a counting module 45, configured to count word frequency information of the keywords in each keyword subset before the third determining module 43 determines the score of each keyword according to the preset association coefficient set.

And a removing module 46, configured to remove the repeated keyword in each keyword subset to obtain each processed keyword subset.

And the sorting module 47 is configured to sort the keywords in each processed keyword subset according to the word frequency information, so as to obtain a sorted keyword set.

The apparatus provided in this embodiment further includes:

and a third obtaining module 48, configured to, after the second determining module 33 determines the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed, obtain the word to be processed in the text to be processed of which the category is determined, where the word to be processed has the label and the score.

And the updating module 49 is used for updating the weight dictionary according to the words to be processed.

An update module 49, comprising:

and the deleting submodule 491 is used for deleting the words in the weight dictionary which are the same as the words to be processed to obtain the weight dictionary after deletion processing.

And the adding submodule 492 is used for adding the words to be processed into the weight dictionary after the deletion processing to obtain an updated weight dictionary.

Wherein, deleting submodule 491 is specifically configured to: determining words in the weight dictionary which are the same as the words to be processed; judging whether the score of the word to be processed is greater than the score of the word same as the word to be processed; if so, replacing the words which are the same as the words to be processed with the words to be processed to obtain an updated weight dictionary.

The text labeling apparatus of this embodiment can execute another text labeling method provided in this embodiment of the present application, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 5 is a schematic structural diagram of a text annotation device provided in an embodiment of the present application, and as shown in fig. 5, an embodiment of the present application provides a text annotation device, which can be used to execute actions or steps of the text annotation device in the embodiments shown in fig. 1 or fig. 2, and specifically includes: a processor 2701, memory 2702, and a communication interface 2703.

The memory 2702 is used to store computer programs.

The processor 2701 is configured to execute the computer program stored in the memory 2702 to implement the actions of the text labeling apparatus in the embodiment shown in fig. 1 or fig. 2, which is not described again.

Optionally, the text annotation device may also include a bus 2704. The processor 2701, the memory 2702, and the communication interface 2703 may be connected to each other via a bus 2704; the bus 2704 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 2704 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

In the embodiments of the present application, the above embodiments may be referred to and referred to by each other, and the same or similar steps and terms are not repeated.

Alternatively, part or all of the above modules may also be implemented by being embedded in a certain chip of the text labeling device in the form of an integrated circuit. And they may be implemented separately or integrated together. That is, the above modules may be configured as one or more integrated circuits implementing the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as the memory 2702 including instructions executable by the processor 2701 of the text annotation device to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a text annotation apparatus, enable the text annotation apparatus to perform the text annotation method described above.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions can be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions can be transmitted from one website, computer, text annotation device, or data center to another website, computer, text annotation device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a text annotation device, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A text labeling method is characterized by comprising the following steps:

determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed;

before the obtaining of the text to be processed, the method further comprises:

2. The method of claim 1, wherein determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed comprises:

3. The method of claim 2, wherein if the number of occurrences of the to-be-processed word in the to-be-processed text is greater than 1, determining the sum of the scores of the to-be-processed words under each label in the to-be-processed text according to the score of each to-be-processed word in the to-be-processed text comprises:

e is a natural constant and ln is a natural logarithm.

4. The method of claim 2, wherein the determining the label with the largest total score is a category of the text to be processed, and comprises:

determining the label with the largest score sum;

5. The method according to claim 1, wherein before said determining the score of each of said keywords according to a preset relevance coefficient set, further comprising:

counting word frequency information of the keywords in each keyword subset;

6. The method according to any one of claims 1-5, further comprising, after determining the category of the text to be processed according to the label and score of each word to be processed in the text to be processed,:

and updating the weight dictionary according to the words to be processed.

7. The method of claim 6, wherein updating the weight dictionary according to the word to be processed comprises:

8. The method according to claim 7, wherein deleting a word in the weight dictionary that is the same as the word to be processed to obtain a deleted weight dictionary comprises:

9. A text labeling apparatus, comprising:

the second determining module is used for determining the category of the text to be processed according to the label and the score of each word to be processed in the text to be processed;

the device, still include:

10. The apparatus of claim 9, wherein the second determining module comprises:

11. The apparatus according to claim 10, wherein if the number of occurrences of the to-be-processed word in the to-be-processed text is greater than 1, the first determining sub-module is specifically configured to:

Wherein N is the total number of the words to be processed under each label,the total number of words to be processed under different labels is the same or different, x_iIs the score, y, of the ith word to be processed under each label_iIs the number of occurrences of the ith word to be processed under each label, i belongs to [1, N ∈]I is a positive integer;

e is a natural constant and ln is a natural logarithm.

12. The apparatus according to claim 10, wherein the second determining submodule is specifically configured to:

determining the label with the largest score sum;

13. The apparatus of claim 9, further comprising:

14. The apparatus of any one of claims 9-13, further comprising:

15. The apparatus of claim 14, wherein the update module comprises:

16. The apparatus according to claim 15, wherein the delete submodule is specifically configured to:

17. A text annotation apparatus, comprising: a processor, a memory, and a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-8.

18. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any one of claims 1-8.