CN115114913B

CN115114913B - Labeling method, labeling device, labeling equipment and readable storage medium

Info

Publication number: CN115114913B
Application number: CN202110291600.8A
Authority: CN
Inventors: 李长林; 蒋宁; 王洪斌; 吴海英; 沈春泽
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2024-02-06
Anticipated expiration: 2041-03-18
Also published as: CN115114913A

Abstract

The application discloses a labeling method, a labeling device, labeling equipment and a readable storage medium, and relates to the technical field of computers so as to improve labeling accuracy. The method comprises the following steps: acquiring a text to be marked; determining a first keyword information set of the text to be annotated; when the fact that the text to be annotated has keyword nesting is determined according to the first keyword information set, keywords with nesting relations are processed, and a second keyword information set is obtained; and labeling the text to be labeled according to the second keyword information set. The embodiment of the application can improve the accuracy of labeling.

Description

Labeling method, labeling device, labeling equipment and readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a labeling method, apparatus, device, and readable storage medium.

Background

Sequence labeling is the most basic task in NLP (Natural Language Processing ), and has very wide application, such as word segmentation, part-of-speech labeling, named entity recognition (Named Entity Recognition, NER), keyword extraction, natural language processing, semantic role labeling, slot extraction and the like.

In sequence labeling, a label is labeled for each element of a sequence. Generally, a sequence refers to a sentence and an element refers to a word in the sentence. For example, the information extraction problem can be considered as a sequence labeling problem, such as extracting meeting time, place, etc.

Currently, the scheme of sequence labeling includes: a manual labeling scheme; a semi-automated protocol; a maximum matching scheme; a combination of lexicon and maximum reverse matching scheme, etc. However, these methods have a problem of low labeling accuracy.

Disclosure of Invention

The embodiment of the application provides a labeling method, a labeling device, labeling equipment and a readable storage medium, so as to improve the labeling accuracy.

In a first aspect, an embodiment of the present application provides a labeling method, including:

acquiring a text to be marked;

determining a first keyword information set of the text to be annotated;

when the fact that the text to be annotated has keyword nesting is determined according to the first keyword information set, keywords with nesting relations are processed, and a second keyword information set is obtained;

and labeling the text to be labeled according to the second keyword information set.

In a second aspect, an embodiment of the present application further provides an labeling device, including:

the first acquisition module is used for acquiring a text to be marked;

the first determining module is used for determining a first keyword information set of the text to be annotated;

the first processing module is used for processing keywords with nesting relations to obtain a second keyword information set when determining that the keywords of the text to be annotated are nested according to the first keyword information set;

and the first labeling module is used for labeling the text to be labeled according to the second keyword information set.

In a third aspect, embodiments of the present application further provide an electronic device, including: a transceiver, a memory, a processor and a program stored on the memory and executable on the processor, which when executed implements the steps in the labeling method as described above.

In a fourth aspect, embodiments of the present application further provide a readable storage medium, where a program is stored, the program, when executed by a processor, implementing the steps in the labeling method as described above.

In the embodiment of the application, when the fact that the keyword nesting exists in the text to be marked is determined according to the first keyword information set of the text to be marked, keywords with nesting relations can be processed, so that the second keyword information set is obtained, and then the text to be marked is marked by the second keyword information set. By using the scheme of the embodiment of the application, because the keywords with nested relations are processed, each keyword can be accurately distinguished when the text is marked, and therefore the marking accuracy can be improved.

Drawings

FIG. 1 is one of the flowcharts of the labeling method provided in the embodiments of the present application;

FIG. 2 is a second flowchart of a labeling method according to an embodiment of the present disclosure;

fig. 3 is a block diagram of a labeling device according to an embodiment of the present application.

Detailed Description

In the embodiment of the application, the term "and/or" describes the association relationship of the association objects, which means that three relationships may exist, for example, a and/or B may be represented: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The term "plurality" in the embodiments of the present application means two or more, and other adjectives are similar thereto.

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, fig. 1 is a flowchart of a labeling method provided in an embodiment of the present application, as shown in fig. 1, including the following steps:

And 101, acquiring a text to be marked.

The text to be marked can be any text to be marked. The text to be annotated can also be text in different fields, such as medical fields, computer fields, etc.

Step 102, determining the first keyword information set of the text to be annotated.

In the embodiment of the application, the keyword information set includes information of a plurality of keywords, wherein the information of each keyword may include the keyword, a starting position of the keyword in the text to be annotated, and a length of the keyword. The keywords, the starting position of the keywords in the text to be annotated, and the length of the keywords may all be referred to as sub-attributes. The keyword information set may be represented in a list form, and the starting positions of the keywords and the keywords in the text to be marked and the lengths of the keywords may be stored in the Keyword, keyword _ start, keyword _long list respectively.

Specifically, in practical application, a reference keyword set corresponding to the text to be annotated may be obtained first, and then information of keywords of the text to be annotated may be obtained by using the reference keyword set, where the keywords are located in the reference keyword set. And then, forming the first keyword information set by utilizing information of each keyword, wherein the information of the keyword comprises the following components: keywords, starting position of keywords in text to be annotated, and length of keywords.

The "keyword" is a keyword itself, for example, zhang san, spring wind, etc. The "initial position of the keyword in the text to be annotated" is the position of the first character of the keyword in the text to be annotated. For example, the position of each character may be represented in the form of a row, a column. Then, the value of the starting position of the keyword in the text to be annotated can be represented by the line number or the column number where the first character of the keyword is located. For example, if two keywords are in the same row, a column number may be used to distinguish the locations of the two keywords in the text to be annotated; if the two keywords are in different lines, then the line number can be used to distinguish the location of the two keywords in the text to be annotated. The "length of the keyword" is the total number of characters included in the keyword.

The reference keyword set stores a plurality of keywords in a certain field. These keywords may be given by experts, scholars, practitioners or technicians in the relevant field, or may be extended based on actual labeling experience. For example, keywords may be supplemented as needed, such as expanded synonyms (complaint-complaint), aliases (Chongqing-mountain city), abbreviations (Beijing university-North university), and the like.

And obtaining a reference keyword set of the field according to the field to which the text to be annotated belongs. For example, if the domain to which the text to be annotated belongs is a medical domain, the set of reference keywords may be a set of keywords for the medical domain.

In the embodiment of the application, the reference keyword sets in different fields can be obtained in advance. In order to improve the accuracy and efficiency of labeling, in the process of acquiring the reference keyword set corresponding to the text to be labeled, acquiring a first keyword set in the field to which the text to be labeled belongs, deleting repeated keywords in the first keyword set, and obtaining an updated first keyword set. By deleting the repeated keywords, the uniqueness of each keyword can be kept, the efficiency of the whole labeling process can be improved, and the accuracy of labeling results can be ensured. And classifying the keywords in the updated first keyword set to obtain the reference keyword set, wherein the reference keyword set comprises keywords of at least two categories, and different keywords belong to different categories.

The classified categories may include time, place, name, job, technical category, etc. When classifying, the uniqueness of classification and the accuracy of classification are ensured, namely, the keywords of each category are not repeated, and each keyword can be accurately classified into the category to which the keyword belongs. In this case, then, the total number of keywords in the different categories should be equal to the total number of keywords in the updated first keyword set. In this way, the first keyword set is processed, so that repeated keywords are deleted, and the accuracy of the labeling result can be ensured.

After the reference keyword set is obtained, the keywords in the text to be marked can be positioned by utilizing the reference keyword set, so that the information of the keywords of the text to be marked is obtained according to the positioning result. And the first keyword information set can be further formed by using the obtained information of each keyword.

And 103, when the fact that the keywords of the text to be annotated are nested is determined according to the first keyword information set, keywords with nesting relations are processed, and a second keyword information set is obtained.

Specifically, in this step, according to the starting position of the keyword in the text to be annotated and the length of the keyword, the keyword of the text to be annotated is subjected to nested relation analysis, so as to obtain an analysis result. And when the analysis result shows that the nested relation exists among the keywords, processing the information of the keywords with the nested relation in the first keyword information set to obtain the second keyword information set.

The second keyword information set comprises keywords, starting positions of the keywords in the text to be marked and lengths of the keywords. In a sense, the second set of keyword information may be considered as an update of the first set of keyword information.

The purpose of the nesting analysis is to find keywords that have nesting relationships. In general, keyword nested relation analysis is performed on the same sentence. Wherein, the nested relation refers to the content that there is overlap between two keywords. For example, in "many houses are broken by large earthquake" both "earthquake" and "earthquake break" are keywords, and both have the same character "earthquake". Thus, the two can be considered to have a nested relationship. The nested relationship includes a full nesting, a cross nesting, or a hybrid nesting.

In performing the nesting analysis, the analysis may be performed as follows:

for a first keyword, a second keyword and a third keyword in a sentence of a text to be annotated, determining that a nesting relationship exists between keywords of the text to be annotated when any one of the following first condition and second condition is satisfied, otherwise determining that no nesting relationship exists between keywords of the text to be annotated:

the first condition is: the lengths of the first keywords and the second keywords are not equal; alternatively, the lengths of the first keyword and the second keyword are not equal, and the lengths of the second keyword and the third keyword are not equal.

That is, the first Keyword i, the second Keyword j, the third Keyword p in the Keyword, the keyword_long [ i ] ] is ]! =keyword_long [ j ]; alternatively, key word_long [ i ] is! =keyword_long [ j ] and keyword_long [ p ] ≡! =keyword_long [ j ].

The second condition includes:

the first sub-condition is: m is not less than N and not more than P;

wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, and P represents a sum of the value of the start position of the second keyword and a length of the second keyword;

namely:

keyword_start[j]<＝keyword_start[i]<＝(keyword_start[j]+keyword_long[j])。

the second sub-condition is: m is more than N and less than or equal to P and less than Q;

wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, P represents a sum of the value of the start position of the second keyword and a length of the second keyword, and Q represents a sum of the value of the start position of the first keyword and the length of the first keyword.

Namely:

keyword_start[j]<keyword_start[i]<＝(keyword_start[j]+keyword_long[j])<(keyword_start[i]+keyword_long[i])。

the third sub-condition is: m is less than or equal to N is less than or equal to P, and M is more than or equal to O and less than or equal to P is more than or equal to S;

wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, and P represents a sum of the value of the start position of the second keyword and a length of the second keyword; o represents the value of the starting position of the third keyword, P represents the sum of the value of the starting position of the second keyword and the length of the second keyword, and S represents the sum of the value of the starting position of the third keyword and the length of the third keyword. Wherein M, N, P, Q, S is a positive integer.

Namely:

keyword_start[j]<＝keyword_start[i]<＝(keyword_start[j]+keyword_long[j])；

keyword_start[j]<keyword_start[n]<＝(keyword_start[j]+keyword_long[j])<(keyword_start[n]+keyword_long[n])。

wherein n is a third keyword.

Specifically, in the embodiment of the present application, when any sub-condition of the first condition and the second condition is satisfied, it is determined that a nesting relationship exists between the keywords of the text to be annotated.

The manner of processing under the above different combinations of conditions is specifically described below.

First case: the first condition is that the lengths of the first keyword and the second keyword are not equal. When determining that the complete nesting relationship exists between the keywords of the text to be annotated (namely, the first sub-condition of the first condition and the second condition is met), deleting the information of the first keyword from the first keyword information set to obtain the second keyword information set, and obtaining the second keyword information set.

That is, in this case, the lengths of the two keywords are not equal, and the following is satisfied: m is not less than N and not more than P; wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, and P represents a sum of the value of the start position of the second keyword and a length of the second keyword.

This situation may be referred to as complete nesting, i.e. the second keyword completely comprises the first keyword.

For example, the sentence "people's squares are shown in the gong and the drum are noisy", wherein "people" and "people's squares" are keywords, and "people's squares" include "people" and both belong to complete nesting. At this time, information of "people" may be deleted from the first keyword information set.

Second case: the first condition is that the lengths of the first keyword and the second keyword are unequal, and the first condition and the second sub-condition are met at the same time. That is, it is determined that a cross nesting relationship exists between the keywords of the text to be annotated. In this case, the lengths of the two keywords are not equal, and the following are satisfied: m is more than N and less than or equal to P and less than Q; wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, P represents a sum of the value of the start position of the second keyword and a length of the second keyword, and Q represents a sum of the value of the start position of the first keyword and the length of the first keyword.

This case may be referred to as cross nesting, i.e., the first keyword and the second keyword do not completely include each other, and both have only portions overlapping each other.

In the case that the large earthquake breaks many houses, the earthquake and the earthquake break are keywords, and the keywords are the same as the characters of the earthquake, so that the earthquake and the earthquake break belong to cross nesting.

For this case, the following procedure may be included:

(1) And cutting the sentence where the keyword with the nested relation is located to obtain a cut text.

Specifically, in the step, a sentence where a keyword with a nested relation is located is segmented to obtain a first segmented text and a second segmented text; wherein a first sub-portion of the first cut text includes content of the first keyword other than the overlapping portion, and a second sub-portion includes the second keyword; a first subsection of the second segmented text including the first keyword and a second subsection including content of the second keyword other than the overlapping portion; the overlapping portion is the same portion in the first keyword and the second keyword.

That is, the position of the cut is the last character of the overlapping portion of the two keywords. For example, in "many houses are broken by a large earthquake" the splitting position is the earthquake. Then, the obtained texts are respectively: "that land is a large earthquake, many houses are broken" and "that land is a large earthquake, many houses are broken".

(2) And acquiring information of keywords of the segmented text according to the segmented text.

Before acquiring the information of the keywords of the segmented text, the method further comprises the following steps: and updating the text to be marked according to the segmented text to obtain the updated text to be marked. And then, acquiring the updated information of the keywords of the text to be marked, namely the information of the keywords of the segmented text.

Specifically, in this step, when the first keyword and the second keyword belong to the same category, scoring the first segmentation text and the second segmentation text, and replacing the sentence with the segmentation text with a higher score to obtain an updated text to be annotated; and when the first keyword and the second keyword belong to different categories, replacing the sentence by the first segmentation text and the second segmentation text to obtain an updated text to be annotated.

Wherein the category is a category in the classification of keywords in the reference keyword set. When the first keyword and the second keyword belong to the same category, scoring of sentence structure (such as scoring by adopting a KenLM method) can be performed on the obtained segmented text, so that a text with more accurate structure is obtained, convenience is provided for subsequent labeling, and the labeling accuracy is further improved. When the first keyword and the second keyword belong to different categories, replacing the sentence by the first segmentation text and the second segmentation text to obtain an updated text to be marked, namely the updated text to be marked does not comprise the original sentence, but comprises the first segmentation text and the second segmentation text, and the positions of the first segmentation text and the second segmentation text in the text to be marked are the same as the positions of the primitive sentence in the text to be marked.

Through the segmentation processing, no keyword nesting exists among the keywords, and then, the updated information of the keywords of the text to be marked can be determined again, and the information of the keywords of the segmented text can be obtained. The keyword acquisition mode principle of the acquisition mode which can be described is the same.

(3) And obtaining the second keyword information set according to the information of the keywords of the segmented text and the first keyword information set.

Specifically, in this step, first, the information of the first keyword is deleted from the first keyword information set, and the information of the second keyword is deleted from the first keyword information set, so as to obtain updated first keyword information set information. And then, updating the updated first keyword information set according to the updated information of the keywords of the text to be marked to obtain a second keyword information set.

Since there is no keyword nesting between keywords, the keywords determined according to the above-described process may be inaccurate, and therefore, here, it is necessary to delete the information of the first keyword from the first keyword information set and delete the information of the second keyword from the first keyword information set, so as to ensure the accuracy of the obtained keywords. In order to ensure the accuracy of the labeling, the first keyword information set needs to be updated according to the updated information of the keywords of the text to be labeled, so as to obtain a second keyword information set.

The second keyword information set may be the same as or different from the first keyword information set.

Third case: the lengths of the first keyword and the second keyword are not equal, and the lengths of the second keyword and the third keyword are not equal, and the following conditions are satisfied: m is less than or equal to N is less than or equal to P, and M is more than or equal to O and less than or equal to P is more than or equal to S;

In this case, there is both a complete nesting of keywords and a cross-nesting of keywords, i.e. a mixed nesting.

For example, "people squares" are shown in the morning, red banners are shown, people are mountain and sea, "wherein" people squares "," people "and" people squares "are keywords, and" people squares "include" people ", and are completely nested; and the labor people squares and the people squares belong to the crossed nesting.

In this case, the processing may be performed as follows:

(1) When the mixed nesting relationship among the keywords of the text to be marked is determined, selecting a first target keyword with the longest length from the first keyword, the second keyword and the third keyword.

(2) And deleting the information of the second keyword and the information of the third keyword from the first keyword information set when the first target keyword is one.

(3) When the first target keywords are multiple, selecting a second target keyword which is nested with the most keywords from the first target keywords, and deleting the information of the first keywords and the information of the third keywords from the first keyword information set.

(4) And deleting the information of the first keyword, the information of the second keyword and the information of the third keyword from the first keyword information set when the first target keyword or the second target keyword does not exist.

For example, in "people in labour square" in the sky around the drum, red flag showing, people in mountain and sea, "people in labour square" is the key word with the longest length and only one, then the information of these key words "people in labour", "people in person", people square "is deleted. Through the method, the obtained text can be ensured to be more attached to the context, and the risk of label missing can be avoided.

And 104, labeling the text to be labeled according to the second keyword information set.

In this step, the text to be annotated can be annotated in at least two ways:

and in the first mode, when the target sub-attribute of the second keyword information set is empty, marking the text to be marked as a first type, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be marked and the length of the keyword. And when the target sub-attribute of the second keyword information set is not empty, marking the keywords in the text to be marked as a second type according to the second keyword information set, and marking the content except the keywords in the text to be marked as a third type.

That is, if one of the sub-attributes is null, it indicates that the second keyword information set is null, that is, no keywords exist in the text to be annotated. Then the text to be annotated may be annotated as a first type. For example, an identifier may be added to the text to be annotated to indicate that the text to be annotated is of a first type, i.e., no keywords are present. If not, then the existence of keywords in the text to be annotated is indicated. At this time, the keywords in the text to be marked can be marked according to the keywords, the positions of the keywords in the text to be marked and the lengths of the keywords. For example, a first identifier is added to the keyword, and a second identifier is added to the other content, where the first identifier and the second identifier are different to distinguish the keyword from the non-keyword.

And obtaining the category of each keyword in the second keyword information set. When the category is empty, or when the target sub-attribute of the second keyword information set is empty, marking the text to be marked as a first type, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be marked and a length of the keyword; and when the category is not empty, or when the target sub-attribute of the second keyword information set is not empty, marking the keywords in the text to be marked as a second type according to the second keyword information set, and marking the content except the keywords in the text to be marked as a third type.

Wherein the category is a category in the classification of keywords in the reference keyword set.

If the category is empty, then it is indicated that no keywords are present in the text to be annotated. Then the text to be annotated may be annotated as a first type. For example, an identifier may be added to the text to be annotated to indicate that the text to be annotated is of a first type, i.e., no keywords are present. If not, then the existence of keywords in the text to be annotated is indicated. At this time, the keywords in the text to be marked can be marked according to the keywords, the positions of the keywords in the text to be marked and the lengths of the keywords. For example, a first identifier is added to the keyword, and a second identifier is added to the other content, where the first identifier and the second identifier are different to distinguish the keyword from the non-keyword.

Or if one of the sub-attributes is null, the second keyword information set is null, that is, no keywords exist in the text to be annotated. Then the text to be annotated may be annotated as a first type. For example, an identifier may be added to the text to be annotated to indicate that the text to be annotated is of a first type, i.e., no keywords are present. If not, then the existence of keywords in the text to be annotated is indicated. At this time, the keywords in the text to be marked can be marked according to the keywords, the positions of the keywords in the text to be marked and the lengths of the keywords. For example, a first identifier is added to the keyword, and a second identifier is added to the other content, where the first identifier and the second identifier are different to distinguish the keyword from the non-keyword.

In the above process, the first type and the third type are used for indicating that the marked object is not a keyword, and the second type is used for indicating that the marked object is a keyword. In practical applications, any identifier may be used to represent the above three types. By the method, the keywords and the non-keywords in the text to be marked can be flexibly marked, so that the marking efficiency is improved.

Referring to fig. 2, fig. 2 is a flowchart of a labeling method provided in an embodiment of the present application, as shown in fig. 2, including the following steps:

step 201, obtaining a keyword set in the field to which the text to be annotated belongs.

Wherein the keywords may be given by an expert, scholars, practitioner or technician in the relevant field. For keywords in some fields, synonyms (complaint-complaint), aliases (Chongqing-mountain city), abbreviations (Beijing university-North university) and the like are also required to be expanded.

Step 202, performing de-duplication processing on the keyword set to obtain a reference keyword set, so as to maintain the uniqueness of each keyword. By the method, the efficiency of the whole labeling process can be improved, and the accuracy of the labeling result can be ensured.

Step 203, classifying the keywords in the reference keyword set. In the embodiment of the present application, the classification principle is as follows:

1) The uniqueness of the classification, i.e., the keywords of each category are not repeated with each other;

2) The accuracy of classification determines the advantages and disadvantages of model training and prediction performance;

in accordance with the above principles, the number of keywords in the set of step 202 is equal to the sum of the number of keywords of each category in step 203.

And 204, positioning keywords in the text to be annotated.

In this step, the text to be annotated in the field is input, and the keywords contained in the text to be annotated and the starting position of the keywords, that is, the position of the first character of the keywords in the text to be annotated, are located in combination with the reference keyword set in step 202.

And 205, recording keywords in the text to be marked, the starting position of the keywords and the length of the keywords, and adding the keywords and the starting position of the keywords and the length of the keywords into a Keyword, keyword _ start, keyword _long list respectively.

Step 206, judging whether the keywords in the keywords are nested.

For any Keyword i and a certain Keyword j in the Keyword, the nested condition is that:

(1) The lengths of keywords i and j are not equal: key word_long [ i ] ] is ]! =keyword_long [ j ].

(2) The value of the starting position of the keyword i is greater than or equal to the value of the starting position of the keyword j and less than or equal to the sum of the value of the starting position of the keyword j and the length of the keyword j. Namely:

keyword_start[j]<＝keyword_start[i]<＝(keyword_start[j]+keyword_long[j])。

(3) The value of the starting position of the keyword i is larger than the value of the starting position of the keyword j, the value of the starting position of the keyword i is smaller than or equal to the sum of the value of the starting position of the keyword j and the length, and the sum of the value of the starting position of the keyword j and the length is smaller than the sum of the value of the starting position of the keyword i and the length.

Namely: keyword_start [ j ] < keyword_start [ i ] < = (keyword_start [ j ] +keyword_long [ j ]) is (keyword_start [ i ] +keyword_long [ i ]).

(4) If any Keyword i, i+ … exists in the Keyword, the condition (2) and the condition (3) are simultaneously satisfied for the Keyword j. For example, the keyword i and the keyword j satisfy the above condition (2), and the keyword i+1 and the keyword j satisfy the above condition (3).

In the above description, the length of the keyword, the value of the start position, and the sum obtained are all positive integers.

Step 207, if no nesting exists among the keywords, keeping Keyword, keyword _ start, keyword _long unchanged.

Step 208, if nesting exists among the keywords, updating Keyword, keyword _ start, keyword _long according to the information of the nested keywords. The specific steps are as follows:

If the conditions (1) and (2) are satisfied, the method belongs to complete nesting, then the position of a keyword i is recorded, and according to the position of the keyword i, the element of the keyword i is deleted in Keyword, keyword _ start, keyword _long, namely the keyword i is deleted, the starting position of the keyword i is deleted, and the length of the keyword i is deleted, so that updated Keyword, keyword _ start, keyword _long is obtained.

If conditions (1) and (3) are satisfied, then it belongs to cross nesting. Then, in this case, the nesting relationship is split. For example "that field is bombarded with a large number of people. "where the two keywords" bombed "and" fried dead "are nested in a cross.

In this case, the sentence is first split, and the split sentence is as follows: "big in that field, many people die. The "and" field is bombarded, and many people die. ". Then, the categories of the two keywords are judged. If the categories of the two keywords are consistent, scoring the two segmented sentences, selecting sentences with high scores for marking, and replacing the original text by the sentences with high scores, namely replacing 'the big bomber' to death many people. ". If the categories of the two keywords are inconsistent, replacing the original text by using the two segmented sentences, namely replacing the word, wherein a plurality of people die due to the fact that the word is bombarded. ".

And finally, deleting the elements corresponding to the positions in Keyword, keyword _ start, keyword _long according to the positions of the keywords i and j, namely deleting the keywords i and j, deleting the initial positions of the keywords i and j and deleting the lengths of the keywords i and j, so as to obtain updated Keyword, keyword _ start, keyword _long.

If conditions (1) and (4) are met, then mixed nesting is included, i.e., both cross nesting and full nesting. For example, "people on the plaza of people in the labor people gong and blonde, red flag show, people mountain and sea". ". Aiming at the keywords of labor people, the keywords of labor and people are completely nested with the keywords of people, and the keywords of people squares are cross nested with the keywords of people.

In this case, first, if only one keyword having the largest keyword length is selected, the label is selected to be labeled with the largest keyword length. That is, only the information of the keyword having the largest length is retained, and the information of the other keywords is deleted. Otherwise, selecting keywords with more nested keywords for labeling. That is, only the information of keywords with more nested keywords is reserved, and the information of other keywords is deleted, so that the context can be more attached, and the risk of label missing can be avoided. For example, in this example, a "people in labour" is noted. If the two conditions are not met, clearing Keyword, keyword _ start, keyword _long corresponding to 'labor people', 'labor', 'people' and 'people squares' in the sentence.

And 209, judging the type of each Keyword in the keywords and recording.

Specifically, in this step, the category of each Keyword is determined in combination with the Keyword category classification in step 203, and recorded in the keyword_type.

It is determined whether keyword_type is null. If Key_type is null, then step 210 is performed, otherwise step 211 is performed.

And 210, if the keyword_type is null, labeling the text to be labeled as the same type.

Step 211, if the keyword_type is not null, marking the Keyword according to Keyword, keyword _ start, keyword _long and the keyword_type, and marking the rest of the content as another same type to distinguish the Keyword.

In this process, the judgment can be performed based on any one of Keyword, keyword _ start, keyword _long, and the principle is the same as that based on the keyword_type.

According to the scheme, the automatic labeling can be completed by combining the keywords in a certain field and the texts in the corresponding field, the structure is simple, the efficiency is high, particularly, the texts in a big data scene are labeled, a large amount of manpower and material resources can be saved, and the keyword labeling in the related field can be completed quickly at low cost. In addition, the scheme of the embodiment of the application has more perfect functions and overcomes the defects of the regular method. By using the scheme of the embodiment of the application, because the keywords with nested relations are processed, each keyword can be accurately distinguished when the text is marked, and therefore the marking accuracy can be improved.

The embodiment of the application also provides a labeling device. As shown in fig. 3, the labeling device 300 includes:

a first obtaining module 301, configured to obtain a text to be annotated; a first determining module 302, configured to determine a first keyword information set of the text to be annotated; the first processing module 303 is configured to process, when it is determined that the text to be annotated has a keyword nesting according to the first keyword information set, keywords having a nesting relationship to obtain a second keyword information set; the first labeling module 304 is configured to label the text to be labeled according to the second keyword information set.

Optionally, the first determining module 302 includes:

the first acquisition sub-module is used for acquiring a reference keyword set corresponding to the text to be annotated;

the second acquisition sub-module is used for acquiring information of keywords of the text to be annotated by utilizing the reference keyword set, wherein the keywords are positioned in the reference keyword set;

a third obtaining sub-module, configured to form the first keyword information set by using information of each keyword, where the information of the keyword includes: the method comprises the steps of obtaining a keyword, a starting position of the keyword in the text to be marked and the length of the keyword.

Optionally, the first obtaining submodule includes:

the obtaining unit is used for obtaining a first keyword set in the field to which the text to be marked belongs;

the deleting unit is used for deleting repeated keywords in the first keyword set to obtain an updated first keyword set;

the classification unit is used for classifying the keywords in the updated first keyword set to obtain the reference keyword set, wherein the reference keyword set comprises keywords of at least two categories, and different keywords belong to different categories.

Optionally, the first processing module 303 includes:

the first processing sub-module is used for carrying out nested relation analysis on the keywords of the text to be marked according to the starting position of the keywords in the text to be marked and the length of the keywords, so as to obtain an analysis result;

and the second processing sub-module is used for processing the information of the keywords with the nested relation in the first keyword information set when the analysis result shows that the nested relation exists among the keywords, so as to obtain the second keyword information set.

Optionally, the first processing sub-module is configured to:

When the first condition and the second condition are met, determining that a nesting relationship exists between the keywords of the text to be annotated, otherwise, determining that no nesting relationship exists between the keywords of the text to be annotated;

the first condition is: the lengths of the first keyword and the second keyword are not equal, or the lengths of the first keyword and the second keyword are not equal, and the lengths of the second keyword and the third keyword are not equal.

The second condition includes:

the first sub-condition is: m is not less than N and not more than P;

wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, P represents a sum of the value of the start position of the second keyword and a length of the second keyword, and Q represents a sum of the value of the start position of the first keyword and the length of the first keyword;

Wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, and P represents a sum of the value of the start position of the second keyword and a length of the second keyword; o represents the value of the starting position of the third keyword, P represents the sum of the value of the starting position of the second keyword and the length of the second keyword, and S represents the sum of the value of the starting position of the third keyword and the length of the third keyword. Wherein M, N, P, Q, S is a positive integer;

optionally, the first processing sub-module is configured to determine that a nesting relationship exists between the keywords of the text to be annotated when any sub-condition of the first condition and the second condition is satisfied.

Optionally, the first condition is that the lengths of the first keyword and the second keyword are not equal; and the second processing submodule is used for deleting the information of the first keyword from the first keyword information set when the fact that the complete nested relation exists among the keywords of the text to be annotated is determined, so that the second keyword information set is obtained.

Optionally, the first condition is that the lengths of the first keyword and the second keyword are not equal; the second processing sub-module includes:

The segmentation unit is used for segmenting sentences where the keywords with nested relations exist when the cross nested relations exist among the keywords of the text to be annotated, so that segmented text is obtained;

the obtaining unit is used for obtaining the information of the keywords of the segmented text;

and the updating unit is used for obtaining the second keyword information set according to the information of the keywords of the segmented text and the first keyword information set.

Optionally, the segmentation unit is configured to segment a sentence where a keyword with a nesting relationship is located, so as to obtain a first segmentation text and a second segmentation text; wherein a first sub-portion of the first cut text includes content of the first keyword other than the overlapping portion, and a second sub-portion includes the second keyword; a first subsection of the second segmented text including the first keyword and a second subsection including content of the second keyword other than the overlapping portion; the overlapping portion is the same portion in the first keyword and the second keyword;

the first updating unit is configured to score the first segmented text and the second segmented text when the first keyword and the second keyword belong to the same category, and replace the sentence with the segmented text with a higher score to obtain an updated text to be annotated; and when the first keyword and the second keyword belong to different categories, replacing the sentence by the first segmentation text and the second segmentation text to obtain an updated text to be annotated.

Optionally, the first condition is: the lengths of the first keywords and the second keywords are not equal, and the lengths of the second keywords and the third keywords are not equal; the second processing sub-module includes:

the selecting unit is used for selecting a first target keyword with the longest length from the first keyword, the second keyword and the third keyword when the mixed nesting relationship exists among the keywords of the text to be marked;

a first deleting unit, configured to delete, when the first target keyword is one, information of the second keyword and information of the third keyword from the first keyword information set;

the second deleting unit is used for selecting a second target keyword which is nested with the most keywords from the first target keywords when the first target keywords are multiple, and deleting the information of the first keywords and the information of the third keywords from the first keyword information set;

and a third deleting unit configured to delete, when the first target keyword or the second target keyword does not exist, information of the first keyword, information of the second keyword, and information of the third keyword from the first keyword information set.

Optionally, the first labeling module 304 includes:

the first labeling sub-module is used for labeling the text to be labeled as a first type when the target sub-attribute of the second keyword information set is empty, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be labeled and the length of the keyword;

and the second labeling sub-module is used for labeling the keywords in the text to be labeled as a second type and labeling the contents except the keywords in the text to be labeled as a third type according to the second keyword information set when the target sub-attribute of the second keyword information set is not empty.

Optionally, the first labeling module 304 includes:

the first acquisition sub-module is used for acquiring the category of each keyword in the second keyword information set;

the third labeling sub-module is used for labeling the text to be labeled as a first type when the category is empty or when the target sub-attribute of the second keyword information set is empty, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be labeled and the length of the keyword;

And the fourth labeling sub-module is used for labeling the keywords in the text to be labeled as a second type and labeling the contents except the keywords in the text to be labeled as a third type according to the second keyword information set when the category is not empty or the target sub-attribute of the second keyword information set is not empty.

The device provided in the embodiment of the present application may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice. In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiment of the application also provides electronic equipment, which comprises: a memory, a processor, and a program stored on the memory and executable on the processor; the processor is configured to read the program implementation in the memory and includes the steps in the labeling method as described above.

The embodiment of the application further provides a readable storage medium, and the readable storage medium stores a program, which when executed by a processor, implements each process of the above-mentioned labeling processing method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is provided herein. The readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic memories (e.g., floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), etc.), optical memories (e.g., CD, DVD, BD, HVD, etc.), semiconductor memories (e.g., ROM, EPROM, EEPROM, nonvolatile memories (NAND FLASH), solid State Disks (SSD)), etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. In light of such understanding, the technical solutions of the present application may be embodied essentially or in part in the form of a software product stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and including instructions for causing a terminal (which may be a cell phone, computer, server, air conditioner, or network device, etc.) to perform the methods described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method of labeling, comprising:

acquiring a text to be marked;

determining a first keyword information set of the text to be annotated;

when the fact that the text to be annotated has keyword nesting is determined according to the first keyword information set, keywords with nesting relations are processed, and a second keyword information set is obtained; wherein, the nested relation refers to the content with overlapping between two keywords; the nesting relation is determined based on the starting position of the keyword in the text to be annotated and the length of the keyword;

2. The method of claim 1, wherein the determining the first set of keyword information for the text to be annotated comprises:

acquiring a reference keyword set corresponding to the text to be annotated;

acquiring information of keywords of the text to be annotated by using the reference keyword set, wherein the keywords are positioned in the reference keyword set;

and forming the first keyword information set by using information of each keyword, wherein the information of the keyword comprises: the method comprises the steps of obtaining a keyword, a starting position of the keyword in the text to be marked and the length of the keyword.

3. The method according to claim 2, wherein when determining that the text to be annotated has a keyword nest according to the first keyword information set, processing keywords having a nest relationship to obtain a second keyword information set includes:

according to the initial position of the keywords in the text to be marked and the length of the keywords, carrying out nested relation analysis on the keywords of the text to be marked to obtain an analysis result;

and when the analysis result shows that the nested relation exists among the keywords, processing the information of the keywords with the nested relation in the first keyword information set to obtain the second keyword information set.

4. The method of claim 3, wherein the performing nested relation analysis on the keywords of the text to be annotated according to the starting position of the keywords in the text to be annotated and the length of the keywords to obtain the analysis result includes:

when the first condition and the second condition are met, determining that a nesting relationship exists between the keywords of the text to be marked, otherwise determining that no nesting relationship exists between the keywords of the text to be marked, wherein the nesting relationship comprises complete nesting, cross nesting or mixed nesting;

The first condition is: the lengths of the first keywords and the second keywords are not equal; or the lengths of the first keyword and the second keyword are not equal, and the lengths of the second keyword and the third keyword are not equal.

5. The method of claim 4, wherein the second condition comprises:

the first sub-condition is: m is not less than N and not more than P;

wherein M represents a value of a start position of the second keyword, N represents a value of a start position of the first keyword, and P represents a sum of the value of the start position of the second keyword and a length of the second keyword; o represents the value of the starting position of the third keyword, P represents the sum of the value of the starting position of the second keyword and the length of the second keyword, and S represents the sum of the value of the starting position of the third keyword and the length of the third keyword; wherein M, N, P, Q, S is a positive integer;

And when any sub-condition of the first condition and the second condition is met, determining that a nested relation exists between the keywords of the text to be annotated.

6. The method of claim 4, wherein the first condition is that the lengths of the first keyword and the second keyword are not equal;

the processing the information of the keywords with the nested relation in the first keyword information set to obtain the second keyword information set includes:

and deleting the information of the first keyword from the first keyword information set when the fact that the complete nesting relationship exists among the keywords of the text to be annotated is determined, so that the second keyword information set is obtained.

7. The method of claim 4, wherein the first condition is that the lengths of the first keyword and the second keyword are not equal;

when determining that the cross nesting relationship exists among the keywords of the text to be marked, segmenting sentences in which the keywords with the nesting relationship exist to obtain segmented text;

Acquiring information of keywords of the segmented text according to the segmented text;

and obtaining the second keyword information set according to the information of the keywords of the segmented text and the first keyword information set.

8. The method of claim 7, wherein the segmenting the sentence in which the keyword with the nested relationship is located to obtain the segmented text comprises:

segmenting sentences in which keywords with nested relations are located to obtain a first segmented text and a second segmented text; wherein a first sub-portion of the first cut text includes content of the first keyword other than the overlapping portion, and a second sub-portion includes the second keyword; a first subsection of the second segmented text including the first keyword and a second subsection including content of the second keyword other than the overlapping portion; the overlapping portion is the same portion in the first keyword and the second keyword.

9. The method of claim 8, further comprising, prior to the obtaining information of the keywords of the segmented text: updating the text to be marked according to the segmented text to obtain the updated text to be marked, which specifically comprises the following steps:

When the first keyword and the second keyword belong to the same category, scoring the first segmentation text and the second segmentation text, and replacing the sentence by using the segmentation text with higher score to obtain an updated text to be marked;

and when the first keyword and the second keyword belong to different categories, replacing the sentence by the first segmentation text and the second segmentation text to obtain an updated text to be annotated.

10. The method of claim 4, wherein the first condition is: the lengths of the first keywords and the second keywords are not equal, and the lengths of the second keywords and the third keywords are not equal;

when determining that a mixed nesting relationship exists among keywords of the text to be annotated, selecting a first target keyword with the longest length from the first keyword, the second keyword and the third keyword;

when the first target keyword is one, deleting the information of the second keyword and the information of the third keyword from the first keyword information set;

When the first target keywords are multiple, selecting a second target keyword which is nested with the most keywords from the first target keywords, and deleting the information of the first keywords and the information of the third keywords from the first keyword information set;

and deleting the information of the first keyword, the information of the second keyword and the information of the third keyword from the first keyword information set when the first target keyword or the second target keyword does not exist.

11. The method according to claim 1, wherein labeling the text to be labeled according to the second keyword information set includes:

when the target sub-attribute of the second keyword information set is empty, marking the text to be marked as a first type, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be marked and the length of the keyword;

and when the target sub-attribute of the second keyword information set is not empty, marking keywords in the text to be marked as a second type according to the second keyword information set, and marking the content except the keywords in the text to be marked as a third type.

12. The method according to claim 1, wherein labeling the text to be labeled according to the second keyword information set includes:

acquiring the category of each keyword in the second keyword information set;

if the category is empty, or if the target sub-attribute of the second keyword information set is empty, marking the text to be marked as a first type, wherein the target sub-attribute is any one of a keyword, a starting position of the keyword in the text to be marked and a length of the keyword;

and marking the keywords in the text to be marked as a second type according to the second keyword information set, and marking the contents except the keywords in the text to be marked as a third type if the category is not null or if the target sub-attribute of the second keyword information set is not null.

13. An labeling device, comprising:

the first acquisition module is used for acquiring a text to be marked;

the first processing module is used for processing keywords with nesting relations to obtain a second keyword information set when determining that the keywords of the text to be annotated are nested according to the first keyword information set; wherein, the nested relation refers to the content with overlapping between two keywords; the nesting relation is determined based on the starting position of the keyword in the text to be annotated and the length of the keyword;

14. An electronic device, comprising: a memory, a processor, and a program stored on the memory and executable on the processor; the method according to any of the claims 1 to 12, characterized in that the processor is adapted to read a program implementation in a memory comprising the steps of the labeling method.

15. A readable storage medium storing a program, wherein the program, when executed by a processor, implements steps comprising the labeling method of any of claims 1-12.