CN115618824A

CN115618824A - Data set labeling method and device, electronic equipment and medium

Info

Publication number: CN115618824A
Application number: CN202211344330.3A
Authority: CN
Inventors: 张涵; 刘星辰; 陈晓峰; 麻沁甜; 张福缘
Original assignee: Shanghai Cangque Information Technology Co ltd
Current assignee: Shanghai Cangque Information Technology Co ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-01-17
Anticipated expiration: 2042-10-31
Also published as: CN115618824B

Abstract

The invention provides a method and a device for labeling a data set, electronic equipment and a medium, wherein the method comprises the following steps: determining a long text of the entity dictionary based on a pre-obtained entity dictionary of the target field; calculating a public subsequence of the long text of the entity dictionary and the text to be labeled, and determining a boundary interval of the public subsequence; splicing the common subsequences with overlapped boundaries based on the boundary interval of the common subsequences to obtain a plurality of disjoint subsequences; and labeling the disjoint subsequences based on the entity names in the entity dictionary to obtain a labeled text. The invention reduces the workload of text labeling and simultaneously reduces the development cost.

Description

Data set labeling method and device, electronic equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data set labeling method, a data set labeling device, electronic equipment and a medium.

Background

Named Entity Recognition (NER) is a commonly used information extraction technology in the field of natural language processing, and generally, when an NER model is built, labeled data is required to be used, and the labeled data directly influences the expression effect of the model.

In the face of texts in a new field, the NER model training has a cold start problem, so that entity class labels to which the NER model training belongs need to be added to each character of the training texts. However, for the NER model training in the new field, the training data set can be generated only by manual marking, and the workload is large and the speed is low; or an auxiliary labeling model is constructed by utilizing a reference data set in the existing target field, for example, according to a field knowledge graph, a masked entity character in a sentence is taken as a target training model for labeling and predicting the entity position, but additional workload and development cost are increased, and the field knowledge graph is rare and cannot be applied to the condition that the field knowledge graph does not exist. In summary, the existing data set text labeling method has the problems of large workload and high development cost.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, an electronic device, and a medium for annotating a data set, so as to reduce the workload of text annotation and reduce the development cost.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for annotating a data set, including: determining a long text of the entity dictionary based on a pre-obtained entity dictionary of the target field; calculating a public subsequence of the long text of the entity dictionary and the text to be marked, and determining a boundary interval of the public subsequence; splicing the common subsequences with overlapped boundaries based on the boundary interval of the common subsequences to obtain a plurality of disjoint subsequences; and labeling the disjoint subsequences based on the entity name in the entity dictionary to obtain a labeled text.

In one embodiment, determining a long text of a solid dictionary based on a previously obtained solid dictionary of a target domain includes: acquiring an entity name in a text of a target field to obtain an entity dictionary; classifying the entity dictionary based on the keywords of the entity names to obtain a plurality of entity categories; and sequencing the entity names based on the entity categories to obtain the long text of the entity dictionary.

In one embodiment, calculating a common subsequence of the entity dictionary long text and the text to be labeled, and determining a boundary interval of the common subsequence includes: calculating a common subsequence as a continuous and same character string between a text to be marked and a long text of the entity dictionary, and determining the length of the character string; based on a predetermined public subsequence length threshold value, eliminating character strings of which the length is smaller than the public subsequence length threshold value to obtain a public subsequence; and determining the boundary interval of the common subsequence based on the character position of the characters in the common subsequence in the text to be labeled.

In one embodiment, before labeling the disjoint subsequences based on the entity names in the entity dictionary, the method further comprises: a stop word dictionary is determined based on the common subsequence.

In one embodiment, labeling disjoint subsequences based on entity names in an entity dictionary comprises: labeling the disjoint subsequences based on entity names in the entity dictionary to obtain a first labeling result; and filtering the stop words in the first labeling result based on the stop word dictionary to obtain a labeling result of the text to be labeled.

In a second aspect, an embodiment of the present invention provides an apparatus for annotating a data set, including: the entity dictionary long text determination module is used for determining an entity dictionary long text based on a pre-obtained entity dictionary of the target field; the public subsequence determining module is used for calculating a public subsequence of the long text of the entity dictionary and the text to be labeled and determining a boundary interval of the public subsequence; the splicing module is used for splicing the public subsequences with overlapped boundaries based on the boundary interval of the public subsequences to obtain a plurality of disjoint subsequences; and the marking module is used for marking the disjoint subsequences based on the entity names in the entity dictionary to obtain a marked text.

In one embodiment, the entity dictionary long text determination module is further configured to: acquiring an entity name in a text of a target field to obtain an entity dictionary; classifying the entity dictionary based on the keywords of the entity names to obtain a plurality of entity categories; and sequencing the entity names based on the entity categories to obtain the long text of the entity dictionary.

In one embodiment, the common subsequence determination module is further configured to: calculating a common subsequence as a continuous and same character string between a text to be labeled and a long text of the entity dictionary, and determining the length of the character string; based on a predetermined public subsequence length threshold value, eliminating character strings of which the length is smaller than the public subsequence length threshold value to obtain a public subsequence; and determining the boundary interval of the common subsequence based on the character position of the characters in the common subsequence in the text to be labeled.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores computer-executable instructions capable of being executed by the processor, and the processor executes the computer-executable instructions to implement the steps of any one of the methods provided in the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of any one of the methods provided in the first aspect.

The embodiment of the invention has the following beneficial effects:

according to the data set labeling method, device, electronic equipment and medium provided by the embodiment of the invention, firstly, a long text of a physical dictionary is determined based on a physical dictionary of a target field obtained in advance; then, calculating a public subsequence of the long text of the entity dictionary and the text to be marked, and determining a boundary interval of the public subsequence; then, splicing the common subsequences with overlapped boundaries based on the boundary interval of the common subsequences to obtain a plurality of disjoint subsequences; and finally, labeling the disjoint subsequences based on the entity names in the entity dictionary to obtain a labeled text. According to the method, the common subsequence of the long text of the entity dictionary and the text to be labeled is calculated according to the long text of the entity dictionary in the target field, and then the text to be labeled is labeled according to the common subsequence, so that the labeled text required by the initial entity recognition model can be quickly generated, the workload of text labeling is reduced, and the development cost is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for annotating a data set according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of entity fuzzy annotation provided in the embodiment of the present invention;

fig. 3 is a schematic diagram of splicing a physical dictionary according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another data set annotation method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data set annotation device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, for NER model training in a new field, a training data set can be generated only by manual marking, and the workload is large and the speed is low; or an auxiliary labeling model is constructed by utilizing a reference data set in the existing target field, for example, according to a field knowledge graph, a masked entity character in a sentence is taken as a target training model for labeling and predicting the entity position, but additional workload and development cost are increased, and the field knowledge graph is rare and cannot be applied to the condition that the field knowledge graph does not exist. In summary, the existing data set text labeling method has the problems of large workload and high development cost.

Based on this, the data set labeling method, device, electronic device and medium provided by the embodiments of the present invention can reduce the workload of text labeling and reduce the development cost.

To facilitate understanding of the present embodiment, first, a detailed description is given of a data set tagging method disclosed in the embodiment of the present invention, which can be executed by an electronic device, such as a smart phone, a computer, a tablet computer, and the like, referring to a flowchart of a data set tagging method shown in fig. 1, the method is mainly illustrated by the following steps S101 to S104:

step S101: and determining a long text of the entity dictionary based on the entity dictionary of the target field obtained in advance.

In one embodiment, an entity dictionary can be constructed by collecting entity names in texts of target fields from networks, books, newspapers and the like in advance through an external method; then, carrying out rough classification on the entity dictionary, and dividing the entity names into different categories according to the same prefix, suffix or other keywords; and finally, splicing the entity names belonging to the same category into an entity dictionary long text.

Step S102: and calculating the common subsequence of the long text of the entity dictionary and the text to be labeled, and determining the boundary interval of the common subsequence.

In one embodiment, a common subsequence, i.e. the longest consecutive common subsequence or longest common substring, refers to consecutive identical portions of 2 character strings of the entity dictionary long text and the text to be annotated, such as: the longest contiguous common subsequence of "helloworld" and "loop" is "lo". The boundary interval refers to an interval formed by the character positions of the characters in the common subsequence in the text to be labeled, such as: assuming that "helloworld" is a long text of a solid dictionary, "loop" is a text to be annotated, and "lo" is a common subsequence, the boundary interval is (1,2). Based on the above, in the embodiment of the present invention, the continuous same portions of the long text of the entity dictionary and the text to be labeled may be calculated to obtain at least one common subsequence, and the boundary interval of the common subsequence is determined according to the character position.

Step S103: and splicing the common subsequences with overlapped boundaries based on the boundary interval of the common subsequences to obtain a plurality of disjoint subsequences.

In an embodiment, because the obtained multiple common sub-sequences may have a situation of boundary coincidence, in order to simplify a text and obtain more accurate labeled data, in the embodiment of the present invention, the multiple common sub-sequences having boundary coincidence may be spliced into a longer sequence according to a boundary interval of the common sub-sequences in an end-to-end manner, so as to obtain multiple disjoint sub-sequences.

Step S104: and labeling the disjoint subsequences based on the entity names in the entity dictionary to obtain a labeled text.

In an embodiment, referring to a schematic diagram of entity fuzzy labeling shown in fig. 2, after a disjoint subsequence is obtained, the disjoint subsequence may be labeled according to a local segment of an entity name in an entity dictionary, so as to obtain a fuzzy label of a text to be labeled.

According to the labeling method of the data set, the common subsequence of the long text of the entity dictionary and the text to be labeled is calculated according to the long text of the entity dictionary in the target field, and then the text to be labeled is labeled according to the common subsequence, so that the labeled text required by an initial entity recognition model can be generated quickly, the workload of text labeling is reduced, and the development cost is reduced.

In a specific implementation, words that may appear in a common subsequence but have no practical meaning need not be entity labeled for the list of words, and therefore, in the embodiment of the present invention, before labeling disjoint subsequences based on entity names in an entity dictionary, the method further includes: a stop word dictionary is determined based on the common subsequence. Specifically, stop words dictionary includes words that occur at high frequency but have no practical meaning.

Further, when labeling disjoint subsequences based on entity names in the entity dictionary, the following ways may be adopted, including but not limited to: firstly, labeling disjoint subsequences based on entity names in an entity dictionary to obtain a first labeling result; and then, filtering the stop words in the first labeling result based on the stop word dictionary to obtain a labeling result of the text to be labeled. In specific implementation, the fuzzy labeling of the entity in the text to be labeled can be realized by using local segments of different entity names in the entity dictionary, and a first labeling result is obtained; and then, filtering and removing the words which appear at high frequency but have no practical meaning in the obtained disjoint subsequence sets according to the stop word dictionary, and finally obtaining a labeled subsequence set as a labeling result of the text to be labeled.

In one embodiment, for the foregoing step S101, that is, when determining the long text of the entity dictionary based on the entity dictionary of the target domain obtained in advance, the following ways may be adopted, including but not limited to:

firstly, an entity name in a text of a target field is obtained, and an entity dictionary is obtained.

In specific implementation, the entity name in the text of the target field may be collected and obtained through an external method to obtain the entity dictionary. Such as: for geographical name entities, national city names may be collected first; for the organization name entity, the national public agency names may be collected first. Collecting required entity names to form a dictionary as a raw material for marking target entities, wherein the entities to be marked can be completely consistent with or partially similar to the collected entity names.

And then, classifying the entity dictionary based on the keywords of the entity names to obtain a plurality of entity categories.

In particular implementations, the additional entity names in the entity dictionary may be roughly classified into several categories by the same key, such as common prefix, suffix, or other. Such as: for organization name entities, the organization name entities are classified into different categories by a prefix common to the names.

And finally, sequencing the entity names based on the entity categories to obtain the long text of the entity dictionary.

In specific implementation, referring to a schematic diagram of splicing an entity dictionary shown in fig. 3, the entity dictionary may be classified according to entity categories, and then each entity category is sorted according to entity names, and all entity names are spliced into an entity dictionary long text. The ordering is to splice similar entity names together, so that subsequent calculation of continuous public subsequences and fuzzy labeling are facilitated.

In one embodiment, for the foregoing step S102, that is, when calculating the common subsequence of the long text of the entity dictionary and the text to be labeled, and determining the boundary interval of the common subsequence, the following manners including but not limited to:

firstly, calculating a common subsequence as a continuous and same character string between a text to be labeled and a long text of the entity dictionary, and determining the length of the character string.

And then, based on a predetermined common subsequence length threshold value, eliminating character strings of which the length is smaller than the common subsequence length threshold value to obtain a common subsequence.

And finally, determining the boundary interval of the public subsequence based on the character position of the characters in the public subsequence in the text to be labeled.

In specific implementation, considering that the length of the entity name to be labeled is usually limited, in order to avoid labeling meaningless words and reduce workload, a public subsequence length threshold value can be determined as a reference in advance according to the characteristics of the entity names in different fields, when the length of a character string is smaller than the public subsequence length threshold value, the character string is indicated not to belong to the entity name to be labeled, and after the character string with the length smaller than the public subsequence length threshold value is removed, a public subsequence can be obtained. Further, determining the boundary interval of the public subsequence according to the character position of the character in the public subsequence in the text to be labeled.

The method provided by the embodiment of the invention can quickly generate the labeled text required by the initial entity recognition model, reduce the workload of text labeling and simultaneously reduce the development cost.

For convenience of understanding, an embodiment of the present invention further provides a specific data set labeling method, which is shown in fig. 4 and mainly includes: the method comprises three parts of field entity dictionary preprocessing, longest continuous public subsequence solving and entity labeling result generation, and specifically comprises the following six steps:

the first step is as follows: and acquiring a solid dictionary of the target field.

Specifically, taking the target domain as the financial domain as an example, for the public fund product name entity, the names of main public fund companies and fund product names under flags can be searched on the network to serve as the entity dictionary, taking the large-scale easy-to-reach fund as an example, the open fund under flags has the red-profit ETF connection A of the easy-to-reach Chinese certificate, the military project (LOF) A of the easy-to-reach Chinese certificate and the like, and the required entity names can be collected to form the entity dictionary for reservation.

The second step: and roughly classifying the entity dictionary according to the keywords.

Specifically, the entity dictionary names are roughly classified into several categories according to the same keyword. For example, for a recruited fund product, the corresponding fund company name may be included as a prefix, such as "Yifanda XXX," and thus, the fund product entities may be categorized into different categories according to the fund company name, such as Yifanda flag fund categories, huaxia funds categories, and other categories, which are similar.

The third step: and completely splicing the entity dictionaries according to the category sequence to obtain long dictionary texts.

Specifically, the entity dictionaries may be sorted according to entity categories first, then sorted according to entity names, and all the entity dictionary names are spliced into a long dictionary text (i.e., a long entity dictionary text).

The fourth step: and solving the longest continuous public subsequence set of the text to be labeled and the long text of the dictionary.

Specifically, the longest continuous public subsequence is calculated based on the entity dictionary long text and the text to be labeled obtained in the third step, and a plurality of public subsequences and boundary intervals of the text to be labeled are obtained. The public subsequence takes the minimum length of the entity name as a parameter, the minimum length of the entity name can be adjusted according to the characteristics of the entity name in different fields, the fund product generally comprises a fund company name, and the minimum length can be 3.

By way of example: assume the solid dictionary long text is: "Yifangda preferred multi-asset three-month possession blend (FOF) A … Huatai Bai Rui dominant piloting blend … broad issue robust preferred six-month possession blend A … British Flexible configuration blend …", text to be annotated is: "11 months, 18 days, yifangda, and pilot six months holding period hybrid fund Fund (FOF) (class A012652C 012653) began to sell. "then, there are 5 in total the longest consecutive common subsequences obtained by the long text of the dictionary and the text to be labeled, wherein the" easy to reach best "boundary is (8,12)," dominant pilot "boundary is (11,15)," six month-holding-period mixture "boundary is (15,23)," mixture "boundary is (21,24), and" FOF "boundary is (30,33).

The fifth step: and splicing the overlapped boundaries in the sub-sequence aggregate to obtain a plurality of completely disjoint entity labels.

Specifically, the subsequence result obtained in the fourth step is subjected to boundary splicing, that is, the 5 longest continuous public subsequences are spliced to obtain a mixed type of the "easy-to-reach dominant pilotage for six months. Because the names of the fund products are partially the same in the fund product field, even if the entity names in the text to be labeled are not in the field entity dictionary, the complete labeled entities can be obtained by partial splicing according to the names of other fund products, thereby realizing the fuzzy labeling of the entities.

And a sixth step: and filtering the words without actual meanings according to the stop word dictionary.

Keywords like the aforementioned 5 th common subsequence "FOF" occur more frequently in fund entities, and are generally included in fund product names, and the occurrence alone is not an entity name, and therefore they can be filtered according to the stop word dictionary. The finally obtained public subsequence is [ (8,24, ' mixed type of ' easy-to-reach dominant pilot six months holding period ') ], and the text can be subjected to entity annotation according to boundary information.

According to the data set labeling method provided by the embodiment of the invention, the preliminary labeling of the entity recognition sample can be realized only by a small amount of development and the entity dictionary with easily-obtained field content, the cold start problem of the training of the NER model is solved, a large amount of manual labeling work can be saved, and the method is simple, convenient and feasible. Meanwhile, the method is applicable to different fields and different entity types, and can play a good auxiliary role in the initial stage of NER model development in the new field.

For the foregoing data set annotation method, an embodiment of the present invention further provides a data set annotation device, referring to a schematic structural diagram of a data set annotation device shown in fig. 5, which illustrates that the device mainly includes the following parts:

a long text determination module 501 for determining a long text of a solid dictionary based on a pre-obtained solid dictionary of a target field;

the public subsequence determining module 502 is used for calculating a public subsequence of the entity dictionary long text and the text to be labeled and determining a boundary interval of the public subsequence;

a splicing module 503, configured to splice the common subsequences with overlapped boundaries based on the boundary interval of the common subsequences to obtain multiple disjoint subsequences;

and a labeling module 504, configured to label the disjoint sub-sequences based on the entity names in the entity dictionary to obtain a labeled text.

According to the labeling device for the data set, provided by the embodiment of the invention, the common subsequence of the long text of the entity dictionary and the text to be labeled is calculated according to the long text of the entity dictionary in the target field, and then the text to be labeled is labeled according to the common subsequence, so that the labeled text required by the initial entity recognition model can be quickly generated, the workload of text labeling is reduced, and the development cost is reduced.

In an embodiment, the entity dictionary long text determination module 501 is further configured to: acquiring an entity name in a text of a target field to obtain an entity dictionary; classifying the entity dictionary based on the keywords of the entity names to obtain a plurality of entity categories; and sequencing the entity names based on the entity categories to obtain the long text of the entity dictionary.

In an embodiment, the common subsequence determining module 502 is further configured to: calculating a common subsequence as a continuous and same character string between a text to be labeled and a long text of the entity dictionary, and determining the length of the character string; based on a predetermined public subsequence length threshold value, eliminating character strings of which the character string length is smaller than the public subsequence length threshold value to obtain a public subsequence; and determining the boundary interval of the common subsequence based on the character position of the characters in the common subsequence in the text to be labeled.

In one embodiment, the above apparatus further comprises: a stop word dictionary determination module to: a stop word dictionary is determined based on the common subsequence.

In an embodiment, the labeling module 504 is further configured to: labeling the disjoint subsequences based on entity names in the entity dictionary to obtain a first labeling result; and filtering the stop words in the first labeling result based on the stop word dictionary to obtain a labeling result of the text to be labeled.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

It should be noted that the specific values provided in the implementation of the present invention are only exemplary and are not limited herein.

The embodiment of the invention also provides electronic equipment, which specifically comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the above embodiments.

Fig. 6 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present invention, where the electronic device 100 includes: a processor 60, a memory 61, a bus 62 and a communication interface 63, wherein the processor 60, the communication interface 63 and the memory 61 are connected through the bus 62; the processor 60 is arranged to execute executable modules, such as computer programs, stored in the memory 61.

The Memory 61 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 63 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 62 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 6, but this does not indicate only one bus or one type of bus.

The memory 61 is configured to store a program, and the processor 60 executes the program after receiving an execution instruction, where the method performed by the apparatus defined by the flow program disclosed in any embodiment of the present invention may be applied to the processor 60, or implemented by the processor 60.

The processor 60 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 60. The Processor 60 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 61, and the processor 60 reads the information in the memory 61 and, in combination with its hardware, performs the steps of the above method.

The computer program product of the readable storage medium provided in the embodiment of the present invention includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the foregoing method embodiment, which is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for annotating a data set, comprising:

determining a long text of the entity dictionary based on a pre-obtained entity dictionary of the target field;

calculating a public subsequence of the long text of the entity dictionary and the text to be marked, and determining a boundary interval of the public subsequence;

splicing the public subsequences with overlapped boundaries based on the boundary interval of the public subsequences to obtain a plurality of disjoint subsequences;

and labeling the disjoint subsequences based on the entity names in the entity dictionary to obtain a labeled text.

2. The labeling method of claim 1, wherein determining the long text of the entity dictionary based on the pre-obtained entity dictionary of the target domain comprises:

acquiring an entity name in a text of a target field to obtain an entity dictionary;

classifying the entity dictionary based on the keywords of the entity names to obtain a plurality of entity categories;

and sequencing the entity names based on the entity categories to obtain entity dictionary long texts.

3. The labeling method of claim 1, wherein calculating a common subsequence of the entity dictionary long text and the text to be labeled and determining a boundary interval of the common subsequence comprises:

calculating the public subsequence as a continuous and same character string between the text to be marked and the long text of the entity dictionary, and determining the length of the character string;

based on a predetermined public subsequence length threshold value, eliminating character strings of which the length of the character strings is smaller than the public subsequence length threshold value to obtain a public subsequence;

and determining the boundary interval of the public subsequence based on the character position of the characters in the public subsequence in the text to be labeled.

4. The labeling method of claim 1, wherein before labeling the disjoint subsequences based on entity names in the entity dictionary, further comprising:

determining a stop word dictionary based on the common subsequence.

5. The labeling method of claim 4, wherein labeling the disjoint subsequences based on entity names in the entity dictionary comprises:

labeling the disjoint subsequences based on the entity name in the entity dictionary to obtain a first labeling result;

and filtering the stop words in the first labeling result based on the stop word dictionary to obtain the labeling result of the text to be labeled.

6. An apparatus for annotating a data set, comprising:

the entity dictionary long text determining module is used for determining an entity dictionary long text based on a pre-obtained entity dictionary of the target field;

the public subsequence determining module is used for calculating a public subsequence of the entity dictionary long text and the text to be labeled and determining a boundary interval of the public subsequence;

the splicing module is used for splicing the common subsequence with overlapped boundaries based on the boundary interval of the common subsequence to obtain a plurality of disjoint subsequences;

and the labeling module is used for labeling the disjoint subsequences based on the entity names in the entity dictionary to obtain a labeled text.

7. The annotation device of claim 6, wherein the entity dictionary long text determination module is further configured to:

8. The annotation apparatus of claim 6, wherein the common subsequence determination module is further configured to:

calculating the public subsequence as a continuous and same character string between the text to be labeled and the long text of the entity dictionary, and determining the length of the character string;

9. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to perform the steps of the method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 5.