CN116205235B

CN116205235B - Data set dividing method and device and electronic equipment

Info

Publication number: CN116205235B
Application number: CN202310491927.9A
Authority: CN
Inventors: 宋洒; 卢文庆; 郭文萍
Original assignee: Beijing Velocity Insight Technology Co ltd
Current assignee: Beijing Velocity Insight Technology Co ltd
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-08-01
Anticipated expiration: 2043-05-05
Also published as: CN116205235A

Abstract

The embodiment of the application discloses a data set dividing method, a data set dividing device and electronic equipment, and relates to the technical field of medical treatment, wherein the method comprises the following steps: acquiring a data set, and labeling entities in a text unit; classifying the data set according to the entity types, and dividing the data set into a plurality of entity type groups; recording the starting position and the ending position of each entity in the text unit; dividing the data set into a plurality of entity location groups according to the location information; counting the proportion of each entity position group in the data set; for each entity type group, the grouping is completed by dividing the entity location groups into a plurality of subgroups in combination with a plurality of entity location groups. The method can count the probability that the entity types and the entity values thereof appear in different positions in sentences in the data set, group the data set according to the entity types, further divide the data set into subgroups according to the position information of the entities, sample the packets according to the probability of the position groups, and ensure that the distribution of the training set and the testing set in the aspects of the entity types, the contexts and the position information is more balanced.

Description

Data set dividing method and device and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of medical treatment, in particular to a data set dividing method, a data set dividing device and electronic equipment.

Background

In the task of identifying named entities in the medical field, there are often cases where many entities exist in the word ambiguous, i.e. the values of the entities are the same, e.g. when the names of the entities are the same, the entities may be of various types, e.g. the word "immune" may in some contexts denote immunotherapy and in other contexts denote autoimmune diseases. In order to realize the task of identifying the ambiguous named entity in the medical field, it is important to accurately distinguish the entity types and accurately divide the data set.

The traditional data set division only introduces the characteristics of entity types, so that the distribution proportion of a plurality of entity types in the training set and the testing set is ensured to be basically equal. However, since some entity values may represent different entity types in different contexts, conventional data set partitioning methods may partition one type of the same entity value into a training set and partition another type of the same entity value into a test set, resulting in uneven data distribution.

The location of the entity value in the specific text may be different, and the location of the specific text may be different paragraphs of the article, such as a leading edge, a trailing portion, a sentence of a leading portion and a sentence of a trailing portion of the same paragraph, or a beginning and a trailing portion within the same sentence. Therefore, if the data sets of training and testing are divided according to the entity types only, the relative position information of the entity values in the text is not considered, so that the data sets are unevenly distributed to influence the model evaluation index, and the word ambiguous named entity recognition task is difficult to accurately realize.

Disclosure of Invention

The embodiment of the application provides a data set dividing method, a data set dividing device and electronic equipment, which can solve the problem of uneven data distribution.

In a first aspect, an embodiment of the present application provides a data set partitioning method, where the method includes:

acquiring a data set, wherein the data set comprises a plurality of text units, and labeling entities in the text units;

classifying the data set according to entity types, and dividing the data set into a plurality of entity type groups;

recording the starting position and the ending position of each entity in the text unit to obtain the position information of each entity;

dividing the dataset into a plurality of entity location groups according to the location information;

counting the proportion of each entity position group in the data set;

for each entity type group, dividing the plurality of entity location groups into a plurality of subgroups in combination to complete grouping.

In one possible design, the text unit includes: chapters, paragraphs, and sentences.

In one possible design, the location information for each entity includes a start location, an end location, and a length of an entity name for the entity.

In one possible design, the dividing the data set into a plurality of entity location groups according to the location information includes:

grouping each entity according to the position information of the entity;

the set of entity locations for each text unit is determined based on the grouping of all entities that each text unit includes.

In one possible design, the grouping the entities includes:

dividing the corresponding entity into a head group if the position information indicates that the entity appears in the part before one third of the text unit;

dividing the corresponding entity into middle groups if the location information indicates that the entity appears in one third to two thirds of the text units;

if the location information indicates that the corresponding entity appears in the two-thirds later portion of the text unit, the entity is divided into ending groups.

In one possible design, the set of physical locations includes: a beginning group, an intermediate group, an ending group, a beginning intermediate group, a beginning ending group, an intermediate ending group, a beginning intermediate ending group.

In one possible design, the method further comprises:

and dividing a training set and a testing set according to a preset proportion and the proportion of each entity position group aiming at each entity type group, so that the proportion of text units of each entity position group is the same in the training set and the testing set.

In a second aspect, embodiments of the present application provide an apparatus, the apparatus comprising:

the receiving module is used for acquiring a data set, wherein the data set comprises a plurality of text units and labeling entities in the text units;

the processing module is used for classifying the data set according to entity types and dividing the data set into a plurality of entity type groups; recording the starting position and the ending position of each entity in the text unit to obtain the position information of each entity; dividing the dataset into a plurality of entity location groups according to the location information; counting the proportion of each entity position group in the data set; for each entity type group, dividing the plurality of entity location groups into a plurality of subgroups in combination to complete grouping.

In a third aspect, embodiments of the present application provide an electronic device including a memory and one or more processors; wherein the memory is for storing computer program code, the computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform part or all of the steps of the method of the first aspect or in various possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer storage medium having instructions stored therein which, when executed on a computer, cause the computer to perform some or all of the steps of the method of the first aspect or in various possible implementations of the first aspect.

The application provides a data set dividing method, which comprises the following steps: acquiring a data set, wherein the data set comprises a plurality of text units, and labeling entities in the text units; classifying the data set according to entity types, and dividing the data set into a plurality of entity type groups; recording the starting position and the ending position of each entity in the text unit to obtain the position information of each entity; dividing the dataset into a plurality of entity location groups according to the location information; counting the proportion of each entity position group in the data set; for each entity type group, dividing the plurality of entity location groups into a plurality of subgroups in combination to complete grouping. The method comprises the steps of recording the positions of entities to obtain position information of all the entities, dividing groups according to the positions, and dividing a data set into a plurality of different subgroups according to the types of the entities to provide a data set classified according to the types of the entities and the positions of the entities. Each subgroup contains an entity of a specific entity type and a specific location, thereby enabling a hierarchical sampling strategy such that the data of the training set and the testing set are distributed more evenly in terms of entity type, context and location information.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flowchart of a data set partitioning method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data set dividing apparatus according to an embodiment of the present application;

fig. 3 is an exemplary structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that, although the terms first, second, etc. may be used in the following embodiments to describe certain types of objects, the objects should not be limited to these terms. These terms are only used to distinguish between specific objects of that class of objects. For example, the terms first, second, etc. may be used in the following embodiments to describe an entity, but the entity should not be limited to these terms. These terms are only used to distinguish between different entities. Other classes of objects that may be described in the following embodiments using the terms first, second, etc. are not described here again.

Embodiments of the present application relate to the medical field.

The embodiment of the application provides a data set dividing method, a data set dividing device and electronic equipment.

The data set partitioning method according to the embodiments of the present application is described below through several implementations.

As shown in fig. 1, fig. 1 illustrates a data set partitioning method 100 (hereinafter referred to as method 100), and the method 100 includes the following steps:

step S101, a data set is obtained, wherein the data set comprises a plurality of text units, and the entities in the text units are marked.

In this embodiment, the data set is a set of data for training and testing the entity recognition model, and in order to make the data distribution ratio of the training set and the test set similar, the data set needs to be classified first, and the entities in the data set are marked to classify the entities. In addition, the entity included in the data set in the application is a word ambiguous entity, and only when the entity has multiple meanings, the entity needs to be divided according to types and positions, and other entities only need to be divided by using a traditional data set.

Step S102, classifying the data set according to entity types, and dividing the data set into a plurality of entity type groups.

In this embodiment, the data sets are first grouped according to each entity's traditional type partitioning.

Step S103, recording the starting position and the ending position of each entity in the text unit, and obtaining the position information of each entity.

In this embodiment, since the different positions in the text where the entities appear may represent different meanings, it is necessary to record where each entity is located. Wherein the different locations where entities appear may be different paragraphs of the article, such as leading edge, middle, and trailing; or may be different sentences in a paragraph; but may also be the beginning, middle and end of the same sentence.

Step S104, dividing the data set into a plurality of entity position groups according to the position information.

In this embodiment, each text unit may include at least one entity, and the location of each entity may be the same or different, and the text units are divided into entity location groups according to the location information of all the entities included in each text unit.

Step S105, counting the proportion of each entity position group in the data set.

Step S106, dividing the entity type groups into a plurality of subgroups according to the entity position groups, and finishing grouping.

In this embodiment, for example, for the "immunotherapeutic" entity type group, we can divide the data sets into different subgroups according to the entity location group, which will provide us with one data set classified by entity type, entity location. Thus, we divide the data set into multiple tiers, each tier containing a subset of entities of a particular entity type and entity level location information combination. For example, one hierarchy may contain "immunotherapeutic" entity types and samples of the entity occurrences at the beginning of the text. In this way, we can apply a hierarchical sampling strategy for each hierarchy, ensuring a more balanced distribution of training and testing sets in terms of entity type, context and location information.

In an alternative embodiment, the text unit includes: chapters, paragraphs, and sentences.

In this embodiment, the probability that an entity appears at different positions can be determined by different text units, for example, different entities with the same name, and the probability that an entity appears at different positions in an article is often different, for example, the same entity may represent a disease type and may also represent a symptom type, and the disease often appears necessarily at a leading edge and a conclusion part, and symptoms appear more in a text part of the article. Similarly, other synonymous entities appear with different probabilities at different locations of the article. Therefore, the text unit of the present application includes a chapter, a paragraph and a sentence, and further covers the influence of the front and rear paragraphs of the chapter, the influence of the front and rear sentences in the same paragraph, and the influence of the front and rear positions in the same sentence, so as to obtain more comprehensive physical position information.

In an alternative embodiment, the location information of each entity includes a start location, an end location, and a length of an entity name of the entity.

In this embodiment, when the text unit is a chapter, the location information of the entity further includes a paragraph in the article where the entity is located; when the text unit is a paragraph, the position information of the entity also comprises the sentence of which paragraph the entity is in; when the text unit is a sentence, the location information of the entity further includes the byte of the sentence in which the entity is located.

In an alternative embodiment, the dividing the data set into a plurality of entity location groups according to the location information includes:

grouping each entity according to the position information of the entity;

In this embodiment, the location of each entity in the text unit is first determined according to the location information of the entity, and further, each text unit may include more than one entity, where the location groups of more than one entity may be the same or different, so, in order to more accurately perform location division on the text unit, it is necessary to divide the location groups on the text unit by combining the location information of all entities included in a single text unit.

In an alternative embodiment, the grouping the entities includes:

In an alternative embodiment, the set of entity locations includes: a beginning group, an intermediate group, an ending group, a beginning intermediate group, a beginning ending group, an intermediate ending group, a beginning intermediate ending group.

In an alternative embodiment, the method further comprises:

In this embodiment, the proportions of the respective entity location groups are first counted, and when dividing the training set and the test set, we need to ensure that these proportions are maintained in the training set and the test set so that the model can learn the characteristics of the respective entity groups. For example, the ratio of the seven entity location groups in the data set is 1:1:2:3:1:1:2, respectively, then when the data of the training set and the test set are divided, the ratio of the seven entity location groups of the same type of entity in the two sets should also be 1:1:2:3:1:1:2.

In summary, the data partitioning method of the embodiment of the application is applied to the medical field, and can count the probability that entity types and entity values thereof in the data set appear at different positions in sentences, group the data set according to the entity types, and then further conduct subgroup partitioning in the group according to the position information of the entity in a specific text. And sampling the packet according to the probability of the statistics, so as to ensure that the distribution of the training set and the testing set in the aspects of entity type, context and position information is more balanced.

Corresponding method 100 the embodiment of the application also provides a device for executing the method.

As shown in fig. 2, fig. 2 illustrates a data partitioning apparatus 200, the apparatus comprising:

a receiving module 201, configured to obtain a data set, where the data set includes a plurality of text units, and label entities in the text units;

a processing module 202, configured to classify the data set into a plurality of entity type groups according to entity types; recording the starting position and the ending position of each entity in the text unit to obtain the position information of each entity; dividing the dataset into a plurality of entity location groups according to the location information; counting the proportion of each entity position group in the data set; for each entity type group, dividing the plurality of entity location groups into a plurality of subgroups in combination to complete grouping.

It will be understood that the above division of each module/unit is merely a division of a logic function, and when actually implemented, the functions of each module may be integrated into a hardware entity, for example, the functions of the processing module may be integrated into a processor implementation, the functions of the receiving module may be integrated into a transceiver implementation, and a program and an instruction for implementing the functions of each module may be maintained in a memory. For example, fig. 3 provides an electronic device 300, the electronic device 300 comprising a processor 301, a transceiver 302, and a memory 303. Wherein the transceiver 302 is configured to perform the transceiving of data and signals in the method 100. The memory 303 may be used to store programs/code and the like required by the processor 301 to perform the method 100.

In a specific implementation, corresponding to the foregoing electronic device 300, the embodiments of the present application further provide a computer storage medium, where the computer storage medium provided in the electronic device 300 may store a program, and when the program is executed, may implement some or all of the steps including the embodiments of the method 100. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

Those skilled in the art will appreciate that, for convenience and brevity, the specific working procedures of the above-described systems, apparatuses and units may refer to the corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed methods, apparatuses, and systems may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a control device for a cloud game, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While alternative embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the scope of the invention.

Claims

1. A method of partitioning a data set, the method comprising:

counting the proportion of each entity position group in the data set;

for each entity type group, dividing into a plurality of subgroups in connection with the plurality of entity location groups, comprising: and dividing a training set and a testing set according to a preset proportion and the proportion of each entity position group aiming at each entity type group so that the proportion of text units of each entity position group in the training set and the testing set is the same, and finishing grouping.

2. The method of claim 1, wherein the text unit comprises: chapters, paragraphs, and sentences.

3. The method of claim 1, wherein the location information for each entity comprises a starting location, an ending location, and a length of an entity name for the entity.

4. The method of claim 1, wherein said dividing said data set into a plurality of entity location groups according to said location information comprises:

grouping each entity according to the position information of the entity;

5. The method of claim 4, wherein said grouping the entities comprises:

6. The method of claim 5, wherein the plurality of sets of entity locations comprises: a beginning group, an intermediate group, an ending group, a beginning intermediate group, a beginning ending group, an intermediate ending group, a beginning intermediate ending group.

7. A data set partitioning apparatus, the apparatus comprising:

8. An electronic device comprising a memory and one or more processors; wherein the memory is for storing computer program code, the computer program code comprising computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 6.

9. A computer readable storage medium comprising a computer program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 6.