CN115033691A - Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium - Google Patents
Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium Download PDFInfo
- Publication number
- CN115033691A CN115033691A CN202210626309.6A CN202210626309A CN115033691A CN 115033691 A CN115033691 A CN 115033691A CN 202210626309 A CN202210626309 A CN 202210626309A CN 115033691 A CN115033691 A CN 115033691A
- Authority
- CN
- China
- Prior art keywords
- text
- labeled
- text segment
- sample
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 245
- 238000000034 method Methods 0.000 title claims abstract description 117
- 238000002372 labelling Methods 0.000 claims abstract description 114
- 238000005070 sampling Methods 0.000 claims description 32
- 238000012545 processing Methods 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 17
- 230000008030 elimination Effects 0.000 claims description 2
- 238000003379 elimination reaction Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 24
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 16
- 230000008451 emotion Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 239000004575 stone Substances 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013434 data augmentation Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a method, a device and a medium for training a text segment recognition model and recognizing a text segment, and relates to the field of artificial intelligence. The model training method comprises the following steps: receiving a plurality of sample texts and labeling information thereof from a client, wherein the labeling information of one sample text comprises: at least one labeling text segment in the sample text and the labeling category to which the labeling text segment belongs; generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the peer relationships between different labeled text segments in the sample texts which belong to the same labeled category; and training the text segment recognition model to be trained by utilizing the training sample data set. Because the dependency relationship and the companion relationship are simultaneously utilized in the model training process, the semantic features of the labeled text segments in the sample texts can be better learned, and the text segment recognition model with higher recognition performance can be trained even under the condition of less sample texts.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a medium for training a text segment recognition model and recognizing a text segment.
Background
Text segment Identification (Span Identification) is a type of natural language processing task that aims to find out text segments belonging to a certain category in the input text. The text segment categories are correspondingly formulated according to specific text segment identification tasks.
For a specific text segment recognition task, a text segment recognition model can be obtained through pre-training, and then the text segment recognition task is executed by utilizing the text segment recognition model. Generally, the text segment recognition model is trained as follows: and acquiring a large number of pre-labeled sample texts, wherein the sample texts are labeled with text sections and categories to which the text sections belong. And training the model to be trained by utilizing the subordination relation between each text segment in the sample text and the category of the text segment, so that the trained model has the text segment recognition capability.
The above training approach requires a large amount of labeled sample text. However, in some scenarios, a large amount of labeled sample texts may not be available, and thus a text segment recognition model with high recognition performance cannot be trained.
Disclosure of Invention
The embodiment of the application provides a method, a device and a medium for training a text segment recognition model, and a text segment recognition model with high recognition performance, even under the condition that the number of labeled sample texts is small.
In a first aspect, an embodiment of the present application provides a method for training a text segment recognition model, including:
receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs;
generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companionship between different labeled text segments in the sample texts belonging to the same labeled category;
and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model.
In a possible implementation manner, generating a training sample data set according to the membership between labeled text segments in the plurality of sample texts and labeled categories to which the labeled text segments belong and the peer relationship between different labeled text segments in the plurality of sample texts belonging to the same labeled category, includes:
generating a plurality of groups of first training samples according to the companion relationship among different labeled text segments belonging to the same labeled category in the plurality of sample texts;
generating a plurality of groups of second training samples according to the membership between the labeled text segments in the plurality of sample texts and the labeled categories to which the labeled text segments belong;
and generating the training sample data set according to the multiple groups of first training samples and the multiple groups of second training samples.
In a possible implementation manner, generating multiple groups of first training samples according to a peer relationship between different labeled text segments belonging to the same labeled category in the multiple sample texts includes:
dividing each labeled text segment in the plurality of sample texts into at least one text segment set, wherein the labeled text segments in each text segment set belong to the same labeled category, and the labeled text segments in different text segment sets belong to different labeled categories;
generating a plurality of peer relationship data pairs according to the at least one text segment set; each companion relationship data pair comprises a first labeled text segment and a second labeled text segment, wherein the first labeled text segment and the second labeled text segment are two different labeled text segments in the same text segment set;
and generating at least one group of first training samples according to each pair of the peer relationship data to obtain the plurality of groups of first training samples.
In one possible implementation, generating at least one set of first training samples according to each pair of peer relationship data includes:
generating a first query text according to the first labeled text segment in the peer relationship data pair, wherein the first query text comprises the first labeled text segment, and the first query text is used for querying a text segment having a peer relationship with the first labeled text segment;
determining at least one first sample text from the plurality of sample texts according to the second annotation text segment in the peer relationship data pair, the first sample text comprising the second annotation text segment;
and generating the at least one group of first training samples according to the first query text, the at least one first sample text and the second labeled text segment.
In one possible implementation, generating a plurality of peer relationship data pairs according to the at least one text segment set includes:
respectively arranging and combining any two different labeled text segments in each text segment set to obtain a plurality of candidate data pairs corresponding to the text segment set;
sampling the candidate data pairs corresponding to at least part of the text segment set to obtain the peer relationship data pairs; the number of peer relationship data pairs is less than the number of candidate data pairs.
In a possible implementation manner, for any one first text segment set in the at least one text segment set, sampling the multiple candidate data pairs corresponding to the first text segment set, including:
determining the sampling number corresponding to the first text segment set;
sampling the plurality of candidate data pairs corresponding to the first text segment set according to the sampling number;
wherein the number of samples is related to one or more of: presetting a sampling proportion, and the difference between the number of the labeled text segments contained in the second text segment set and the number of the labeled text segments contained in the first text segment set; the second text segment set is a text segment set which contains the most number of labeled text segments in the at least one text segment set.
In a possible implementation manner, generating multiple groups of second training samples according to the membership between the labeled text segments in the multiple sample texts and the labeled categories to which the labeled text segments belong includes:
carrying out duplicate removal processing on the labeling categories to which the labeling text segments belong in the plurality of sample texts to obtain a labeling category set;
generating a second query text corresponding to each labeling category in the labeling category set, wherein the second query text comprises the labeling category and is used for querying a text segment belonging to the labeling category;
and generating the multiple groups of second training samples according to second query texts corresponding to the labeling types in the labeling type set and the membership relation between each labeling text segment in the multiple sample texts and the labeling type to which the labeling text segment belongs.
In one possible implementation of the method according to the invention,
generating the multiple groups of second training samples according to the second query texts corresponding to the labeling categories in the labeling category set and the membership between each labeling text segment in the multiple sample texts and the labeling category to which the labeling text segment belongs, including:
for each annotation category in the set of annotation categories, separately traversing each sample text in the plurality of sample texts:
if the sample text has the label text segment belonging to the label category, generating a group of second training samples according to a second query text corresponding to the label category, the sample text and the label text segment belonging to the label category in the sample text; or,
and if the labeled text segment belonging to the labeled category does not exist in the sample text, generating a group of second training samples according to a second query text corresponding to the labeled category, the sample text and the empty text segment.
In a second aspect, an embodiment of the present application provides a text segment identification method, including:
acquiring a query text and a target text, wherein the query text is used for querying a text segment belonging to a preset category in the target text;
processing the query text and the target text input through a trained text segment recognition model to obtain at least one target text segment in the target text, or obtaining a null text segment, wherein the target text segment belongs to the preset category;
wherein the text segment recognition model is trained by the method according to any one of the first aspect.
In a third aspect, an embodiment of the present application provides a text segment identification method, including:
acquiring a target text;
processing the target text through a trained text segment recognition model to obtain at least one target text segment in the target text and the category of each target text segment, or obtain a null text segment;
wherein the text segment recognition model is obtained by training according to the method of any one of the first aspect.
In a fourth aspect, an embodiment of the present application provides a training apparatus for a text segment recognition model, including:
the receiving module is used for receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs;
the generating module is used for generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companion relationship between different labeled text segments in the sample texts belonging to the same labeled category;
and the training module is used for training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model.
In a fifth aspect, an embodiment of the present application provides a text passage identification apparatus, including:
the acquisition module is used for acquiring a query text and a target text, wherein the query text is used for querying a text segment belonging to a preset category in the target text;
the processing module is used for processing the query text and the target text input through a trained text segment recognition model to obtain at least one target text segment in the target text or obtain a null text segment, wherein the target text segment belongs to the preset category;
wherein the text segment recognition model is trained by the apparatus according to the third aspect.
In a sixth aspect, an embodiment of the present application provides a text passage identification apparatus, including:
the acquisition module is used for acquiring a target text;
the processing module is used for processing the target text through the trained text segment recognition model to obtain at least one target text segment in the target text and the category to which each target text segment belongs, or obtain a null text segment;
wherein the text segment recognition model is trained by the apparatus according to the third aspect.
In a seventh aspect, an embodiment of the present application provides an electronic device, including: a memory having stored therein a computer program configured to be executed by the processor to implement the method of any one of the first aspect, or the method of the second aspect, or the method of the third aspect, and at least one processor.
In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method according to any one of the first aspect, or the method according to the second aspect, or the method according to the third aspect.
In a ninth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspect, or the method according to the second aspect, or the method according to the third aspect.
The embodiment of the application provides a method, a device and a medium for training a text segment recognition model and recognizing a text segment, wherein the method for training the text segment recognition model comprises the following steps: receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs; generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companionship between different labeled text segments in the sample texts belonging to the same labeled category; and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model. In the model training, the method not only utilizes the membership relationship between each text segment in the sample text and the category to which the text segment belongs, but also utilizes the companion relationship between different text segments belonging to the same category in the sample text, so that the semantic features of the labeled text segments in the sample text can be better learned, the utilization rate of the sample text is improved, and the text segment recognition model with higher recognition performance can be trained even under the condition of less number of sample texts.
Drawings
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a training method for a text segment recognition model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of affiliations and companion relationships provided by an embodiment of the present application;
FIG. 4 is a flowchart illustrating another training method for a text segment recognition model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a process for generating a first training sample according to an embodiment of the present application;
fig. 6 is a schematic diagram of a generation process of a second training sample provided in an embodiment of the present application;
FIG. 7 is a diagram illustrating the input and output of a text passage recognition model provided by an embodiment of the present application;
fig. 8A is a schematic flowchart of a text segment identification method according to an embodiment of the present application;
fig. 8B is a flowchart illustrating another text segment recognition method according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a training apparatus for a text segment recognition model according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a text passage recognition apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and in the claims, and in the drawings, of the embodiments of the application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein.
It should be understood that the terms "comprises" and "comprising," and any variations thereof, as used herein, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the description of the embodiments of the present application, the term "correspond" may indicate that there is a direct correspondence or an indirect correspondence between the two, may also indicate that there is an association between the two, and may also indicate and be indicated, configure and configured, and so on.
To facilitate understanding of the technical solutions of the present application, first, concepts and terms related to the embodiments of the present application are explained.
Text segment Identification (Span Identification) is a type of natural language processing task that aims to find out text segments belonging to a certain class in an input text. The text segment categories are correspondingly formulated according to specific text segment identification tasks. Some important text sections can be quickly positioned and classified through the text section identification task, so that a user can conveniently and quickly position key information, and the difficulty of reading complex texts by the user is greatly reduced. In practical scenarios, with the need for sophisticated specialized text processing, text segment recognition also comes in a variety of forms including, but not limited to: named entity identification, emotion type identification, contract text clause extraction, and the like.
Data Augmentation (Data Augmentation) is a technique that artificially expands a training Data set by letting limited Data produce more equivalent Data. The method is an effective means for overcoming the defect of insufficient training data, and is widely applied to various fields of deep learning at present.
Machine Reading Comprehension (Machine Reading Comprehension) is a type of model framework for natural language processing. The input of the method comprises two parts of query (query) and related context text (context), and the output of the method is a text segment in the context text, so that the text segment can satisfy the input query.
Membership (Subordinate relationship): in text segment recognition, the relationship between a text segment and the category to which it belongs is referred to as an affiliation, i.e., a text segment belongs to a category.
Peer relationship (Peer relationship) in text segment recognition, two text segments of the same category have more similar semantics, and the relationship between the two text segments is called Peer relationship.
Named Entity Recognition (Named Entity Recognition) is a text segment Recognition task that recognizes a text segment in a text that belongs to an Entity type, including but not limited to a person's name, place name, organization name, proper noun, etc.
attribute-Based Sentiment Analysis (Aspect Based Sentiment Analysis) is a text segment recognition task for recognizing a text segment in text that expresses a certain Sentiment type, including but not limited to: positive, negative, happy, sad, etc.
Contract Clause Extraction (Contract Clause Extraction) is a text segment identification task that identifies some type of Clause in the Contract text, including but not limited to: time terms, renewal term terms, related legal terms, and the like.
The Propaganda technique Detection (Span Based Propaganda Detection) Based on text segment is a text segment recognition task for recognizing a text segment corresponding to a certain Propaganda technique type in a text, including but not limited to: repeat, shout size, etc.
For the specific text segment recognition task, a text segment recognition model can be obtained through pre-training, and then the text segment recognition task is executed by utilizing the text segment recognition model. Generally, the text segment recognition model is trained as follows: and acquiring a large number of pre-labeled sample texts, wherein the sample texts are labeled with text sections and categories to which the text sections belong. And training the model to be trained by utilizing the membership between each text segment in the sample text and the category to which the text segment belongs, so that the trained model has the text segment recognition capability. The above training approach requires that a large number of sample texts be pre-labeled. However, in some scenarios, a large amount of labeled sample texts may not be obtained, and thus a text segment recognition model with high recognition performance cannot be trained.
In order to solve the technical problems, according to the technical scheme, in the process of model training, not only are the dependencies between the text segments in the sample text and the categories to which the text segments belong utilized, but also the companions between different text segments belonging to the same category in the sample text are utilized, so that the model can learn not only the dependencies between the text segments and the categories to which the text segments belong, but also the companions between different text segments of the same category, and the utilization rate of the sample text is improved. Even when the number of sample texts is small, the recognition performance of the trained model can be improved.
Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes: client and model training platform. The model training platform is pre-configured with a text segment recognition model to be trained and a training algorithm of the text segment recognition model.
As an example, an enterprise user can access the model training platform through a client and upload sample texts and labeling information thereof collected in advance to the model training platform. Wherein, the labeling information of the sample text comprises: one or more labeled text segments in the sample text, and a labeled category to which each labeled text segment belongs. And the model training platform generates a training sample data set based on the sample text and the labeling information thereof. In the process of generating the training sample data set, not only the membership between each labeled text segment in the labeled text and the labeled category to which the labeled text segment belongs is utilized, but also the companion relationship between different labeled text segments of the same labeled category is utilized. And then, training the text segment recognition model to be trained by the model training platform based on the training sample data set and the training algorithm of the pre-configured text segment recognition model to obtain the trained text segment recognition model. And finally outputting the trained text segment recognition model to the client.
Further, the enterprise user may deploy the trained text segment recognition model to the execution device. The execution device can complete the text segment recognition task by using the trained text segment recognition model.
Based on the application scenario, the technical solutions provided in the embodiments of the present application are described in detail below by specific embodiments. It should be noted that the technical solutions provided in the embodiments of the present application may include part or all of the following contents, and these specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flowchart of a training method for a text segment recognition model according to an embodiment of the present application. The method of the present embodiment may be performed by the model training platform of fig. 1. As shown in fig. 2, the method of the present embodiment includes:
s201, receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the labeled category to which each labeled text segment belongs.
In the embodiment of the present application, one sample text may be one sentence or one paragraph. The annotation information for the sample text may include one or more annotation text segments. One markup text segment refers to a number of language units that appear consecutively in the sample text. For example, the language units in Chinese may be Chinese characters and the language units in English may be words. The labeling information of the sample text can also comprise a labeling category to which each labeling text segment belongs.
It should be understood that the labeling information of the sample text (i.e., each labeled text segment and its label category) is labeled according to the specific text segment recognition task. Assuming that the text segment recognition task is used for recognizing the entity in the text, the labeling information of the sample text includes: at least one entity text segment and an entity category to which each entity text segment belongs. Assuming that the text segment recognition task is used for recognizing emotion in the text, the labeling information of the sample text includes: at least one emotion text segment and the emotion category to which each emotion text segment belongs. Assuming that the text segment recognition task is used to recognize the terms in the contract text, the labeling information of the sample text includes: at least one clause text field and a clause category to which each clause text field belongs.
For example, taking the entity identification task as an example, assuming that the sample text is "zhang san ancestor nationality", the labeling of the sample text may include: { Zhang three, name of person }, and { Shijiazhuang, name of place }.
S202: and generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companion relationship between different labeled text segments in the sample texts belonging to the same labeled category.
In the embodiment of the application, based on a plurality of sample texts and labeling information thereof input by a client, a model training platform can analyze to obtain the membership between a labeling text segment in the plurality of sample texts and a labeling category to which the labeling text segment belongs, and the companion relationship between different labeling text segments in the plurality of sample text segments which belong to the same labeling category. And further, constructing a training sample data set by utilizing the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companionship between different labeled text segments in the sample text segments which belong to the same labeled category.
This is illustrated below with reference to fig. 3. Fig. 3 is a schematic diagram of an affiliation and a companion relationship provided in an embodiment of the present application. As shown in fig. 3, taking the following three sample texts as an example,
sample text 1: "Zhang San ancestral nationality in Shijiazhuan", its label information includes: { Zhang III, name of person }, { Shijiazhuang, place name }
Sample text 2: "zhangsan currently lives in west' an", and its labeling information includes: { Zhang III, name of person }, { xi' an, place name };
sample text 3: "the working unit of lie four is in luoyang", and its labeling information includes: { lie four, person name }, { luoyang, place name };
based on the three sample texts and the labeled information thereof, the following membership can be obtained through combing: the 'Zhang III' and the 'name of the person' are in an affiliation relationship, the 'Liqu' and the 'name of the person' are in an affiliation relationship, the 'Shijiazhu' and the 'name of the place' are in an affiliation relationship, the 'Xian' and the 'name of the place' are in an affiliation relationship, and the 'Luoyang' and the 'name of the place' are in an affiliation relationship.
Based on the two sample texts and the labeling information thereof, the following companion relationship can be obtained by combing: the three Zhang and the four Li are in a companion relationship, the Shijiazhuang and the Xian are in a companion relationship, the Xian and the Luoyang are in a companion relationship, and the Shijiazhuang and the Luoyang are in a companion relationship.
And further, a training sample data set can be constructed by utilizing the subordination relation and the companion relation obtained by combing. Illustratively, a part of training samples can be constructed according to the dependency relationship, a part of training samples can be constructed according to the companion relationship, and the two parts of training samples are combined together to obtain a training sample data set.
In an embodiment of the present application, each training sample in the set of training sample data satisfies the data structure of the following triplet: { query text, text to be queried, text segment satisfying query }. The sample text is a text to be queried, and the query text is used for indicating a query condition for querying the sample text. And the labeled text segment is a text segment which meets the query condition in the sample text. In the triples, the query text and the sample text are used as the input of the text segment identification model, and the labeled text segment is used as the expected output of the text segment identification model.
Illustratively, for the above-described dependencies, the following training samples may be generated:
{ "inquiring a text section belonging to a name category", "Zhang three ancestors in Shijiazhuang", "Zhang three" };
{ "inquire about the text section belonging to the name category", "zhang san live in xi' an at present", "zhang san" };
{ "inquiring text sections belonging to the category of the names of people", "working units of Li four are in Luoyang", "Li four" };
{ "inquiring a text section belonging to the place name category, Zhang III ancestry nationality in Shijiazhuang, and Shijiazhuang' };
{ "query a text segment belonging to a place name category", "zhang san currently lives in west ampere", "west ampere" };
{ "query a text segment belonging to a place name category", "work units of Li four are in Luoyang", "Luoyang" };
for the above-mentioned companion relationship, the following training samples may be generated:
{ "inquire about a text segment similar to Li four", "Zhang three ancestry nationality in Shijiazhuang", "Zhang three" };
{ "inquire about the text segment similar to Li four", "Zhang three live in xi' an at present", "Zhang three" };
{ "inquiring a text segment similar to Zhang III", "work unit of Li IV is in Luoyang", "Li IV" };
{ "inquire about a text segment similar to Shijiazhuang", "Zhang III live in West 'an", "West' an" };
{ "query a text segment similar to the Shijiazhuang", "the working unit of Li four is in Luoyang", "Luoyang" };
{ "inquire about a text segment similar to xi' an", "Zhang three ancestor nationality in Shi Jia Zhuang", "Shi Jia Zhuang" };
{ "query a text segment similar to Xian", "the working unit of Li four is in Luoyang", "Luoyang" };
{ "inquire about text section similar to Luoyang", "Zhang three ancestral nationality in Shijiazhuang", "Shijiazhuang" };
{ "inquire about a text segment similar to Luoyang", "Zhang III currently lives in xi 'an", "xi' an" };
the two training samples form a training sample data set.
In the embodiment of the application, the peer-to-peer relationship is used as a by-product of the dependency relationship, the peer-to-peer relationship does not need to be additionally labeled, the number of the peer-to-peer relationship is very rich, the peer-to-peer relationship is a square level of the dependency relationship, and the number of training samples can be increased. For example, if only the dependency relationship is utilized in the above example, only 6 training samples can be constructed; and by simultaneously utilizing the dependency relationship and the companion relationship, 14 training samples can be constructed. In addition, the peer relationship provides other text segment information similar to the content of the text segment, and the semantics of the category of the text segment are enriched, that is, the training sample constructed by the peer relationship can be used for enhancing the semantic representation of the labeled text segment, so that the model training performance can be improved.
S203: and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model.
The embodiment of the application does not limit the network structure of the text segment recognition model. In some possible implementations, the text passage recognition model may employ a framework based on machine-reading understanding. Alternatively, the text segment recognition model may be a transform-based multiple text segment extraction machine-reading understanding (Multi-span MRC) model, or may also be another transform-based model. The Multi-span MRC model has the capability to query multiple text segments. Alternatively, the text segment recognition model may be a sequence-to-sequence model based on a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).
The embodiment of the application does not limit the training process of the text segment recognition model. For example, for each training sample { query text, text to be queried, text segment satisfying query }, the first two elements (i.e., query text, sample text) in the above triplet may be used as inputs of the text segment recognition model, so as to obtain a predicted text segment output by the text segment recognition model. The loss value is determined based on the difference between the predicted-text segment and the third element in the triplet (i.e., the annotated text segment). And updating the model parameters of the text segment recognition model by taking the minimization of the loss value as a target. Determining whether the updated text segment recognition model meets a preset convergence condition, if so, determining the updated text segment recognition model as a trained text segment recognition model; if not, repeating the training process until the updated text segment recognition model meets the preset convergence condition.
The training method for the text segment recognition model provided by the embodiment of the application comprises the following steps: receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs; generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companionship between different labeled text segments in the sample texts belonging to the same labeled category; and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model. In the model training, not only the membership between each text segment in the sample text and the category to which the text segment belongs, but also the companion relationship between different text segments belonging to the same category in the sample text are utilized, so that the semantic features of the labeled text segments in the sample text can be better learned, the utilization rate of the sample text is improved, and a text segment recognition model with higher recognition performance can be obtained through training even under the condition that the number of the sample texts is small.
On the basis of the above-mentioned embodiments, the following describes the present application in more detail with reference to a specific embodiment.
Fig. 4 is a flowchart illustrating another training method for a text segment recognition model according to an embodiment of the present application. The present embodiment is described in detail with respect to a generation process of a training sample data set. As shown in fig. 4, the method of the present embodiment includes:
s401: receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the labeled category to which each labeled text segment belongs.
It should be understood that the specific implementation manner of S401 is similar to that of S201 in the embodiment shown in fig. 2, and is not described herein again.
S402: and generating a plurality of groups of first training samples according to the companion relationship among different labeled text segments belonging to the same labeled category in the plurality of sample texts.
In this embodiment, any two labeled text segments with a peer relationship may form one data pair, and the data pair may be referred to as a peer relationship data pair. A plurality of companion relationship data pairs can be generated according to the companion relationship between different labeled text segments belonging to the same labeled category in a plurality of sample texts, and then a plurality of groups of first training samples are generated based on the plurality of companion relationship data pairs.
In some possible implementations, the multiple sets of first training samples may be generated as follows:
(1) dividing each labeled text segment in the plurality of sample texts into at least one text segment set, wherein the labeled text segments in each text segment set belong to the same labeled category, and the labeled text segments in different text segment sets belong to different labeled categories.
That is, each labeled text segment belonging to the same label category is divided into one text segment set. In this way, assuming that N annotation categories coexist in the plurality of sample texts, N text segment sets are obtained by division.
This is illustrated below with reference to fig. 5. Fig. 5 is a schematic diagram of a generation process of a first training sample provided in the application example. As shown in fig. 5, taking the following three sample texts as an example,
sample text 1: "Zhang III ancestral nationality in Shijiazhuang", its label information includes: { Zhang III, name of person }, { Shi Jia Zhuang, place name }
Sample text 2: "zhangsan currently lives in west' an", and its labeling information includes: { Zhang III, name of person }, { xi' an, place name };
sample text 3: "the working unit of lie four is in luoyang", and its labeling information includes: { lie four, person name }, { luoyang, place name };
dividing the labeled text segments belonging to the category of 'names of people' in the 3 sample texts into a text segment set, and dividing the labeled text segments belonging to the category of 'names of places' in the 3 sample texts into a text segment set. Thus, two text segment sets are obtained, namely { Zhang III, Li IV }, { Shijiazhuang, Xian and Luoyang }.
(2) And generating a plurality of companion relationship data pairs according to the at least one text segment set, wherein each companion relationship data pair comprises a first labeled text segment and a second labeled text segment, and the first labeled text segment and the second labeled text segment are two different labeled text segments in the same text segment set.
Illustratively, any two different labeled text segments in each text segment set can be arranged and combined to obtain a plurality of peer relationship data pairs.
With reference to fig. 5, for a text segment set { zhang three, lie four } corresponding to the "name" category, any two labeled text segments in the text segment set are arranged and combined to obtain 2 peer-to-peer relationship data pairs, which are:
< Zhang three, Li four >
< Lisi, Zhang III >
Aiming at a text segment set { Shijiazhuang, Xian and Luoyang } corresponding to the category of the place name, any two different labeled text segments in the text segment set are arranged and combined to obtain 6 companion relationship data pairs, which are respectively:
shijiazhuang, Xian >
< Shijiazhuang, Luoyang >
< Xian, Luoyang >
Xian, Shijiazhuang >
< Luoyang Shijiazhuang >
< Luoyang, Xian >
(3) And generating at least one group of first training samples according to each pair of the peer relationship data to obtain the plurality of groups of first training samples.
One or more sets of first training samples may be generated for each companion relationship data pair generated in step (2). Wherein each group of the first training samples all satisfy the data structure of the following triples: { query text, text to be queried, text segment satisfying query }.
In one possible implementation, for each peer-to-peer relationship data pair, one or more sets of first training samples may be generated as follows: generating a first query text according to a first labeled text segment in the companion relationship data pair, wherein the first query text comprises the first labeled text segment, and the first query text is used for querying a text segment with companion relationship with the first labeled text segment; determining at least one first sample text in the plurality of sample texts according to a second labeled text segment in the companion relationship data pair, wherein the first sample text comprises the second labeled text segment; and generating at least one group of first training samples according to the first query text, the at least one first sample text and the second labeled text segment.
This is illustrated below with reference to fig. 5.
Regarding the companion relationship data pair < three, four, three as the first labeled text segment, and "four as the second labeled text segment. A first query text "query a text segment similar to zhangsan" is generated from zhangsan ", and a sample text containing" liquan "is found as a first sample text among a plurality of sample texts, for example, the first sample text is" working unit of lie is in luo yang ". Thus, a group of first training samples is generated according to the first query text, the first sample text and the second labeled text segment, as follows:
{ "inquire about a text segment similar to Zhang III", "Liqu's work unit is in Luoyang", "Liqu" }.
And aiming at the companion relationship data pair < Liqu, Zhang III >, and 'Liqu' as a first labeled text segment, and 'Zhang III' as a second labeled text segment. A first query text "query a text passage similar to liqi" is generated from "liqi", and a sample text containing "zhang" is found as a first sample text among a plurality of sample texts, for example, two first sample texts are found, respectively, "zhang of zhang at rockhouse", "zhang at now lives in west' at west". Thus, according to the first query text, the two first sample texts and the second labeled text segment, two groups of first training samples can be generated, which are respectively:
{ "inquiring a text segment similar to Li IV", "Zhang III ancestor nationality in Shijiazhuang", "Zhang III" };
{ "inquire about a text segment similar to Li four", "Zhang three live in Xian at present", "Zhang three" }.
With the similar processing manner, for other peer-to-peer relationship data pairs, the following first training sample may be generated:
{ "inquire about the text section similar to the Shijiazhuang", "Zhang III live in xi 'an", "xi' an" };
{ "inquiring a text segment similar to the Shijiazhuang", "the working unit of Li four is in Luoyang", "Luoyang" };
{ "query a text segment similar to Xian", "Lifours work unit is in Luoyang", "Luoyang" };
{ "inquire about a text segment similar to xi' an", "Zhang San is native to the stone house", "stone house" };
{ "inquire about text segments similar to Luoyang", "Zhang three with China in stone house", "stone house" };
{ "inquire about text segments similar to Luoyang", "Zhang III live in xi 'an", "xi' an" }
It should be understood that the processing procedures of other peer-to-peer relationship data pairs are similar to those described above, and this embodiment is not illustrated.
Referring to fig. 5, 9 groups of first training samples are formed jointly according to the peer relationship between the labeled text segments in the three sample texts.
In the example shown in fig. 5, for each text segment set, assuming that the text segment set includes M labeled text segments, M × M-1 companion relationship data pairs can be obtained by arranging and combining labeled text segments in the text segment set. Further, the number of first training samples generated based on the M × M-1 companion relationship data is greater than or equal to M × M-1. Therefore, when the number of the labeled text segments in the plurality of sample texts is larger, the number of the first training samples generated based on the peer relationship is also larger.
In the model training process, when the number of training samples is too large, the time consumption of the model training process is long, and the model training efficiency is reduced. For this reason, in some possible implementation manners of the embodiment of the present application, the step (2) may be implemented as follows:
and respectively carrying out permutation and combination on any two different labeled text segments in each text segment set to obtain a plurality of candidate data pairs corresponding to the text segment set. For example, in connection with the example shown in fig. 5, for a text segment set { three by three, four by lie }, any two different labeled text segments therein are arranged and combined to obtain 2 candidate data pairs, which are: < zhangsane, lie four > and < zhuyiwan, zhangsane >; aiming at a text segment set { Shijiazhuang, xi' an, Luoyang }, any two different labeled text segments are arranged and combined to obtain 6 candidate data pairs, wherein the data pairs are as follows: < Shijiazhuang, Xian >, < Shijiazhuang, Luoyang >, < Xian, Shijiazhuang >, < Luoyang, Xian >.
Further, sampling the plurality of candidate data pairs corresponding to at least part of the text segment sets in the at least one text segment set to obtain the plurality of peer relationship data pairs; the number of peer relationship data pairs is less than the number of candidate data pairs. Wherein, the meaning of the sampling processing is as follows: and selecting a part of candidate data pairs from the candidate data pairs corresponding to a certain text segment set or certain text segment sets as the data pairs with the same relationship.
For example, in connection with the example shown in fig. 5, at least part of the candidate data pairs may be selected as peer relationship data pairs from 2 candidate data pairs corresponding to the text segment set { zhang, liquad }, and at least part of the candidate data pairs may be selected as peer relationship data pairs from 6 candidate data pairs corresponding to the text segment set { chijiazhuang, west ampere, loyang }.
It should be noted that, in practical application, sampling processing may be performed on all candidate data pairs corresponding to all text segment sets, or sampling processing may be performed only on candidate data pairs corresponding to a part of text segment sets, which is not limited in this embodiment of the present application.
Optionally, for any first text segment set in the at least one text segment set, the following method may be adopted to sample a plurality of candidate data pairs corresponding to the first text segment set: determining the sampling number corresponding to a first text segment set, and sampling the candidate data pairs corresponding to the first text segment set according to the sampling number; wherein the number of samples is related to one or more of: presetting a sampling proportion, and the difference between the number of the labeled text segments contained in the second text segment set and the number of the labeled text segments contained in the first text segment set; the second text segment set is a text segment set which contains the most number of labeled text segments in the at least one text segment set.
Several ways of determining the sampled data are illustrated below.
In the first mode, it is assumed that the number of labeled text segments included in the first text segment set is S 1 The sample number T may be determined using the following equation:
T=λ*S 1
wherein λ is a preset sampling ratio.
In this way, the candidate data pairs corresponding to each text segment set are sampled based on the preset sampling proportion, so that the number distribution of the candidate data pairs corresponding to each text segment set is not changed by the sampling processing.
In the second mode, it is assumed that the number of the labeled text segments contained in the first text segment set is S 1 The number of the labeled text segments contained in the second text segment set is S 2 The sample number T may be determined using the following equation:
T=S 2 -S 1
in this way, the sampling number is determined according to the difference between the number of the labeled text segments included in the second text segment set and the number of the labeled text segments included in the first text segment set, so that the number of the peer relationship data pairs corresponding to each text segment set after sampling processing is relatively balanced. Furthermore, the recognition performance of the model to various labeling types is balanced.
The third mode corresponds to a combination of the first and second modes. Assume that the first text segment set contains labeled text segments with the number S 1 The number of the labeled text segments contained in the second text segment set is S 2 The sample number T may be determined using the following equation:
T=λ*S 1 +(S 2 -S 1 )
wherein λ is a preset sampling ratio.
It should be understood that, by sampling candidate data pairs corresponding to at least part of the text segment set, the number of the peer relationship data pairs output in step (2) can be reduced, and then the number of the first training samples generated in step (3) is reduced, so that the model training efficiency is prevented from being affected, and the balance between the model identification performance and the model training efficiency is realized.
S403: and generating a plurality of groups of second training samples according to the membership between the labeled text segments in the plurality of sample texts and the labeled categories to which the labeled text segments belong.
In this embodiment, for each labeling category, each sample text in the plurality of sample texts may be traversed, and a group of second training samples is generated based on the labeling category and the sample text, so as to obtain a plurality of groups of second training samples.
In one possible implementation, the plurality of sets of second training samples may be generated as follows:
(1) and carrying out duplication elimination processing on the labeling categories to which the labeling text segments belong in the plurality of sample texts to obtain a labeling category set.
That is to say, the labeling category set includes all the appearing labeling categories in the labeling information of the plurality of sample texts, and the labeling categories in the labeling category set are not repeated.
This is illustrated below with reference to fig. 6. Fig. 6 is a schematic diagram of a generation process of a second training sample according to an embodiment of the present application. As shown in fig. 6, taking the following three sample texts as an example,
sample text 1: "Zhang San ancestral nationality in Shijiazhuan", its label information includes: { Zhang III, name of person }, { Shijiazhuang, place name }
Sample text 2: "zhangsan currently lives in west' an", and its labeling information includes: { Zhang III, name of person }, { Xian, place name };
sample text 3: "the working unit of lie four is in luoyang", and its labeling information includes: { lie four, person name }, { luoyang, place name };
the labeling information of the 3 sample texts includes 2 labeling categories, namely a "person name" and a "place name". Therefore, a set of label categories is obtained as { person name, place name }.
(2) And generating a second query text corresponding to each label category in the label category set, wherein the second query text comprises the label categories, and the second query text is used for querying text segments belonging to the label categories.
Illustratively, for the annotation category "person name", a second query text "query a text segment belonging to the person name category" is generated. For the label category "place name", a second query text "query a text segment belonging to the place name category" is generated.
(3) And generating the multiple groups of second training samples according to the second query texts corresponding to the labeling categories in the labeling category set and the membership between each labeling text segment in the multiple sample texts and the labeling category to which the labeling text segment belongs.
In this embodiment, for each labeling category in the labeling category set, each sample text in the plurality of sample texts may be traversed, and a group of second training samples is generated based on the labeling category and the sample text, so as to obtain a plurality of groups of second training samples. Each group of second training samples all satisfy the data structure of the following triples: { query text, text to be queried, text segment satisfying query }.
In a possible implementation manner, for each labeling category in a labeling category set, traversing each sample text in the plurality of sample texts respectively, and if a labeling text segment belonging to the labeling category exists in the sample text, generating a group of second training samples according to a second query text corresponding to the labeling category, the sample text, and the labeling text segment belonging to the labeling category in the sample text; or if the sample text does not have the labeled text segment belonging to the labeled category, generating a group of second training samples according to the second query text corresponding to the labeled category, the sample text and the empty text segment.
This is illustrated below with reference to fig. 6.
For the annotation category of "person name" in the annotation category set, each sample text is traversed as follows:
for the sample text 1, because the labeled text segment "zhang" belonging to the category of "name" exists in the sample text 1, a group of second training samples { "query the text segment belonging to the category of name", "ancestor of zhang" in stone house "," zhang "};
for the sample text 2, because the labeled text segment "zhang san" belonging to the category of "name" exists in the sample text 2, a group of second training samples { "query the text segment belonging to the category of name", "zhang san live in west safety at present", "zhang san" };
for the sample text 3, because the labeled text segment "lie four" belonging to the category of "name" exists in the sample text 3, a group of second training samples { "query the text segment belonging to the category of name", "lie four has a working unit in luoyang", "lie four" };
for the annotation class of place name in the annotation class set, traversing each sample text segment as follows:
for the sample text 1, because a labeled text segment "shijiazhuang" belonging to the category of "place name" exists in the sample text 1, a group of second training samples { "query a text segment belonging to the category of place name", "zhang at shijiazhuang", and "shijiazhuang" };
for the sample text 2, because the labeled text segment "xi ' in the category of" place name "exists in the sample text 2, a group of second training samples {" query the text segment in the category of place name "," zhang, live in xi ' an at present "," xi ' an "};
for the sample text 3, because a labeled text segment "luoyang" belonging to the category of the "place name" exists in the sample text 3, a group of second training samples { "query the text segment belonging to the category of the place name", "lie four work units are in luoyang", "luoyang" };
referring to fig. 6, 6 groups of second training samples are formed jointly according to the membership between each labeled text segment in the three sample texts and the labeled category to which the labeled text segment belongs.
It should be noted that, in some scenarios, when a certain annotation category is combined with the sample text, there may not be an annotation text segment belonging to the annotation category in the sample text. For example, taking the labeled category "place name" as an example, assuming that there is sample text 4 "boy is three", there is no labeled text segment belonging to the "place name" category in this sample text 4, and at this time, a set of second training samples { "query text segments belonging to the place name category", "boy is three", and "empty" } is generated. It should be understood that through the second training sample, the model can accurately identify the situation that the text segment of a certain labeling category does not exist, so that the text segment identification capability of the model is improved.
S404: and generating the training sample data set according to the multiple groups of first training samples and the multiple groups of second training samples.
Illustratively, a training sample data set is generated based on 9 sets of first training samples generated by the example shown in fig. 5 and 6 sets of second training samples generated by the example shown in fig. 6. The set of training sample data comprises a total of 15 training samples.
In the embodiment of the application, the first training sample is constructed based on the companion relationship between the labeled text segments, so that the number of training samples in the training sample data set is greatly increased, and the increased number of training samples is the square times of the number of training samples based on the dependency relationship only. Therefore, even in the case of only a small amount of label text, a high-quality training sample data set can be automatically constructed on a square-times scale, and noise is hardly introduced.
S405: and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model.
In this embodiment, the first training sample and the second training sample have the same format and both satisfy the following triple data structure: { query text, text to be queried, text segment satisfying query }. That is, each training sample in the set of training sample data satisfies the above-described triple data structure.
For example, a first training sample { "query a text segment similar to zhangsan", "lie four work unit is in luoyang", "lie four" } is taken as an example, where "query a text segment similar to zhangsan" is a query text, "lie four work unit is in luoyang" is a text to be queried, and "lie four" is a text segment satisfying the query.
For example, a second training sample { "query a text segment belonging to a name category", "lie four work units are in luoyang", "lie four" }, wherein "query a text segment belonging to a name category" is a query text, "lie four work units are in luoyang" is a text to be queried, and "lie four" is a text segment satisfying the query.
When the text segment recognition model is trained by using the training sample data set, the query text and the text to be queried in the first training sample/the second training sample are input into the text segment recognition model, and the predicted text segment output by the model is obtained.
This is illustrated in connection with fig. 7. Fig. 7 is a schematic diagram of input and output of a text segment recognition model provided in an embodiment of the present application. As shown in fig. 7, it is assumed that the input format of the text segment recognition model is "[ CLS ] query text [ SEP ] to be queried", and the output of the model is a predicted text segment. Wherein [ CLS ] and [ SEP ] are preset separator characters.
For the first training sample, taking { "query a text segment similar to Zhang III", "work unit of Li four is in Luoyang", "Li four" } as an example, splicing the query text "query a text segment similar to Zhang III" and the to-be-queried text "work unit of Li four is in Luoyang" to obtain a model input text "[ CLS ] query a text segment [ SEP ] similar to Zhang III ] and work unit of Li four is in Luoyang [ SEP ]".
For the second training sample, taking { "query text segment belonging to the category of the name", "work unit of lie four is in luoyang", "lie four" } as an example, splicing the query text "query text segment belonging to the category of the name" and the text to be queried "work unit of lie four is in luoyang" to obtain a model input text "[ CLS ] query text segment [ SEP ] of the category of the name [ SEP ] work unit in luoyang [ SEP ]".
And inputting the model input text into a text segment recognition model, and outputting a predicted text segment after the text segment recognition model is subjected to recognition processing. And further, determining a loss function according to the predicted text segment and the text segment which meets the query in the training sample. And updating the model parameters of the text segment recognition model by taking the minimum loss function as a target. And continuously repeating the training process until the text segment recognition model reaches the preset convergence condition, and obtaining the trained text segment recognition model. It should be understood that, in the actual training process, one training sample may be input into the model in each iteration, or multiple training samples may also be input, which is not limited in this embodiment. Fig. 7 illustrates a case where two training samples are input.
In this embodiment, because the first training sample and the second training sample have the same format, when the text segment recognition model is trained by using the training sample data set, the first training sample and the second training sample may not be distinguished. That is, the first training sample and the second training sample may be mixed together for model training. The first training sample and the second training sample do not need to be distinguished in the model, so that the model structure and the processing process in the model do not need to be modified, the model can learn the subordination relation and the companion relation at the same time, and the implementation difficulty is reduced.
Furthermore, in the technical scheme of the embodiment of the application, the training data of various types of text segment recognition tasks can be summarized into the data format of the same form by constructing the training sample into the data structure of the triple of the { query text, the text to be queried and the text segment meeting the query }, so that one model can simultaneously solve various types of text segment recognition tasks, and the model mobility and the universality can be improved.
In addition, the technical scheme of the application enhances the data based on the existing data without involving external resources and tools, and has wide applicability.
The training process of the text passage recognition model is described above with reference to fig. 2 to 7. The process of using the text segment recognition model is described below with reference to fig. 8A and 8B.
Fig. 8A is a flowchart illustrating a text segment recognition method according to an embodiment of the present application. As shown in fig. 8, the method of the present embodiment includes:
s801: acquiring a query text and a target text, wherein the query text is used for querying a text segment belonging to a preset category in the target text.
S802: and processing the query text and the target text through a trained text segment recognition model to obtain at least one target text segment in the target text, or obtain a null text segment, wherein the target text segment belongs to the preset category.
The text segment recognition model is obtained by training by adopting the model training method provided by any method embodiment.
Illustratively, query text and target text are entered into a text passage recognition model. And if the text segment recognition model finds the target text segment belonging to the preset category in the target text, outputting the target text segment. And if the text recognition model does not find the target text segment belonging to the preset category in the target text, outputting a blank text segment.
It should be understood that the text segment recognition method of the present embodiment can be applied to various recognition tasks, including but not limited to: named entity identification, attribute-based sentiment analysis, contract term extraction, and the like.
The query text input to the model differs when applied to different recognition tasks. Illustratively, when applied to named entity recognition, the query text is used to query the target text for text segments belonging to a preset entity category; when the attribute-based emotion analysis is applied, the query text is used for querying a text segment belonging to a preset emotion category in the target text; when applied to contract term extraction, the query text is used for querying a text segment belonging to a preset term category in the target text.
Fig. 8B is a flowchart illustrating another text segment recognition method according to an embodiment of the present application. As shown in fig. 8B, the method of the present embodiment includes:
s811: and acquiring a target text.
S812: and processing the target text through the trained text segment recognition model to obtain at least one target text segment in the target text and the category of each target text segment, or obtain a null text segment.
The text segment recognition model is obtained by training by adopting the model training method provided by any method embodiment.
Illustratively, the target text is input into a text recognition model, and the text recognition model respectively executes the following processing for each preset category in a plurality of preset categories: and inquiring the target text according to the inquiry text corresponding to the preset category, determining whether the target text has a target text segment belonging to the preset category, and outputting the target text segment and the preset category if the target text segment exists. And if the target text does not have the text segment belonging to any preset category in the plurality of preset categories, outputting a blank text segment. The preset categories may be categories trained in a model training process.
In this embodiment, in the training process of the text segment recognition model, not only the dependency relationship but also the companion relationship is utilized, so that the text segment recognition model has higher recognition performance. Furthermore, in the embodiment, when the text segment is recognized by the text segment recognition model, the accuracy of the text segment recognition result can be improved.
The above describes the training method and the recognition method of the text segment recognition model provided in the embodiment of the present application, and the following describes the training device and the recognition device of the text segment recognition model provided in the embodiment of the present application.
In the embodiment of the present application, the training device of the text segment recognition model and the text segment recognition device may be divided into functional modules according to the method embodiments, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be implemented in the form of hardware, and can also be implemented in the form of a software functional module.
It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation. The following description will be given by taking an example in which each functional module is divided by using a corresponding function.
Fig. 9 is a schematic structural diagram of a training apparatus for a text segment recognition model according to an embodiment of the present application. As shown in fig. 9, the training apparatus 900 for text segment recognition model provided in this embodiment includes: a receiving module 901, a generating module 902 and a training module 903. Wherein,
a receiving module 901, configured to receive multiple sample texts from a client and labeling information of each sample text, where the labeling information of one sample text includes: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs;
a generating module 902, configured to generate a training sample data set according to a membership relationship between labeled text segments in the multiple sample texts and labeled categories to which the labeled text segments belong, and a peer relationship between different labeled text segments in the multiple sample texts belonging to the same labeled category;
and the training module 903 is configured to train the text segment recognition model to be trained by using the training sample data set, so as to obtain a trained text segment recognition model.
In some possible implementations, the generating module 902 is specifically configured to:
generating a plurality of groups of first training samples according to the companion relationship among different labeled text segments belonging to the same labeled category in the plurality of sample texts;
generating a plurality of groups of second training samples according to the subordination relation between the labeling text segments in the plurality of sample texts and the labeling categories to which the labeling text segments belong;
and generating the training sample data set according to the multiple groups of first training samples and the multiple groups of second training samples.
In some possible implementations, the generating module 902 is specifically configured to:
dividing each labeled text segment in the plurality of sample texts into at least one text segment set, wherein labeled text segments in each text segment set belong to the same labeled category, and labeled text segments in different text segment sets belong to different labeled categories;
generating a plurality of peer relationship data pairs according to the at least one text segment set; each companion relationship data pair comprises a first labeled text segment and a second labeled text segment, wherein the first labeled text segment and the second labeled text segment are two different labeled text segments in the same text segment set;
and generating at least one group of first training samples according to each companion relationship data pair to obtain the plurality of groups of first training samples.
In some possible implementations, the generating module 902 is specifically configured to:
generating a first query text according to the first labeled text segment in the peer relationship data pair, wherein the first query text comprises the first labeled text segment, and the first query text is used for querying a text segment having a peer relationship with the first labeled text segment;
determining at least one first sample text from the plurality of sample texts according to the second annotation text segment in the peer relationship data pair, the first sample text comprising the second annotation text segment;
and generating the at least one group of first training samples according to the first query text, the at least one first sample text and the second labeled text segment.
In some possible implementations, the generating module 902 is specifically configured to:
respectively arranging and combining any two different labeled text segments in each text segment set to obtain a plurality of candidate data pairs corresponding to the text segment set;
sampling the candidate data pairs corresponding to at least part of the text segment set to obtain the peer relationship data pairs; the number of peer relationship data pairs is less than the number of candidate data pairs.
In some possible implementations, for any first text segment set in the at least one text segment set, the generating module 902 is specifically configured to:
determining the sampling number corresponding to the first text segment set;
sampling the plurality of candidate data pairs corresponding to the first text segment set according to the sampling number;
wherein the number of samples is related to one or more of: presetting a sampling proportion, and the difference between the number of the labeled text segments contained in the second text segment set and the number of the labeled text segments contained in the first text segment set; the second text segment set is a text segment set which contains the most number of labeled text segments in the at least one text segment set.
In some possible implementations, the generating module 902 is specifically configured to:
carrying out duplicate removal processing on the labeling categories to which the labeling text segments belong in the plurality of sample texts to obtain a labeling category set;
generating a second query text corresponding to each label category in the label category set, wherein the second query text comprises the label category and is used for querying a text segment belonging to the label category;
and generating the multiple groups of second training samples according to the second query texts corresponding to the labeling categories in the labeling category set and the membership between each labeling text segment in the multiple sample texts and the labeling category to which the labeling text segment belongs.
In some possible implementations, the generating module 902 is specifically configured to:
for each annotation class in the set of annotation classes, respectively traversing each sample text in the plurality of sample texts:
if the sample text has the label text segment belonging to the label category, generating a group of second training samples according to a second query text corresponding to the label category, the sample text and the label text segment belonging to the label category in the sample text; or,
and if the labeled text segment belonging to the labeled category does not exist in the sample text, generating a group of second training samples according to a second query text corresponding to the labeled category, the sample text and the empty text segment.
The training device for a text segment recognition model provided in this embodiment may execute the training method for a text segment recognition model provided in any method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 10 is a schematic structural diagram of a text passage recognition apparatus according to an embodiment of the present application. As shown in fig. 10, the text passage recognition apparatus 1000 provided in this embodiment includes: an acquisition module 1001 and a processing module 1002. Wherein,
an obtaining module 1001, configured to obtain a query text and a target text, where the query text is used to query a text segment belonging to a preset category in the target text; the processing module 1002 is configured to process the query text and the target text input through a trained text segment recognition model to obtain at least one target text segment in the target text, or obtain a null text segment, where the target text segment belongs to the preset category.
Or,
an obtaining module 1001 configured to obtain a target text; the processing module 1002 is configured to process the target text through the trained text segment recognition model to obtain at least one target text segment in the target text and a category to which each target text segment belongs, or obtain a null text segment.
The text segment recognition model is obtained by training with a training device of the text segment recognition model shown in fig. 9.
The text passage recognition apparatus provided in this embodiment may execute the text passage recognition method provided in the foregoing method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 11, the electronic device 1100 provided in the present embodiment includes: a memory 1101 and at least one processor 1102; the memory 1101 stores a computer program configured to be executed by the processor 1102 to implement the method for training a text segment recognition model provided in any one of the above method embodiments, or the method for recognizing a text segment, which implements similar principles and technical effects, and is not described herein again.
Optionally, the memory 1101 may be separate or integrated with the processor 1102. When the memory 1101 is a separate device from the processor 1102, the electronic device 1100 further comprises: a bus 1103 for connecting the memory 1101 and the processor 1102.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for training a text segment recognition model or the method for recognizing a text segment provided in any of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
An embodiment of the present application provides a computer program product, including a computer program, where the computer program is executed by a processor to implement the method for training a text segment recognition model or the method for text segment recognition provided in any of the foregoing method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.
An embodiment of the present application further provides a chip, including: the memory stores a computer program, and the processor runs the computer program to implement the training method for the text segment recognition model or the text segment recognition method provided by any one of the above method embodiments, which has similar implementation principles and technical effects and is not described herein again.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of hardware and software modules.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the present disclosure as defined by the appended claims.
Claims (13)
1. A training method of a text segment recognition model is characterized by comprising the following steps:
receiving a plurality of sample texts from a client and the labeling information of each sample text, wherein the labeling information of one sample text comprises: at least one labeled text segment in the sample text and the label category to which each labeled text segment belongs;
generating a training sample data set according to the membership between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the companionship between different labeled text segments in the sample texts belonging to the same labeled category;
and training the text segment recognition model to be trained by utilizing the training sample data set to obtain the trained text segment recognition model.
2. The method of claim 1, wherein generating a training sample data set according to the affiliation between the labeled text segments in the sample texts and the labeled categories to which the labeled text segments belong and the peer-to-peer relationship between different labeled text segments in the sample texts belonging to the same labeled category comprises:
generating a plurality of groups of first training samples according to the companion relationship among different labeled text segments belonging to the same labeled category in the plurality of sample texts;
generating a plurality of groups of second training samples according to the membership between the labeled text segments in the plurality of sample texts and the labeled categories to which the labeled text segments belong;
and generating the training sample data set according to the multiple groups of first training samples and the multiple groups of second training samples.
3. The method of claim 2, wherein generating a plurality of sets of first training samples according to a peer relationship between different labeled text segments belonging to a same labeled category in the plurality of sample texts comprises:
dividing each labeled text segment in the plurality of sample texts into at least one text segment set, wherein the labeled text segments in each text segment set belong to the same labeled category, and the labeled text segments in different text segment sets belong to different labeled categories;
generating a plurality of peer relationship data pairs according to the at least one text segment set; each companion relationship data pair comprises a first labeled text segment and a second labeled text segment, wherein the first labeled text segment and the second labeled text segment are two different labeled text segments in the same text segment set;
and generating at least one group of first training samples according to each companion relationship data pair to obtain the plurality of groups of first training samples.
4. The method of claim 3, wherein generating at least one set of first training samples from each peer relationship data pair comprises:
generating a first query text according to the first labeled text segment in the peer relationship data pair, wherein the first query text comprises the first labeled text segment, and the first query text is used for querying a text segment having a peer relationship with the first labeled text segment;
determining at least one first sample text from the plurality of sample texts according to the second annotation text segment in the peer relationship data pair, the first sample text comprising the second annotation text segment;
and generating the at least one group of first training samples according to the first query text, the at least one first sample text and the second labeled text segment.
5. The method of claim 3 or 4, wherein generating a plurality of peer relationship data pairs from the at least one set of text segments comprises:
respectively arranging and combining any two different labeled text segments in each text segment set to obtain a plurality of candidate data pairs corresponding to the text segment set;
sampling the plurality of candidate data pairs corresponding to at least part of the text segment set to obtain a plurality of companion relationship data pairs; the number of peer relationship data pairs is less than the number of candidate data pairs.
6. The method according to claim 5, wherein sampling the plurality of candidate data pairs corresponding to the first text segment set for any one of the at least one text segment set comprises:
determining the sampling number corresponding to the first text segment set;
sampling the plurality of candidate data pairs corresponding to the first text segment set according to the sampling number;
wherein the number of samples is related to one or more of: presetting a sampling proportion, and the difference between the number of the labeled text segments contained in the second text segment set and the number of the labeled text segments contained in the first text segment set; the second text segment set is a text segment set which contains the most number of labeled text segments in the at least one text segment set.
7. The method according to any one of claims 2 to 6, wherein generating a plurality of sets of second training samples according to the membership between the labeled text segments in the plurality of sample texts and the labeled categories to which the labeled text segments belong comprises:
carrying out duplication elimination processing on the labeling categories to which the labeling text segments belong in the plurality of sample texts to obtain a labeling category set;
generating a second query text corresponding to each labeling category in the labeling category set, wherein the second query text comprises the labeling category and is used for querying a text segment belonging to the labeling category;
and generating the multiple groups of second training samples according to the second query texts corresponding to the labeling categories in the labeling category set and the membership between each labeling text segment in the multiple sample texts and the labeling category to which the labeling text segment belongs.
8. The method of claim 7, wherein generating the plurality of groups of second training samples according to the second query text corresponding to each labeled category in the labeled category set and the membership between each labeled text segment in the plurality of sample texts and the labeled category to which the labeled text segment belongs comprises:
for each annotation category in the set of annotation categories, separately traversing each sample text in the plurality of sample texts:
if the sample text has the label text segment belonging to the label category, generating a group of second training samples according to a second query text corresponding to the label category, the sample text and the label text segment belonging to the label category in the sample text; or,
and if the sample text does not have the labeled text segment belonging to the labeled category, generating a group of second training samples according to a second query text, the sample text and the empty text segment corresponding to the labeled category.
9. A method for text segment recognition, comprising:
acquiring a query text and a target text, wherein the query text is used for querying a text segment belonging to a preset category in the target text;
processing the query text and the target text through a trained text segment recognition model to obtain at least one target text segment in the target text, or obtaining a null text segment, wherein the target text segment belongs to the preset category;
wherein the text passage recognition model is trained by the method of any one of claims 1 to 8.
10. A method for text segment recognition, comprising:
acquiring a target text;
processing the target text through a trained text segment recognition model to obtain at least one target text segment in the target text and the category of each target text segment, or obtain a null text segment;
wherein the text passage recognition model is trained by the method of any one of claims 1 to 8.
11. An electronic device, comprising: a memory and at least one processor; the memory has stored therein a computer program configured to be executed by the at least one processor to implement the method of any one of claims 1 to 8, or the method of claim 9 or 10.
12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 8, or the method of claim 9 or 10.
13. A computer program product, comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8, or the method of claim 9 or 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210626309.6A CN115033691A (en) | 2022-06-02 | 2022-06-02 | Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210626309.6A CN115033691A (en) | 2022-06-02 | 2022-06-02 | Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115033691A true CN115033691A (en) | 2022-09-09 |
Family
ID=83122859
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210626309.6A Pending CN115033691A (en) | 2022-06-02 | 2022-06-02 | Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115033691A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114764594A (en) * | 2022-04-02 | 2022-07-19 | 阿里巴巴(中国)有限公司 | Classification model feature selection method, device and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019174423A1 (en) * | 2018-03-16 | 2019-09-19 | 北京国双科技有限公司 | Entity sentiment analysis method and related apparatus |
CN113344098A (en) * | 2021-06-22 | 2021-09-03 | 北京三快在线科技有限公司 | Model training method and device |
-
2022
- 2022-06-02 CN CN202210626309.6A patent/CN115033691A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019174423A1 (en) * | 2018-03-16 | 2019-09-19 | 北京国双科技有限公司 | Entity sentiment analysis method and related apparatus |
CN113344098A (en) * | 2021-06-22 | 2021-09-03 | 北京三快在线科技有限公司 | Model training method and device |
Non-Patent Citations (2)
Title |
---|
ZARA NASAR ET AL.: "Named Entity Recognition and Relation Extraction: State-of-the-Art", 《ACM COMPUTING SURVEYS (CSUR), VOLUME 54, ISSUE 1》, 11 February 2021 (2021-02-11), pages 1 - 39 * |
吴婷: "篇章级实体关系识别关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 February 2021 (2021-02-15), pages 138 - 2956 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114764594A (en) * | 2022-04-02 | 2022-07-19 | 阿里巴巴(中国)有限公司 | Classification model feature selection method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bui et al. | Infercode: Self-supervised learning of code representations by predicting subtrees | |
CN107203468B (en) | AST-based software version evolution comparative analysis method | |
CN110580308B (en) | Information auditing method and device, electronic equipment and storage medium | |
CN111191275A (en) | Sensitive data identification method, system and device | |
CN110321437B (en) | Corpus data processing method and device, electronic equipment and medium | |
CN113051356A (en) | Open relationship extraction method and device, electronic equipment and storage medium | |
CN110866107A (en) | Method and device for generating material corpus, computer equipment and storage medium | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
Jiang et al. | Combining embedding-based and symbol-based methods for entity alignment | |
CN113254649A (en) | Sensitive content recognition model training method, text recognition method and related device | |
CN114153839B (en) | Integration method, device, equipment and storage medium of multi-source heterogeneous data | |
CN113190220A (en) | JSON file differentiation comparison method and device | |
CN111428513A (en) | False comment analysis method based on convolutional neural network | |
CN115186015A (en) | Network security knowledge graph construction method and system | |
Wu et al. | Deep learning models for spatial relation extraction in text | |
CN115033691A (en) | Training of text segment recognition model, text segment recognition method, text segment recognition device, and medium | |
CN115600605A (en) | Method, system, equipment and storage medium for jointly extracting Chinese entity relationship | |
WO2016093839A1 (en) | Structuring of semi-structured log messages | |
CN108830302B (en) | Image classification method, training method, classification prediction method and related device | |
CN114840642A (en) | Event extraction method, device, equipment and storage medium | |
CN111178701A (en) | Risk control method and device based on feature derivation technology and electronic equipment | |
CN111178080A (en) | Named entity identification method and system based on structured information | |
CN110765276A (en) | Entity alignment method and device in knowledge graph | |
CN103425795A (en) | Radar data analyzing method based on cloud calculation | |
CN114842982B (en) | Knowledge expression method, device and system for medical information system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |