CN117973525A

CN117973525A - Complex sample relation extraction method, system, equipment and medium

Info

Publication number: CN117973525A
Application number: CN202410230243.8A
Authority: CN
Inventors: 黄思东; 黄加胜; 彭俊伟
Original assignee: Guangzhou Tanyu Technology Co ltd
Current assignee: Guangzhou Tanyu Technology Co ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-05-03

Abstract

The invention provides a relation extraction method, a system, equipment and a medium of a complex sample, wherein the method comprises the following steps: acquiring sample data, and preprocessing the sample data; a BERT model is used as a language encoder to construct a relation extraction model, and fine tuning optimization is carried out on the relation extraction model; performing initial prediction on sample data through a relation extraction model, outputting a sample marked with an erroneous recognition as a complex sample, constructing a positive and negative sample set through the complex sample, and performing reinforcement training on the sample data by adopting a contrast learning training method and reverse propagation optimization until convergence; and constructing an evaluation index system of the relation extraction model, and comprehensively evaluating the performance of the relation extraction model through the evaluation index system.

Description

Complex sample relation extraction method, system, equipment and medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method, a system, equipment and a medium for extracting a relationship of a complex sample.

Background

The relation extraction task is a key link for constructing map data and knowledge base data of products, users, transactions and the like, and provides support for a recommendation system, search optimization, user behavior analysis and the like through the map and the knowledge base, particularly in the field of electronic commerce, the electronic commerce data comprises a large number of product descriptions, user evaluation and transaction records, and the diversity and complexity of the electronic commerce data increase the difficulty of the relation extraction task.

Early relation extraction tasks rely on manually written rules and templates, but often perform poorly in the face of large-scale and diversified electronic commerce data, and with the development of machine learning, feature engineering-based methods are beginning to be applied to relation extraction tasks, but expert knowledge is still required to extract effective features, so that deep learning-neural network models can automatically learn data features, and dependence on artificial feature engineering is reduced.

However, although the current lightweight deep learning model has the characteristic of high efficiency in processing large-scale data, the current lightweight deep learning model still has limitations in identifying and processing complex samples in the data, particularly complex samples with a large number of nonstandard expressions and obscure semantic relations exist in electronic commerce data, and the current lightweight deep learning model cannot meet the high requirement of efficiently and universally identifying the complex samples.

Therefore, a relationship extraction method of complex samples is needed to solve the problem that the above model cannot efficiently and accurately process complex sample data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method, a system, equipment and a medium for extracting the relation of complex samples.

The first aspect of the invention discloses a relation extraction method of complex samples, which comprises the following steps:

s1, acquiring sample data, and preprocessing the sample data;

s2: a BERT model is used as a language encoder to construct a relation extraction model, and fine tuning optimization is carried out on the relation extraction model;

S3: performing initial prediction on sample data through the relation extraction model, outputting a sample marked with an erroneous recognition as a complex sample, constructing a positive and negative sample set through the complex sample, and performing reinforcement training on the sample data by adopting a contrast learning training method and back propagation optimization until convergence;

s4: and constructing an evaluation index system of the relation extraction model, and comprehensively evaluating the performance of the relation extraction model through the evaluation index system.

In an alternative embodiment, the preprocessing the sample data includes:

s11: performing word segmentation processing on the sample data through a word segmentation device, and decomposing a text sentence segment into word units;

S12: identifying head entities and tail entities of text sentence segments in the sample data, and marking the positions of the head entities and the tail entities of the text sentence segments after word segmentation based on character indexes;

s13: acquiring a stop word list, and removing word units with semantic contribution lower than a threshold value from sample data through the stop word list;

s14: and carrying out length standardization processing on the text sentence segments in the sample data according to the input requirements of the relation extraction model.

In an alternative embodiment, said fine tuning the relation extraction model comprises:

s21: updating pre-training weights of the relational abstract model based on historical sample data;

S22: inserting a classification token for classifying related tasks at the beginning of the sequence of the sample data, and inserting a segmentation token for segmenting different sentences at the end of the sentence segment;

s23: splicing the entity relation templates, and converting the relation extraction task into a gap filling task through a prompt instruction to perform relation prediction;

S24: constructing label mapping of a relation predictive word and a special token in a gap filling task, and expressing the relation between predictive word expression entities through the label mapping;

S25: presetting the learning rate, data batch and training period of the relation extraction model, and training the fine tuning process of the relation extraction model through a cross entropy loss function until the model converges.

In an alternative embodiment, said constructing positive and negative sample sets from said complex samples comprises:

S31: traversing all complex samples, and selecting samples which have the same relation type as the complex samples and have correct relation prediction as positive samples for each complex sample;

S32: samples having different relationship types with the complex samples are selected as negative samples for each complex sample.

In an alternative embodiment, the training the sample data with the contrast learning training method and the back propagation optimization until convergence comprises:

s33: acquiring a characteristic representation of each sample through the relation extraction model;

s34: calculating the feature representation distance between the positive sample pair and the negative sample pair through cosine similarity;

s35: minimizing the feature representation distance between the positive sample pairs using a contrast learning loss function, and maximizing the feature representation distance between the negative samples by the contrast learning loss function;

s36: the adam optimizer is adopted to carry out strengthening training on the complex samples through back propagation until the contrast loss function converges.

In an alternative embodiment, the expression of the contrast learning loss function is:

Wherein, Representing contrast learning loss, L (θ, Y, S _i,S_j) represents a loss function based on the parametric model θ, the output tensor Y, the positive sample pair S _i, and the negative sample pair S _j, M represents the total number of samples of the correct prediction set and the incorrect prediction set, Y _i represents a relational tag, M represents a trainable parameter for controlling the positive and negative sample pair distance limits, and D _θ represents a distance calculation formula.

In an optional embodiment, the constructing an evaluation index system of the relationship extraction model, and comprehensively evaluating the performance of the relationship extraction model through the evaluation index system includes:

S41: constructing an evaluation index system through accuracy, recall and F1 score;

S42: calculating the ratio between the number of correctly classified samples and the total number of samples of the relation extraction model by adopting accuracy, wherein the samples comprise correctly classified positive samples, negative samples and incorrectly classified positive samples and negative samples;

s43: calculating the ratio between the number of positive samples correctly classified by the relation extraction model and the total actual positive sample by adopting a recall ratio, and outputting a mark as identification sensitivity;

S44: and F1 scores of the relation extraction model are calculated according to the accuracy rate and the recall rate, and the performance of the current relation extraction model is evaluated according to the ranking of the F1 scores.

In a second aspect, the invention discloses a complex sample relationship extraction system, the system comprising:

the data preprocessing module is used for acquiring sample data and preprocessing the sample data;

The model fine tuning module is used for constructing a relation extraction model by adopting the BERT model as a language encoder and carrying out fine tuning optimization on the relation extraction model;

The complex sample enhancement module is used for carrying out initial prediction on sample data through the relation extraction model, outputting samples with wrong identification marks as complex samples, constructing positive and negative sample sets through the complex samples, and carrying out enhancement training on the sample data by adopting a contrast learning training method and reverse propagation optimization until convergence;

And the index evaluation module is used for constructing an evaluation index system of the relation extraction model, and comprehensively evaluating the performance of the relation extraction model through the evaluation index system.

A third aspect of the present invention discloses a relationship extraction apparatus for complex samples, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform a complex sample relationship extraction method as disclosed in any one of the first aspects of the present invention.

A fourth aspect of the present invention discloses a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a relationship extraction method of complex samples as disclosed in any one of the first aspects of the present invention.

Compared with the prior art, the invention has the following advantages:

(1) According to the invention, the positive and negative sample sets are constructed to enhance the complex samples, so that the problem of prediction and identification of the complex samples can be effectively processed, more accurate and personalized recommendation and search results can be provided for an e-commerce platform, and therefore, the user experience is improved, and more users are attracted.

(2) According to the method, the relation extraction model is trained, and the optimal relation extraction task completion result is output by combining the evaluation index system, so that enterprises can better understand the user demands, and the user participation is improved through more targeted service and recommendation.

(3) The invention improves the accuracy of the relation extraction task through the extraction method of the complex sample, meets the urgent requirements of the E-commerce field, and improves the performance of a recommendation system, a search engine and user behavior analysis.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a method for extracting relationships between complex samples according to the present invention;

FIG. 2 is a schematic diagram of a complex sample relationship extraction system of the present invention.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in connection with the present application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

Example 1

Referring to fig. 1, the embodiment of the invention discloses a relation extraction method of complex samples, which comprises the following steps:

s1, acquiring sample data, and preprocessing the sample data;

It should be noted that, the complex sample is a sample of the relation extraction model with erroneous prediction and recognition, and it is difficult to classify the model correctly due to the reasons of fuzzy semantics, complex sentence structure, or insignificant relation between entities.

In an alternative embodiment, the preprocessing the sample data includes:

It should be noted that the self-contained WordPi ece word segmentation device of the BERT model can be used, and the WordPiece word segmentation device can effectively decompose sentences into sub-word units (subwords), and the word segmentation mode can process words outside a vocabulary, and meanwhile certain vocabulary integrity is reserved.

Further, identifying the head entity and the tail entity in each sentence, if the sentence describes the relationship between the person and the organization, the two entities need to be identified, and marking the positions of the head entity and the tail entity in the sentence after word segmentation can be achieved by adding special marks (such as < head > and < tai >).

Further, using general Chinese and English stop word lists, it is contemplated that some common words with less semantic contribution, such as "the", "i s", "at", etc., may be removed during the preprocessing stage.

It should be noted that, according to the input requirement of the model (such as the maximum length limit of BERT), the sentence length is normalized, including cutting too long sentences or filling too short sentences with padding.

It should be noted that, the BERT model is selected as a language encoder to perform pre-training weight downloading, and the BERT is pre-trained on a large number of texts, so that a complex language structure and semantics can be understood.

Further, to accommodate the input format of the BERT, a [ CLS ] tag is added at the beginning of each input sample, and a [ SEP ] tag is added at the end of the sentence.

Further, to more clearly identify the location of an entity, special markers need to be added around the entity to indicate the location of the entity. The first entity is filled with "[ E1]" and "[/E1]", respectively, before and after. The second entity is filled with "[ E2]" and "[/E2]", respectively, back and forth.

Further, in the process of template stitching, templates describing the relationship between entity 1 and entity 2 are processed, at this stage, the algorithm converts the task of relationship extraction into a form of filling questions, where the Prompt (hint word) is used to guide the model to understand and predict the relationship between entities in the text, for example: the relationship between [ Product ] and [ Attribute ] is [ MASK ] "is to predict and fill the space for MASK.

Specifically, when constructing a Label map of a relation predictor and a particular token in a fill-in task, a particular token, such as [ Label 1], [ Label 2], [ Label N ], where [ Label N ] represents the relation in N, is added. If the model predicts that the probability of filling in the [ Label N ] is maximum, the N-th relation is between entities in the original sentence.

It should be noted that, in constructing positive and negative samples, for each complex sample (mispredicted sample), a sample having the same true relationship as that but correctly predicted is selected as the positive sample, for example, if the relationship between the product and the manufacturer in one sample is mispredicted, a sample that also represents the relationship between the product and the manufacturer but is correctly predicted is selected as the positive sample, and a sample having a different relationship from that is selected as the negative sample for each complex sample, if the complex sample represents the relationship between the product and the manufacturer, the negative sample may represent the relationship between the product and the price or other relationships.

The Accuracy (Accuracy) refers to a ratio between the number of samples of the model correctly classified and the total number of samples, and the calculation formula is expressed as:

Wherein TP (True Positives) denotes the positive case number of correct classification, TN (True Negatives) denotes the negative case number of correct classification, FP (False Positives) denotes the positive case number of erroneous classification, and FN (False Negatives) denotes the negative case number of erroneous classification. The accuracy measures the proportion of the model that is correctly classified in all samples. However, when the categories are unbalanced, accuracy may not be a good indicator as it may be affected by the main category.

Further, recall (Recall) refers to the ratio between the number of positive cases that the model correctly recognizes and the number of all actual positive cases, also known as Sensitivity (Sensitivity) or true case rate (True Positive Rate), and its calculation formula is expressed as:

The recall rate measures the recognition capability of the model to the correct case. High recall means that the model can catch more positive examples, but may lead to more false positives.

Further, the F1 Score (F1-Score) refers to an index comprehensively considering the accuracy and the recall, that is, a harmonic mean of the accuracy and the recall, and the calculation formula is expressed as follows:

where Precision refers to the ratio between the number of positive examples of correct classification of a model and the number of samples all classified as positive examples. It measures the accuracy of the model in the positive example prediction. And F1 fraction combines the accuracy and the recall rate, and comprehensively evaluates the performance of the model in positive and negative sample classification. It is particularly useful in dealing with unbalanced categories.

According to the invention, the positive and negative sample sets are constructed to enhance the complex samples, and the problem of prediction and identification of the complex samples is processed, so that more accurate and personalized recommendation and search results can be provided for the e-commerce platform, and the user experience is improved, and more users are attracted. The method has the advantages that the relation extraction model is trained, and the optimal relation extraction task completion result is output by combining the evaluation index system, so that enterprises can better understand the user demands, and the user participation is improved through more targeted service and recommendation. The accuracy and the freshness of the relation extraction task are improved through the extraction method of the complex sample, so that the urgent requirements of the electronic commerce field are met, and the performance of a recommendation system, a search engine and user behavior analysis is improved.

As shown in fig. 2, the second aspect of the present invention discloses a complex sample relationship extraction system, which includes:

At least one processor, and

A memory communicatively coupled to the at least one processor; wherein,

The computer device may be a terminal comprising a processor, a memory, a network interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a relationship extraction method for complex samples. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored on a non-volatile computer readable storage medium, which when executed may comprise the steps of embodiments of the relationship extraction method by complex samples as described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Or the modules of the present invention may be stored in a computer-readable storage medium if implemented as software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or part contributing to the related art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program code, such as a removable storage device, RAM, ROM, magnetic or optical disk.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method for extracting relationships between complex samples, the method comprising:

s1, acquiring sample data, and preprocessing the sample data;

2. The complex sample relationship extraction method of claim 1, wherein said preprocessing the sample data comprises:

3. The method of complex sample relation extraction according to claim 1, wherein said fine-tuning optimization of said relation extraction model comprises:

4. The method of complex sample relation extraction according to claim 1, wherein said constructing positive and negative sample sets from said complex samples comprises:

5. The method of complex sample relation extraction according to claim 4, wherein the training the sample data with contrast learning training and back propagation optimization until convergence comprises:

6. The complex sample relation extraction method according to claim 5, wherein the expression of the contrast learning loss function is:

7. The method for extracting relationships between complex samples according to claim 1, wherein the constructing an evaluation index system of the relationship extraction model, and comprehensively evaluating the performance of the relationship extraction model through the evaluation index system comprises:

8. A complex sample relationship extraction system, the system comprising:

9. A complex sample relationship extraction apparatus, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the complex sample relationship extraction method of any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the complex sample relationship extraction method of any one of claims 1 to 7.