CN117114013A

CN117114013A - Semantic annotation method and device based on small sample

Info

Publication number: CN117114013A
Application number: CN202311319086.XA
Authority: CN
Inventors: 魏炜; 贺智辰
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-11-24
Anticipated expiration: 2043-10-12
Also published as: CN117114013B

Abstract

The invention is applicable to the technical field of text processing, and provides a semantic annotation method and device based on a small sample, wherein the method comprises the following steps: acquiring a user-defined semantic tag, a document set marked based on the user-defined semantic tag and a text to be marked; establishing a training model by using a self-defined semantic tag and a document set as a training set through a machine learning algorithm to generate an automatic labeling model; marking the text to be marked by using an automatic marking model; outputting the marked text, and storing the marked text into a database. According to the embodiment of the invention, the problems of high training cost, difficulty in realizing individuation and ensuring accuracy are solved.

Description

Semantic annotation method and device based on small sample

Technical Field

The invention relates to the field of text processing, in particular to a semantic annotation method and device based on a small sample.

Background

No matter the information such as the text information or the important content which is learned or read by the user cannot be marked in life and work, no matter the information is collected, studied in academic or required by writing, and the like, so that related information can be quickly searched when the information is used in the future.

How to realize quick, efficient and accurate labeling according to text semantics, and is convenient for searching and improving the working efficiency becomes a difficult problem. To solve this problem, conventional label models such as manual labeling are adopted in the prior art, or a scheme of implementing automatic labeling through training by a large number of samples is adopted in the prior art. However, the scheme adopted in the prior art has the defects of high manual labeling time consumption, low efficiency, high large sample training cost, high difficulty, difficulty in realizing individuation and ensuring accuracy.

Disclosure of Invention

The embodiment of the invention provides a semantic annotation method based on a small sample, which aims to solve the problems of high training cost, difficulty in realizing individuation and guarantee of accuracy.

The embodiment of the invention is realized in such a way that a semantic annotation method based on a small sample is provided, which comprises the following steps:

acquiring a user-defined semantic tag, a document set marked based on the user-defined semantic tag and a text to be marked;

the self-defined semantic tags and the document set are used as training sets, training models are established through a machine learning algorithm, and automatic labeling models are generated;

marking the text to be marked by using the automatic marking model;

outputting the marked text, and storing the marked text into a database.

Still further, the method further comprises the steps of:

receiving text reviewed by a user;

comparing the text after review with the text after labeling, and judging whether labels labeled in the two texts are the same or not;

if the labels marked in the two texts are different, adding the reviewed text and the modified label into a training set, carrying out iterative updating on the automatic marking model, and simultaneously storing the reviewed text into a database to replace the original marked text.

Still further, the method further comprises the steps of:

receiving semantic search keywords input by a user;

performing tag matching according to the semantic search keywords and texts stored in the database;

if the labels of the semantic search keywords exist in the text of the database, displaying the content of the labels marked with the semantic search keywords;

and if the text of the database does not contain the label of the semantic search keyword, outputting a prompt of not retrieving the related information.

Still further, the content of the tag includes one or any combination of words, sentences, paragraphs, or documents.

Further, the document set comprises a plurality of documents marked on the basis of the self-defined semantic tags;

wherein each custom semantic tag corresponds to 3 to 5 documents.

The embodiment of the invention also provides a semantic annotation device based on the small sample, which comprises the following steps:

the labeling information acquisition unit is used for acquiring user-defined semantic labels, a document set labeled based on the user-defined semantic labels and texts to be labeled;

the automatic labeling model generating unit is used for establishing a training model by taking the self-defined semantic tag and the document set as training sets through a machine learning algorithm to generate an automatic labeling model;

the text labeling unit is used for labeling the text to be labeled by utilizing the automatic labeling model;

the text output unit is used for outputting the marked text and storing the marked text into the database.

Still further, the apparatus further comprises:

the first receiving unit is used for receiving the text reviewed by the user;

the label judging unit is used for comparing the received text after review with the text after labeling and judging whether labels labeled in the two texts are the same or not;

and the data updating unit is used for determining that if labels marked in the two texts are different according to the judgment, adding the reviewed texts and the modified labels into a training set, carrying out iterative updating on the automatic marking model, and simultaneously storing the reviewed texts into a database to replace the original marked texts.

Still further, the apparatus further comprises:

the second receiving unit is used for receiving semantic search keywords input by a user;

the keyword searching unit is used for performing tag matching on the semantic search keywords input by the user and the texts stored in the database;

the first display unit is used for determining that if the label of the semantic search keyword exists in the text of the database according to the label matching result, the content of the label marked with the semantic search keyword is displayed;

and the second display unit is used for determining that if the labels of the semantic search keywords do not exist in the text of the database according to the label matching result, outputting a prompt of not retrieving the related information.

Because the user-defined semantic tags are adopted to label the document, a semantic tag library is not required to be established in advance, but the user is self-defined, and the individual requirements of the user are met. The user adopts the custom semantic tags, so that the richness of the tags is increased, the labeling personalized processing is realized, the definition of the tags can be carried out on the semantics according to the understanding of the user, and the convenience and the flexibility of retrieval are higher.

In addition, the self-defined semantic labels and the labeled documents are used as training sets to establish an automatic labeling model, training based on small samples is achieved, training cost is reduced, meanwhile, because personalized labels only need a small number of training set samples, compared with the situation that repeated iteration is difficult to achieve in training of a large data set, training based on the small samples can be completed quickly, and accuracy can be improved through repeated iteration.

Drawings

FIG. 1 is a flow chart of one embodiment of a small sample-based semantic annotation method provided by the present invention;

FIG. 2 is a flow chart of another embodiment of a small sample-based semantic annotation method provided by the present invention;

FIG. 3 is a flow chart of yet another embodiment of the present invention for providing a small sample based semantic annotation method;

FIG. 4 is a schematic structural diagram of one embodiment of a small sample-based semantic annotation apparatus provided by the present invention;

FIG. 5 is a schematic structural diagram of another embodiment of a small sample-based semantic annotation apparatus provided by the present invention;

fig. 6 is a schematic structural diagram of still another embodiment of the small sample-based semantic annotation apparatus provided by the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the prior art, the traditional label model such as manual labeling is high in time consumption and low in efficiency, the scheme of training through a large number of large samples is high in cost and high in repeated iteration difficulty, individuation is difficult to achieve and high accuracy is guaranteed, so that the semantic labeling method based on the small samples is provided, individuation labeling is achieved according to user definition, and rapid repeated iteration improvement of accuracy is achieved based on the training of the small samples.

Example 1

Referring to fig. 1, fig. 1 is a flowchart of an embodiment of a small sample-based semantic annotation method according to the present invention.

In step S101, a user-defined semantic tag, a document set labeled based on the user-defined semantic tag, and a text to be labeled are obtained;

it may be appreciated that in the embodiment of the present invention, a user-defined semantic tag is obtained, where the user-defined semantic tag refers to a tag that defines a semantic according to his understanding or his preference, unlike a general or public tag defined in some fields to satisfy group needs, the user-defined semantic tag is defined to satisfy individual needs.

It will be appreciated that in the embodiment of the present invention, the user-defined semantic tags are generally defined by the user according to the read sentence, word, paragraph or some document or documents, based on the core semantics of the words to be expressed, that is, the tag is used to represent the content of the labeled sentence, word or paragraph, and the meaning represented by the tag is related to the meaning represented by the tag, or the meaning assigned to the tag by the user is represented by the tag, so as to play a role in prompting or other self-defining roles that the user can understand.

For example, where some enterprise pollution abatement measures are introduced but no "environmental measures" are directly mentioned, a user may define the label "environmental measures" to indicate that the paragraph is relevant to environmental measure content;

for another example, a user may browse a document for authoring purposes and tag "important" parts that the user believes need to refer to/review again during authoring.

It will be appreciated that in embodiments of the present invention, a document set labeled based on custom semantic labels refers to a plurality of documents for which labeling has been completed according to self-defined labels.

It can be appreciated that in the embodiment of the present invention, the text to be annotated is an unlabeled document, that is, the document that the user wants to automatically annotate through the machine learning model may be one or more.

In step S102, a self-defined semantic tag and a document set are used as training sets, a training model is established through a machine learning algorithm, and an automatic labeling model is generated;

it can be appreciated that in the embodiment of the present invention, the user-defined semantic tags and the plurality of documents that the user has completed annotating with the user-defined semantic tags are used as a training set for machine learning.

It will be appreciated that in embodiments of the invention, the process of machine learning involves segmenting the training set data, such as into larger data sets for training, remaining smaller subsets for testing, building a predictive model based on the training, checking the best model in the test, and for obtaining the best model, super-parametric optimization may also be performed, the super-parameters being essentially parameters of the machine learning algorithm, directly affecting the learning process and predictive performance. In other embodiments, the training set data may also be divided into three parts for training, verification and testing, respectively, without limitation.

It will be appreciated that in embodiments of the present invention, the machine learning algorithm can be broadly divided into three types: supervised learning type, unsupervised learning type, and reinforcement learning type. The machine learning algorithm may be any one of a decision tree algorithm, a random forest algorithm, a support vector machine algorithm, a deep learning algorithm, a logistic regression algorithm, a clustering algorithm, a bayesian classifier, and a neural network algorithm. Taking the random forest algorithm as an example, when using the random forest R package, two common hyper-parameters are typically optimized, including mtry and ntree parameters, mtry (maxfeatures) representing the number of variables randomly sampled as candidates at each split, and ntree (nestimators) representing the number of trees to be grown.

It can be appreciated that in the embodiment of the present invention, the automatic labeling model is generated by training the user-defined semantic labels provided in the sample set and the features of the document set labeled based on the user-defined semantic labels to build the model.

In step S103, labeling the text to be labeled by using an automatic labeling model;

it can be understood that in the embodiment of the invention, the text to be annotated is input into the trained automatic annotation model, the result is judged through model calculation, and the words, sentences or paragraphs which are needed to be annotated in the text to be annotated are predicted according to the characteristics extracted by the provided multiple groups of samples.

In step S104, the annotated text is output and saved to the database.

It can be appreciated that in the embodiment of the present invention, after automatic labeling is completed through the model, the labeled text is output to the model, and meanwhile, the labeled text is saved to the database for later searching or retrieving.

The user adopts the custom semantic tags, so that the richness of the tags is increased, the labeling personalized processing is realized, the definition of the tags can be carried out on the semantics according to the understanding of the user, and the convenience and the flexibility of retrieval are higher.

Example two

Referring to fig. 2, fig. 2 is a flowchart of another embodiment of a small sample-based semantic annotation method according to the present invention. In this embodiment, the present solution is further optimized based on the embodiment solution, and the method of the present invention further includes the following steps based on the embodiment step:

in step S201, receiving text reviewed by the user;

it can be understood that, in the embodiment of the present invention, the user reviews the text after the annotation output in step S104 of the embodiment, and if there is an annotation to be modified, the user modifies the annotation by making a review determination; if the modification is not needed, the labeling accuracy of the machine learning model is higher, and the content to be labeled by the user can be accurately expressed.

It will be appreciated that in embodiments of the present invention, after a user review, the reviewed text is entered into the system, which receives the user's reviewed text.

In step S202, comparing the text after review with the text after labeling, and judging whether the labels labeled in the two texts are the same;

it will be appreciated that in embodiments of the present invention, the annotated text in the database is compared to the reviewed text uploaded by the user by retrieving the annotated text stored in the database.

It can be understood that in the embodiment of the present invention, determining whether the labels marked in the two texts are identical includes determining whether the positions of the labels marked in the two texts are in one-to-one correspondence, whether the used labels are identical, and whether the meanings represented by the labels are identical, that is, whether the two documents need to be compared to be identical, and if one place is different, determining that the labels marked in the two texts are not identical.

In step S203, if the labels marked in the two texts are different, the reviewed text and the modified label are added into the training set, and the automatic marking model is iteratively updated, and meanwhile, the reviewed text is saved to the database to replace the original marked text.

It can be understood that in the embodiment of the present invention, if the labels marked in the two texts are different, that is, if there is a difference between the labels marked in the two texts, the reviewed text and the modified label are added as new training samples to the training set, and the automatic marking model is iteratively updated through a machine learning algorithm.

It can be appreciated that in the embodiment of the invention, the reviewed text is saved to the database at the same time to replace the original marked text, so as to update the data saved in the database.

By taking the self-defined semantic labels and the labeled documents as training sets to establish an automatic labeling model, training based on small samples is achieved, training cost is reduced, meanwhile, because personalized labels only need a small number of training set samples, training based on the small samples can be completed quickly compared with the situation that repeated iteration is difficult to achieve for training of large-scale data sets, and accuracy can be improved through repeated iteration.

Example III

Referring to fig. 3, fig. 3 is a flowchart of still another embodiment of a small sample-based semantic annotation method according to the present invention. In this embodiment, the present solution is further optimized based on the second solution of the embodiment, and based on the second step of the embodiment, the method of the present invention further includes the following steps:

in step S301, a semantic search keyword input by a user is received;

it will be appreciated that in embodiments of the invention, a user retrieves a database by entering semantic search keywords to obtain annotations of any one or combination of words, sentences, paragraphs or text associated with the keywords.

It will be appreciated that in embodiments of the present invention, after the system receives the semantic search keywords entered by the user, the data stored in the database is indexed by the search keywords.

In step S302, tag matching is performed according to the semantic search keywords and the text stored in the database;

it can be appreciated that in the embodiment of the present invention, the data stored in the database is indexed by the search keyword, including performing a comparison with the tags existing in all the files stored in the database to achieve matching, and determining whether the same tag as the search keyword exists.

In step S303, if a tag of a semantic search keyword exists in the text of the database, displaying the content of the tag labeled with the semantic search keyword;

it may be appreciated that in the embodiment of the present invention, if the text of the database has the same tag as the search keyword, the content of the tag labeled with the semantic search keyword is displayed, where the content may be one or more words, sentences, paragraphs or documents, or other indicated content.

In step S304, if the text of the database does not have the tag of the semantic search keyword, a prompt for not retrieving the related information is output.

It may be appreciated that in the embodiment of the present invention, if the same tag exists in the text of the database as the search keyword, the system prompts that no relevant information is retrieved.

Through the step of searching keywords by semantics, establishing the connection between the custom tag and the search keywords, detecting the accuracy of the custom tag, realizing the calibration of the custom tag and realizing more convenient semantic search.

Example IV

It will be appreciated that in embodiments of the invention, the content of the tag includes one or any combination of words, sentences, paragraphs, or documents.

Example five

It can be appreciated that in the embodiment of the invention, the document set contains a plurality of documents based on custom semantic tag labels; wherein each custom semantic tag corresponds to 3 to 5 documents.

It can be understood that in the embodiment of the invention, the user adds the personalized tag, and only needs to provide a small amount of samples as the training set, namely, each custom semantic tag exists in 3 to 5 uploaded training set documents, so that rapid training and repeated iteration can be realized, the efficiency is greatly improved, and the accuracy can be improved through repeated iteration.

Example six

Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of a small sample-based semantic annotation apparatus according to the present invention. As an implementation of the small sample-based semantic annotation method shown in fig. 1, this embodiment provides a small sample-based semantic annotation apparatus, where an embodiment of the apparatus corresponds to the method embodiment shown in fig. 1, and the apparatus includes:

the annotation information obtaining unit 101 is configured to obtain a semantic tag customized by a user, a document set annotated based on the customized semantic tag, and a text to be annotated;

the automatic labeling model generating unit 102 is configured to build a training model by using a self-defined semantic tag and a document set as a training set through a machine learning algorithm to generate an automatic labeling model;

a text labeling unit 103, configured to label a text to be labeled by using an automatic labeling model;

and the text output unit 104 is used for outputting the marked text and storing the marked text into a database.

The embodiment of the invention has the beneficial effects that the user adopts the custom semantic label, so that the richness of the label is increased, the labeling personalized processing is realized, the definition of the label can be carried out on the semantic according to the understanding of the user, and the convenience and the flexibility of the retrieval are higher.

Example seven

Referring to fig. 5, fig. 5 is a flowchart of another embodiment of the small sample-based semantic annotation apparatus according to the present invention. As a device further optimized based on the sixth aspect of the embodiment, this embodiment provides a semantic labeling device based on a small sample, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device further includes:

a first receiving unit 201, configured to receive text reviewed by a user;

a tag judgment unit 202, configured to compare the received text after review with the text after labeling, and judge whether the tags labeled in the two texts are the same;

and the data updating unit 203 is configured to, according to the determination, add the reviewed text and the modified label to the training set if the labels marked in the two texts are different, and iteratively update the automatic marking model, and save the reviewed text to the database instead of the original marked text.

The method has the advantages that the self-defined semantic tags and the labeled documents are used as training sets to establish an automatic labeling model, training based on small samples is achieved, training cost is reduced, meanwhile, because personalized tags only need a small number of training set samples, compared with the situation that repeated iteration is difficult to achieve in training of large data sets, training based on the small samples can be completed quickly, and accuracy can be improved through repeated iteration.

Example eight

Referring to fig. 6, fig. 6 is a flowchart of still another embodiment of the small sample-based semantic annotation apparatus according to the present invention. As a device further optimized on the basis of the seventh aspect of the embodiment, this embodiment provides a semantic annotation device based on a small sample, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 3, and the device further includes:

a second receiving unit 301, configured to receive a semantic search keyword input by a user;

a keyword searching unit 302, configured to perform tag matching with a text stored in a database according to a semantic search keyword input by a user;

a first display unit 303, configured to determine, according to a result of the tag matching, that if a tag of a semantic search keyword exists in a text of the database, display content of the tag labeled with the semantic search keyword;

and the second display unit 304 is configured to determine, according to the result of the tag matching, that if the tag of the semantic search keyword does not exist in the text of the database, output a prompt that the related information is not retrieved.

The embodiment of the invention has the beneficial effects that the keyword searching unit is added to establish the connection between the custom tag and the search keyword, the accuracy of the custom tag is detected, the calibration of the custom tag is realized, and more convenient semantic search is realized.

Example nine

The embodiment provides a semantic annotation system based on a small sample, which comprises the following steps: a memory and a processor;

and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the steps of the small sample-based semantic annotation method of any of the embodiments described above.

The semantic annotation system based on the small sample has the advantages that a semantic tag library is not required to be established in advance, but the user is used for defining the semantic tags, so that the richness of the tags is increased, the annotation personalized processing is realized, the definition of the tags can be performed on the semantics according to the understanding of the user, the convenience and the flexibility of retrieval are higher, meanwhile, training can be completed quickly based on training of the small sample, and the accuracy can be improved through repeated iteration.

Examples ten

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores program instructions which, when executed by a processor, implement the steps of the small sample-based semantic annotation method of any of the embodiments.

The storage medium has the advantages that a semantic tag library is not required to be established in advance, but the user can customize the semantic tags, so that the richness of the tags is increased, the personalized labeling processing is realized, the definition of the tags can be performed on the semantics according to the understanding of the user, the convenience and the flexibility of retrieval are higher, meanwhile, training can be completed quickly based on training of small samples, and the accuracy can be improved through repeated iteration.

The invention is operational with numerous general purpose or special purpose computer system environments or configurations.

For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

In summary, the semantic labeling method and device based on the small sample provided by the embodiment of the invention can be used for obtaining the user-defined semantic label, the document set labeled based on the user-defined semantic label and the text to be labeled; establishing a training model by using a self-defined semantic tag and a document set as a training set through a machine learning algorithm to generate an automatic labeling model; marking the text to be marked by using an automatic marking model; outputting the marked text, and storing the marked text into a database. Therefore, the problems of high training cost, difficulty in realizing individuation and ensuring accuracy are solved, and the effects of realizing individuation annotation according to user definition and realizing rapid repeated iteration and accuracy improvement based on small sample training are brought.

It is understood that those skilled in the art can combine the various embodiments of the above embodiments to obtain technical solutions of the various embodiments under the teachings of the above embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A semantic annotation method based on a small sample, comprising:

marking the text to be marked by using the automatic marking model;

outputting the marked text, and storing the marked text into a database.

2. The small sample-based semantic annotation method of claim 1, wherein the method further comprises the steps of:

receiving text reviewed by a user;

3. The small sample-based semantic annotation method of claim 2, wherein the method further comprises the steps of:

receiving semantic search keywords input by a user;

4. The small sample based semantic annotation method of claim 3, wherein the content of the tag comprises one or any combination of words, sentences, paragraphs, or documents.

5. The small sample-based semantic annotation method of any one of claims 1-4, wherein the set of documents comprises a plurality of documents annotated based on the custom semantic tags;

wherein each custom semantic tag corresponds to 3 to 5 documents.

6. A small sample-based semantic annotation apparatus comprising:

7. The small sample-based semantic annotation apparatus of claim 6, wherein the apparatus further comprises:

the first receiving unit is used for receiving the text reviewed by the user;

8. The small sample-based semantic annotation apparatus of claim 7, wherein the apparatus further comprises:

9. A small sample-based semantic annotation system comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the small sample-based semantic annotation method according to any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the small sample based semantic annotation method according to any of claims 1 to 6.