CN115526183A

CN115526183A - Semantic consistency recognition method, storage medium and device for term context

Info

Publication number: CN115526183A
Application number: CN202211197928.4A
Authority: CN
Inventors: 裘思科; 张坚毅; 陈杰; 谢周兵
Original assignee: MIGU Culture Technology Co Ltd
Current assignee: MIGU Culture Technology Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-27

Abstract

The invention provides a semantic consistency recognition method, a storage medium and equipment for terms in context, wherein the semantic consistency recognition method for terms in context comprises the following steps: acquiring text data of a preset scale; segmenting the text data into data sets, and marking the data sets according to the distinction of positive examples and negative examples; sentence hiding processing is carried out on the data set, and target terms are spliced with the data set; splicing each sentence after the target term to form a training sample set; identifying sentences in the training sample set according to the distinction of positive examples and negative examples; and determining a term consistency recognition model according to the recognition result of each sentence in the training sample set. The invention adopts a deep learning technology to realize the term context semantic consistency recognition of the risk position.

Description

Semantic consistency recognition method, storage medium and device for term context

Technical Field

The invention belongs to the technical field of language processing, relates to a semantic consistency identification method, and particularly relates to a semantic consistency identification method, a storage medium and equipment for terms in context.

Background

At present, in the process of writing a language, an object which one desires to express may be spelled, wrongly written, or irregularly written due to various reasons such as an input error or a difference in habitual expression. For these target terms expressed in terms of phrases, identifying their semantic consistency in the context of sentences has a very important role in finding and correcting these errors.

In general, semantic consistency of terms can be solved by a classification problem, i.e., for a given set of target term sets, collecting a set of sentences containing the target terms as labeled data, by training a text classifier, or by constructing a two-classifier for each target term.

Therefore, how to provide a semantic consistency identification method, a storage medium and a device for a term in context to solve the defect that the prior art cannot further improve the semantic consistency identification accuracy and the like becomes a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a semantic consistency recognition method, a storage medium and a device in terms of context, which are used for solving the problem that the prior art cannot further improve the semantic consistency recognition accuracy.

To achieve the above and other related objects, one aspect of the present invention provides a method for recognizing semantic consistency of terms in context, where the method for recognizing semantic consistency of terms in context includes: acquiring text data of a preset scale; segmenting and processing the text data into data sets, and marking the data sets according to the distinction of positive examples and negative examples; sentence masking processing is carried out on the data set, and target terms are spliced with the data set; splicing each sentence after the target term to form a training sample set; identifying sentences in the training sample set according to the distinction of positive examples and negative examples; and determining a term consistency recognition model according to the recognition result of each sentence in the training sample set.

In an embodiment of the present invention, the step of obtaining text data of a preset size includes: and obtaining text data of a preset scale from the comment data of characters and events or by utilizing a web crawler capturing mode.

In an embodiment of the present invention, the step of segmenting the text data into data sets and marking the data sets according to the distinction between positive examples and negative examples includes: carrying out data cleaning on the text data, and removing special symbols which are not beneficial to training; dividing the cleaned text data into sentence sets according to heuristic rules; scanning each sentence in the sentence set according to a given term set, and checking whether a completely matched term character string or a term approximate string exists in the sentence; in response to there being a completely matching term string, recording the sentence as a positive example into a training database; in response to the existence of the term approximation string, the sentence is recorded as a negative example into the training database.

In an embodiment of the present invention, the step of performing sentence masking processing on the data set and splicing the target term with the data set includes: in a context sentence of the data set, masking a character string corresponding to the target term or a deformed character string related to the target term by using a masking symbol; and splicing the character strings corresponding to the target terms and the character strings corresponding to the masked context sentences through separators.

In an embodiment of the present invention, the step of identifying the sentences in the training sample set according to the positive examples and the negative examples includes: in response to a current sentence in the set of training samples being a positive example, marking the current sentence as 0; in response to a current sentence in the set of training samples being a negative example, the current sentence is labeled as 1.

In an embodiment of the present invention, after the step of determining the term consistency recognition model according to the recognition result of each sentence in the training sample set, the semantic consistency recognition method for the term context further includes: randomly ordering the training sample set; generating different model parameters according to the training sample set subjected to random sequencing for each time; and selecting the model parameters corresponding to the best results to generate the term consistency identification model.

In an embodiment of the invention, after the step of determining the term consistency recognition model according to the recognition result of each sentence in the training sample set, the method for semantic consistency recognition of the terms in context further includes: inputting a sentence to be detected; identifying terms and deformed terms in the sentence to be detected through pattern matching; in the sentence to be detected, masking the term and the deformed term; splicing the masked sentence to be detected with the term and the deformation term; and inputting the spliced sentences to be detected into the term consistency identification model for consistency identification.

In an embodiment of the present invention, after the step of inputting the spliced sentences to be detected into the term consistency identification model for consistency identification, the semantic consistency identification method of the terms in context further includes: judging whether the consistency identification result is consistent with the expected result or not; in response to the consistency identification result not being consistent with the expected result, recording an inconsistency condition, and reporting that there is an error in the use of the term or the morphed term.

To achieve the above and other related objects, another aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a semantic consistency recognition method of the described terms in context.

To achieve the above and other related objects, a last aspect of the present invention provides an electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the semantic consistency recognition method in the context of the term.

As described above, the semantic consistency recognition method, storage medium, and device for the terms in the context according to the present invention have the following advantages:

the method adopts a deep learning technology, is based on a basic language model, and enforces the model learning context characteristic discrimination capability by removing risk characters and constructing a context and term pair by splicing, thereby realizing the term context semantic consistency identification of risk positions. According to the invention, a negative example sample of a target term which does not exist in the existing identification mode is constructed through the judgment of the positive example and the negative example; the method supports dynamic expansion of the term set through common recognition and training of the term character string and the term deformation character string, and achieves multi-term classification effect; the method avoids the interference of the terms to the context through sentence masking, and forces the model to learn the consistency recognition capability of the context to the terms. And then, potential errors existing in term writing in text proofreading can be identified as a judgment basis for judging whether the terms have irregular expression, so that the problem of term detection false alarm caused by a pure rule replacement mode is solved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for semantic consistency recognition in context of terms of the present invention in one embodiment.

FIG. 2 is a flow chart of training data construction in an embodiment of a semantic consistency recognition method in context of terms of the present invention.

FIG. 3 is a flow chart illustrating a training prediction process of the semantic consistency recognition method in context of terms of the present invention in one embodiment.

FIG. 4 is a diagram illustrating a term consistency recognition model of the term consistency recognition method according to an embodiment of the present invention.

Fig. 5 is a schematic structural connection diagram of an electronic device according to an embodiment of the invention.

Description of the element reference

5. Electronic device

51. Processor with a memory for storing a plurality of data

52. Memory device

S11 to S15

Detailed Description

The following embodiments of the present invention are provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The semantic consistency identification method, the storage medium and the equipment for the term context adopt a deep learning technology to realize the semantic consistency identification of the term context of the risk position.

The principle and implementation of the semantic consistency recognition method, storage medium and device in the context of the term of the present embodiment will be described in detail below with reference to fig. 1 to 5, so that those skilled in the art can understand the semantic consistency recognition method, storage medium and device in the context of the term of the present embodiment without creative work.

Referring now to FIG. 1, a schematic flow chart diagram illustrating a method for semantic consistency recognition in context of terms in accordance with the present invention is shown. As shown in fig. 1, the method for identifying semantic consistency of terms in context specifically includes the following steps:

and S11, acquiring text data of a preset scale.

In one embodiment, text data of a preset scale is obtained from comment data of people and events or by using a web crawler capture mode.

And S12, segmenting and processing the text data into data sets, and marking the data sets according to the distinction of positive examples and negative examples.

Please refer to fig. 2, which illustrates a flow chart of a training data structure of the semantic consistency recognition method according to an embodiment of the present invention. As shown in fig. 2, S12 specifically includes the following steps:

(1) And carrying out data cleaning on the text data, and removing special symbols which are not beneficial to training.

Specifically, a big data original text is obtained, data cleaning is carried out on the text data, and special symbols which are not beneficial to training, such as special symbols like webpage labels, are removed.

(2) And according to heuristic rules in the prior art, dividing the cleaned text data into sentence sets.

(3) Each sentence in the sentence set is scanned according to a given term set to check whether a completely matched term character string exists in the sentence or a term approximate string exists in the sentence.

(4) In response to there being a completely matching term string, recording the sentence as a positive example into a training database; in response to the existence of the term approximation string, the sentence is recorded as a negative example into the training database.

Specifically, the calculation of the approximate string may be determined according to whether the pinyin is identical, whether the approximate pinyin exists, whether the multi-character, the few-character, the wrongly written character exists, and one or a combination of the cases of sequence exchange.

Therefore, in order to support dynamic expansion of a term set, a classifier does not need to be designed for each term independently, and the term consistency recognition model designed by the invention only needs to prepare positive example and negative example samples associated with new terms in a mode of term character strings and term approximate strings during application, so that the model can have new term recognition capability on the basis of the existing capability through training. In the conventional classification task, the whole sentence is used as the input of a pre-training model, all term sets are required to be given in advance and used as a learning target, and obviously, the amount of data for recognition and training is large.

S13, sentence hiding processing is carried out on the data set, and target terms are spliced with the data set; and splicing each sentence after the target term to form a training sample set.

In an embodiment, S13 specifically includes the following steps:

(1) And in the context sentence of the data set, masking the character string corresponding to the target term or the deformed character string related to the target term by using a masking symbol.

Specifically, in order to force the model to predict whether the terms in the specified positions are semantically consistent by using the context, the terms or their deformed character strings are masked with a MASK symbol [ MASK ] at the time of input, whereby the model avoids excessively memorizing the character string information of the terms in the sentence itself, and can only use the context information of the terms and the positions of the terms in the sentence. Thereby ensuring that the semantic consistency of terms at sentence level can fully learn the semantic of the context.

(2) And splicing the character strings corresponding to the target terms and the character strings corresponding to the masked context sentences through separators.

Specifically, the context of the target term and the character string of the target term are spliced by a separator [ SEP ].

And S14, identifying sentences in the training sample set according to the positive examples and the negative examples.

In an embodiment, S14 specifically includes the following steps:

in response to a current sentence in the set of training samples being a positive example, marking the current sentence as 0; in response to a current sentence in the set of training samples being a negative example, the current sentence is labeled as 1.

And S15, determining a term consistency recognition model according to the recognition result of each sentence in the training sample set.

In an embodiment, after step S15, the method for semantic consistency recognition of the term context further includes:

(1) Randomly ordering the training sample set.

(2) And generating different model parameters according to the training sample set subjected to random sequencing at each time.

(3) And selecting the model parameters corresponding to the optimal results to generate the term consistency identification model.

Specifically, during training, the marked data set obtained in the previous step is utilized, data in the marked data set is read, term masking is carried out, the masked text and the related terms are spliced, learning labels are set according to the positive and negative of the samples, the spliced text is converted into a Token sequence and a corresponding id sequence through a standard Tokenizer module, position embedding and segmentation embedding are carried out to form an input vector of a pre-training basic model, and a hidden vector corresponding to each original Token is finally output through the transform change of a Robertta network model. For [ CLS ] Token, the corresponding hidden vector is subjected to full connection layer to obtain the probability value of the category, loss is calculated through output value and actual value, network parameters are adjusted through back propagation and random gradient descent, fine adjustment of a large-scale pre-training model is achieved, after 10 epochs are trained in sequence, the best result is selected as the final term semantic consistency recognition model. Where 1 epoch is equal to one training using all samples in the training set, and the magnitude of epochs is the number of times that the entire data set has been trained after being randomly ordered.

(1) And inputting a sentence to be detected.

(2) And identifying terms and deformation terms in the sentence to be detected through pattern matching.

(3) And in the sentence to be detected, masking the term and the deformed term.

(4) And splicing the sentence to be detected after the covering treatment with the term and the deformed term.

(5) And inputting the spliced sentences to be detected into the term consistency identification model for consistency identification.

Further, after the step of inputting the spliced sentences to be detected into the term consistency identification model for consistency identification, the semantic consistency identification method of the terms in context further includes: judging whether the consistency identification result is consistent with the expected result or not; in response to the consistency identification result not being consistent with the expected result, recording an inconsistency condition, and reporting that there is an error in the use of the term or the morphed term.

Specifically, in the model application stage, for a given sentence to be detected, terms and their deformation existing in the sentence are found through pattern matching, when the terms or their deformation are found, the original sentence is masked and term-spliced, an embedded vector input to the network is generated in the same way as in the training stage, and the final consistency result is obtained after the input network is operated. If the term is in an irregular deformation form, and the judgment result of the context consistency of the current term indicates that the term writing is not standard enough, the term writing can be replaced by adopting a standard form, so that the judgment of the context semantic consistency of the term is realized, and the term writing is applied to the standard detection processing in the text proofreading process.

Therefore, for the model, a simple task has higher identification accuracy, the invention converts the classification of the multiple terms into the semantic consistency problem of the terms at the designated position of the context, thereby having better classification effect than the multi-label classification, and the term data can be expanded at any time according to the business requirement without adjusting the architecture of the network model or deploying multiple identification models.

Please refer to fig. 3, which illustrates a training prediction flow chart of the semantic consistency recognition method according to an embodiment of the present invention. As shown in fig. 3, the complete flow of the term context semantic consistency identification method according to an embodiment of the present invention is as follows: marking a data set, carrying out sentence masking, replacing terms or term deformation approximate strings by [ MASK ], splicing target terms by masked sentences, if the current sentence is a positive example, marking the target term as 0, otherwise marking the target term as 1, forming a learning sample, randomly sequencing the formed learning sample set, training the model according to batches, for example, reserving 10 epochs, selecting model parameters corresponding to the best result, and generating a term consistency model. And then carrying out model application, giving a sentence to be detected, finding the term and the deformation thereof through pattern matching recognition of the sentence, masking the term or the deformation thereof, splicing the masked sentence with a correct target term, then inputting a term consistency model for consistency recognition, recording the inconsistent condition of an expected result, and reporting that the term is used wrongly.

Please refer to fig. 4, which is a schematic diagram of a term consistency recognition model of the term consistency recognition method in the context of the present invention. As shown in FIG. 4, the context of the term and the character string of the term are spliced through a separator [ SEP ], and a [ CLS ] mark is added at the beginning, in this way, the input context and the target object to be identified by the model are input into the pre-training model, and the model only needs to output whether the target term and the context are consistent, so that the multi-label classification problem is converted into a simpler binary classification problem.

Wherein, in the input line, [ CLS ] represents Classification Token as discrimination Token of semantic consistency, [ SEP ] represents Separation Token as separator between input texts.

Position embedding represents the position vector of the input sequence, so that the model can distinguish the semantic difference of words at different positions, in the calculation process, sin and cos functions are adopted for calculation, the value range of sin and cos is [ -1,1], the size of position coding can be well limited, the training process is more stable, and the calculation formula is as follows:

wherein d is _model Representing the length of the position vector, pos being the position of the word, i representing the dimension of the word, essentially converting a word in a sentence with position pos into d _model The position vector of (2).

The segment embedding represents the paragraph semantic block information in the input, as shown in fig. 4, the context sentence where the term is located, together with [ CLS ] and the first [ SEP ] symbol, constitutes the first paragraph, and the character string of the term itself and the second [ SEP ] symbol constitute the paragraph where the target term to be distinguished is located. Thus, the first 6 segments embedded in a row in FIG. 4 are represented by EA, the last three by EB, and in actual operation of the model, EA and EB are represented by 0 and 1, respectively.

The Token Embedding represents the Embedding mapping with a word as a unit in an input sentence, a model collects all encountered characters to form a vocabulary vocab, and the Token Embedding utilizes the vocabulary to map Token characters into a unique numerical code so as to be embedded into a network in a numerical form for operation. The vocabulary maintains mappings of some special symbols to numeric IDs, e.g., uniformly converting never encountered symbols to UNKNOWN symbols: [ UNK ]. The [ MASK ] symbol is a masked symbol, indicating that there are characters present, requiring the model to speculate on context.

The scope of the semantic consistency recognition method in the context of the terms described in the present invention is not limited to the execution sequence of the steps listed in the embodiment, and all the solutions implemented by the steps addition, subtraction and step replacement in the prior art according to the principles of the present invention are included in the scope of the present invention.

The present embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a semantic consistency recognition method for the context of said terms.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage medium comprises: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Please refer to fig. 5, which is a schematic structural connection diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 5, the present embodiment provides an electronic device 5, which specifically includes: a processor 51 and a memory 52. The memory 52 is configured to store a computer program, and the processor 51 is configured to execute the computer program stored in the memory 52, so as to enable the electronic device 5 to execute the steps of the semantic consistency identification method in the context of the term.

The Processor 51 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component.

The Memory 52 may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

In practical applications, the electronic device may be a computer that includes components such as memory, a memory controller, one or more processing units (CPU), peripheral interfaces, RF circuitry, audio circuitry, speakers, microphone, input/output (I/O) subsystems, a display screen, other output or control devices, and external ports; the computer includes, but is not limited to, personal computers such as desktop computers, notebook computers, tablet computers, smart phones, personal Digital Assistants (PDAs), and the like. In other embodiments, the electronic device may also be a server, where the server may be arranged on one or more entity servers according to various factors such as functions and loads, or may be a cloud server formed by a distributed or centralized server cluster, which is not limited in this embodiment.

In summary, the term context semantic consistency recognition method, the storage medium and the device in the invention adopt a deep learning technology, and based on a basic language model, the model learning context feature discrimination capability is forced in a mode of removing risk characters and constructing context and term pairs by splicing, so as to realize the term context semantic consistency recognition of risk positions. According to the invention, a negative example sample of a target term which does not exist in the existing identification mode is constructed through the judgment of the positive example and the negative example; the method supports dynamic expansion of the term set through common recognition and training of the term character string and the term deformation character string, and achieves a multi-term classification effect; the method avoids the interference of the terms to the context through sentence masking, and forces the model to learn the consistency recognition capability of the context to the terms. And further, potential errors existing in writing of terms in text proofreading can be identified, the potential errors are used as a judgment basis for judging whether the terms have irregular expression, and the problem of term detection false alarm caused by a pure rule replacement mode is solved. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Those skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for identifying semantic consistency of terms in context, the method comprising:

acquiring text data of a preset scale;

segmenting and processing the text data into data sets, and marking the data sets according to the distinction of positive examples and negative examples;

sentence hiding processing is carried out on the data set, and target terms are spliced with the data set; splicing each sentence after the target term to form a training sample set;

identifying sentences in the training sample set according to the differentiation of positive examples and negative examples;

and determining a term consistency recognition model according to the recognition result of each sentence in the training sample set.

2. The method for semantic consistency recognition of terms in context according to claim 1, wherein the step of obtaining text data of a preset size comprises:

and acquiring text data of a preset scale from the comment data of the characters and events or by using a web crawler capture mode.

3. The method of claim 1, wherein the step of segmenting the text data into data sets, labeling the data sets with positive and negative case distinctions, comprises:

carrying out data cleaning on the text data, and removing special symbols which are not beneficial to training;

dividing the cleaned text data into sentence sets according to heuristic rules;

scanning each sentence in the sentence set according to a given term set, and checking whether a completely matched term character string or a term approximate string exists in the sentence;

in response to there being a completely matching term string, recording the sentence as a positive example into a training database; in response to the existence of the term approximation string, the sentence is recorded as a negative example into the training database.

4. The method of claim 1, wherein the step of sentence masking the data set to concatenate the target term with the data set comprises:

in a context sentence of the data set, masking a character string corresponding to the target term or a deformed character string related to the target term by using a masking symbol;

and splicing the character strings corresponding to the target terms and the character strings corresponding to the masked context sentences through separators.

5. The method according to claim 1, wherein the step of identifying sentences in the training sample set according to positive and negative example differentiation comprises:

in response to a current sentence in the set of training samples being a positive example, marking the current sentence as 0;

in response to a current sentence in the set of training samples being a negative example, the current sentence is labeled as 1.

6. The method according to claim 1, wherein after the step of determining a term consistency recognition model from the recognition results of the sentences in the training sample set, the method further comprises:

randomly ordering the training sample set;

generating different model parameters according to the training sample set subjected to random sequencing at each time;

and selecting the model parameters corresponding to the optimal results to generate the term consistency identification model.

7. The method according to claim 1, wherein after the step of determining a term consistency recognition model from the recognition results of the sentences in the training sample set, the method further comprises:

inputting a sentence to be detected;

identifying terms and deformed terms in the sentence to be detected through pattern matching;

in the sentence to be detected, masking the term and the deformed term;

splicing the masked sentence to be detected with the term and the deformed term;

and inputting the spliced sentences to be detected into the term consistency identification model for consistency identification.

8. The method according to claim 7, wherein after the step of inputting the spliced sentences to be detected into the term consistency recognition model for consistency recognition, the method for semantic consistency recognition of terms in context further comprises:

judging whether the consistency identification result is consistent with the expected result or not;

in response to the consistency-identifying result not being consistent with the expected result, recording an inconsistency condition, reporting an error in the use of the term or the morphed term.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for semantic correspondence recognition in context with terms of any one of claims 1 to 8.

10. An electronic device, comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the semantic consistency recognition method in context with the term as recited in any one of claims 1 to 8.