WO2021238337A1 - Method and device for entity tagging - Google Patents

Method and device for entity tagging Download PDF

Info

Publication number
WO2021238337A1
WO2021238337A1 PCT/CN2021/080402 CN2021080402W WO2021238337A1 WO 2021238337 A1 WO2021238337 A1 WO 2021238337A1 CN 2021080402 W CN2021080402 W CN 2021080402W WO 2021238337 A1 WO2021238337 A1 WO 2021238337A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
entity
sample
word
mask
Prior art date
Application number
PCT/CN2021/080402
Other languages
French (fr)
Chinese (zh)
Inventor
孟函可
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021238337A1 publication Critical patent/WO2021238337A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Definitions

  • This application relates to the field of artificial intelligence (AI), and more specifically to methods and devices for entity annotation in the AI field.
  • AI artificial intelligence
  • NER Named entity recognition
  • NLP natural language processing
  • NER can identify entities such as person names, place names, organization names, date and time, etc., so that the identified entities can be used for information extraction, information retrieval, syntactic analysis, semantic role labeling, etc.
  • the input sentence can be input to the sequence labeling model to output the label of each word.
  • a sequence labeling model trained on a specific corpus can only be applied to specific input sentences. For example, if the training sentence of the sample set of the training sequence labeling model includes movie corpus, the input sequence labeling model predicts the input sentence It is necessary to include movies to predict the label. If the input sentence includes movies and TV shows, only movies can be predicted but TV shows cannot be predicted. If there are multiple input sentences with multiple different corpora, multiple sequence labeling models with different corpora or different corpus combinations need to be trained, which will lead to high complexity. And in order to predict the label of the input sentence, multiple sequence labeling models need to be run concurrently, and the sequence labeling model suitable for the input sentence is matched among the multiple sequence labeling models, resulting in a large amount of calculation and high complexity.
  • the embodiments of the present application provide a method and device for entity labeling, which can reduce complexity and help improve the performance of entity labeling.
  • a method for entity labeling is provided.
  • the method can be executed by a processor or a processing module.
  • the method includes: determining N mask vectors of N sample sets, N sample sets and N mask vectors One-to-one correspondence, the entity corpus corresponding to different sample sets in N sample sets is different, each sample set in N sample sets includes multiple samples of at least one entity corpus, and each of the N mask vectors has M dimensions Corresponding to M named entities, M and N are positive integers;
  • the first sequence labeling model is updated according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain the second sequence labeling model, and the second sequence labeling model is used for entity labeling.
  • one sample set corresponds to one mask vector
  • the entity corpus corresponding to different sample sets is different.
  • the sample set mask vectors of different corpora are different, and the processor can combine N corresponding to N sample sets.
  • the mask vector updates the first sequence labeling model. Since the M dimensions of each mask vector correspond to M named entities, each mask vector can reflect the part of the named entity concerned and not the remaining part of the named entity. In this way, the processor can update the sequence labeling model at one time. Adjust the parameters corresponding to some named entities, and do not adjust the parameters corresponding to the remaining named entities. After one or more updates, the second-sequence tagging model can predict the prediction sentences of different corpora, avoiding the need to train M different sets of M samples
  • the entity annotation model can reduce complexity and help improve the performance of entity annotation.
  • the N mask vectors are used to mask multiple loss vectors obtained from the N sample sets, and the multiple masked loss vectors are used to update the first sequence labeling model.
  • the processor inputs the words of the training sentences of each sample set of the N sample sets into the first sequence labeling model before the update to obtain the weight vector of each word, and the processor calculates the weight vector of each word and the weight vector of each word according to the The actual labels of the words are input into the loss function, and multiple loss vectors are obtained.
  • the different entity corpora corresponding to different sample sets in N sample sets can be understood as: the entity corpora corresponding to different sample sets in N sample sets are not completely the same. Specifically, the first sample set in the above N sample sets corresponds to the first entity corpus, the second sample set corresponds to the second entity corpus, the first entity corpus is completely different from the second entity corpus, or the first entity corpus is different from the second entity corpus.
  • the entity corpus can have the same part of the corpus. In other words, the entity corpora corresponding to different sample sets in the N sample sets are completely different or partly the same and partly different.
  • the number of entity corpora corresponding to different sample sets in the above N sample sets is the same and different corpus types (there is at least one corpus with different types), or the number of entity corpora corresponding to different sample sets is at least one of the same or different corpora types.
  • the number of entity corpus corresponding to the set is also different in different corpus types.
  • One of the above-mentioned N sample sets includes training sentences of at least one entity corpus, and different training sentences included in the same sample set correspond to the same entity corpus.
  • each of the above N mask vectors are the same, and they are all M-dimensional vectors.
  • each of the above N mask vectors corresponds to a named entity
  • the M-dimensional mask vector corresponds to M named entities one-to-one
  • the N mask vectors correspond to a total of M named entities.
  • Different entity corpus includes different named entities.
  • the first entity corpus includes the first named entity
  • the second entity corpus includes the second named entity
  • the first named entity and the second named entity are not exactly the same.
  • each mask vector consists of 0 and 1.
  • the first-sequence annotation model can be updated one or more times in the above solution. After each update of the first-sequence annotation model, the updated one can continue to be called the first-sequence annotation model. After this update, the second-sequence labeling model can be obtained.
  • Each sample set of the aforementioned N sample sets consists of a test set and a training set.
  • the samples in the training set are used to update the first sequence labeling model, and the samples in the test set are used to test the stability of the second sequence labeling model.
  • the middle sample of each sample set is a sentence that includes entity words.
  • the sample in the test set can be called a test sentence, and the sample in the training set can be called a training sentence.
  • the first sequence labeling model is updated according to the partial samples and N mask vectors in each sample set of N sample sets, including:
  • the dimensions of the weight vector, the actual label vector, and the loss vector are M.
  • the first word when updating the first-sequence tagging model, taking the first word as an example, the first word can be input into the first-sequence tagging model to obtain the weight vector of the first word.
  • the weight vector can reflect the first word to a certain extent.
  • the weight vector and the actual label vector of the first word are used to calculate the loss vector, and the first mask vector is used to mask the loss vector.
  • the loss vector after the mask is used to update the first
  • a sequence of labeling models only the parameters of the named entity corresponding to the non-zero position of the mask vector are adjusted, and the parameters of the named entity corresponding to the zero position of the mask vector are not adjusted, so that the updated first sequence labeling model can be closer to the mask.
  • the sequence labeling model of the named entity corresponding to the non-zero position of the code vector can improve the accuracy of the second sequence labeling model.
  • the dimension of the weight vector of the first word, the dimension of the actual label vector of the first word, the dimension of the loss vector, the dimension of each mask vector and the dimension of the masked loss vector are the same.
  • the aforementioned loss function is a cross-entropy function.
  • the multiplication of two vectors may be a dot multiplication operation
  • the dot multiplication operation is the multiplication of corresponding elements of two vectors.
  • the first word is a physical word in the first sample, rather than a non-physical word. In this way, the efficiency of updating the first sequence labeling model can be improved.
  • the method further includes: testing the stability of the second sequence annotation model according to the remaining samples in each sample set of the N sample sets.
  • part of the samples in each sample set can be used to train the first sequence labeling model to obtain the second sequence labeling model.
  • some samples in each sample set can be used to update the first sequence labeling model, using each sample set The remaining samples of test the stability of the second-sequence annotation model.
  • the method further includes: inputting the second entity word in the prediction sentence into the second sequence labeling model, and outputting the prediction vector;
  • the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets;
  • the dimension of the prediction vector is M.
  • determining at least one label of the second entity word according to the prediction label vector includes: determining whether the value of each dimension of the prediction vector is greater than a preset value; then the value in the prediction vector is greater than the preset value
  • the named entity tag corresponding to the dimension of is determined to be at least one tag of the second entity word.
  • the second sequence labeling model can be used to predict the label of the second entity word, and the second sequence labeling model can be used to label the second entity word according to whether the value of each element in the prediction vector output by the second sequence labeling model is greater than a preset value One or more tags. In this way, in this embodiment of the present application, one entity word can be marked with more than one tag.
  • determining N mask vectors of N sample sets includes: determining that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets; According to the entity corpus corresponding to each sample set in the N sample set, the value corresponding to each of the N mask vectors is determined.
  • a method for entity labeling including: inputting a second entity word in a prediction sentence into a second sequence labeling model, and outputting a prediction vector; and determining at least one label of the second entity word according to the prediction vector.
  • the above-mentioned second sequence labeling model is obtained after updating the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors.
  • determining at least one label of the second entity word according to the prediction label vector includes: determining whether the value of each dimension of the prediction vector is greater than a preset value; then the value in the prediction vector is greater than the preset value
  • the named entity tag corresponding to the dimension of is determined to be at least one tag of the second entity word.
  • a device for entity labeling is provided, and the device is configured to execute the foregoing first aspect or the method in any possible implementation manner of the first aspect.
  • the device may include a module for executing the first aspect or the method in any possible implementation manner of the first aspect.
  • a device for entity labeling is provided, and the device is configured to execute the foregoing second aspect or any possible implementation method of the second aspect.
  • the apparatus may include a module for executing the second aspect or the method in any possible implementation manner of the second aspect.
  • a device for entity labeling includes a processor, the processor is coupled with a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions stored in the memory, so that the first The method in one aspect is executed.
  • the processor is used to execute a computer program or instruction stored in the memory, so that the device executes the method in the first aspect.
  • the device includes one or more processors.
  • the device may also include a memory coupled with the processor.
  • the device may include one or more memories.
  • the memory can be integrated with the processor or provided separately.
  • the device may also include a transceiver.
  • a device for entity labeling includes a processor, the processor is coupled with a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions stored in the memory, so that the first The method in the two aspects is executed.
  • the processor is used to execute a computer program or instruction stored in the memory, so that the device executes the method in the second aspect.
  • the device includes one or more processors.
  • the device may also include a memory coupled with the processor.
  • the device may include one or more memories.
  • the memory can be integrated with the processor or provided separately.
  • the device may also include a transceiver.
  • a computer-readable storage medium on which a computer program (also referred to as an instruction or code) for implementing the method in the first aspect is stored.
  • the computer when the computer program is executed by a computer, the computer can execute the method in the first aspect.
  • a computer-readable storage medium on which a computer program (also referred to as an instruction or code) for implementing the method in the first aspect or the second aspect is stored.
  • the computer when the computer program is executed by a computer, the computer can execute the method in the second aspect.
  • this application provides a chip including a processor.
  • the processor is used to read and execute the computer program stored in the memory to execute the method in the first aspect and any possible implementation manners thereof.
  • the chip further includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
  • the chip further includes a communication interface.
  • this application provides a chip system including a processor.
  • the processor is used to read and execute the computer program stored in the memory to execute the method in the second aspect and any possible implementation manners thereof.
  • the chip further includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
  • the chip further includes a communication interface.
  • the present application provides a computer program product.
  • the computer program product includes a computer program (also referred to as an instruction or code).
  • the computer program When the computer program is executed by a computer, the computer realizes the method.
  • the present application provides a computer program product.
  • the computer program product includes a computer program (also referred to as an instruction or code).
  • the computer program When the computer program is executed by a computer, the computer realizes the method.
  • Fig. 1 is a schematic diagram of a method for entity labeling provided by an embodiment of the present application.
  • Fig. 2 is a schematic diagram of a method for obtaining a second sequence labeling model provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of an example of updating a first sequence labeling model provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of another example of updating the first sequence labeling model provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of using a second sequence labeling model for prediction provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an example of using a second sequence labeling model for prediction provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a possible application scenario provided by an embodiment of the present application.
  • Fig. 8 is a schematic block diagram of a device for entity labeling provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of another apparatus for entity labeling provided by an embodiment of the present application.
  • AI services involve voice assistants, subtitle generation, voice input, chat robots, customer robots, or spoken language evaluation.
  • voice assistants e.g., voice assistants, subtitle generation, voice input, chat robots, customer robots, or spoken language evaluation.
  • other AIs can also be included.
  • this embodiment of the application does not limit this.
  • sample set consists of a test set and a training set.
  • the samples in the test set are test samples, which can also be called test sentences; the samples in the training set are training samples, and the training samples can also be called training sentences.
  • the samples in each sample set include the same corpus.
  • the samples in a sample set are composed of test sentences and training sentences that include the same corpus.
  • each sample in sample set 1 includes movie entities, and sample set 1 includes 3 samples as an example.
  • the 3 samples are: I want to watch "Romance of the Three Kingdoms", and show me “Youth in Teen”, please open " Tangshan Earthquake”; for another example, at least part of the samples in sample set 2 may include movie and TV series entities, and the remaining part of the samples may include movies or TV series.
  • sample set 2 including 3 training sentences the 3 training sentences are: Show me “Nezha” (Nezha is a movie) and "Sansheng III” (Sansheng III is a TV series), I want to watch "The Romance of the Three Kingdoms” (the Romance of the Three Kingdoms may be a movie or a TV series here), "Give me Play Nezha” (Nezha is a movie).
  • the sequence labeling model can be a long short-term memory (LSTM)-conditional random field (CRF) model. LSTM is suitable for sequence modeling problems. After LSTM is superimposed on CRF, it is conducive to path planning.
  • the sequence annotation model may also be a sequence-to-sequence (Seq2Seq) or a transform (transformer) model.
  • a mask vector a vector composed of 0 and 1.
  • One dimension of the mask vector corresponds to a named entity.
  • a value of 1 for one dimension means paying attention to the named entity, and a value of 0 means not paying attention to the named entity. entity.
  • a sequence labeling model trained on a specific corpus can only be applied to a specific input sentence.
  • the sample set of the training sequence labeling model is the above sample set 2
  • the input sequence labeling model predicts the input sentence It also needs to include movies and/or TV shows.
  • the entity of the input sentence predicted by the input sequence annotation model needs to be a subset of the entity of the training sentence and the test sentence included in the sample set 2 to meet the accuracy of the prediction.
  • the sample set of the training sequence labeling model is the above sample set 1, then the input sentence of the input sequence labeling model for prediction needs to include movies before the label can be predicted. If the input sentence includes movies and TV series, it can only predict that the movie cannot be predicted.
  • Predicting TV series for example, if the input sentence is "I want to watch Romance of the Three Kingdoms", only movie tags can be output to "Romance of the Three Kingdoms". If "Romance of the Three Kingdoms" may be a TV series, the TV series tags cannot be output, which will cause the entity to not be labeled. precise. If you need to predict multiple input sentences of multiple corpora, you need to train multiple sequence labeling models of different corpora or different corpus combinations, which will lead to high complexity. In addition, in order to predict the label of the input sentence, multiple sequence labeling models need to be concurrently used, and the sequence labeling model suitable for the input sentence is matched among the multiple sequence labeling models, resulting in a large amount of calculation and high complexity.
  • the method 100 may be executed by a processor, and the method 100 includes:
  • the processor determines N mask vectors of the N sample sets, and the N sample sets correspond to the N mask vectors one-to-one.
  • the entity corpus corresponding to different sample sets in the N sample sets is different, and each sample in the N sample set
  • the set includes multiple samples of at least one entity corpus.
  • the M dimensions of each of the N mask vectors correspond to M named entities, and M and N are positive integers.
  • the different entity corpora corresponding to different sample sets in the N sample sets can be understood as: the entity corpora corresponding to different sample sets are not completely the same. Specifically, the partial entity corpus corresponding to different sample sets in the N sample sets are the same and the partial entity corpus is different, or the entity corpora corresponding to different sample sets in the N sample sets are completely different.
  • the dimension of each of the above N mask vectors is M, one dimension of the mask vector corresponds to a named entity, and the M-dimensional mask vector corresponds to M named entities one-to-one, and N The mask vector corresponds to a total of M named entities.
  • different sample sets include at least one same training sentence and/or test sentence, or each training sentence and/or each test sentence included in the different sample sets are different. To limit.
  • N mask vectors of M dimensions N mask vectors of M dimensions.
  • N 6, that is, 6 sample sets (sample set 1, sample set 2, sample set 3, sample set 4, sample set 5, and sample set 6) correspond to 6 mask vectors
  • sample set 1 corresponds to Movie corpus
  • sample set 2 corresponds to TV series corpus
  • sample set 3 corresponds to variety show corpus
  • sample set 4 corresponds to animation corpus
  • sample set 5 corresponds to movie and TV series corpus
  • sample set 6 corresponds to TV series and variety show corpus.
  • the mask can be specified The corresponding relationship between the dimension of the code vector and the named entity.
  • the first dimension of each mask vector corresponds to movies
  • the second dimension corresponds to TV series
  • the third dimension corresponds to variety shows
  • the fourth dimension corresponds to animation.
  • the 6 mask vectors corresponding to the 6 sample sets are [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1], [1 1 0 0], [0 1 1 0].
  • the 1 in it means that the mask vector can be [0 0 0 0], which means that the named entity of any dimension will not be masked; middle Indicates that the mask vector can be [1 0 0 0], [0 1 0 0], [0 0 1 0] and [0 0 0 1]; middle Indicates that the mask vector can be [1 1 0 0], [0 1 1 0], [1 0 1 0], [1 0 0 1], [0 1 0 1] and [0 0 1 1]; middle Indicates that the mask vector can be [1 1 1 0], [0 1 1 1], [1 0 1 1] and [1 1 0 1]; middle Indicates that the mask vector can be [1 1 1 1].
  • S120 The processor updates the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain a second sequence labeling model, and the second sequence labeling model is used for entity labeling.
  • one sample set corresponds to one mask vector, and different sample sets correspond to different entity corpora.
  • the sample set mask vectors of different corpora are different, and the processor can combine the N mask vectors corresponding to the N sample sets.
  • the code vector updates the first sequence labeling model. Since the M dimensions of each mask vector correspond to M named entities, each mask vector can reflect the part of the named entity concerned and not the remaining part of the named entity. In this way, the processor can update the sequence labeling model at one time. Adjust the parameters corresponding to some named entities, and do not adjust the parameters corresponding to the remaining part of the named entities.
  • the second-sequence tagging model can predict the input sentences of different corpora, avoiding the need to train M different for M sample sets
  • the entity annotation model can reduce complexity and help improve the performance of entity annotation.
  • S210 The processor obtains N sample sets.
  • the entity words in the training sentences and the test sentences included in each sample set of the N sample sets have corresponding actual labels.
  • a training sentence is "I want to watch Nezha and Sansheng III”
  • Nezha can be marked as a movie label, and Sansheng III as a TV series label.
  • This method is a way of mixing multiple named entities , That is, a training sentence can be labeled with at least two tags. If the training sentence is still "I want to watch Nezha and Sansheng III", you can mark “Nezha” as a movie label, but Sansheng III does not, or mark “Sansheng III” as a TV series, but Nezha does not.
  • This method is a method of labeling a single named entity; in this way, the sample set of different corpora (movie corpus, TV drama corpus, movie + TV drama corpus) can include the training sentence "I want to see Nezha and Sansheng III".
  • S220 The processor determines N mask vectors corresponding to the N sample sets, and S220 is equivalent to S110. Among them, N sample sets have a one-to-one correspondence with N mask vectors.
  • the processor determines the dimension and value of each mask vector according to the named entities included in each sample set of the N sample sets. For the specific determination method, refer to the description of S110.
  • one sample set corresponding to one mask vector can be understood as multiple samples in one sample set corresponding to one mask vector.
  • the actual label vector exists for the words in each training sentence.
  • the actual label vector and the mask vector have the same dimensions. Combined with the example in S110, the dimension of the mask vector is 4, and the dimension of the actual label vector is also 4.
  • the actual label vector The first dimension corresponds to movies, the second dimension corresponds to TV shows, the third dimension corresponds to variety shows, and the fourth dimension corresponds to animation.
  • the vector is [1 0 0]
  • the actual label vector of "The Romance of the Three Kingdoms" in the training sentence "I want to watch the Romance of the Three Kingdoms” is [1 1 0 0], that is, the Romance of the Three Kingdoms may be a movie or a TV series.
  • the initial first sequence labeling model may be an LSTM-CRF model.
  • sequence of S220 and S230 is not limited, and S220 can be performed before or after S230 or at the same time.
  • the following describes the first word in the first sample in the first sample set (the first sample is a training sentence) in the first sample set of N sample sets as an example.
  • the words in the samples in other sample sets are similar to the first word. Avoid repeating it and give examples in detail.
  • the updated first-sequence annotation model may also be referred to as the first-sequence annotation model.
  • the processor inputs the first word in the first sample in the first sample set of the N sample sets into the first sequence labeling model, and outputs a weight vector of the first word.
  • the physical meaning of the weight vector of the first word is: the first word is the weight of each named entity tag, and the larger a certain dimension of the weight vector is, it means that the first word is the weight of the named entity tag corresponding to that dimension. The more likely it is.
  • the dimensions of the first mask vector, the actual label vector of the first word, and the weight vector of the first word are the same, and the named entities corresponding to the same dimension of each vector are the same, for example, Combining the example of S110, the first dimension of the first mask vector, the actual label vector of the first word, and the weight vector of the first word corresponds to movies, the second dimension corresponds to TV shows, the third dimension corresponds to variety shows, and the fourth dimension corresponds to cartoon.
  • the first word is a physical word in the first training sentence, of course, it may also be a non-physical word, which is not limited in the embodiment of the present application.
  • S250 The processor inputs the actual label vector and weight vector of the first word into the loss function, and calculates the loss vector of the first word.
  • the loss function is a cross entropy function.
  • the processor needs to compare the actual label vector of the first word with the weight vector to determine the degree of deviation between the weight vector output by the first sequence model and the actual label vector of the first word.
  • S260 The processor multiplies the loss vector of the first word by the first mask vector corresponding to the first sample set to obtain a masked loss vector.
  • the masked loss vector is used to update the first sequence labeling model, therefore, S230 is executed.
  • the first mask vector corresponding to the first sample set only focuses on the named entity corresponding to the non-zero part.
  • the processor multiplies the first mask vector by the loss vector of the first word, the processor obtains When updating or adjusting the first-sequence labeling model with the masked loss vector, only pay attention to the named entity corresponding to the non-zero part of the first mask vector, and not pay attention to the named entity corresponding to zero.
  • the processor is adjusting The parameters corresponding to some named entities of the first-sequence tagging model will not affect the parameters of other named entities, which can ensure that the adjusted first-sequence tagging model can meet the input sentences of different corpora.
  • S230-S260 are the execution process of the first word in the first training sentence in the first sample set, and any one word in any training sentence in any sample set can also perform a process similar to S230-S260.
  • the embodiments of the present application are not described in detail. Only the following two cases will be discussed in which time sequence the processor uses multiple training sentences of each sample set to update the first sequence labeling model:
  • Case 1 The processor inputs part of the samples in each sample set into the first sequence labeling model in batches, and the first sequence labeling model can be updated multiple times at the same time.
  • Samples are test samples, 30 test samples are used to test the stability of the second sequence labeling model), 70 training samples of each sample set are input into the first sequence labeling model in 7 batches, for example, the first sequence
  • the batch of training samples includes 10 training samples in each training sample set in 3 training sample sets.
  • the processor can update the first sequence labeling model once according to a training sample in a training sample set, and a training sample includes an entity word
  • the processor can update the first sequence labeling model at the same time for the 3 training samples in the 3 training sample set according to S240-S260; at the second time point, the processor can update The other 3 training samples in the 3 training sample sets update the first sequence labeling model at the same time according to S240-S260; and so on, at the 10th time point, the processor can combine the last 3 training samples in the 3 training sample sets Follow S240-S260 to update the first sequence labeling model at the same time, complete the process of using the first batch of training samples to update the first sequence labeling model; and so on, input 60 training samples from each training sample set in 3 training sample sets Go to the first-sequence labeling model to obtain the updated first-sequence labeling model.
  • This example is just to better illustrate the process of updating the first sequence labeling model.
  • the processor can update the first sequence labeling model once according to a word in a training sample of a training sample set, or according to a sample
  • the first sequence labeling model is updated once for multiple words in multiple training samples in the set.
  • Case 2 The processor mixes all the training samples included in each sample set of the N sample sets, and then inputs the mixed training samples into the first sequence labeling model in batches, and executes S240-S260 once for each training sample.
  • the first sequence labeling model can be updated in batches, where each training sample has a corresponding mask vector.
  • the sample set of the movie corpus includes training samples of "I want to watch the Romance of the Three Kingdoms", and the mask vector corresponding to the sample set of the movie corpus is [1 0 0 0].
  • the processor inputs the words "I want to see the Romance of the Three Kingdoms” into the first sequence labeling model. Among them, "I”, “Yao” and “Look” are intangible words marked with "O” in the graph, and the processor will "Three Kingdoms” Enter the first sequence labeling model for "Romance", that is, this step is the above-mentioned S240.
  • the output weight vector P is [0.5 0.4 0 0.1].
  • the first dimension of the weight vector corresponds to movies, and the second dimension corresponds to TV shows.
  • the third dimension corresponds to variety shows, and the fourth dimension corresponds to animation, that is, the probability that "Romance of the Three Kingdoms” may be a movie is 0.5, the probability that it may be a TV series is 0.4, the probability that it may be a variety show is 0, and the probability that it may be an animation is 0.1.
  • the values of the dimensions of the weight vector add up to 1.
  • the actual label vector Y of "Romance of the Three Kingdoms” is [1 0 0 0], for example, the loss function is -(y i log(p i )+(1-y i )log(1-p i )), Among them, y i is the value of the corresponding dimension of Y, p i is the value of the corresponding dimension of P, and the value of i is 1, 2, 3, 4.
  • the loss vector calculated by the processor according to P and Y is [0.3 0.2 0 0.04], and the processor uses the mask vector [1 0 0 0] to multiply the loss vector [0.3 0.2 0 0.04] to obtain the masked loss Vector [0.3 0 0 0], and then feed back the masked loss vector [0.3 0 0 0] to the first sequence labeling model, and use the masked loss vector [0.3 0 0 0] to adjust only the first sequence label
  • the parameters of the model about the part of the movie, other parameters remain unchanged.
  • the multiplication of two vectors can also be a dot multiplication operation, that is, the corresponding positions of the two vectors are multiplied.
  • the sample set of movie + TV drama corpus includes training samples of "I want to watch the Romance of the Three Kingdoms", and the sample set of movie corpus + TV drama corresponds to The mask vector is [1 1 0 0].
  • the processor inputs the words "I want to see the Romance of the Three Kingdoms” into the first sequence labeling model. Among them, "I”, “Yao” and “Look” are intangible words marked with "O” in the graph, and the processor will "Three Kingdoms” Enter the first sequence labeling model for "Romance", that is, this step is the above-mentioned S240.
  • the output weight vector P is [0.5 0.4 0 0.1].
  • the first dimension of the weight vector corresponds to movies, and the second dimension corresponds to TV shows.
  • the third dimension corresponds to variety shows, and the fourth dimension corresponds to animation, that is, the probability that "Romance of the Three Kingdoms” may be a movie is 0.5, the probability that it may be a TV series is 0.4, the probability that it may be a variety show is 0, and the probability that it may be an animation is 0.1.
  • the values of the dimensions of the weight vector add up to 1.
  • the actual label vector Y of "The Romance of the Three Kingdoms" is [1 1 0 0], that is, the Romance of the Three Kingdoms may be a TV series or a movie.
  • loss function is - (y i log (p i ) + (1-y i) log (1-p i)), where y i is the value corresponding to the Y dimension, p i is a value corresponding to P dimension, The value of i is 1, 2, 3, 4.
  • the loss vector calculated by the processor according to P and Y is [0.3 0.2 0 0.04], and the processor uses the mask vector [1 1 0 0] to multiply the loss vector [0.3 0.2 0 0.04] to obtain the masked loss Vector [0.3 0.2 0 0], and then feed back the masked loss vector [0.3 0.2 0 0] to the first sequence labeling model, and use the masked loss vector [0.3 0 0 0] to adjust only the first sequence label Regarding the parameters of the part of the model for movies and TV series, other parameters remain unchanged.
  • the multiplication of two vectors can also be a dot multiplication operation, that is, the corresponding positions of the two vectors are multiplied.
  • the mask vector corresponding to a sample set is equal to the actual label vector of the words in the training sample of the sample set.
  • the mask vector and training sample corresponding to the movie sample set The corresponding actual label vectors are all [1 0 0 0]; for another example, in Figure 4, the mask vector corresponding to the movie + TV series sample set and the actual label vector corresponding to the training sample are both [1 1 0 0].
  • the description in Figure 2 to Figure 4 above is to use part of the samples in each sample set of N sample sets (partial samples are also called training samples) to update the first sequence labeling model to obtain the second sequence labeling model.
  • the processor can use the remaining samples in each of the N sample sets (the remaining samples may also be referred to as test samples) to test the stability of the labeling model in the second sequence.
  • the remaining samples of each sample set are input into the second sequence labeling model, and the second sequence labeling model outputs the weight vector of each word of each sample, and the named entity label of each word is determined according to the weight vector.
  • the named entity label of the word is compared with the actual label.
  • the method of continuing to execute the above-mentioned Figure 1 or Figure 2 can be: updating the first sequence labeling model to the second sequence labeling model after testing, re-collecting the sample set, and continuing to execute the method shown in Figure 1 or Figure 2, or it can be : Re-determine a first-sequence annotation model that is not related to the second-sequence annotation model, re-collect the sample set, and continue to execute the method shown in Figure 1 or Figure 2 until the obtained second-sequence annotation model is stable.
  • the second-sequence annotation model can be used for prediction.
  • the specific prediction process is shown in FIG. 5, which is executed by the processor, and the method 500 includes:
  • S510 The processor inputs the second entity word in the prediction sentence into the second sequence labeling model, and outputs the prediction vector.
  • the processor determines at least one label of the second entity word according to the prediction vector, and the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets. Among them, the dimension of the prediction vector is M.
  • S520 includes: determining a named entity tag corresponding to a dimension whose value is greater than a preset value in the prediction vector as the at least one tag of the second entity word.
  • the default value is 0.5.
  • the output prediction vector is [0.7 0.6 0 1].
  • the first dimension and the second dimension of the prediction vector are both greater than 0.5.
  • the first dimension corresponds to movies, and the second dimension corresponds to TV series.
  • the named entity label of "Romance of the Three Kingdoms" is Movies and TV series.
  • the sum of the values of the various dimensions of the prediction vector may be equal to 1 or may not be equal to 1, for example, may be greater than 1, which is not limited in the embodiment of the present application.
  • the processor may include a natural language understanding (NLU) module, such as As shown in Figure 7, it consists of an automatic speech recognition (ASR) module, a dialog manager (DM) module, a natural language understanding (NLU) module, and a text to speech (text to speech, TTS) module.
  • NLU natural language understanding
  • the module executes the method 700. Specifically, the method 700 includes the following steps:
  • the ASR module receives the user's statement.
  • the ASR module converts the user's speech into text information.
  • the ASR module sends the text information to the DM module.
  • the DM module combines the context of the text information to determine the context information corresponding to the text information.
  • the user's statement in S701 is a verbal expression, and the user may have said other statements related to this conversation before the verbal expression. In this way, the user's other statements are contextual information.
  • the DM module sends the context information and text information to the NLU module.
  • the text information at this time can also be called a predicted input sentence.
  • the NLU module inputs the text information into the second sequence labeling model, and determines the intent information and slot information corresponding to the text information in combination with the context information.
  • S706 is related to the foregoing embodiment of the present application, using the second sequence labeling model to predict the named entity tags of the entity words in the input sentence, and determining the intent information and slot information corresponding to the text information according to the named entity tags.
  • the NLU module sends the intent information and the slot information to the DM module.
  • the DM module calls the voice result in the TTS module according to the intent information and the slot information.
  • the TTS module performs voice playback to the user according to the voice result.
  • the user in S701 will say “I want to check tomorrow's weather”, and the upper and lower information of the text information is determined in S704. For example, before the user says “I want to check tomorrow's weather”, The user also said “I want to query the weather in Beijing”.
  • the NLU module knows that the user's intention is to query the weather based on these two sentences, and the slot is to query the weather in Beijing tomorrow.
  • the voice result is query tomorrow's weather in Beijing, S709, TTS will play the weather in Beijing tomorrow.
  • FIG. 7 only shows one possible application scenario, and the embodiment of the present application can also be applied to other application scenarios, such as playing video of a TV voice assistant.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered as going beyond the scope of protection of this application.
  • the embodiment of the present application may divide the electronic device into functional modules based on the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other feasible division methods in actual implementation. The following is an example of dividing each function module corresponding to each function as an example.
  • FIG. 8 is a schematic block diagram of an apparatus 800 for labeling named entities according to an embodiment of the application.
  • the device 800 includes a determining unit 810 and an updating unit 820.
  • the determining unit 810 is configured to perform operations related to the determination of the processor in the above embodiment.
  • the update unit 820 is configured to perform operations related to the update of the processor in the above embodiment.
  • the determining unit 810 is used to determine N mask vectors of N sample sets.
  • the N sample sets correspond to the N mask vectors in a one-to-one manner.
  • the entity corpus corresponding to different sample sets in the N sample sets is different.
  • a sample set includes multiple samples of at least one entity corpus, the M dimensions of each mask vector in the N mask vectors correspond to M named entities, and M and N are positive integers;
  • the updating unit 820 is configured to update the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain the second sequence labeling model, and the second sequence labeling model is used for entity labeling.
  • the determining unit 810 is specifically configured to:
  • the dimensions of the weight vector, the actual label vector, and the loss vector are M.
  • the first word is an entity word in the first sample.
  • the loss function is a cross-entropy function.
  • the device 800 further includes: a testing unit configured to test the stability of the second sequence annotation model according to the remaining samples in each sample set of the N sample sets.
  • the device 800 further includes: an input and output unit, configured to input the second entity word in the prediction sentence into the second sequence labeling model, and output the prediction vector; the determining unit is also configured to determine the second entity word according to the prediction vector At least one label of the entity word, the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets, and the dimension of the prediction vector is M.
  • the input and output unit can communicate with the outside.
  • the input and output unit may also be referred to as a communication interface or a communication unit.
  • the determining unit 810 is specifically configured to: determine whether the value of each dimension of the prediction vector is greater than a preset value; determine the named entity label corresponding to the dimension whose value is greater than the preset value in the prediction vector as the second At least one label of the entity word.
  • the determining unit 810 is specifically configured to: determine that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets;
  • the value corresponding to each of the N mask vectors is determined.
  • FIG. 9 is a schematic structural diagram of an apparatus 900 for labeling named entities provided by an embodiment of the present application.
  • the communication device 900 includes a processor 910, a memory 920, a communication interface 930, and a bus 940.
  • the processor 910 in the device 900 shown in FIG. 9 may correspond to the determining unit 810 and the updating unit 820 in the device 800 in FIG. 8.
  • the communication interface 930 may correspond to an input and output unit in the device 800.
  • the processor 910 may be connected to the memory 920.
  • the memory 920 can be used to store the program code and data. Therefore, the memory 920 may be a storage unit inside the processor 910, or an external storage unit independent of the processor 910, or may include a storage unit inside the processor 910 and an external storage unit independent of the processor 910. part.
  • the apparatus 900 may further include a bus 940.
  • the memory 920 and the communication interface 930 may be connected to the processor 910 through the bus 940.
  • the bus 940 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus 940 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one line is used to represent in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the processor 910 may adopt a central processing unit (central processing unit, CPU).
  • the processor can also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made programmable gate arrays (field programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the processor 910 adopts one or more integrated circuits to execute related programs to implement the technical solutions provided in the embodiments of the present application.
  • the memory 920 may include a read-only memory and a random access memory, and provides instructions and data to the processor 910.
  • a part of the processor 910 may also include a non-volatile random access memory.
  • the processor 910 may also store device type information.
  • the processor 910 executes the computer-executable instructions in the memory 920 to execute the operation steps of the foregoing method through the device 900.
  • the device 900 according to the embodiment of the present application may correspond to the device 800 in the embodiment of the present application, and the above and other operations and/or functions of each unit in the device 800 are used to implement the corresponding process of the method.
  • the embodiments of the present application also provide a computer-readable medium, the computer-readable medium stores program code, and when the computer program code runs on the computer, the computer executes The methods in the above aspects.
  • the embodiments of the present application also provide a computer program product, the computer program product includes: computer program code, when the computer program code runs on a computer, the computer executes the above Methods in all aspects.
  • the terminal device or the network device includes a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer.
  • the hardware layer may include hardware such as a central processing unit (CPU), a memory management unit (MMU), and memory (also referred to as main memory).
  • the operating system of the operating system layer can be any one or more computer operating systems that implement business processing through processes, for example, Linux operating systems, Unix operating systems, Android operating systems, iOS operating systems, or windows operating systems.
  • the application layer can include applications such as browsers, address books, word processing software, and instant messaging software.
  • the embodiment of this application does not specifically limit the specific structure of the execution subject of the method provided in the embodiment of this application, as long as it can run a program that records the code of the method provided in the embodiment of this application to follow the method provided in the embodiment of this application.
  • the execution subject of the method provided in the embodiments of the present application may be a terminal device or a network device, or a functional module in the terminal device or the network device that can call and execute the program.
  • the various aspects or features of this application can be implemented as methods, devices, or products using standard programming and/or engineering techniques.
  • article of manufacture used herein can encompass a computer program accessible from any computer-readable device, carrier, or medium.
  • the computer-readable medium may include, but is not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (digital versatile disc, DVD), etc.), etc. ), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.).
  • the various storage media described herein may represent one or more devices and/or other machine-readable media for storing information.
  • the term "machine-readable medium” may include, but is not limited to: wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
  • the processor mentioned in the embodiment of the present application may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processors, DSP), and application-specific integrated circuits ( application specific integrated circuit (ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory mentioned in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM).
  • RAM can be used as an external cache.
  • RAM can include the following various forms: static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM) , Double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and Direct RAM Bus RAM (DR RAM).
  • static random access memory static random access memory
  • dynamic RAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM synchronous DRAM
  • Double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM Direct RAM Bus RAM
  • the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component
  • the memory storage module
  • memories described herein are intended to include, but are not limited to, these and any other suitable types of memories.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each functional unit in each embodiment of the present application may be integrated into one unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the essence of the technical solution of this application, or the part that contributes to the existing technology, or the part of the technical solution, can be embodied in the form of a computer software product, and the computer software product is stored in a storage
  • the computer software product includes a number of instructions, which are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media may include but are not limited to: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks, etc., which can store programs The medium of the code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method and device for entity tagging, applied in the field of artificial intelligence. In the method, a processor can update a first sequence tagging model by using N mask vectors corresponding to N sample sets. M dimensions of each mask vector correspond to M named entities, and therefore, each mask vector can reflect that some of the named entities are concerned and the remaining named entities are not concerned. Thus, in one update of a sequence tagging model, the processor can adjust parameters corresponding to some of the named entities but does not adjust parameters corresponding to the remaining named entities; and after being updated once or multiple times, a second sequence tagging model can predict prediction statements of different corpora, avoiding the need to train M different entity tagging models for the M sample sets, so that the complexity can be reduced, and the performance of entity tagging is facilitated to improve.

Description

用于实体标注的方法和装置Method and device for entity labeling
本申请要求于2020年05月29日提交中国专利局、申请号为202010474348.X、申请名称为“用于实体标注的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2020, the application number is 202010474348.X, and the application name is "Methods and Apparatus for Entity Marking", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能(artificial intelligence,AI)领域,并且更具体地涉及AI领域中的用于实体标注的方法和装置。This application relates to the field of artificial intelligence (AI), and more specifically to methods and devices for entity annotation in the AI field.
背景技术Background technique
命名实体识别(named entity recognition,NER)是自然语言处理(natural language processing,NLP)中的一项基础任务。NER能够识别人名、地名、组织机构名、日期时间等类别的实体,从而可以利用识别的实体进行信息抽取、信息检索、句法分析、语义角色标注等。Named entity recognition (NER) is a basic task in natural language processing (NLP). NER can identify entities such as person names, place names, organization names, date and time, etc., so that the identified entities can be used for information extraction, information retrieval, syntactic analysis, semantic role labeling, etc.
在NER中可以将输入语句输入序列标注模型,从而输出各个词语的标签。在现有技术中,利用特定语料训练的序列标注模型,只能适用于特定的输入语句,例如,训练序列标注模型的样本集的训练语句包括电影语料,则输入序列标注模型进行预测的输入语句需要包括电影,则才能预测标签,如果输入语句包括电影和电视剧,则只能预测电影不能预测电视剧。如果有多种不同语料的多个输入语句,则需要训练不同语料或不同语料组合的多个序列标注模型,则会导致复杂度高。并且为了预测输入语句的标签需要并发运行多个序列标注模型,在多个序列标注模型中匹配适合输入语句的序列标注模型,导致计算量大,复杂度高。In NER, the input sentence can be input to the sequence labeling model to output the label of each word. In the prior art, a sequence labeling model trained on a specific corpus can only be applied to specific input sentences. For example, if the training sentence of the sample set of the training sequence labeling model includes movie corpus, the input sequence labeling model predicts the input sentence It is necessary to include movies to predict the label. If the input sentence includes movies and TV shows, only movies can be predicted but TV shows cannot be predicted. If there are multiple input sentences with multiple different corpora, multiple sequence labeling models with different corpora or different corpus combinations need to be trained, which will lead to high complexity. And in order to predict the label of the input sentence, multiple sequence labeling models need to be run concurrently, and the sequence labeling model suitable for the input sentence is matched among the multiple sequence labeling models, resulting in a large amount of calculation and high complexity.
发明内容Summary of the invention
本申请实施例提供了一种用于实体标注的方法和装置,能够降低复杂度,有助于提高实体标注的性能。The embodiments of the present application provide a method and device for entity labeling, which can reduce complexity and help improve the performance of entity labeling.
第一方面,提供了一种用于实体标注的方法,方法可由处理器或者处理模模块执行,方法包括:确定N个样本集的N个掩码向量,N个样本集与N个掩码向量一一对应,N个样本集中不同样本集对应的实体语料不同,N个样本集中每个样本集包括至少一个实体语料的多个样本,N个掩码向量中每个掩码向量的M个维度对应M个命名实体,M和N为正整数;In the first aspect, a method for entity labeling is provided. The method can be executed by a processor or a processing module. The method includes: determining N mask vectors of N sample sets, N sample sets and N mask vectors One-to-one correspondence, the entity corpus corresponding to different sample sets in N sample sets is different, each sample set in N sample sets includes multiple samples of at least one entity corpus, and each of the N mask vectors has M dimensions Corresponding to M named entities, M and N are positive integers;
根据N个样本集中每个样本集中的部分样本和N个掩码向量更新第一序列标注模型,得到第二序列标注模型,第二序列标注模型用于实体标注。The first sequence labeling model is updated according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain the second sequence labeling model, and the second sequence labeling model is used for entity labeling.
在上述技术方案中,一个样本集对应一个掩码向量,不同样本集对应的实体语料不同,换句话说,不同语料的样本集掩码向量不同,处理器可以结合N个样本集对应的N个掩码向量更新第一序列标注模型。由于每个掩码向量的M个维度对应M个命名实体,这样每个掩码向量能够体现关注部分命名实体,不关注剩余部分命名实体, 这样,可以在一次更新序列标注模型中,处理器可以调整部分命名实体对应的参数,不调整剩余部分命名实体对应的参数,经过一次或多次更新后,第二序列标注模型可以预测不同语料的预测语句,避免需要为M个样本集训练M个不同的实体标注模型,能够降低复杂度,有助于提高实体标注的性能。In the above technical solution, one sample set corresponds to one mask vector, and the entity corpus corresponding to different sample sets is different. In other words, the sample set mask vectors of different corpora are different, and the processor can combine N corresponding to N sample sets. The mask vector updates the first sequence labeling model. Since the M dimensions of each mask vector correspond to M named entities, each mask vector can reflect the part of the named entity concerned and not the remaining part of the named entity. In this way, the processor can update the sequence labeling model at one time. Adjust the parameters corresponding to some named entities, and do not adjust the parameters corresponding to the remaining named entities. After one or more updates, the second-sequence tagging model can predict the prediction sentences of different corpora, avoiding the need to train M different sets of M samples The entity annotation model can reduce complexity and help improve the performance of entity annotation.
可选地,N个掩码向量用于掩码根据N个样本集得到的多个损失向量,掩码后的多个损失向量用于更新第一序列标注模型。可选地,处理器将N个样本集每个样本集的训练语句的词语输入到更新前的第一序列标注模型得到每个词语的权重向量,处理器根据将每个词语的权重向量和每个词语的实际标签输入到损失函数中,得到多个损失向量。Optionally, the N mask vectors are used to mask multiple loss vectors obtained from the N sample sets, and the multiple masked loss vectors are used to update the first sequence labeling model. Optionally, the processor inputs the words of the training sentences of each sample set of the N sample sets into the first sequence labeling model before the update to obtain the weight vector of each word, and the processor calculates the weight vector of each word and the weight vector of each word according to the The actual labels of the words are input into the loss function, and multiple loss vectors are obtained.
N个样本集中不同样本集对应的实体语料不同可以理解为:N个样本集中不同样本集对应的实体语料不完全相同。具体地,上述N个样本集中第一样本集对应第一实体语料,第二样本集对应第二实体语料,第一实体语料与第二实体语料完全不同,或者,第一实体语料与第二实体语料可以存在部分语料相同。换句话说,N个样本集中不同样本集对应的实体语料完全不同或者部分相同部分不同。The different entity corpora corresponding to different sample sets in N sample sets can be understood as: the entity corpora corresponding to different sample sets in N sample sets are not completely the same. Specifically, the first sample set in the above N sample sets corresponds to the first entity corpus, the second sample set corresponds to the second entity corpus, the first entity corpus is completely different from the second entity corpus, or the first entity corpus is different from the second entity corpus. The entity corpus can have the same part of the corpus. In other words, the entity corpora corresponding to different sample sets in the N sample sets are completely different or partly the same and partly different.
上述N个样本集中不同样本集对应的实体语料的数量相同语料种类不同(存在至少一个语料的种类不同),或者,不同样本集对应的实体语料的数量不同语料种类存在至少一个相同,或者不同样本集对应的实体语料的数量不同语料种类也不同。The number of entity corpora corresponding to different sample sets in the above N sample sets is the same and different corpus types (there is at least one corpus with different types), or the number of entity corpora corresponding to different sample sets is at least one of the same or different corpora types. The number of entity corpus corresponding to the set is also different in different corpus types.
上述N个样本集中一个样本集中包括至少一个实体语料的训练语句,同一样本集中包括的不同训练语句对应相同的实体语料。One of the above-mentioned N sample sets includes training sentences of at least one entity corpus, and different training sentences included in the same sample set correspond to the same entity corpus.
上述N个掩码向量中每个掩码向量的维度相同,都为M维向量。The dimensions of each of the above N mask vectors are the same, and they are all M-dimensional vectors.
上述N个掩码向量中每个掩码向量的一个维度对应一个命名实体,M维的掩码向量与M个命名实体一一对应,N个掩码向量共对应M个命名实体。One dimension of each of the above N mask vectors corresponds to a named entity, the M-dimensional mask vector corresponds to M named entities one-to-one, and the N mask vectors correspond to a total of M named entities.
不同的实体语料包括不同的命名实体,例如,第一实体语料包括第一命名实体,第二实体语料包括第二命名实体,第一命名实体与第二命名实体不完全相同。Different entity corpus includes different named entities. For example, the first entity corpus includes the first named entity, the second entity corpus includes the second named entity, and the first named entity and the second named entity are not exactly the same.
可选地,每个掩码向量由0和1组成。Optionally, each mask vector consists of 0 and 1.
需要说明的是,上述方案中可以更新一次或多次第一序列标注模型,每更新完一次第一序列标注模型之后,更新完的可以继续称之为第一序列标注模型,这样经过一次或多次更新之后,可以得到第二序列标注模型。It should be noted that the first-sequence annotation model can be updated one or more times in the above solution. After each update of the first-sequence annotation model, the updated one can continue to be called the first-sequence annotation model. After this update, the second-sequence labeling model can be obtained.
上述N个样本集中每个样本集由测试集和训练集组成,训练集中的样本用于更新第一序列标注模型,测试集中的样本用于测试第二序列标注模型的稳定性。每个样本集的中样本是包括实体词的语句,在测试集中的样本可以称为测试语句,在训练集中的样本可以称为训练语句。Each sample set of the aforementioned N sample sets consists of a test set and a training set. The samples in the training set are used to update the first sequence labeling model, and the samples in the test set are used to test the stability of the second sequence labeling model. The middle sample of each sample set is a sentence that includes entity words. The sample in the test set can be called a test sentence, and the sample in the training set can be called a training sentence.
在一些可能的实现方式中,根据N个样本集中每个样本集中的部分样本和N个掩码向量更新第一序列标注模型,包括:In some possible implementations, the first sequence labeling model is updated according to the partial samples and N mask vectors in each sample set of N sample sets, including:
将N个样本集中的第一样本集中的第一样本中的第一词语输入到第一序列标注模型中,输出第一词语的权重向量;Input the first word in the first sample in the first sample set of the N sample sets into the first sequence labeling model, and output the weight vector of the first word;
将第一词语的实际标签向量与权重向量输入到损失函数中,计算第一词语的损失向量;Input the actual label vector and weight vector of the first word into the loss function, and calculate the loss vector of the first word;
将损失向量和第一样本集对应的第一掩码向量相乘,得到掩码后的损失向量;根 据掩码后的损失向量更新第一序列标注模型;Multiply the loss vector and the first mask vector corresponding to the first sample set to obtain the masked loss vector; update the first sequence labeling model according to the masked loss vector;
其中,权重向量、实际标签向量和损失向量的维度为M。Among them, the dimensions of the weight vector, the actual label vector, and the loss vector are M.
在上述方案中,在更新第一序列标注模型时,以第一词语为例,可以将第一词语输入第一序列标注模型,得到第一词语的权重向量,权重向量在一定程度上能体现第一词语被标记为哪个标签的可能性,利用权重向量与第一词语的实际标签向量计算损失向量,并利用第一掩码向量掩码损失向量,这样,在利用掩码后的损失向量更新第一序列标注模型时,只调整掩码向量非零位置对应的命名实体的参数,不调整掩码向量零位置对应的命名实体的参数,从而可以使得更新后的第一序列标注模型更接近于掩码向量非零位置对应的命名实体的序列标注模型,从而可以提高第二序列标注模型的准确性。In the above solution, when updating the first-sequence tagging model, taking the first word as an example, the first word can be input into the first-sequence tagging model to obtain the weight vector of the first word. The weight vector can reflect the first word to a certain extent. For the possibility of a word being marked as which label, the weight vector and the actual label vector of the first word are used to calculate the loss vector, and the first mask vector is used to mask the loss vector. In this way, the loss vector after the mask is used to update the first When a sequence of labeling models, only the parameters of the named entity corresponding to the non-zero position of the mask vector are adjusted, and the parameters of the named entity corresponding to the zero position of the mask vector are not adjusted, so that the updated first sequence labeling model can be closer to the mask. The sequence labeling model of the named entity corresponding to the non-zero position of the code vector can improve the accuracy of the second sequence labeling model.
上述第一词语的权重向量的维度、第一词语的实际标签向量的维度、损失向量的维度、每个掩码向量的维度和掩码后的损失向量的维度相同。The dimension of the weight vector of the first word, the dimension of the actual label vector of the first word, the dimension of the loss vector, the dimension of each mask vector and the dimension of the masked loss vector are the same.
可选地,上述的损失函数为交叉熵(cross-entropy)函数。Optionally, the aforementioned loss function is a cross-entropy function.
需要理解的是,在本申请中,两个向量相乘可以为点乘运算,点乘运算为两个向量对应元素相乘。It should be understood that, in this application, the multiplication of two vectors may be a dot multiplication operation, and the dot multiplication operation is the multiplication of corresponding elements of two vectors.
在一些可能的实现方式中,第一词语为第一样本中的实体词,而不是非实体词,这样,可以提高更新第一序列标注模型的效率。In some possible implementations, the first word is a physical word in the first sample, rather than a non-physical word. In this way, the efficiency of updating the first sequence labeling model can be improved.
在一些可能的实现方式中,方法还包括:根据N个样本集中每个样本集中的剩余样本测试第二序列标注模型的稳定性。In some possible implementation manners, the method further includes: testing the stability of the second sequence annotation model according to the remaining samples in each sample set of the N sample sets.
具体地,每个样本集中的部分样本可以用于训练第一序列标注模型,得到第二序列标注模型,这样,可以利用每个样本集中的部分样本更新第一序列标注模型,利用每个样本集中的剩余样本测试第二序列标注模型的稳定性。Specifically, part of the samples in each sample set can be used to train the first sequence labeling model to obtain the second sequence labeling model. In this way, some samples in each sample set can be used to update the first sequence labeling model, using each sample set The remaining samples of test the stability of the second-sequence annotation model.
在一些可能的实现方式中,方法还包括:将预测语句中的第二实体词输入到第二序列标注模型,输出预测向量;In some possible implementations, the method further includes: inputting the second entity word in the prediction sentence into the second sequence labeling model, and outputting the prediction vector;
根据预测向量确定第二实体词的至少一个标签,预测语句为包括N个样本集中任一样本集对应的实体语料的语句;Determine at least one label of the second entity word according to the prediction vector, and the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets;
其中,预测向量的维度为M。Among them, the dimension of the prediction vector is M.
在一些可能的实现方式中,根据预测标签向量确定第二实体词的至少一个标签,包括:确定预测向量每个维度的取值是否大于预设值;则将预测向量中取值大于预设值的维度对应的命名实体标签确定为第二实体词的至少一个标签。In some possible implementations, determining at least one label of the second entity word according to the prediction label vector includes: determining whether the value of each dimension of the prediction vector is greater than a preset value; then the value in the prediction vector is greater than the preset value The named entity tag corresponding to the dimension of is determined to be at least one tag of the second entity word.
在上述方案中,可以利用第二序列标注模型预测第二实体词的标签,可以根据第二序列标注模型输出的预测向量中每个元素的取值是否大于预设值,为第二实体词标记一个或多个标签,这样,本申请实施例中可以为一个实体词标记一个以上的标签。In the above solution, the second sequence labeling model can be used to predict the label of the second entity word, and the second sequence labeling model can be used to label the second entity word according to whether the value of each element in the prediction vector output by the second sequence labeling model is greater than a preset value One or more tags. In this way, in this embodiment of the present application, one entity word can be marked with more than one tag.
在一些可能的实现方式中,确定N个样本集的N个掩码向量,包括:确定N个掩码向量中每个掩码向量的维度为N个样本集对应的实体语料种类的总数量;根据N个样本集中每个样本集对应的实体语料确定N个掩码向量每个掩码向量对应的取值。In some possible implementations, determining N mask vectors of N sample sets includes: determining that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets; According to the entity corpus corresponding to each sample set in the N sample set, the value corresponding to each of the N mask vectors is determined.
第二方面,提供了一种用于实体标注方法,包括:将预测语句中的第二实体词输入到第二序列标注模型,输出预测向量;根据预测向量确定第二实体词的至少一个标签。In a second aspect, a method for entity labeling is provided, including: inputting a second entity word in a prediction sentence into a second sequence labeling model, and outputting a prediction vector; and determining at least one label of the second entity word according to the prediction vector.
上述第二序列标注模型是根据N个样本集中每个样本集中的部分样本和N个掩码向量更新第一序列标注模型之后得到的。The above-mentioned second sequence labeling model is obtained after updating the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors.
在一些可能的实现方式中,根据预测标签向量确定第二实体词的至少一个标签,包括:确定预测向量每个维度的取值是否大于预设值;则将预测向量中取值大于预设值的维度对应的命名实体标签确定为第二实体词的至少一个标签。In some possible implementations, determining at least one label of the second entity word according to the prediction label vector includes: determining whether the value of each dimension of the prediction vector is greater than a preset value; then the value in the prediction vector is greater than the preset value The named entity tag corresponding to the dimension of is determined to be at least one tag of the second entity word.
第三方面,提供了一种用于实体标注的装置,所述装置用于执行上述第一方面或第一方面的任一可能的实现方式中的方法。具体地,所述装置可以包括用于执行第一方面或第一方面的任一可能的实现方式中的方法的模块。In a third aspect, a device for entity labeling is provided, and the device is configured to execute the foregoing first aspect or the method in any possible implementation manner of the first aspect. Specifically, the device may include a module for executing the first aspect or the method in any possible implementation manner of the first aspect.
第四方面,提供一种用于实体标注的装置,所述装置用于执行上述第二方面或第二方面的任一可能的实现方式中的方法。具体地,所述装置可以包括用于执行第二方面或第二方面的任一可能的实现方式中的方法的模块。In a fourth aspect, a device for entity labeling is provided, and the device is configured to execute the foregoing second aspect or any possible implementation method of the second aspect. Specifically, the apparatus may include a module for executing the second aspect or the method in any possible implementation manner of the second aspect.
第五方面,提供一种用于实体标注的装置,所述装置包括处理器,处理器与存储器耦合,存储器用于存储计算机程序或指令,处理器用于执行存储器存储的计算机程序或指令,使得第一方面中的方法被执行。In a fifth aspect, a device for entity labeling is provided. The device includes a processor, the processor is coupled with a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions stored in the memory, so that the first The method in one aspect is executed.
例如,处理器用于执行存储器存储的计算机程序或指令,使得该装置执行第一方面中的方法。For example, the processor is used to execute a computer program or instruction stored in the memory, so that the device executes the method in the first aspect.
可选地,该装置包括的处理器为一个或多个。Optionally, the device includes one or more processors.
可选地,该装置中还可以包括与处理器耦合的存储器。Optionally, the device may also include a memory coupled with the processor.
可选地,该装置包括的存储器可以为一个或多个。Optionally, the device may include one or more memories.
可选地,该存储器可以与该处理器集成在一起,或者分离设置。Optionally, the memory can be integrated with the processor or provided separately.
可选地,该装置中还可以包括收发器。Optionally, the device may also include a transceiver.
第六方面,提供一种用于实体标注的装置,所述装置包括处理器,处理器与存储器耦合,存储器用于存储计算机程序或指令,处理器用于执行存储器存储的计算机程序或指令,使得第二方面中的方法被执行。In a sixth aspect, a device for entity labeling is provided. The device includes a processor, the processor is coupled with a memory, the memory is used to store computer programs or instructions, and the processor is used to execute the computer programs or instructions stored in the memory, so that the first The method in the two aspects is executed.
例如,处理器用于执行存储器存储的计算机程序或指令,使得该装置执行第二方面中的方法。For example, the processor is used to execute a computer program or instruction stored in the memory, so that the device executes the method in the second aspect.
可选地,该装置包括的处理器为一个或多个。Optionally, the device includes one or more processors.
可选地,该装置中还可以包括与处理器耦合的存储器。Optionally, the device may also include a memory coupled with the processor.
可选地,该装置包括的存储器可以为一个或多个。Optionally, the device may include one or more memories.
可选地,该存储器可以与该处理器集成在一起,或者分离设置。Optionally, the memory can be integrated with the processor or provided separately.
可选地,该装置中还可以包括收发器。Optionally, the device may also include a transceiver.
第七方面,提供一种计算机可读存储介质,其上存储有用于实现第一方面中的方法的计算机程序(也可称为指令或代码)。In a seventh aspect, a computer-readable storage medium is provided, on which a computer program (also referred to as an instruction or code) for implementing the method in the first aspect is stored.
例如,该计算机程序被计算机执行时,使得该计算机可以执行第一方面中的方法。For example, when the computer program is executed by a computer, the computer can execute the method in the first aspect.
第八方面,提供一种计算机可读存储介质,其上存储有用于实现第一方面或者第二方面中的方法的计算机程序(也可称为指令或代码)。In an eighth aspect, a computer-readable storage medium is provided, on which a computer program (also referred to as an instruction or code) for implementing the method in the first aspect or the second aspect is stored.
例如,该计算机程序被计算机执行时,使得该计算机可以执行第二方面中的方法。For example, when the computer program is executed by a computer, the computer can execute the method in the second aspect.
第九方面,本申请提供一种芯片,包括处理器。处理器用于读取并执行存储器中存储的计算机程序,以执行第一方面及其任意可能的实现方式中的方法。In a ninth aspect, this application provides a chip including a processor. The processor is used to read and execute the computer program stored in the memory to execute the method in the first aspect and any possible implementation manners thereof.
可选地,所述芯片还包括存储器,存储器与处理器通过电路或电线与存储器连接。Optionally, the chip further includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
进一步可选地,所述芯片还包括通信接口。Further optionally, the chip further includes a communication interface.
第十方面,本申请提供一种芯片系统,包括处理器。处理器用于读取并执行存储器中存储的计算机程序,以执行第二方面及其任意可能的实现方式中的方法。In a tenth aspect, this application provides a chip system including a processor. The processor is used to read and execute the computer program stored in the memory to execute the method in the second aspect and any possible implementation manners thereof.
可选地,所述芯片还包括存储器,存储器与处理器通过电路或电线与存储器连接。Optionally, the chip further includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
进一步可选地,所述芯片还包括通信接口。Further optionally, the chip further includes a communication interface.
第十一方面,本申请提供一种计算机程序产品,所述计算机程序产品包括计算机程序(也可称为指令或代码),所述计算机程序被计算机执行时使得所述计算机实现第一方面中的方法。In an eleventh aspect, the present application provides a computer program product. The computer program product includes a computer program (also referred to as an instruction or code). When the computer program is executed by a computer, the computer realizes the method.
第十二方面,本申请提供一种计算机程序产品,所述计算机程序产品包括计算机程序(也可称为指令或代码),所述计算机程序被计算机执行时使得所述计算机实现第二方面中的方法。In the twelfth aspect, the present application provides a computer program product. The computer program product includes a computer program (also referred to as an instruction or code). When the computer program is executed by a computer, the computer realizes the method.
附图说明Description of the drawings
图1是本申请实施例提供的用于实体标注的方法示意图。Fig. 1 is a schematic diagram of a method for entity labeling provided by an embodiment of the present application.
图2是本申请实施例提供的得到第二序列标注模型的方法示意图。Fig. 2 is a schematic diagram of a method for obtaining a second sequence labeling model provided by an embodiment of the present application.
图3是本申请实施例提供的更新第一序列标注模型的示例示意图。Fig. 3 is a schematic diagram of an example of updating a first sequence labeling model provided by an embodiment of the present application.
图4是本申请实施例提供的另一更新第一序列标注模型的示例示意图。Fig. 4 is a schematic diagram of another example of updating the first sequence labeling model provided by an embodiment of the present application.
图5是本申请实施例提供的利用第二序列标注模型进行预测的示意图。FIG. 5 is a schematic diagram of using a second sequence labeling model for prediction provided by an embodiment of the present application.
图6是本申请实施例提供的利用第二序列标注模型进行预测的示例示意图。FIG. 6 is a schematic diagram of an example of using a second sequence labeling model for prediction provided by an embodiment of the present application.
图7是本申请实施例提供的可能的应用场景示意图。Fig. 7 is a schematic diagram of a possible application scenario provided by an embodiment of the present application.
图8是本申请实施例提供的用于实体标注的装置的示意性框图。Fig. 8 is a schematic block diagram of a device for entity labeling provided by an embodiment of the present application.
图9是本申请实施例提供的另一用于实体标注的装置的示意性框图。FIG. 9 is a schematic block diagram of another apparatus for entity labeling provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请提供的实施例可以应用于AI领域中的AI业务,AI业务涉及语音助手、字幕生成、语音输入、聊天机器人、客户机器人或口语评测,当然,在实际应用中,还可以包括其他的AI业务,本申请实施例对此不作限定。The embodiments provided in this application can be applied to AI services in the AI field. AI services involve voice assistants, subtitle generation, voice input, chat robots, customer robots, or spoken language evaluation. Of course, in actual applications, other AIs can also be included. For services, this embodiment of the application does not limit this.
下面对本申请实施例用到的术语进行解释。The following explains the terms used in the embodiments of the present application.
1、样本集,一个样本集由测试集和训练集组成,测试集中的样本为测试样本,测试样本也可以称为测试语句;训练集中的样本为训练样本,训练样本也可以称为训练语句。每个样本集中的样本包括相同的语料,换句话说,一个样本集中的样本由包括相同语料的测试语句和训练语句组成。例如,样本集1的每个样本包括电影实体,以样本集1包括3个样本为例,3个样本分别为:我要看“三国演义”,给我播放“少年的你”,请打开“唐山大地震”;又例如,样本集2的至少部分样本可以包括电影和电视剧实体,剩余部分样本可以包括电影或电视剧,以样本集2包括3个训练语句为例,3个训练语句分别为:给我播放“哪吒”(哪吒是电影)和“三生三世”(三生三世是电视剧),我要看“三国演义”(这里三国演义可能是电影也可能是电视剧),“给我播放哪吒”(哪吒是电影)。1. Sample set. A sample set consists of a test set and a training set. The samples in the test set are test samples, which can also be called test sentences; the samples in the training set are training samples, and the training samples can also be called training sentences. The samples in each sample set include the same corpus. In other words, the samples in a sample set are composed of test sentences and training sentences that include the same corpus. For example, each sample in sample set 1 includes movie entities, and sample set 1 includes 3 samples as an example. The 3 samples are: I want to watch "Romance of the Three Kingdoms", and show me "Youth in Youth", please open " Tangshan Earthquake"; for another example, at least part of the samples in sample set 2 may include movie and TV series entities, and the remaining part of the samples may include movies or TV series. Taking sample set 2 including 3 training sentences as an example, the 3 training sentences are: Show me "Nezha" (Nezha is a movie) and "Sansheng III" (Sansheng III is a TV series), I want to watch "The Romance of the Three Kingdoms" (the Romance of the Three Kingdoms may be a movie or a TV series here), "Give me Play Nezha" (Nezha is a movie).
2、序列标注模型,可以为长短期记忆网络(long short-term memory,LSTM)-条件随机 场(conditional random field,CRF)模型。LSTM适合序列建模问题,LSTM叠加CRF之后有利于路径规划。序列标注模型还可以是序列对序列(sequence to sequence,Seq2Seq)或者转换(transformer)模型。2. The sequence labeling model can be a long short-term memory (LSTM)-conditional random field (CRF) model. LSTM is suitable for sequence modeling problems. After LSTM is superimposed on CRF, it is conducive to path planning. The sequence annotation model may also be a sequence-to-sequence (Seq2Seq) or a transform (transformer) model.
3、掩码向量,由0和1组成的向量,掩码向量的一个维度对应一个命名实体,一个维度的取值为1表示关注该命名实体,一个维度的取值为0表示不关注该命名实体。3. A mask vector, a vector composed of 0 and 1. One dimension of the mask vector corresponds to a named entity. A value of 1 for one dimension means paying attention to the named entity, and a value of 0 means not paying attention to the named entity. entity.
现有技术中,利用特定语料训练的序列标注模型,只能适用于特定的输入语句,例如,训练序列标注模型的样本集的为上述的样本集2,则输入序列标注模型进行预测的输入语句也需要包括电影和/或电视剧,换句话说,输入序列标注模型进行预测的输入语句的实体需要是样本集2包括的训练语句和测试语句的实体的子集,才能满足预测的准确性。又例如,训练序列标注模型的样本集为上述的样本集1,则输入序列标注模型进行预测的输入语句需要包括电影,则才能预测标签,如果输入语句包括电影和电视剧,则只能预测电影不能预测电视剧,例如,预测输入语句为“我要看三国演义”,则只能给“三国演义”输出电影标签,“三国演义”有可能是电视剧,则无法输出电视剧标签,因此会导致实体标注不准确。如果需要预测多种语料的多个输入语句,则需要训练不同语料或不同语料组合的多个序列标注模型,则会导致复杂度高。并且为了预测输入语句的标签需要并发多个序列标注模型,在多个序列标注模型中匹配适合输入语句的序列标注模型,导致计算量大,复杂度高。In the prior art, a sequence labeling model trained on a specific corpus can only be applied to a specific input sentence. For example, if the sample set of the training sequence labeling model is the above sample set 2, then the input sequence labeling model predicts the input sentence It also needs to include movies and/or TV shows. In other words, the entity of the input sentence predicted by the input sequence annotation model needs to be a subset of the entity of the training sentence and the test sentence included in the sample set 2 to meet the accuracy of the prediction. For another example, the sample set of the training sequence labeling model is the above sample set 1, then the input sentence of the input sequence labeling model for prediction needs to include movies before the label can be predicted. If the input sentence includes movies and TV series, it can only predict that the movie cannot be predicted. Predicting TV series, for example, if the input sentence is "I want to watch Romance of the Three Kingdoms", only movie tags can be output to "Romance of the Three Kingdoms". If "Romance of the Three Kingdoms" may be a TV series, the TV series tags cannot be output, which will cause the entity to not be labeled. precise. If you need to predict multiple input sentences of multiple corpora, you need to train multiple sequence labeling models of different corpora or different corpus combinations, which will lead to high complexity. In addition, in order to predict the label of the input sentence, multiple sequence labeling models need to be concurrently used, and the sequence labeling model suitable for the input sentence is matched among the multiple sequence labeling models, resulting in a large amount of calculation and high complexity.
下面结合附图描述本申请实施例提供的用于实体标注的方法100,方法100可以由处理器执行,方法100包括:The following describes the method 100 for entity labeling provided by the embodiments of the present application with reference to the accompanying drawings. The method 100 may be executed by a processor, and the method 100 includes:
S110,处理器确定N个样本集的N个掩码向量,N个样本集与N个掩码向量一一对应,N个样本集中不同样本集对应的实体语料不同,N个样本集中每个样本集包括至少一个实体语料的多个样本,N个掩码向量中每个掩码向量的M个维度对应M个命名实体,M和N为正整数。S110. The processor determines N mask vectors of the N sample sets, and the N sample sets correspond to the N mask vectors one-to-one. The entity corpus corresponding to different sample sets in the N sample sets is different, and each sample in the N sample set The set includes multiple samples of at least one entity corpus. The M dimensions of each of the N mask vectors correspond to M named entities, and M and N are positive integers.
其中,N个样本集中不同样本集对应的实体语料不同可以理解为:不同样本集对应的实体语料不完全相同。具体地,N个样本集中不同样本集对应的部分实体语料相同部分实体语料不同,或者,N个样本集中不同样本集对应的实体语料完全不同。Among them, the different entity corpora corresponding to different sample sets in the N sample sets can be understood as: the entity corpora corresponding to different sample sets are not completely the same. Specifically, the partial entity corpus corresponding to different sample sets in the N sample sets are the same and the partial entity corpus is different, or the entity corpora corresponding to different sample sets in the N sample sets are completely different.
可以理解的是,上述N个掩码向量每个掩码向量的维度都是M,掩码向量的一个维度对应一个命名实体,M维的掩码向量与M个命名实体一一对应,N个掩码向量共对应M个命名实体。It is understandable that the dimension of each of the above N mask vectors is M, one dimension of the mask vector corresponds to a named entity, and the M-dimensional mask vector corresponds to M named entities one-to-one, and N The mask vector corresponds to a total of M named entities.
可选地,N个样本集中,不同样本集包括至少一个相同的训练语句和/或测试语句,或者不同样本集的包括的每个训练语句和/或每个测试语句不同,本申请实施例不予限制。Optionally, in the N sample sets, different sample sets include at least one same training sentence and/or test sentence, or each training sentence and/or each test sentence included in the different sample sets are different. To limit.
为了更好的说明N个样本集、M维的N个掩码向量。下面举例描述,假设N=6,即6个样本集(样本集1、样本集2、样本集3、样本集4、样本集5和样本集6)对应6个掩码向量,样本集1对应电影语料、样本集2对应电视剧语料、样本集3对应综艺语料,样本集4对应动漫语料,样本集5对应电影和电视剧语料,样本集6对应电视剧和综艺语料。则6个样本集共包括4种类型的语料,因此掩码向量的维度是4,即M=4,4个维度分别对应电影、电视剧、动漫和综艺4个命名实体,具体地,可以规定掩码向量的维度与命名实体之间的对应关系,例如,每个掩码向量的第一个维度对应电影,第二维度对应电视剧,第三个维度对应综艺,第四个维度对应动漫,这样,6个样本集对应的6个掩码向量分别 为[1 0 0 0]、[0 1 0 0]、[0 0 1 0]、[0 0 0 1]、[1 1 0 0]、[0 1 1 0]。In order to better explain N sample sets, N mask vectors of M dimensions. The following example describes, assuming that N=6, that is, 6 sample sets (sample set 1, sample set 2, sample set 3, sample set 4, sample set 5, and sample set 6) correspond to 6 mask vectors, and sample set 1 corresponds to Movie corpus, sample set 2 corresponds to TV series corpus, sample set 3 corresponds to variety show corpus, sample set 4 corresponds to animation corpus, sample set 5 corresponds to movie and TV series corpus, and sample set 6 corresponds to TV series and variety show corpus. The 6 sample sets include a total of 4 types of corpus, so the dimension of the mask vector is 4, that is, M=4, and the 4 dimensions correspond to the 4 named entities of movies, TV series, animation, and variety shows. Specifically, the mask can be specified The corresponding relationship between the dimension of the code vector and the named entity. For example, the first dimension of each mask vector corresponds to movies, the second dimension corresponds to TV series, the third dimension corresponds to variety shows, and the fourth dimension corresponds to animation. The 6 mask vectors corresponding to the 6 sample sets are [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1], [1 1 0 0], [0 1 1 0].
需要说明的是,M和N具有以下关系,N是小于或等于
Figure PCTCN2021080402-appb-000001
的值,结合上述的例子,若M=4,则N是小于或等于16的正整数。其中,
Figure PCTCN2021080402-appb-000002
中的1表示掩码向量可以为[0 0 0 0],表示对任何一个维度的命名实体都不进行掩码处理;
Figure PCTCN2021080402-appb-000003
中的
Figure PCTCN2021080402-appb-000004
表示掩码向量可以为[1 0 0 0]、[0 1 0 0]、[0 0 1 0]和[0 0 0 1];
Figure PCTCN2021080402-appb-000005
中的
Figure PCTCN2021080402-appb-000006
表示掩码向量可以为[1 1 0 0]、[0 1 1 0]、[1 0 1 0]、[1 0 0 1]、[0 1 0 1]和[0 0 1 1];
Figure PCTCN2021080402-appb-000007
中的
Figure PCTCN2021080402-appb-000008
表示掩码向量可以为[1 1 1 0]、[0 1 1 1]、[1 0 1 1]和[1 1 0 1];
Figure PCTCN2021080402-appb-000009
中的
Figure PCTCN2021080402-appb-000010
表示掩码向量可以为[1 1 1 1]。
It should be noted that M and N have the following relationship, and N is less than or equal to
Figure PCTCN2021080402-appb-000001
In combination with the above example, if M=4, then N is a positive integer less than or equal to 16. in,
Figure PCTCN2021080402-appb-000002
The 1 in it means that the mask vector can be [0 0 0 0], which means that the named entity of any dimension will not be masked;
Figure PCTCN2021080402-appb-000003
middle
Figure PCTCN2021080402-appb-000004
Indicates that the mask vector can be [1 0 0 0], [0 1 0 0], [0 0 1 0] and [0 0 0 1];
Figure PCTCN2021080402-appb-000005
middle
Figure PCTCN2021080402-appb-000006
Indicates that the mask vector can be [1 1 0 0], [0 1 1 0], [1 0 1 0], [1 0 0 1], [0 1 0 1] and [0 0 1 1];
Figure PCTCN2021080402-appb-000007
middle
Figure PCTCN2021080402-appb-000008
Indicates that the mask vector can be [1 1 1 0], [0 1 1 1], [1 0 1 1] and [1 1 0 1];
Figure PCTCN2021080402-appb-000009
middle
Figure PCTCN2021080402-appb-000010
Indicates that the mask vector can be [1 1 1 1].
S120,处理器根据N个样本集中每个样本集中的部分样本和N个掩码向量更新第一序列标注模型,得到第二序列标注模型,第二序列标注模型用于实体标注。S120: The processor updates the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain a second sequence labeling model, and the second sequence labeling model is used for entity labeling.
在方法100中,一个样本集对应一个掩码向量,不同样本集对应的实体语料不同,换句话说,不同语料的样本集掩码向量不同,处理器可以结合N个样本集对应的N个掩码向量更新第一序列标注模型。由于每个掩码向量的M个维度对应M个命名实体,这样每个掩码向量能够体现关注部分命名实体,不关注剩余部分命名实体,这样,可以在一次更新序列标注模型中,处理器可以调整部分命名实体对应的参数,不调整剩余部分命名实体对应的参数,经过一次或多次更新后,第二序列标注模型可以预测不同语料的输入语句,避免需要为M个样本集训练M个不同的实体标注模型,能够降低复杂度,有助于提高实体标注的性能。In method 100, one sample set corresponds to one mask vector, and different sample sets correspond to different entity corpora. In other words, the sample set mask vectors of different corpora are different, and the processor can combine the N mask vectors corresponding to the N sample sets. The code vector updates the first sequence labeling model. Since the M dimensions of each mask vector correspond to M named entities, each mask vector can reflect the part of the named entity concerned and not the remaining part of the named entity. In this way, the processor can update the sequence labeling model at one time. Adjust the parameters corresponding to some named entities, and do not adjust the parameters corresponding to the remaining part of the named entities. After one or more updates, the second-sequence tagging model can predict the input sentences of different corpora, avoiding the need to train M different for M sample sets The entity annotation model can reduce complexity and help improve the performance of entity annotation.
为了更好的理解上述方法100,下面结合图2的方法200具体描述如何得到第二序列标注模型,方法200由方法100中的处理器执行。In order to better understand the above method 100, how to obtain the second sequence labeling model is described in detail below in conjunction with the method 200 of FIG. 2. The method 200 is executed by the processor in the method 100.
S210,处理器获取N个样本集。S210: The processor obtains N sample sets.
具体地,在S210之前,人工需要收集包括实体语料的多个训练语句和多个测试语句,并对多个训练语句和多个测试语句进行人工标注,人工标注分为单个命名实体标注和多个命名实体混标,将标注后的训练语句按照命名实体标签或者语料进行分类得到N个样本集,并输入到处理器中。Specifically, before S210, humans need to collect multiple training sentences and multiple test sentences including entity corpus, and manually label multiple training sentences and multiple test sentences. Manual labeling is divided into single named entity labeling and multiple test sentences. Named entity mixed standard, the annotated training sentence is classified according to the named entity label or corpus to obtain N sample sets, and input into the processor.
N个样本集中每个样本集包括的训练语句和测试语句中的实体词都存在对应的实际标签。The entity words in the training sentences and the test sentences included in each sample set of the N sample sets have corresponding actual labels.
示例性地,若一个训练语句为“我想看哪吒和三生三世”,则可以将哪吒标注为电影标签,三生三世标注为电视剧标签,这种方式为多个命名实体混标的方式,即一个训练语句可以标注至少两个以上的标签。如果训练语句还是“我想看哪吒和三生三世”,则可以将“哪吒”标注为电影标签,三生三世不标注,或者,将“三生三世”标注为电视剧,哪吒不标注,这种方式为单个命名实体标注的方式;这样,不同语料(电影语料、电视剧语料、电影+电视剧语料)的样本集中都可以包括“我想看哪吒和三生三世”这个训练语句。Exemplarily, if a training sentence is "I want to watch Nezha and Sansheng III", then Nezha can be marked as a movie label, and Sansheng III as a TV series label. This method is a way of mixing multiple named entities , That is, a training sentence can be labeled with at least two tags. If the training sentence is still "I want to watch Nezha and Sansheng III", you can mark "Nezha" as a movie label, but Sansheng III does not, or mark "Sansheng III" as a TV series, but Nezha does not. This method is a method of labeling a single named entity; in this way, the sample set of different corpora (movie corpus, TV drama corpus, movie + TV drama corpus) can include the training sentence "I want to see Nezha and Sansheng III".
S220,处理器确定N个样本集对应的N个掩码向量,S220等同于S110。其中,N个样本集与N个掩码向量一一对应。S220: The processor determines N mask vectors corresponding to the N sample sets, and S220 is equivalent to S110. Among them, N sample sets have a one-to-one correspondence with N mask vectors.
具体地,处理器根据N个样本集中每个样本集的包括的命名实体确定每个掩码向量的维度和取值,具体地确定方式参见S110的描述。Specifically, the processor determines the dimension and value of each mask vector according to the named entities included in each sample set of the N sample sets. For the specific determination method, refer to the description of S110.
需要说明的是,一个样本集对应一个掩码向量可以理解为一个样本集中的多个样本对应一个掩码向量。It should be noted that one sample set corresponding to one mask vector can be understood as multiple samples in one sample set corresponding to one mask vector.
每个训练语句中的词语的也存在实际标签向量,实际标签向量与掩码向量的维度相同,结合S110中的例子,掩码向量的维度是4,实际标签向量的维度也是4,实际标签向量的第一个维度对应电影,第二个维度对应电视剧,第三个维度对应综艺,第四个维度对应动漫,则对于“我要看哪吒”这个训练语句中的“哪吒”的实际标签向量为[1 0 0 0],对于“我要看三国演义”这个训练语句中的“三国演义”的实际标签向量为[1 1 0 0],即三国演义可能为电影也可能为电视剧。The actual label vector exists for the words in each training sentence. The actual label vector and the mask vector have the same dimensions. Combined with the example in S110, the dimension of the mask vector is 4, and the dimension of the actual label vector is also 4. The actual label vector The first dimension corresponds to movies, the second dimension corresponds to TV shows, the third dimension corresponds to variety shows, and the fourth dimension corresponds to animation. For the actual label of "Nezha" in the training sentence "I want to watch Nezha" The vector is [1 0 0], and the actual label vector of "The Romance of the Three Kingdoms" in the training sentence "I want to watch the Romance of the Three Kingdoms" is [1 1 0 0], that is, the Romance of the Three Kingdoms may be a movie or a TV series.
S230,确定第一序列标注模型,例如,初始的第一序列标注模型可以是LSTM-CRF模型。S230: Determine the first sequence labeling model. For example, the initial first sequence labeling model may be an LSTM-CRF model.
需要说明的是,S220与S230的顺序不限定,S220可以在S230之前或者之后或者同时进行。It should be noted that the sequence of S220 and S230 is not limited, and S220 can be performed before or after S230 or at the same time.
下面以N个样本集中的第一样本集中的第一样本(第一样本为训练语句)中的第一词语为例描述,其他样本集中的样本中的词语与第一词语类似,为了避免赘述,不详细举例。The following describes the first word in the first sample in the first sample set (the first sample is a training sentence) in the first sample set of N sample sets as an example. The words in the samples in other sample sets are similar to the first word. Avoid repeating it and give examples in detail.
需要说明的是,第一序列标注模型被处理器更新一次,则更新后的第一序列标注模型也可以称为第一序列标注模型。It should be noted that, if the first-sequence annotation model is updated once by the processor, the updated first-sequence annotation model may also be referred to as the first-sequence annotation model.
S240,处理器将N个样本集中的第一样本集中的第一样本中的第一词语输入到第一序列标注模型中,输出第一词语的权重向量。S240. The processor inputs the first word in the first sample in the first sample set of the N sample sets into the first sequence labeling model, and outputs a weight vector of the first word.
可以理解的是,第一词语的权重向量的物理意义为:第一词语为各个命名实体标签的权重,权重向量的某一维度越大,则表示第一词语为该维度对应的命名实体标签的可能性越大。It is understandable that the physical meaning of the weight vector of the first word is: the first word is the weight of each named entity tag, and the larger a certain dimension of the weight vector is, it means that the first word is the weight of the named entity tag corresponding to that dimension. The more likely it is.
需要说明的是,本申请实施例中,第一掩码向量,第一词语的实际标签向量以及第一词语的权重向量的维度相同,且每个向量的同一维度对应的命名实体相同,例如,结合S110的例子,第一掩码向量、第一词语的实际标签向量以及第一词语的权重向量的第一维度对应电影,第二维度对应电视剧,第三个维度对应综艺,第四个维度对应动漫。It should be noted that in this embodiment of the application, the dimensions of the first mask vector, the actual label vector of the first word, and the weight vector of the first word are the same, and the named entities corresponding to the same dimension of each vector are the same, for example, Combining the example of S110, the first dimension of the first mask vector, the actual label vector of the first word, and the weight vector of the first word corresponds to movies, the second dimension corresponds to TV shows, the third dimension corresponds to variety shows, and the fourth dimension corresponds to cartoon.
可选地,第一词语为第一训练语句中的实体词,当然也可以是非实体词,本申请实施例不予限定。Optionally, the first word is a physical word in the first training sentence, of course, it may also be a non-physical word, which is not limited in the embodiment of the present application.
S250,处理器将第一词语的实际标签向量和权重向量输入到损失函数中,计算第一词语的损失向量。S250: The processor inputs the actual label vector and weight vector of the first word into the loss function, and calculates the loss vector of the first word.
例如,损失函数为交叉熵(cross entropy)函数。For example, the loss function is a cross entropy function.
需要说明的是,S250中,处理器需要将第一词语的实际标签向量与权重向量进行比较,确定第一序列模型输出的权重向量与第一词语的实际标签向量的偏离程度。It should be noted that in S250, the processor needs to compare the actual label vector of the first word with the weight vector to determine the degree of deviation between the weight vector output by the first sequence model and the actual label vector of the first word.
S260,处理器将第一词语的损失向量与第一样本集对应的第一掩码向量相乘,得到掩码后的损失向量。掩码后的损失向量用于更新第一序列标注模型,因此,执行S230。S260: The processor multiplies the loss vector of the first word by the first mask vector corresponding to the first sample set to obtain a masked loss vector. The masked loss vector is used to update the first sequence labeling model, therefore, S230 is executed.
在S260中,第一样本集对应的第一掩码向量只关注非零部分对应的命名实体,处理器将第一掩码向量与第一词语的损失向量相乘之后,处理器在利用得到的掩码后的损失向量更新或调整第一序列标注模型时,只关注第一掩码向量的非零部分对应的命名实体,不关注零对应的命名实体,换句话说,从而处理器在调整第一序列标注模型的部分命名实体对应的参数时不会影响其他命名实体的参数,这样可以保证调整后的第一序列标注模型能够满足不同语料的输入语句。In S260, the first mask vector corresponding to the first sample set only focuses on the named entity corresponding to the non-zero part. After the processor multiplies the first mask vector by the loss vector of the first word, the processor obtains When updating or adjusting the first-sequence labeling model with the masked loss vector, only pay attention to the named entity corresponding to the non-zero part of the first mask vector, and not pay attention to the named entity corresponding to zero. In other words, the processor is adjusting The parameters corresponding to some named entities of the first-sequence tagging model will not affect the parameters of other named entities, which can ensure that the adjusted first-sequence tagging model can meet the input sentences of different corpora.
上述S230-S260为第一样本集中的第一训练语句中的第一词语的执行过程,任何一个 样本集中的任何一个训练语句中的任何一个词语的也可以执行与S230-S260类似的过程,为了避免赘述本申请实施例不详细描述。仅分以下两种情况讨论处理器按照何种时序利用每个样本集的多个训练语句更新第一序列标注模型:The above S230-S260 are the execution process of the first word in the first training sentence in the first sample set, and any one word in any training sentence in any sample set can also perform a process similar to S230-S260. In order to avoid repetitive description, the embodiments of the present application are not described in detail. Only the following two cases will be discussed in which time sequence the processor uses multiple training sentences of each sample set to update the first sequence labeling model:
情况一,处理器将每个样本集中的部分样本分批输入到第一序列标注模型中,第一序列标注模型可以同时被更新多次。举例来说,N=3,每个样本集包括100个样本,每个样本集的其中70个样本为训练样本,70个训练样本用于更新第一序列标注模型(每个样本集中的剩余30个样本为测试样本,30个测试样本用于测试第二序列标注模型的稳定性),将每个样本集的其中70个训练样本分7批输入到第一序列标注模型中,例如,第一批训练样本分别包括3个训练样本集中每个训练样本集中的10个训练样本,如处理器根据一个训练样本集的一个训练样本可以更新一次第一序列标注模型,以一个训练样本包括一个实体词为例,则在第一个时间点,处理器可以将3个训练样本集中的3个训练样本按照S240-S260同时分别更新一次第一序列标注模型;在第二个时间点,处理器可以将3个训练样本集中的另外3个训练样本按照S240-S260同时分别更新一次第一序列标注模型;依次类推,在第10个时间点,处理器可以将3个训练样本集中的最后3个训练样本按照S240-S260同时分别更新一次第一序列标注模型,完成利用第一批训练样本更新第一序列标注模型的过程;依次类推,将3个训练样本集中每个训练样本集中的60个训练样本输入到第一序列标注模型中,得到更新后的第一序列标注模型。这个例子只是为了更好的说明更新第一序列标注模型的过程,上述描述的是处理器根据一个训练样本集的一个训练样本中的一个词语可以更新一次第一序列标注模型,也可以根据一个样本集中的多个训练样本中的多个词语更新一次第一序列标注模型。Case 1: The processor inputs part of the samples in each sample set into the first sequence labeling model in batches, and the first sequence labeling model can be updated multiple times at the same time. For example, N=3, each sample set includes 100 samples, 70 samples of each sample set are training samples, and 70 training samples are used to update the first sequence labeling model (the remaining 30 samples in each sample set) Samples are test samples, 30 test samples are used to test the stability of the second sequence labeling model), 70 training samples of each sample set are input into the first sequence labeling model in 7 batches, for example, the first sequence The batch of training samples includes 10 training samples in each training sample set in 3 training sample sets. For example, the processor can update the first sequence labeling model once according to a training sample in a training sample set, and a training sample includes an entity word As an example, at the first time point, the processor can update the first sequence labeling model at the same time for the 3 training samples in the 3 training sample set according to S240-S260; at the second time point, the processor can update The other 3 training samples in the 3 training sample sets update the first sequence labeling model at the same time according to S240-S260; and so on, at the 10th time point, the processor can combine the last 3 training samples in the 3 training sample sets Follow S240-S260 to update the first sequence labeling model at the same time, complete the process of using the first batch of training samples to update the first sequence labeling model; and so on, input 60 training samples from each training sample set in 3 training sample sets Go to the first-sequence labeling model to obtain the updated first-sequence labeling model. This example is just to better illustrate the process of updating the first sequence labeling model. The above description is that the processor can update the first sequence labeling model once according to a word in a training sample of a training sample set, or according to a sample The first sequence labeling model is updated once for multiple words in multiple training samples in the set.
情况二,处理器将N个样本集中每个样本集包括的训练样本全部混合在一起,然后将混合后的训练样本分批输入到第一序列标注模型中,每个训练样本执行一次S240-S260,则可以分批更新第一序列标注模型,其中,每个训练样本存在对应的掩码向量。Case 2: The processor mixes all the training samples included in each sample set of the N sample sets, and then inputs the mixed training samples into the first sequence labeling model in batches, and executes S240-S260 once for each training sample. , The first sequence labeling model can be updated in batches, where each training sample has a corresponding mask vector.
为了更好的说明方法200,下面结合图3进行举例说明,如图3所示,电影语料的样本集包括“我要看三国演义”的训练样本,电影语料的样本集对应的掩码向量为[1 0 0 0]。处理器将“我要看三国演义”的词语分别输入到第一序列标注模型中,其中,“我”“要”“看”为非实体词图中以“O”标记,处理器将“三国演义”输入第一序列标注模型,即此步骤为上述的S240,例如对于“三国演义”输出的权重向量P为[0.5 0.4 0 0.1],权重向量的第一维度对应电影,第二维度对应电视剧,第三个维度对应综艺,第四个维度对应动漫,即“三国演义”可能为电影的概率为0.5,可能为电视剧的概率为0.4,可能为综艺的概率为0,可能为动漫的概率为0.1,此时权重向量的各个维度的值加起来等于1。在电影语料中“三国演义”的实际标签向量Y为[1 0 0 0],例如损失函数为-(y i log(p i)+(1-y i)log(1-p i)),其中y i为Y对应维度的取值,p i为P对应维度的取值,i的取值为1,2,3,4。则处理器根据P和Y计算得到的损失向量为[0.3 0.2 0 0.04],处理器利用掩码向量[1 0 0 0]与损失向量[0.3 0.2 0 0.04]相乘,得到掩码后的损失向量[0.3 0 0 0],然后将掩码后的损失向量[0.3 0 0 0]反馈到第一序列标注模型中,利用掩码后的损失向量[0.3 0 0 0]只调整第一序列标注模型关于电影那部分的参数,其他参数保留不变。其中,两个向量相乘也可以为点乘运算,即两个向量相应位置相乘。 In order to better illustrate the method 200, the following is an example with reference to Figure 3. As shown in Figure 3, the sample set of the movie corpus includes training samples of "I want to watch the Romance of the Three Kingdoms", and the mask vector corresponding to the sample set of the movie corpus is [1 0 0 0]. The processor inputs the words "I want to see the Romance of the Three Kingdoms" into the first sequence labeling model. Among them, "I", "Yao" and "Look" are intangible words marked with "O" in the graph, and the processor will "Three Kingdoms" Enter the first sequence labeling model for "Romance", that is, this step is the above-mentioned S240. For example, for "Romance of the Three Kingdoms", the output weight vector P is [0.5 0.4 0 0.1]. The first dimension of the weight vector corresponds to movies, and the second dimension corresponds to TV shows. , The third dimension corresponds to variety shows, and the fourth dimension corresponds to animation, that is, the probability that "Romance of the Three Kingdoms" may be a movie is 0.5, the probability that it may be a TV series is 0.4, the probability that it may be a variety show is 0, and the probability that it may be an animation is 0.1. At this time, the values of the dimensions of the weight vector add up to 1. In the movie corpus, the actual label vector Y of "Romance of the Three Kingdoms" is [1 0 0 0], for example, the loss function is -(y i log(p i )+(1-y i )log(1-p i )), Among them, y i is the value of the corresponding dimension of Y, p i is the value of the corresponding dimension of P, and the value of i is 1, 2, 3, 4. Then the loss vector calculated by the processor according to P and Y is [0.3 0.2 0 0.04], and the processor uses the mask vector [1 0 0 0] to multiply the loss vector [0.3 0.2 0 0.04] to obtain the masked loss Vector [0.3 0 0 0], and then feed back the masked loss vector [0.3 0 0 0] to the first sequence labeling model, and use the masked loss vector [0.3 0 0 0] to adjust only the first sequence label The parameters of the model about the part of the movie, other parameters remain unchanged. Among them, the multiplication of two vectors can also be a dot multiplication operation, that is, the corresponding positions of the two vectors are multiplied.
为了更好的说明方法200,下面结合图4进行举例说明,如图4所示,电影+电视剧语 料的样本集包括“我要看三国演义”的训练样本,电影语料+电视剧的样本集对应的掩码向量为[1 1 0 0]。处理器将“我要看三国演义”的词语分别输入到第一序列标注模型中,其中,“我”“要”“看”为非实体词图中以“O”标记,处理器将“三国演义”输入第一序列标注模型,即此步骤为上述的S240,例如对于“三国演义”输出的权重向量P为[0.5 0.4 0 0.1],权重向量的第一维度对应电影,第二维度对应电视剧,第三个维度对应综艺,第四个维度对应动漫,即“三国演义”可能为电影的概率为0.5,可能为电视剧的概率为0.4,可能为综艺的概率为0,可能为动漫的概率为0.1,此时权重向量的各个维度的值加起来等于1。在电影语料中“三国演义”的实际标签向量Y为[1 1 0 0],即三国演义有可能是电视剧也可以可能是电影。例如损失函数为-(y i log(p i)+(1-y i)log(1-p i)),其中y i为Y对应维度的取值,p i为P对应维度的取值,i的取值为1,2,3,4。则处理器根据P和Y计算得到的损失向量为[0.3 0.2 0 0.04],处理器利用掩码向量[1 1 0 0]与损失向量[0.3 0.2 0 0.04]相乘,得到掩码后的损失向量[0.3 0.2 0 0],然后将掩码后的损失向量[0.3 0.2 0 0]反馈到第一序列标注模型中,利用掩码后的损失向量[0.3 0 0 0]只调整第一序列标注模型关于电影和电视剧那部分的参数,其他参数保留不变。其中,两个向量相乘也可以为点乘运算,即两个向量相应位置相乘。 In order to better illustrate the method 200, the following is an example with reference to Figure 4. As shown in Figure 4, the sample set of movie + TV drama corpus includes training samples of "I want to watch the Romance of the Three Kingdoms", and the sample set of movie corpus + TV drama corresponds to The mask vector is [1 1 0 0]. The processor inputs the words "I want to see the Romance of the Three Kingdoms" into the first sequence labeling model. Among them, "I", "Yao" and "Look" are intangible words marked with "O" in the graph, and the processor will "Three Kingdoms" Enter the first sequence labeling model for "Romance", that is, this step is the above-mentioned S240. For example, for "Romance of the Three Kingdoms", the output weight vector P is [0.5 0.4 0 0.1]. The first dimension of the weight vector corresponds to movies, and the second dimension corresponds to TV shows. , The third dimension corresponds to variety shows, and the fourth dimension corresponds to animation, that is, the probability that "Romance of the Three Kingdoms" may be a movie is 0.5, the probability that it may be a TV series is 0.4, the probability that it may be a variety show is 0, and the probability that it may be an animation is 0.1. At this time, the values of the dimensions of the weight vector add up to 1. In the movie corpus, the actual label vector Y of "The Romance of the Three Kingdoms" is [1 1 0 0], that is, the Romance of the Three Kingdoms may be a TV series or a movie. E.g. loss function is - (y i log (p i ) + (1-y i) log (1-p i)), where y i is the value corresponding to the Y dimension, p i is a value corresponding to P dimension, The value of i is 1, 2, 3, 4. Then the loss vector calculated by the processor according to P and Y is [0.3 0.2 0 0.04], and the processor uses the mask vector [1 1 0 0] to multiply the loss vector [0.3 0.2 0 0.04] to obtain the masked loss Vector [0.3 0.2 0 0], and then feed back the masked loss vector [0.3 0.2 0 0] to the first sequence labeling model, and use the masked loss vector [0.3 0 0 0] to adjust only the first sequence label Regarding the parameters of the part of the model for movies and TV series, other parameters remain unchanged. Among them, the multiplication of two vectors can also be a dot multiplication operation, that is, the corresponding positions of the two vectors are multiplied.
可以理解的是,在一定程度上,一个样本集对应的掩码向量与该样本集中训练样本中的词语的实际标签向量相等,例如,图3中,电影样本集对应的掩码向量和训练样本对应的实际标签向量都为[1 0 0 0];又例如,图4中,电影+电视剧样本集对应的掩码向量和训练样本对应的实际标签向量都为[1 1 0 0]。It is understandable that, to a certain extent, the mask vector corresponding to a sample set is equal to the actual label vector of the words in the training sample of the sample set. For example, in Figure 3, the mask vector and training sample corresponding to the movie sample set The corresponding actual label vectors are all [1 0 0 0]; for another example, in Figure 4, the mask vector corresponding to the movie + TV series sample set and the actual label vector corresponding to the training sample are both [1 1 0 0].
上述图2-图4中描述的是利用N个样本集中每个样本集中的部分样本(部分样本也称为训练样本)更新第一序列标注模型,得到第二序列标注模型,当得到第二序列标注模型时,处理器可以利用N个每个样本集中的剩余部分样本(剩余样本也可以称为测试样本)可以测试第二序列标注模型的稳定性。例如,将每个样本集的剩余部分样本输入第二序列标注模型,第二序列标注模型输出每个样本的每个词语的权重向量,根据权重向量确定每个词语的命名实体标签,根据每个词语的命名实体标签与实际标签比较,如果一致,则合格的样本数量加一,否则不合格的样本数量加一,依次类推,直到处理器将剩余部分样本输入第二序列标注模型之后,确定样本的合格率,合格率为合格的样本数/总的样本数,如果合格率满足阈值,则表示第二序列标注模型稳定,否则第二序列标注模型不稳定,继续执行上述图1-图2的方法。继续执行上述图1-图2的方法可以是:将第一序列标注模型更新为测试后的第二序列标注模型,重新采集样本集,继续执行图1或图2所示的方法,也可以是:重新确定一个与第二序列标注模型无关的第一序列标注模型,重新采集样本集,继续执行图1或图2所示的方法,直到得到的第二序列标注模型稳定。The description in Figure 2 to Figure 4 above is to use part of the samples in each sample set of N sample sets (partial samples are also called training samples) to update the first sequence labeling model to obtain the second sequence labeling model. When the second sequence is obtained When labeling the model, the processor can use the remaining samples in each of the N sample sets (the remaining samples may also be referred to as test samples) to test the stability of the labeling model in the second sequence. For example, the remaining samples of each sample set are input into the second sequence labeling model, and the second sequence labeling model outputs the weight vector of each word of each sample, and the named entity label of each word is determined according to the weight vector. The named entity label of the word is compared with the actual label. If they are consistent, the number of qualified samples is increased by one, otherwise the number of unqualified samples is increased by one, and so on, until the processor inputs the remaining samples into the second sequence labeling model to determine the sample If the qualification rate meets the threshold, it means that the second-sequence labeling model is stable, otherwise the second-sequence labeling model is unstable. Continue to perform the above figure 1 to figure 2 method. The method of continuing to execute the above-mentioned Figure 1 or Figure 2 can be: updating the first sequence labeling model to the second sequence labeling model after testing, re-collecting the sample set, and continuing to execute the method shown in Figure 1 or Figure 2, or it can be : Re-determine a first-sequence annotation model that is not related to the second-sequence annotation model, re-collect the sample set, and continue to execute the method shown in Figure 1 or Figure 2 until the obtained second-sequence annotation model is stable.
经过上述过程中,得到的第二序列标注模型稳定之后,可以利用第二序列标注模型进行预测,具体预测过程如图5所示,图5由处理器执行,方法500包括:After the above-mentioned process, after the obtained second-sequence annotation model is stable, the second-sequence annotation model can be used for prediction. The specific prediction process is shown in FIG. 5, which is executed by the processor, and the method 500 includes:
S510,处理器将预测语句中的第二实体词输入到第二序列标注模型,输出预测向量。S510: The processor inputs the second entity word in the prediction sentence into the second sequence labeling model, and outputs the prediction vector.
S520,处理器根据预测向量确定第二实体词的至少一个标签,预测语句为包括N个样本集中任一样本集对应的实体语料的语句。其中,预测向量的维度为M。S520: The processor determines at least one label of the second entity word according to the prediction vector, and the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets. Among them, the dimension of the prediction vector is M.
可选地,S520,包括:将预测向量中取值大于预设值的维度对应的命名实体标签确定为第二实体词的至少一个标签。例如,预设值为0.5,如图6所示,将“我要看三国演义”输入第二序列标注模型,其中,“我”“要”“看”为非实体词图,对于“三国演义”输出的预测 向量为[0.7 0.6 0 1],预测向量第一个维度和第二维度都大于0.5,第一个维度对应电影,第二维度对应电视剧,则“三国演义”的命名实体标签为电影和电视剧。Optionally, S520 includes: determining a named entity tag corresponding to a dimension whose value is greater than a preset value in the prediction vector as the at least one tag of the second entity word. For example, the default value is 0.5. As shown in Figure 6, "I want to see the Romance of the Three Kingdoms" is entered into the second sequence labeling model. "The output prediction vector is [0.7 0.6 0 1]. The first dimension and the second dimension of the prediction vector are both greater than 0.5. The first dimension corresponds to movies, and the second dimension corresponds to TV series. The named entity label of "Romance of the Three Kingdoms" is Movies and TV series.
需要说明的是,预测向量的各个维度的值加起来可能等于1也可能不等于1,例如可能大于1,本申请实施例对此不作限制。It should be noted that the sum of the values of the various dimensions of the prediction vector may be equal to 1 or may not be equal to 1, for example, may be greater than 1, which is not limited in the embodiment of the present application.
下面结合图7描述本申请实施例中的一个可能的语音助手的场景,由图7的各个模块协调完成,上述方法实施例中处理器可以包括自然语言理解(natural language understand,NLU)模块,如图7所示,由自动语音识别(automatic speech recognition,ASR)模块、对话管理(dialog manager,DM)模块、自然语言理解(natural language understand,NLU)模块和文本转语音(text to speech,TTS)模块执行方法700。具体地,方法700包括以下步骤:The following describes a possible voice assistant scenario in the embodiment of the present application with reference to FIG. 7, which is coordinated by the various modules in FIG. 7. In the above method embodiment, the processor may include a natural language understanding (NLU) module, such as As shown in Figure 7, it consists of an automatic speech recognition (ASR) module, a dialog manager (DM) module, a natural language understanding (NLU) module, and a text to speech (text to speech, TTS) module. The module executes the method 700. Specifically, the method 700 includes the following steps:
S701,ASR模块接收用户说法。S701, the ASR module receives the user's statement.
S702,ASR模块将用户说话转化为文本信息。S702: The ASR module converts the user's speech into text information.
S703,ASR模块将文本信息发送给DM模块。S703: The ASR module sends the text information to the DM module.
S704,DM模块结合该文本信息的上下文,确定该文本信息对应的上下文信息。S704: The DM module combines the context of the text information to determine the context information corresponding to the text information.
需要说明是,S701中的用户说法为一次语言表达,在该次语言表达之前用户还可能说过与此次对话相关的其他的说法,这样,用户其他的说法即为上下文信息。It should be noted that the user's statement in S701 is a verbal expression, and the user may have said other statements related to this conversation before the verbal expression. In this way, the user's other statements are contextual information.
S705,DM模块将上下文信息以及文本信息发送给NLU模块。S705: The DM module sends the context information and text information to the NLU module.
此时的文本信息也可以称为预测的输入语句。The text information at this time can also be called a predicted input sentence.
S706,NLU模块将文本信息输入第二序列标注模型中,结合上下文信息确定文本信息对应的意图信息和槽位信息。S706: The NLU module inputs the text information into the second sequence labeling model, and determines the intent information and slot information corresponding to the text information in combination with the context information.
需要说明的是,S706即为与本申请上述实施例相关利用第二序列标注模型预测输入语句中的实体词的命名实体标签,根据命名实体标签确定文本信息对应的意图信息和槽位信息。It should be noted that S706 is related to the foregoing embodiment of the present application, using the second sequence labeling model to predict the named entity tags of the entity words in the input sentence, and determining the intent information and slot information corresponding to the text information according to the named entity tags.
S707,NLU模块将意图信息和槽位信息发送给DM模块。S707, the NLU module sends the intent information and the slot information to the DM module.
S708,DM模块根据意图信息和槽位信息调用TTS模块中的语音结果。S708: The DM module calls the voice result in the TTS module according to the intent information and the slot information.
S709,TTS模块根据语音结果向用户进行语音播放。S709: The TTS module performs voice playback to the user according to the voice result.
为了更好的理解方法700,举例来说,S701中用户会说“我想查询明天的天气”,S704中确定文本信息的上下信息,例如,在用户说“我想查询明天的天气”之前,用户还说了“我想查询北京的天气”,这样,S706中,NLU模块根据这两句话就知道用户的意图是查询天气,槽位是查询北京明天的天气,S708中,语音结果是查询到的北京明天的天气情况,S709,TTS执行播放北京明天的天气情况。In order to better understand the method 700, for example, the user in S701 will say "I want to check tomorrow's weather", and the upper and lower information of the text information is determined in S704. For example, before the user says "I want to check tomorrow's weather", The user also said "I want to query the weather in Beijing". In S706, the NLU module knows that the user's intention is to query the weather based on these two sentences, and the slot is to query the weather in Beijing tomorrow. In S708, the voice result is query Tomorrow's weather in Beijing, S709, TTS will play the weather in Beijing tomorrow.
需要说明的是,图7所示的各个模块的逻辑划分只是为了更好的理解场景,在实际应用中,不限定是上述模块,还可以是其他的划分方式,本申请实施例不予限定。此外,图7仅示出了一个可能的应用场景,本申请实施例还可以应用到其他的应用场景,例如电视语音助手的播放视频等。It should be noted that the logical division of each module shown in FIG. 7 is only for a better understanding of the scenario. In actual applications, it is not limited to the foregoing modules, and other division methods may also be used, which are not limited in the embodiment of the present application. In addition, FIG. 7 only shows one possible application scenario, and the embodiment of the present application can also be applied to other application scenarios, such as playing video of a TV voice assistant.
也需要说明的是,本申请实施例的举例都是以样本集中的样本是中文描述,本申请实施例还可以应用到任何一种可能的语言,例如,英文、法文或德文等等,本申请不予限制。It should also be noted that the examples of the embodiments of this application are described in Chinese in the sample collection, and the embodiments of this application can also be applied to any possible language, for example, English, French, or German. Applications are not restricted.
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。The various embodiments described herein may be independent solutions, or may be combined according to internal logic, and these solutions fall into the protection scope of the present application.
可以理解的是,上述各个方法实施例中由电子设备实现的方法和操作,也可以由可用于电子设备的部件(例如芯片或者电路)实现。It can be understood that the methods and operations implemented by the electronic device in each of the foregoing method embodiments may also be implemented by components (for example, chips or circuits) that can be used in electronic devices.
上文描述了本申请提供的方法实施例,下文将描述本申请提供的装置实施例。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的内容可以参见上文方法实施例,为了简洁,这里不再赘述。The method embodiments provided by the present application are described above, and the device embodiments provided by the present application will be described below. It should be understood that the description of the device embodiment and the description of the method embodiment correspond to each other. Therefore, for the content that is not described in detail, please refer to the above method embodiment. For the sake of brevity, it will not be repeated here.
本领域技术人员应该可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的保护范围。Those skilled in the art should be aware that, in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered as going beyond the scope of protection of this application.
本申请实施例可以根据上述方法示例,对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有其它可行的划分方式。下面以采用对应各个功能划分各个功能模块为例进行说明。The embodiment of the present application may divide the electronic device into functional modules based on the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other feasible division methods in actual implementation. The following is an example of dividing each function module corresponding to each function as an example.
图8为本申请实施例提供的用于命名实体标注的装置800的示意性框图。该装置800包括确定单元810和更新单元820。确定单元810用于执行上文的实施例中处理器的确定相关的操作。更新单元820用于执行上文的实施例中处理器的更新相关的操作。FIG. 8 is a schematic block diagram of an apparatus 800 for labeling named entities according to an embodiment of the application. The device 800 includes a determining unit 810 and an updating unit 820. The determining unit 810 is configured to perform operations related to the determination of the processor in the above embodiment. The update unit 820 is configured to perform operations related to the update of the processor in the above embodiment.
确定单元810,用于确定N个样本集的N个掩码向量,N个样本集与N个掩码向量一一对应,N个样本集中不同样本集对应的实体语料不同,N个样本集中每个样本集包括至少一个实体语料的多个样本,N个掩码向量中每个掩码向量的M个维度对应M个命名实体,M和N为正整数;The determining unit 810 is used to determine N mask vectors of N sample sets. The N sample sets correspond to the N mask vectors in a one-to-one manner. The entity corpus corresponding to different sample sets in the N sample sets is different. A sample set includes multiple samples of at least one entity corpus, the M dimensions of each mask vector in the N mask vectors correspond to M named entities, and M and N are positive integers;
更新单元820,用于根据N个样本集中每个样本集中的部分样本和N个掩码向量更新第一序列标注模型,得到第二序列标注模型,第二序列标注模型用于实体标注。The updating unit 820 is configured to update the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain the second sequence labeling model, and the second sequence labeling model is used for entity labeling.
作为一个可选实施例,确定单元810具体用于:As an optional embodiment, the determining unit 810 is specifically configured to:
将N个样本集中的第一样本集中的第一样本中的第一词语输入到第一序列标注模型中,输出第一词语的权重向量;Input the first word in the first sample in the first sample set of the N sample sets into the first sequence labeling model, and output the weight vector of the first word;
将第一词语的实际标签向量与权重向量输入到损失函数中,计算第一词语的损失向量;Input the actual label vector and weight vector of the first word into the loss function, and calculate the loss vector of the first word;
将损失向量和第一样本集对应的第一掩码向量相乘,得到掩码后的损失向量;Multiply the loss vector and the first mask vector corresponding to the first sample set to obtain the masked loss vector;
根据掩码后的损失向量更新第一序列标注模型;Update the first sequence labeling model according to the masked loss vector;
其中,权重向量、实际标签向量和损失向量的维度为M。Among them, the dimensions of the weight vector, the actual label vector, and the loss vector are M.
作为一个可选实施例,第一词语为第一样本中的实体词。As an optional embodiment, the first word is an entity word in the first sample.
作为一个可选实施例,损失函数为交叉熵函数。As an optional embodiment, the loss function is a cross-entropy function.
作为一个可选实施例,装置800还包括:测试单元,用于根据N个样本集中每个样本集中的剩余样本测试第二序列标注模型的稳定性。As an optional embodiment, the device 800 further includes: a testing unit configured to test the stability of the second sequence annotation model according to the remaining samples in each sample set of the N sample sets.
作为一个可选实施例,装置800还包括:输入输出单元,用于将预测语句中的第二实体词输入到第二序列标注模型,输出预测向量;确定单元还用于根据预测向量确定第二实体词的至少一个标签,预测语句为包括N个样本集中任一样本集对应的实体语料的语句, 预测向量的维度为M。As an optional embodiment, the device 800 further includes: an input and output unit, configured to input the second entity word in the prediction sentence into the second sequence labeling model, and output the prediction vector; the determining unit is also configured to determine the second entity word according to the prediction vector At least one label of the entity word, the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets, and the dimension of the prediction vector is M.
其中,输入输出单元可以与外部进行通信。输入输出单元还可以称为通信接口或通信单元。Among them, the input and output unit can communicate with the outside. The input and output unit may also be referred to as a communication interface or a communication unit.
作为一个可选实施例,确定单元810具体用于:确定预测向量每个维度的取值是否大于预设值;将预测向量中取值大于预设值的维度对应的命名实体标签确定为第二实体词的至少一个标签。As an optional embodiment, the determining unit 810 is specifically configured to: determine whether the value of each dimension of the prediction vector is greater than a preset value; determine the named entity label corresponding to the dimension whose value is greater than the preset value in the prediction vector as the second At least one label of the entity word.
作为一个可选实施例,确定单元810具体用于:确定N个掩码向量中每个掩码向量的维度为N个样本集对应的实体语料种类的总数量;As an optional embodiment, the determining unit 810 is specifically configured to: determine that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets;
根据N个样本集中每个样本集对应的实体语料确定N个掩码向量每个掩码向量对应的取值。According to the entity corpus corresponding to each sample set in the N sample set, the value corresponding to each of the N mask vectors is determined.
图9是本申请实施例提供的用于命名实体标注的装置900的结构性示意性图。所述通信装置900包括:处理器910、存储器920、通信接口930、总线940。FIG. 9 is a schematic structural diagram of an apparatus 900 for labeling named entities provided by an embodiment of the present application. The communication device 900 includes a processor 910, a memory 920, a communication interface 930, and a bus 940.
图9所示的装置900中的处理器910可以对应于图8中的装置800中的确定单元810和更新单元820。通信接口930可以对应于装置800中的输入输出单元。The processor 910 in the device 900 shown in FIG. 9 may correspond to the determining unit 810 and the updating unit 820 in the device 800 in FIG. 8. The communication interface 930 may correspond to an input and output unit in the device 800.
其中,该处理器910可以与存储器920连接。该存储器920可以用于存储该程序代码和数据。因此,该存储器920可以是处理器910内部的存储单元,也可以是与处理器910独立的外部存储单元,还可以是包括处理器910内部的存储单元和与处理器910独立的外部存储单元的部件。Wherein, the processor 910 may be connected to the memory 920. The memory 920 can be used to store the program code and data. Therefore, the memory 920 may be a storage unit inside the processor 910, or an external storage unit independent of the processor 910, or may include a storage unit inside the processor 910 and an external storage unit independent of the processor 910. part.
可选的,装置900还可以包括总线940。其中,存储器920、通信接口930可以通过总线940与处理器910连接。总线940可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线940可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。Optionally, the apparatus 900 may further include a bus 940. The memory 920 and the communication interface 930 may be connected to the processor 910 through the bus 940. The bus 940 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 940 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one line is used to represent in FIG. 9, but it does not mean that there is only one bus or one type of bus.
应理解,在本申请实施例中,该处理器910可以采用中央处理单元(central processing unit,CPU)。该处理器还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。或者该处理器910采用一个或多个集成电路,用于执行相关程序,以实现本申请实施例所提供的技术方案。It should be understood that, in this embodiment of the present application, the processor 910 may adopt a central processing unit (central processing unit, CPU). The processor can also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (ASICs), ready-made programmable gate arrays (field programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. Or, the processor 910 adopts one or more integrated circuits to execute related programs to implement the technical solutions provided in the embodiments of the present application.
该存储器920可以包括只读存储器和随机存取存储器,并向处理器910提供指令和数据。处理器910的一部分还可以包括非易失性随机存取存储器。例如,处理器910还可以存储设备类型的信息。The memory 920 may include a read-only memory and a random access memory, and provides instructions and data to the processor 910. A part of the processor 910 may also include a non-volatile random access memory. For example, the processor 910 may also store device type information.
在装置900运行时,所述处理器910执行所述存储器920中的计算机执行指令以通过所述装置900执行上述方法的操作步骤。When the device 900 is running, the processor 910 executes the computer-executable instructions in the memory 920 to execute the operation steps of the foregoing method through the device 900.
应理解,根据本申请实施例的装置900可对应于本申请实施例中的装置800,并且装置800中的各个单元的上述和其它操作和/或功能分别为了实现方法的相应流程,为了简洁,在此不再赘述。It should be understood that the device 900 according to the embodiment of the present application may correspond to the device 800 in the embodiment of the present application, and the above and other operations and/or functions of each unit in the device 800 are used to implement the corresponding process of the method. For brevity, I won't repeat it here.
可选地,在一些实施例中,本申请实施例还提供了一种计算机可读介质,所述计算机 可读介质存储有程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行上述各方面中的方法。Optionally, in some embodiments, the embodiments of the present application also provide a computer-readable medium, the computer-readable medium stores program code, and when the computer program code runs on the computer, the computer executes The methods in the above aspects.
可选地,在一些实施例中,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行上述各方面中的方法。Optionally, in some embodiments, the embodiments of the present application also provide a computer program product, the computer program product includes: computer program code, when the computer program code runs on a computer, the computer executes the above Methods in all aspects.
在本申请实施例中,终端设备或网络设备包括硬件层、运行在硬件层之上的操作系统层,以及运行在操作系统层上的应用层。其中,硬件层可以包括中央处理器(central processing unit,CPU)、内存管理单元(memory management unit,MMU)和内存(也称为主存)等硬件。操作系统层的操作系统可以是任意一种或多种通过进程(process)实现业务处理的计算机操作系统,例如,Linux操作系统、Unix操作系统、Android操作系统、iOS操作系统或windows操作系统等。应用层可以包含浏览器、通讯录、文字处理软件、即时通信软件等应用。In the embodiment of the present application, the terminal device or the network device includes a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer. Among them, the hardware layer may include hardware such as a central processing unit (CPU), a memory management unit (MMU), and memory (also referred to as main memory). The operating system of the operating system layer can be any one or more computer operating systems that implement business processing through processes, for example, Linux operating systems, Unix operating systems, Android operating systems, iOS operating systems, or windows operating systems. The application layer can include applications such as browsers, address books, word processing software, and instant messaging software.
本申请实施例并未对本申请实施例提供的方法的执行主体的具体结构进行特别限定,只要能够通过运行记录有本申请实施例提供的方法的代码的程序,以根据本申请实施例提供的方法进行通信即可。例如,本申请实施例提供的方法的执行主体可以是终端设备或网络设备,或者,是终端设备或网络设备中能够调用程序并执行程序的功能模块。The embodiment of this application does not specifically limit the specific structure of the execution subject of the method provided in the embodiment of this application, as long as it can run a program that records the code of the method provided in the embodiment of this application to follow the method provided in the embodiment of this application. Just communicate. For example, the execution subject of the method provided in the embodiments of the present application may be a terminal device or a network device, or a functional module in the terminal device or the network device that can call and execute the program.
本申请的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本文中使用的术语“制品”可以涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。The various aspects or features of this application can be implemented as methods, devices, or products using standard programming and/or engineering techniques. The term "article of manufacture" used herein can encompass a computer program accessible from any computer-readable device, carrier, or medium. For example, the computer-readable medium may include, but is not limited to: magnetic storage devices (for example, hard disks, floppy disks, or tapes, etc.), optical disks (for example, compact discs (CD), digital versatile discs (digital versatile disc, DVD), etc.), etc. ), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), cards, sticks or key drives, etc.).
本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可以包括但不限于:无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。The various storage media described herein may represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to: wireless channels and various other media capable of storing, containing, and/or carrying instructions and/or data.
应理解,本申请实施例中提及的处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor mentioned in the embodiment of the present application may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processors, DSP), and application-specific integrated circuits ( application specific integrated circuit (ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
还应理解,本申请实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM)。例如,RAM可以用作外部高速缓存。作为示例而非限定,RAM可以包括如下多种形式:静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器 (enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。It should also be understood that the memory mentioned in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM). For example, RAM can be used as an external cache. As an example and not a limitation, RAM can include the following various forms: static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM) , Double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and Direct RAM Bus RAM (DR RAM).
需要说明的是,当处理器为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)可以集成在处理器中。It should be noted that when the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, the memory (storage module) can be integrated in the processor.
还需要说明的是,本文描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It should also be noted that the memories described herein are intended to include, but are not limited to, these and any other suitable types of memories.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的保护范围。A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this document, the units and steps can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to realize the described functions, but this realization should not be considered as going beyond the protection scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。此外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上,或者说对现有技术做出贡献的部分,或者该技术方案的部分,可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,该计算机软件产品包括若干指令,该指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。前述的存储介质可以包括但不限于:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the essence of the technical solution of this application, or the part that contributes to the existing technology, or the part of the technical solution, can be embodied in the form of a computer software product, and the computer software product is stored in a storage In the medium, the computer software product includes a number of instructions, which are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media may include but are not limited to: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks, etc., which can store programs The medium of the code.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terminology used in the specification of the application herein is only for the purpose of describing specific embodiments, and is not intended to limit the application.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (18)

  1. 一种用于实体标注的方法,其特征在于,包括:A method for entity labeling, which is characterized in that it includes:
    确定N个样本集的N个掩码向量,所述N个样本集与所述N个掩码向量一一对应,所述N个样本集中不同样本集对应的实体语料不同,所述N个样本集中每个样本集包括至少一个实体语料的多个样本,所述N个掩码向量中每个掩码向量的M个维度对应M个命名实体,M和N为正整数;Determine N mask vectors of N sample sets, the N sample sets correspond to the N mask vectors one-to-one, the entity corpora corresponding to different sample sets in the N sample sets are different, and the N samples Each sample set in the set includes multiple samples of at least one entity corpus, and the M dimensions of each of the N mask vectors correspond to M named entities, and M and N are positive integers;
    根据所述N个样本集中每个样本集中的部分样本和所述N个掩码向量更新第一序列标注模型,得到第二序列标注模型,所述第二序列标注模型用于实体标注。The first sequence labeling model is updated according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain a second sequence labeling model, and the second sequence labeling model is used for entity labeling.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述N个样本集中每个样本集中的部分样本和所述N个掩码向量更新第一序列标注模型,包括:The method according to claim 1, wherein the updating the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors comprises:
    将所述N个样本集中的第一样本集中的第一样本中的第一词语输入到所述第一序列标注模型中,输出所述第一词语的权重向量;Inputting the first word in the first sample in the first sample set of the N sample sets into the first sequence tagging model, and outputting the weight vector of the first word;
    将所述第一词语的实际标签向量与所述权重向量输入到损失函数中,计算所述第一词语的损失向量;Input the actual label vector of the first word and the weight vector into a loss function, and calculate the loss vector of the first word;
    将所述损失向量和所述第一样本集对应的第一掩码向量相乘,得到掩码后的所述损失向量;Multiply the loss vector and the first mask vector corresponding to the first sample set to obtain the masked loss vector;
    根据所述掩码后的所述损失向量更新所述第一序列标注模型;Updating the first sequence labeling model according to the loss vector after the mask;
    其中,所述权重向量、所述实际标签向量和所述损失向量的维度为M。Wherein, the dimensions of the weight vector, the actual label vector, and the loss vector are M.
  3. 根据权利要求2所述的方法,其特征在于,所述第一词语为所述第一样本中的实体词。The method according to claim 2, wherein the first word is an entity word in the first sample.
  4. 根据权利要求2或3所述的方法,其特征在于,所述损失函数为交叉熵函数。The method according to claim 2 or 3, wherein the loss function is a cross entropy function.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    根据所述N个样本集中每个样本集中的剩余样本测试所述第二序列标注模型的稳定性。The stability of the second sequence annotation model is tested according to the remaining samples in each sample set of the N sample sets.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 5, wherein the method further comprises:
    将预测语句中的第二实体词输入到所述第二序列标注模型,输出预测向量;Input the second entity word in the prediction sentence into the second sequence labeling model, and output the prediction vector;
    根据所述预测向量确定所述第二实体词的至少一个标签,所述预测语句为包括所述N个样本集中任一样本集对应的实体语料的语句;Determining at least one label of the second entity word according to the prediction vector, and the prediction sentence is a sentence including an entity corpus corresponding to any sample set in the N sample sets;
    其中,所述预测向量的维度为M。Wherein, the dimension of the prediction vector is M.
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述预测向量确定所述第二实体词的至少一个标签,包括:The method according to claim 6, wherein the determining at least one label of the second entity word according to the prediction vector comprises:
    确定所述预测向量每个维度的取值是否大于预设值;Determine whether the value of each dimension of the prediction vector is greater than a preset value;
    将所述预测向量中取值大于预设值的维度对应的命名实体标签确定为所述第二实体词的所述至少一个标签。Determine a named entity tag corresponding to a dimension whose value is greater than a preset value in the prediction vector as the at least one tag of the second entity word.
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述确定N个样本集的N个掩码向量,包括:The method according to any one of claims 1 to 7, wherein the determining N mask vectors of N sample sets comprises:
    确定所述N个掩码向量中每个掩码向量的维度为所述N个样本集对应的实体语料种类的总数量;Determining that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets;
    根据所述N个样本集中每个样本集对应的实体语料确定所述N个掩码向量每个掩 码向量对应的取值。The value corresponding to each mask vector of the N mask vectors is determined according to the entity corpus corresponding to each sample set in the N sample sets.
  9. 一种用于实体标注的装置,其特征在于,包括:A device for entity labeling, characterized in that it comprises:
    确定单元,用于确定N个样本集的N个掩码向量,所述N个样本集与所述N个掩码向量一一对应,所述N个样本集中不同样本集对应的实体语料不同,所述N个样本集中每个样本集包括至少一个实体语料的多个样本,所述N个掩码向量中每个掩码向量的M个维度对应M个命名实体,M和N为正整数;The determining unit is configured to determine N mask vectors of N sample sets, where the N sample sets correspond to the N mask vectors one-to-one, and the entity corpora corresponding to different sample sets in the N sample sets are different, Each sample set of the N sample sets includes a plurality of samples of at least one entity corpus, the M dimensions of each of the N mask vectors correspond to M named entities, and M and N are positive integers;
    更新单元,用于根据所述N个样本集中每个样本集中的部分样本和所述N个掩码向量更新第一序列标注模型,得到第二序列标注模型,所述第二序列标注模型用于实体标注。The update unit is configured to update the first sequence labeling model according to the partial samples in each sample set of the N sample sets and the N mask vectors to obtain a second sequence labeling model, and the second sequence labeling model is used for Entity annotation.
  10. 根据权利要求9所述的装置,其特征在于,所述确定单元具体用于:The device according to claim 9, wherein the determining unit is specifically configured to:
    将所述N个样本集中的第一样本集中的第一样本中的第一词语输入到所述第一序列标注模型中,输出所述第一词语的权重向量;Inputting the first word in the first sample in the first sample set of the N sample sets into the first sequence labeling model, and outputting the weight vector of the first word;
    将所述第一词语的实际标签向量与所述权重向量输入到损失函数中,计算所述第一词语的损失向量;Input the actual label vector of the first word and the weight vector into a loss function, and calculate the loss vector of the first word;
    将所述损失向量和所述第一样本集对应的第一掩码向量相乘,得到掩码后的所述损失向量;Multiply the loss vector and the first mask vector corresponding to the first sample set to obtain the masked loss vector;
    根据所述掩码后的所述损失向量更新所述第一序列标注模型;Updating the first sequence labeling model according to the loss vector after the mask;
    其中,所述权重向量、所述实际标签向量和所述损失向量的维度为M。Wherein, the dimensions of the weight vector, the actual label vector, and the loss vector are M.
  11. 根据权利要求10所述的装置,其特征在于,所述第一词语为所述第一样本中的实体词。The device of claim 10, wherein the first word is a physical word in the first sample.
  12. 根据权利要求10或11所述的装置,其特征在于,所述损失函数为交叉熵函数。The device according to claim 10 or 11, wherein the loss function is a cross entropy function.
  13. 根据权利要求9至12中任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 9 to 12, wherein the device further comprises:
    测试单元,用于根据所述N个样本集中每个样本集中的剩余样本测试所述第二序列标注模型的稳定性。The testing unit is configured to test the stability of the second sequence annotation model according to the remaining samples in each sample set of the N sample sets.
  14. 根据权利要求9至13中任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 9 to 13, wherein the device further comprises:
    输入输出单元,用于将预测语句中的第二实体词输入到所述第二序列标注模型,输出预测向量;The input and output unit is configured to input the second entity word in the prediction sentence into the second sequence labeling model, and output the prediction vector;
    所述确定单元还用于根据所述预测向量确定所述第二实体词的至少一个标签,所述预测语句为包括所述N个样本集中任一样本集对应的实体语料的语句;The determining unit is further configured to determine at least one label of the second entity word according to the prediction vector, and the prediction sentence is a sentence including the entity corpus corresponding to any sample set in the N sample sets;
    其中,所述预测向量的维度为M。Wherein, the dimension of the prediction vector is M.
  15. 根据权利要求14所述的装置,其特征在于,所述确定单元具体用于:The device according to claim 14, wherein the determining unit is specifically configured to:
    确定所述预测向量每个维度的取值是否大于预设值;Determine whether the value of each dimension of the prediction vector is greater than a preset value;
    将所述预测向量中取值大于预设值的维度对应的命名实体标签确定为所述第二实体词的所述至少一个标签。Determine a named entity tag corresponding to a dimension whose value is greater than a preset value in the prediction vector as the at least one tag of the second entity word.
  16. 根据权利要求9至15中任一项所述的装置,其特征在于,所述确定单元具体用于:The device according to any one of claims 9 to 15, wherein the determining unit is specifically configured to:
    确定所述N个掩码向量中每个掩码向量的维度为所述N个样本集对应的实体语料种类的总数量;Determining that the dimension of each mask vector in the N mask vectors is the total number of entity corpus types corresponding to the N sample sets;
    根据所述N个样本集中每个样本集对应的实体语料确定所述N个掩码向量每个掩 码向量对应的取值。The value corresponding to each mask vector of the N mask vectors is determined according to the entity corpus corresponding to each sample set in the N sample sets.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被运行时,实现如权利要求1至8中任一项所述的方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed, the method according to any one of claims 1 to 8 is implemented.
  18. 一种芯片,包括处理器,所述处理器与存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述芯片执行如权利要求1至8中任一项所述的方法。A chip comprising a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the chip executes as claimed in claim 1. To the method described in any one of 8.
PCT/CN2021/080402 2020-05-29 2021-03-12 Method and device for entity tagging WO2021238337A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010474348.XA CN113743117B (en) 2020-05-29 2020-05-29 Method and device for entity labeling
CN202010474348.X 2020-05-29

Publications (1)

Publication Number Publication Date
WO2021238337A1 true WO2021238337A1 (en) 2021-12-02

Family

ID=78724593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/080402 WO2021238337A1 (en) 2020-05-29 2021-03-12 Method and device for entity tagging

Country Status (2)

Country Link
CN (1) CN113743117B (en)
WO (1) WO2021238337A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528394A (en) * 2022-04-22 2022-05-24 杭州费尔斯通科技有限公司 Text triple extraction method and device based on mask language model
CN114706927A (en) * 2022-04-12 2022-07-05 平安国际智慧城市科技股份有限公司 Data batch annotation method based on artificial intelligence and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040362A1 (en) * 2006-03-30 2008-02-14 Sony France S.A. Hybrid audio-visual categorization system and method
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text
CN109726397A (en) * 2018-12-27 2019-05-07 网易(杭州)网络有限公司 Mask method, device, storage medium and the electronic equipment of Chinese name entity
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110457689A (en) * 2019-07-26 2019-11-15 科大讯飞(苏州)科技有限公司 Semantic processes method and relevant apparatus

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836453B2 (en) * 2015-08-27 2017-12-05 Conduent Business Services, Llc Document-specific gazetteers for named entity recognition
CN108829681B (en) * 2018-06-28 2022-11-11 鼎富智能科技有限公司 Named entity extraction method and device
CN110347839B (en) * 2019-07-18 2021-07-16 湖南数定智能科技有限公司 Text classification method based on generative multi-task learning model
CN110377744B (en) * 2019-07-26 2022-08-09 北京香侬慧语科技有限责任公司 Public opinion classification method and device, storage medium and electronic equipment
CN110688853B (en) * 2019-08-12 2022-09-30 平安科技(深圳)有限公司 Sequence labeling method and device, computer equipment and storage medium
CN111159407B (en) * 2019-12-30 2022-01-28 北京明朝万达科技股份有限公司 Method, apparatus, device and medium for training entity recognition and relation classification model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040362A1 (en) * 2006-03-30 2008-02-14 Sony France S.A. Hybrid audio-visual categorization system and method
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN109726397A (en) * 2018-12-27 2019-05-07 网易(杭州)网络有限公司 Mask method, device, storage medium and the electronic equipment of Chinese name entity
CN110457689A (en) * 2019-07-26 2019-11-15 科大讯飞(苏州)科技有限公司 Semantic processes method and relevant apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706927A (en) * 2022-04-12 2022-07-05 平安国际智慧城市科技股份有限公司 Data batch annotation method based on artificial intelligence and related equipment
CN114706927B (en) * 2022-04-12 2024-05-03 平安国际智慧城市科技股份有限公司 Data batch labeling method based on artificial intelligence and related equipment
CN114528394A (en) * 2022-04-22 2022-05-24 杭州费尔斯通科技有限公司 Text triple extraction method and device based on mask language model

Also Published As

Publication number Publication date
CN113743117B (en) 2024-04-09
CN113743117A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN107908635B (en) Method and device for establishing text classification model and text classification
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
WO2021027533A1 (en) Text semantic recognition method and apparatus, computer device, and storage medium
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
JP5901001B1 (en) Method and device for acoustic language model training
CN111191030B (en) Method, device and system for identifying single sentence intention based on classification
WO2019174186A1 (en) Automatic poem generation method and apparatus, and computer device and storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
WO2020181808A1 (en) Text punctuation prediction method and apparatus, and computer device and storage medium
WO2021238337A1 (en) Method and device for entity tagging
CN113053367B (en) Speech recognition method, speech recognition model training method and device
US11636272B2 (en) Hybrid natural language understanding
CN113407677B (en) Method, apparatus, device and storage medium for evaluating consultation dialogue quality
CN113947086A (en) Sample data generation method, training method, corpus generation method and apparatus
CN111339308A (en) Training method and device of basic classification model and electronic equipment
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN110263304B (en) Statement encoding method, statement decoding method, device, storage medium and equipment
CN111209746B (en) Natural language processing method and device, storage medium and electronic equipment
WO2023093909A1 (en) Workflow node recommendation method and apparatus
US20220083745A1 (en) Method, apparatus and electronic device for determining word representation vector
CN116150333A (en) Text matching method, device, electronic equipment and readable storage medium
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN109065016B (en) Speech synthesis method, speech synthesis device, electronic equipment and non-transient computer storage medium
CN110249326B (en) Natural language content generator
WO2020133291A1 (en) Text entity recognition method and apparatus, computer device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21812412

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21812412

Country of ref document: EP

Kind code of ref document: A1