WO2022022421A1 - 语言表示模型系统、预训练方法、装置、设备及介质 - Google Patents

语言表示模型系统、预训练方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022022421A1
WO2022022421A1 PCT/CN2021/108194 CN2021108194W WO2022022421A1 WO 2022022421 A1 WO2022022421 A1 WO 2022022421A1 CN 2021108194 W CN2021108194 W CN 2021108194W WO 2022022421 A1 WO2022022421 A1 WO 2022022421A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
granularity
model
language representation
word segmentation
Prior art date
Application number
PCT/CN2021/108194
Other languages
English (en)
French (fr)
Inventor
张新松
李鹏帅
李航
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US17/923,316 priority Critical patent/US20230244879A1/en
Priority to JP2023504177A priority patent/JP2023535709A/ja
Priority to EP21848640.5A priority patent/EP4134865A4/en
Publication of WO2022022421A1 publication Critical patent/WO2022022421A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • the embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a language representation model system, a language representation model pre-training method, a natural language processing method, an apparatus, an electronic device, and a storage medium.
  • Embodiments of the present disclosure provide a language representation model system, a pre-training method for a language representation model, a natural language processing method, an apparatus, an electronic device, and a storage medium, and provide a word-granularity representation of natural language in units of words.
  • a mixed-granularity language representation model that performs word-granularity representation of natural language in units of words, which provides a model basis for downstream natural language processing tasks, helps to improve the processing accuracy of downstream natural language processing tasks, and improves the language representation model. migration effect.
  • an embodiment of the present disclosure provides a language representation model system, the system comprising:
  • the word granularity language representation sub-model is used to output the first semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with the word as the word segmentation unit;
  • the word granularity language representation sub-model is configured to output a second semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with a word as a word segmentation unit.
  • an embodiment of the present disclosure further provides a pre-training method for a language representation model, the method comprising:
  • pre-training the word-granularity language representation sub-model in units of words in the language representation model is performed to obtain a pre-trained word-granularity language representation sub-model.
  • an embodiment of the present disclosure further provides a natural language processing method, the method comprising:
  • fine-tuning word segmentation result fine-tuning the pre-trained language representation model
  • the natural language to be processed is processed through the fine-tuned language representation model.
  • an embodiment of the present disclosure further provides an apparatus for pre-training a language representation model, the apparatus comprising:
  • a word segmentation module which is used to perform word segmentation on the corpus samples in word units and word units respectively, and obtain word segmentation results and word segmentation results;
  • a first pre-training module configured to use the word segmentation result to pre-train the word-granularity language representation sub-model using words as a unit in the language representation model to obtain a pre-trained word-granular language representation sub-model;
  • the second pre-training module is configured to use the word segmentation result to pre-train the word-granularity language representation sub-model in units of words in the language representation model to obtain a pre-trained word-granular language representation sub-model.
  • an embodiment of the present disclosure further provides a natural language processing device, the device comprising:
  • a determination module for determining fine-tuning sample corpora based on natural language processing tasks
  • the word segmentation module is used to segment the fine-tuning sample corpus in word-unit and word-unit respectively, and obtain the fine-tuning word segmentation result and the fine-tuning word segmentation result;
  • a fine-tuning module used for fine-tuning the pre-trained language representation model by using the fine-tuning word segmentation result and the fine-tuning word segmentation result;
  • the processing module is used to process the natural language to be processed through the fine-tuned language representation model.
  • an embodiment of the present disclosure further provides a device, the device comprising:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the method for pre-training a language representation model according to any one of the embodiments of the present disclosure.
  • an embodiment of the present disclosure further provides a storage medium containing computer-executable instructions, when executed by a computer processor, the computer-executable instructions are used to execute the language representation according to any of the embodiments of the present disclosure A pretraining method for the model.
  • an embodiment of the present disclosure further provides a storage medium containing computer-executable instructions, when executed by a computer processor, the computer-executable instructions are used to execute the natural language according to any of the embodiments of the present disclosure Approach.
  • an embodiment of the present disclosure further provides a computer program product, including a computer program carried on a non-transitory computer-readable medium, the computer program comprising program codes for executing any one of the embodiments of the present disclosure
  • the pre-training method of language representation model described in item
  • an embodiment of the present disclosure further provides a computer program product, including a computer program carried on a non-transitory computer-readable medium, the computer program including program codes for executing any of the embodiments of the present disclosure The natural language processing method described in item.
  • an embodiment of the present disclosure further provides a computer program, including a computer program carried on a non-transitory computer-readable medium, where the computer program includes program codes for executing any one of the embodiments of the present disclosure
  • the pre-training method of language representation model described in item is not limited to:
  • an embodiment of the present disclosure further provides a computer program, including a computer program carried on a non-transitory computer-readable medium, where the computer program includes program codes for executing any one of the embodiments of the present disclosure The natural language processing method described in item.
  • a language representation model system includes: a word-granularity language representation sub-model taking words as word segmentation units and a word-granularity language representation sub-model taking words as a unit; wherein, the word-granularity language representation sub-model uses In order to output the first semantic vector corresponding to the semantics expressed by each word in the sentence based on the sentence with the word as the word segmentation unit; the word granularity language representation sub-model is used for The technical means of outputting the second semantic vector corresponding to the semantics expressed by each participle in the sentence by the sentence provides a word granularity representation of natural language in units of words, and a representation of natural language in units of words.
  • the mixed-granularity language representation model for word-granularity representation provides a model basis for downstream natural language processing tasks, helps to improve the processing accuracy of downstream natural language processing tasks, and improves the migration effect of language representation models.
  • FIG. 1 is a schematic structural diagram of a language representation model system according to Embodiment 1 of the present disclosure
  • FIG. 2 is a schematic structural diagram of another language representation model system provided by Embodiment 2 of the present disclosure.
  • FIG. 3 is a schematic flowchart of a method for pre-training a language representation model according to Embodiment 3 of the present disclosure
  • FIG. 4 is a schematic flowchart of another method for pre-training a language representation model according to Embodiment 3 of the present disclosure
  • FIG. 5 is a schematic flowchart of a natural language processing method according to Embodiment 4 of the present disclosure.
  • FIG. 6 is a schematic flowchart of another natural language processing method provided by Embodiment 4 of the present disclosure.
  • FIG. 7 is a schematic structural diagram of a pre-training apparatus for a language representation model according to Embodiment 5 of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a natural language processing apparatus according to Embodiment 6 of the present disclosure.
  • FIG. 9 is a schematic structural diagram of an electronic device according to Embodiment 7 of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Currently commonly used language representation models are usually single-granularity language representation models, such as a word-granularity language representation model in units of words or a word-granularity language representation model in units of words.
  • the inventor found that the above-mentioned word-granularity language representation model in units of words does not consider the problem of wrong attention weight of characters.
  • a character has no relationship with another character in a specific context, but because of These two characters are closely related in the training sample corpus, and are given unreasonable attention weights, which in turn leads to the problem that the semantics of the characters are misrepresented in this specific context.
  • the above-mentioned word-granularity language representation model with word as unit will bring wrong semantic parsing due to wrong segmentation.
  • FIG. 1 is a schematic structural diagram of a language representation model system according to Embodiment 1 of the present disclosure. As shown in FIG. 1 , the system includes: a word-granularity language representation sub-model 110 with a word-segmentation unit and a word-granular language representation sub-model 120 with a word as a unit.
  • the word granularity language representation sub-model 110 is used for outputting the first semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with the word as the word segmentation unit;
  • the word-granularity language representation sub-model 120 is configured to output a second semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with a word as a word segmentation unit.
  • the sentence is segmented using words as the word segmentation unit, and the result of word segmentation is: shop/shop/li/ ⁇ /ping/pong/ball/beat/sell /finished/finished.
  • the sentence is segmented using words as word segmentation units, and the obtained word segmentation results are: shop/li/de/ping-pong/auction/end/out.
  • a language representation model includes both a word-granularity language representation sub-model using words as word segmentation units, and a word-granular language representation sub-model using words as units.
  • it can express both The first semantic vector corresponding to the semantics expressed by each participle in the sentence can be expressed, and the second semantic vector corresponding to the semantics expressed by each participle in the sentence can be expressed, which provides semantics for downstream natural language processing tasks. Understanding the basics, correcting word granularity features in specific language processing tasks, and enriching the expression of word granularity features through word granularity features, help improve the processing accuracy of downstream natural language processing tasks, and improve the migration of language representation models Effect.
  • FIG. 2 is a schematic structural diagram of another language representation model provided by Embodiment 2 of the present disclosure.
  • the embodiments of the present disclosure further describe the structures of the word-granularity language representation sub-model and the word-granularity language representation sub-model respectively, and specifically provide the word-granularity language representation sub-model and the word-granularity sub-model.
  • the language represents the structure of the submodel.
  • the word-granularity language representation sub-model 210 includes: a word-granularity input layer 211 , a word-granularity encoding layer 212 and a word-granularity output layer 213 with words as word segmentation units.
  • the word granularity input layer 211 is connected to the word granularity coding layer 212, and is used to receive sentences with words as word segmentation units, and combine the position of each word segmentation in the sentence and the position where the sentence is located.
  • each word segment into a corresponding word vector, and send the word vector to the word granularity coding layer 212; the word granularity coding layer 212 is connected to the word granularity output layer 213 for receiving To the word vector, determine the first semantic vector corresponding to the semantics expressed by each participle in the sentence, and output the first semantic vector to the word granularity output layer 213; the word granularity output layer 213 uses The first semantic vector received at output.
  • the number of the word granularity coding layer 212 is at least two layers, and the structure of the word granularity coding layer is a Transformer structure.
  • the word granularity language representation sub-model 220 includes: a word granularity input layer 221, a word granularity encoding layer 222 and a word granularity output layer 223 with words as word segmentation units;
  • the word granularity input layer 221 is connected to the word granularity coding layer 222, and is used to receive sentences with words as word segmentation units, and combine the position of each word segmentation in the sentence and the paragraph where the sentence is located. , and convert each word segment into a corresponding word vector. Therefore, the word vector corresponding to each word segment includes the location vector of the word segment, the paragraph vector, and the pre-learned embedding vector embedding of the corresponding word segment.
  • the word granularity coding layer 222 is connected to the word granularity output layer 223, and is used to determine, based on the received word vector, the expression of each participle in the sentence and output the second semantic vector to the word granularity output layer 223; the word granularity output layer 223 is used to output the received second semantic vector.
  • the number of the word granularity coding layer 222 is at least two layers, and the structure of the word granularity coding layer 222 is a Transformer structure.
  • the Transformer structure includes an encoding component, a decoding component, and the connection relationship between them.
  • the encoding component is usually composed of at least two encoders, and the decoding component is composed of the same number of encoders as the encoders.
  • the encoder consists of two layers, a self-attention layer and a feed-forward neural network layer.
  • the self-attention layer helps the current node not only focus on the current word in order to obtain the semantics of the context.
  • the decoder decoder also contains a self-attention layer and a feed-forward neural network layer, but there is an attention layer between these two layers to help the current node get the key content that needs to be paid attention to.
  • Transformer structure is widely used in the field of natural language processing NLP, such as machine translation, question answering system, text summarization and speech recognition, etc., and it shows superior performance.
  • the word-granularity language representation sub-model includes: a word-granularity input layer, a word-granularity coding layer and a word-granularity output layer with words as word segmentation units;
  • the word-granularity language representation sub-model includes: The word is the word granularity input layer, the word granularity coding layer and the word granularity output layer of the word segmentation unit; it realizes the purpose of giving the semantic representation of each word segmentation of the word granularity and the semantic representation of each word of the word granularity for the same sentence, It provides an understanding basis for downstream natural language processing tasks, helps to improve the processing accuracy of downstream natural language processing tasks, and improves the transfer effect of language representation models.
  • FIG. 3 provides a pre-training method for a language representation model according to Embodiment 3 of the present disclosure, which is used for pre-training the language representation model described in the above embodiment, so that the language representation model has the ability to perform word granularity on input sentences. As well as the function of word-granular semantic representation, it provides an understanding basis for downstream natural language processing tasks. As shown in Figure 3, the method includes the following steps:
  • Step 310 Determine corpus samples for pre-training.
  • the corpus samples may be obtained from commonly used websites, or collected and organized manually for sentences that are prone to ambiguity. Such ambiguous sentences are, for example, "the table tennis balls in the store are sold out” or “we want to hang a portrait in drawing room” and so on.
  • Step 320 Perform word segmentation on the corpus samples in word units and word units respectively, and obtain a word segmentation result and a word segmentation result.
  • the sentence is segmented using words as the word segmentation unit, and the result of word segmentation is: shop/shop/li/ ⁇ /ping/pong/ball/beat/sell /finished/finished.
  • the sentence is segmented using words as word segmentation units, and the obtained word segmentation results are: shop/li/de/ping-pong/auction/end/out.
  • Step 330 Using the word segmentation result, perform pre-training on the word-granularity language representation sub-model in the language representation model in units of words, and obtain a pre-trained word-granularity language representation sub-model.
  • Training includes:
  • the masked word segmentation and the unmasked word segmentation are input to the word granularity input layer of the word granularity language representation sub-model, so as to combine the position of each said word segmentation in the corpus and all the words through the word granularity input layer. Describe the paragraph where the corpus is located, convert each of the segmented words into a corresponding word vector, and send the word vector to the word granularity coding layer of the word granularity language representation sub-model;
  • the second semantic vector corresponding to the semantics expressed by each word segment in the corpus is determined by the word granularity coding layer, and the second semantic vector is output to the word granularity output layer of the word granularity language representation sub-model ;
  • the pre-training ends.
  • the set ratio is usually 15%, and 15% of the word segmentations are randomly determined from the word segmentation results, and mask MASK is performed on these word segmentations.
  • the word segmentation but for each word segmentation, whether it needs to be masked or does not need to be masked, add a position mark and a paragraph mark, the position mark indicates the position of the current word segmentation in the current sentence, the paragraph Identifies the paragraph in which the sentence in which the current participle is located is located.
  • the word granularity output layer is specifically used to predict the masked word segmentation based on the position identification and paragraph identification of the masked word segmentation, combined with the semantics of the front and rear word segmentation of the masked word segmentation, and specifically to calculate the masked word segmentation as a specific word segmentation.
  • the probability of the word segmentation with the highest probability is output.
  • the word granularity output layer finally outputs the masked word "shop”
  • the word granularity output layer finally outputs the masked words "quotient" and "shop”.
  • Step 340 Using the word segmentation result, pre-train the word-granularity language representation sub-model in the language representation model in units of words, and obtain a pre-trained word-granularity language representation sub-model.
  • the pre-training process for the word-granular language representation sub-model is similar to the pre-training process for the word-granular language representation sub-model, exemplarily,
  • word segmentation result to pre-train the word-granularity language representation sub-model in units of words in the language representation model, including:
  • the masked word segmentation and the unmasked word segmentation are input to the word granularity input layer of the word granularity language representation sub-model, so as to combine each of the segmentation words in the corpus through the word granularity input layer. the position and the paragraph in which the corpus is located, convert each of the word segments into a corresponding word vector, and send the word vector to the word granularity coding layer of the word granularity language representation sub-model;
  • the word granularity coding layer determine the first semantic vector corresponding to the semantics expressed by each participle in the corpus, and output the first semantic vector to the word granularity of the word granularity language representation sub-model output layer;
  • the pre-training ends.
  • the number of the word granularity coding layers is at least two layers, and the structure of the word granularity coding layer is a Transformer structure; the number of the word granularity coding layers is at least two layers, and the structure of the word granularity coding layer is Transformer structure.
  • the parameters of the language representation model include: word vector parameters corresponding to the word-granularity language representation sub-model, word vector parameters corresponding to the word-granularity language representation sub-model, and word-granularity language representation The associated parameters, position vector and paragraph vector of the Transformer shared by the sub-model and the word-granularity language representation sub-model.
  • the word vector parameter may specifically refer to the pre-learned embedding vector embedding of each word segment, or a correlation matrix used to determine the embedding vector.
  • the word vector parameter may specifically refer to the pre-learned embedding vector embedding of each word segment, or a correlation matrix used to determine the embedding vector.
  • the position vector specifically represents the position of the segmented word or character in the sentence, and the paragraph vector represents the information of the paragraph where the sentence to be segmented is located.
  • the position vector, the paragraph vector, the word-granularity language representation sub-model and the word-granularity language representation sub-model can be shared, so it is sufficient to keep one set, and there is no need for the word-granularity language representation sub-model and the word-granularity language representation sub-model.
  • the word-granular language representation sub-models are saved separately, which can reduce the order of magnitude of model parameters and reduce the complexity of the language representation model.
  • the associated parameters of the Transformer refer to parameters learned in the pre-training process,
  • step 330 there is no time sequence limitation between step 330 and step 340.
  • the operation of step 330 may be performed first, and then the operation of step 340 may be performed, or the operation of step 340 may be performed first, and then the operation of step 330 may be performed,
  • the operations of step 330 may also be performed in parallel with the operations of step 340 .
  • the pre-training method for a language representation model provided by the embodiment of the present disclosure, by dividing the same sentence into word segmentation units of different granularities, and then pre-training the language representation sub-models of different granularities based on the word segmentation of different granularities respectively, obtains the ability to simultaneously
  • the input sentence is represented as the semantic vector of word-granular word segmentation and the mixed-granularity language representation model of the semantic vector of word-granular word segmentation, which provides a semantic understanding basis for downstream natural language processing tasks. Correcting word granularity features and enriching the expression of word granularity features through word granularity features help to improve the processing accuracy of downstream natural language processing tasks and improve the migration effect of language representation models.
  • FIG. 5 is a schematic flowchart of a natural language processing method according to Embodiment 4 of the present disclosure. Specifically, a specific natural language processing task is performed based on the language representation model disclosed in the foregoing embodiment, and the natural language processing task includes at least one of the following: Types: semantic similarity calculation, text classification, linguistic reasoning, keyword recognition, and reading comprehension.
  • the natural language processing method includes the following steps:
  • Step 510 Determine the fine-tuning sample corpus based on the natural language processing task.
  • the language representation model disclosed in the above embodiments is only used for semantic representation of natural language sentences with different granularities, and provides an understanding basis for all natural language processing tasks. When a specific natural language processing task needs to be performed, further fine-tuning training of the language representation model needs to be performed according to the specific natural language processing task.
  • the natural language processing task is semantic similarity calculation
  • the fine-tuning sample corpus is in the form of: sentence 1: sentence 2: similarity label; wherein, "0" indicates that sentence 1 and sentence 2 have different meanings, that is, two sentences The semantics of the two are not similar, and "1" means that the meanings of statement 1 and statement 2 are the same, that is, the semantics of the two are similar.
  • the natural language processing task is text classification, and the fine-tuning sample corpus is in the form of: classification label: classification name: text to be classified.
  • Step 520 Perform word segmentation on the fine-tuning sample corpus in word-unit and word-unit, respectively, to obtain a fine-tuning word segmentation result and a fine-tuning word segmentation result.
  • Step 530 using the fine-tuning word segmentation result and the fine-tuning word segmentation result to fine-tune the pre-trained language representation model.
  • the pre-trained language representation model is retrained, so that the language representation model learns a specific task processing strategy.
  • Step 540 Process the natural language to be processed through the fine-tuned language representation model.
  • reference numeral 610 denotes the pre-training stage of the mixed-granularity language representation model, which corresponds to the content disclosed in Embodiment 3 of the present disclosure.
  • the language representation model is the language representation model described in any of the above-mentioned embodiments, which includes a word-granularity language representation sub-model with words as word segmentation units and a word-granular language representation sub-model with words as word segmentation units, so it is called a hybrid Granular language representation model.
  • Reference numeral 620 represents the secondary training of the pre-trained language representation model, that is, the fine-tuning stage, specifically, fine-tuning the pre-trained language representation model according to a specific natural language processing task.
  • the reference number 630 represents the downstream application scenarios of the mixed-granularity language representation model, including the semantic similarity calculation task AFQMC, the short text classification task TNEWS, the long text classification task IFLYTEK, the language reasoning task CMNLI, the Chinese referential resolution task CLUEWSC2020, the paper keywords Recognition task CSL, Simplified Chinese reading comprehension task CMRC2018, Chinese idiom reading comprehension filling-in-the-blank task CHID, and Chinese multiple-choice reading comprehension task C3.
  • the semantic similarity calculation task AFQMC specifically refers to: for example: ⁇ "sentence1": “Where is the credit card limit for Double Eleven", “sentence2”: “You can withdraw credit card limit”, “label”: "0" ⁇ .
  • Each piece of data has three attributes, from front to back, sentence 1, sentence 2, and sentence similarity label.
  • the label label, 1 means that the meanings of sentence1 and sentence2 are similar, and 0 means that the meanings of sentence1 and sentence2 are different.
  • the pre-trained language representation model described in the above embodiment is fine-tuned by using pre-collected corpus related to the specific task, that is, the pre-trained language representation model is retrained to Adapt it to specific task applications.
  • the specific task as a semantic similarity calculation task as an example, first collect corpus with similar semantics and dissimilar semantics.
  • the representation of the corpus is ⁇ "sentence1":"Where is the double eleven credit card withdrawal”,"sentence2" :"You can withdraw credit card limit”,"label”:"0" ⁇ , each piece of data has three attributes, from front to back are sentence 1, sentence 2, sentence similarity label (that is, the label label, 1 means sentence1 Similar to the meaning of sentence2, 0 means that sentence1 and sentence2 have different meanings). Then use the received corpus to retrain the pre-trained language representation model.
  • the trained language representation model can express each word in The first semantic vector corresponding to the semantics expressed in the sentence, and the second semantic vector corresponding to the semantics expressed by each participle in the sentence, the word granularity feature is corrected by the word granularity feature, and the expression of the word granularity feature is enriched by the word granularity feature , which can reduce the probability of wrong understanding of the semantics of the sentence, and the probability of wrong parsing due to wrong word segmentation, which can improve the learning effect of specific natural language processing tasks, thereby improving the task processing accuracy.
  • the latter language representation model has higher model transfer characteristics.
  • the short text classification task TNEWS specifically refers to: Example: ⁇ "label”: “102”, “label_des”: “news_entertainment”, “sentence”: "Donut selfie, the angle of mystery is so beautiful, beauty attracts everything” ⁇ .
  • Each piece of data has three attributes, from front to back are category ID, category name, news string (title only).
  • the long text classification task IFLYTEK specifically refers to: Example ⁇ "label":"110",”label_des”:”Community Supermarket”,”sentence”:"Happy Express Supermarket was founded in 2016, focusing on creating instant delivery within 30 minutes on mobile One-stop shopping platform, the product categories include fruits, vegetables, meat, poultry, eggs and milk, seafood and aquatic products, grain and oil seasoning, drinks and beverages, snack foods, daily necessities, takeaways, etc.
  • Happy Company hopes to use a new business model for more efficient and fast warehousing
  • the distribution model is committed to becoming a faster, better, more and more economical online retail platform, bringing consumers a better consumption experience, while promoting the process of food safety, and becoming an Internet company that is respected by the society.
  • Each piece of data has three attributes, which are category ID, category name, and text content from front to back.
  • the language reasoning task CMNLI specifically refers to the example ⁇ "sentence1":"new rights are good enough”
  • sentence2 "everyone likes the latest benefits”
  • label :"neutral” ⁇ .
  • Each piece of data has three attributes, from front to back, sentence 1, sentence 2, and entailment relationship label.
  • the Chinese referential resolution task CLUEWSC2020 specifically refers to what is the specific goal of the language representation model to determine the pronoun referent. For example: ⁇ "target”: ⁇ "span2_index”:37,”span1_index”:5,”span1_text”:”bed”,”span2_text”:"it” ⁇ ,”idx”:261,”label”:”false” ,”text”:"At this time the phone next to the pillow on the bed rang, I'm surprised because it has been suspended for two months because of arrears, and now it suddenly rang.” ⁇ ”true” means that the pronoun really refers to span1_text nouns in span1_text, "false” means not referring to the nouns in span1_text.
  • the paper keyword identification task CSL specifically refers to: extracts from Chinese paper abstracts and their keywords, and papers are selected from some Chinese social science and natural science core journals.
  • the FFT calculation process is optimized to reduce the calculation amount of the new algorithm, so that it can meet the real-time requirements of 3D imaging sonar.
  • the simulation and experimental results show that the imaging resolution of the subregional FFT beamforming algorithm is significantly improved compared with the traditional uniform FFT beamforming algorithm, and it meets the real-time requirements.”
  • keyword ["Water Acoustics”
  • FFT “Beamforming”
  • 3D Imaging Sonar”] "label”:"1" ⁇ .
  • Each piece of data has four attributes, which are data ID, paper abstract, keywords, and true and false labels from front to back.
  • Simplified Chinese reading comprehension task CMRC2018 specifically refers to: give an article and a question, you need to find the answer in the article.
  • the Chinese idiom reading comprehension fill-in-the-blank task CHID specifically refers to: give an article and dig out some idioms, and give candidate idioms, let the language representation model select the appropriate idioms to fill in the blanks.
  • the Chinese multiple-choice reading comprehension task C3 specifically refers to: according to a dialogue, complete the selection topic combined with the dialogue content.
  • the task processing effect of the language representation model provided by the implementation of the present disclosure greatly exceeds the task processing effect of the single-granularity language representation model trained on the same corpus.
  • the task processing effect of the language representation model is because the language representation model provided by the embodiments of the present disclosure is a mixed-granularity language representation model, including a word-granularity language representation sub-model with words as word segmentation units and a word-granular language representation sub-model with words as word segmentation units.
  • the language representation model provided by the implementation of the present disclosure can not only express the first semantic vector corresponding to the semantics expressed by each participle in the sentence, but also express the meaning of each participle in the sentence.
  • the second semantic vector corresponding to the expressed semantics corrects the word granularity feature through the word granularity feature, and enriches the expression of the word granularity feature through the word granularity feature, thereby reducing the probability of a wrong understanding of the semantics of the sentence, and the segmentation due to wrong word segmentation. It brings the probability of wrong parsing and improves the processing accuracy of natural language processing tasks.
  • BERT recurrence refers to the task processing results obtained by the applicant by using BERT to evaluate the above processing tasks
  • the mixed-granularity language representation model refers to the It includes both word-granularity language representation sub-models with words as word-partitioning units and language representation models that include word-granularity language representation sub-models with words as word-segmentation units.
  • the mixed-granularity language representation represents the idea of a mixed-granularity language representation, that is, the semantic vector of each word and the semantic vector of each word are respectively expressed for the same natural sentence, and the mixed-granularity language representation model can be any one.
  • the single-granularity language representation model is extended and obtained, for example, the single-granularity language representation model may be BERT, RoBERTa, or ALBERT. It can be seen from Table 1 that the performance and average score of the mixed-granularity language representation model in most task processing applications far exceed other existing single-granularity language representation models. , the processing performance of the mixed-granularity language representation model is also much better than that of the single-granularity language representation model.
  • Table 1 Comparison table of task processing performance of each language representation model
  • fine-tuning training is performed on the hybrid-granularity language representation model disclosed in the above embodiments in combination with specific processing tasks, and then natural language task processing is performed based on the fine-tuned and trained hybrid-granularity language representation model.
  • the word granularity feature corrects the word granularity feature, enriches the expression of the word granularity feature through the word granularity feature, and improves the processing accuracy of natural language processing tasks.
  • FIG. 7 provides a pre-training apparatus for a language representation model according to Embodiment 5 of the present disclosure, which specifically includes: a determination module 710, a word segmentation module 720, a first pre-training module 730, and a second pre-training module 740;
  • the determination module 710 is used to determine the corpus samples for pre-training; the word segmentation module 720 is used to segment the corpus samples in word units and word units, respectively, to obtain word segmentation results and word segmentation results;
  • a pre-training module 730 configured to use the word segmentation result to pre-train the word-granularity language representation sub-model with words as units in the language representation model to obtain a pre-trained word-granular language representation sub-model;
  • the second pre-training The module 740 is configured to use the word segmentation result to pre-train the word-granularity language representation sub-model in units of words in the language representation model to obtain a pre-trained word-granular language representation sub-model.
  • the first pre-training module 730 includes:
  • the masking unit is used to mask the word segmentation with a set proportion in the word segmentation result in word units;
  • the input unit is used to input the masked word segmentation and the unmasked word segmentation to the word granularity input layer of the word granularity language representation sub-model, so as to combine each of the word segmentations in the corpus through the word granularity input layer.
  • the position in the corpus and the paragraph where the corpus is located convert each of the segmented words into a corresponding word vector, and send the word vector to the word granularity coding layer of the word granularity language representation sub-model;
  • a determining unit configured to determine a second semantic vector corresponding to the semantics expressed by each word segment in the corpus through the word granularity coding layer, and output the second semantic vector to the word granularity language representation sub-model The word granularity output layer of ;
  • an output unit configured to output the masked word segmentation based on the second semantic vector of each of the word segmentation through the word granularity output layer; when the accuracy of the word segmentation output by the word granularity output layer reaches a second threshold , the pre-training is over.
  • the second pre-training module 740 includes:
  • a masking unit used for masking the set proportion of word segmentation in the word segmentation result
  • An output unit for inputting the masked word segmentation and the unmasked word segmentation to the word granularity input layer of the word granularity language representation sub-model, so as to combine each of the segmentations through the word granularity input layer.
  • a determining unit configured to determine, through the word granularity coding layer, a first semantic vector corresponding to the semantics expressed by each word segment in the corpus, and output the first semantic vector to the word granularity language representation sub The word granularity output layer of the model;
  • the output unit is used for outputting the masked word segmentation based on the first semantic vector of each of the word segmentation through the word granularity output layer; when the accuracy of the word segmentation output by the word granularity output layer reaches the first When a threshold is reached, pre-training ends.
  • the parameters of the language representation model include: word vector parameters corresponding to the word granularity language representation sub-model, word vector parameters corresponding to the word granularity language representation sub-model, The associated parameters of the Transformer, the position vector and the paragraph vector shared by the granular language representation sub-model and the word granular language representation sub-model.
  • the technical solution of the embodiment of the present disclosure by dividing the same sentence into word segmentation units of different granularities, and then pre-training language representation sub-models of different granularities based on the word segmentation of different granularities, it is possible to obtain the ability to simultaneously represent the input sentence as word granularity.
  • the mixed-granularity language representation model of the semantic vector of word segmentation and the semantic vector of word-granularity word segmentation provides a semantic understanding basis for downstream natural language processing tasks.
  • word-granularity features are used to correct word-granularity features.
  • the word granularity feature enriches the expression of the word granularity feature, which helps to improve the processing accuracy of downstream natural language processing tasks and improves the migration effect of language representation models.
  • the pre-training device for a language representation model provided by the embodiment of the present disclosure can execute the pre-training method for a language representation model provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 8 provides a natural language processing apparatus according to Embodiment 6 of the present disclosure, which specifically includes: a determination module 810, a word segmentation module 820, a fine-tuning module 830, and a processing module 840;
  • the determination module 810 is used to determine the fine-tuning sample corpus based on the natural language processing task;
  • the word segmentation module 820 is used to segment the fine-tuning sample corpus with words as a unit and a word as a unit, respectively, to obtain the fine-tuning word segmentation result and the fine-tuning score. word result;
  • the fine-tuning module 830 is used for fine-tuning the pre-trained language representation model by using the fine-tune word segmentation result and the fine-tuning word segmentation result;
  • the processing module 840 is used for the natural language to be processed through the fine-tuned language representation model to be processed.
  • the natural language processing task includes at least one of the following: semantic similarity calculation, text classification, language reasoning, keyword recognition and reading comprehension.
  • the technical solution of the embodiments of the present disclosure is to perform fine-tuning training on the mixed-granularity language representation model disclosed in the above embodiments in combination with specific processing tasks, and then perform natural language task processing based on the fine-tuned and trained mixed-granularity language representation model. Correct word granularity features, enrich the expression of word granularity features through word granularity features, and improve the processing accuracy of natural language processing tasks.
  • the natural language processing practice provided by the embodiment of the present disclosure can execute the natural language processing method provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 9 it shows a schematic structural diagram of an electronic device (eg, a terminal device or a server in FIG. 9 ) 400 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 9 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 400 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 401 that may be loaded into random access according to a program stored in a read only memory (ROM) 402 or from a storage device 406 Various appropriate actions and processes are executed by the programs in the memory (RAM) 403 . In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404 .
  • I/O interface 405 input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 of a computer, etc.; a storage device 406 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409 .
  • Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 9 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 409, or from the storage device 406, or from the ROM 402.
  • the processing apparatus 401 When the computer program is executed by the processing apparatus 401, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the terminal provided by the embodiments of the present disclosure, the pre-training method for a language representation model and the natural language processing method provided by the above embodiments belong to the same inventive concept, and the technical details not described in detail in the embodiments of the present disclosure can be referred to the above embodiments. And the embodiments of the present disclosure have the same beneficial effects as the above-mentioned embodiments.
  • Embodiments of the present disclosure provide a computer storage medium on which a computer program is stored, and when the program is executed by a processor, implements the pre-training method for a language representation model and the natural language processing method provided by the foregoing embodiments.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (Read-Only Memory) Memory, ROM), erasable programmable read-only memory (Erasable Programmable ROM, EPROM or flash memory), optical fiber, portable compact disk read-only memory (Compact Disk ROM, CD-ROM), optical storage device, magnetic storage device, or Any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device:
  • pre-training the word-granularity language representation sub-model in units of words in the language representation model is performed to obtain a pre-trained word-granularity language representation sub-model.
  • the natural language to be processed is processed through the fine-tuned language representation model.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Wherein, the name of the unit does not constitute a limitation of the unit itself under certain circumstances, for example, the editable content display unit may also be described as an "editing unit".
  • Embodiments of the present disclosure also include a computer program that, when run on an electronic device or executed by a processor, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Product, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programming Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programming Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a language representation model system, which includes:
  • the word granularity language representation sub-model is used to output the first semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with the word as the word segmentation unit;
  • the word granularity language representation sub-model is configured to output a second semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the sentence with a word as a word segmentation unit.
  • Example 2 provides a language representation model system.
  • the word granularity input layer is connected with the word granularity coding layer, and is used for receiving sentences with words as word segmentation units, and combining the position of each word segmentation in the sentence and the paragraph where the sentence is located, converting each word segment into a corresponding word vector, and sending the word vector to the word granularity coding layer;
  • the word granularity encoding layer is connected to the word granularity output layer, and is configured to determine, based on the received word vector, a first semantic vector corresponding to the semantics expressed by each participle in the sentence, and convert the first semantic vector output to the word granularity output layer;
  • the word granularity output layer is used to output the received first semantic vector.
  • Example 3 provides a language representation model system.
  • the number of the word-granularity coding layers is at least two, and the structure of the word-granularity coding layers is: Transformer structure.
  • Example 4 provides a language representation model system, optionally,
  • the word-granularity language representation sub-model includes: a word-granularity input layer, a word-granularity encoding layer and a word-granularity output layer with words as word segmentation units;
  • the word granularity input layer is connected to the word granularity coding layer, and is used for receiving sentences with words as word segmentation units, and combining the position of each word segmentation in the sentence and the paragraph in which the sentence is located, the Each word segment is converted into a corresponding word vector, and the word vector is sent to the word granularity coding layer;
  • the word granularity encoding layer is connected to the word granularity output layer, and is used for determining a second semantic vector corresponding to the semantics expressed by each word segment in the sentence based on the received word vector, and converting the second semantic vector output to the word granularity output layer;
  • the word granularity output layer is used for outputting the received second semantic vector.
  • Example 5 provides a language representation model system, optionally, the number of the word granularity coding layers is at least two, and the structure of the word granularity coding layer is: Transformer structure.
  • Example 6 provides a pre-training method for a language representation model, the method comprising:
  • pre-training the word-granularity language representation sub-model in units of words in the language representation model is performed to obtain a pre-trained word-granularity language representation sub-model.
  • Example 7 provides a pre-training method for a language representation model, optionally,
  • the pre-training of the word-granularity language representation sub-model with words as a unit in the language representation model using the word segmentation result includes:
  • the masked word segmentation and the unmasked word segmentation are input to the word granularity input layer of the word granularity language representation sub-model, so as to combine the position of each said word segmentation in the corpus and all the words through the word granularity input layer. Describe the paragraph where the corpus is located, convert each of the segmented words into a corresponding word vector, and send the word vector to the word granularity coding layer of the word granularity language representation sub-model;
  • the second semantic vector corresponding to the semantics expressed by each word segment in the corpus is determined by the word granularity coding layer, and the second semantic vector is output to the word granularity output layer of the word granularity language representation sub-model ;
  • the pre-training ends.
  • Example 8 provides a pre-training method for a language representation model, optionally,
  • word segmentation result to pre-train the word-granularity language representation sub-model in units of words in the language representation model, including:
  • the masked word segmentation and the unmasked word segmentation are input to the word granularity input layer of the word granularity language representation sub-model, so as to combine each of the segmentation words in the corpus through the word granularity input layer. the position and the paragraph in which the corpus is located, convert each of the word segments into a corresponding word vector, and send the word vector to the word granularity coding layer of the word granularity language representation sub-model;
  • a first semantic vector corresponding to the semantics expressed by each word segment in the corpus is determined by the word granularity coding layer, and the first semantic vector is output to the word granularity output of the word granularity language representation sub-model Floor;
  • the pre-training ends.
  • Example 9 provides a pre-training method for a language representation model, optionally,
  • the number of the word granularity coding layer is at least two layers, and the structure of the word granularity coding layer is a Transformer structure;
  • the number of the word granularity coding layers is at least two, and the structure of the word granularity coding layer is a Transformer structure.
  • Example 10 provides a pre-training method for a language representation model, optionally,
  • the parameters of the language representation model include: word vector parameters corresponding to the word-granularity language representation sub-model, word vector parameters corresponding to the word-granularity language representation sub-model, the word-granularity language representation sub-model and the The word-granularity language represents the associated parameters, position vector, and paragraph vector of the Transformer shared by the sub-models.
  • Example 11 provides a natural language processing method, including:
  • the natural language to be processed is processed through the fine-tuned language representation model.
  • Example 12 provides a natural language processing method.
  • the natural language processing task includes at least one of the following: semantic similarity calculation, text classification, Verbal reasoning, keyword recognition, and reading comprehension.
  • Example 13 provides a pre-training apparatus for a language representation model, including:
  • a word segmentation module which is used to perform word segmentation on the corpus samples in word units and word units respectively, and obtain word segmentation results and word segmentation results;
  • a first pre-training module configured to use the word segmentation result to pre-train the word-granularity language representation sub-model using words as a unit in the language representation model to obtain a pre-trained word-granular language representation sub-model;
  • the second pre-training module is used for pre-training the word-granularity language representation sub-model in units of words in the language representation model by using the word segmentation result to obtain a pre-trained word-granular language representation sub-model.
  • Example 14 provides a natural language processing apparatus, including: a determination module configured to determine a fine-tuning sample corpus based on a natural language processing task;
  • a word segmentation module which is used to segment the fine-tuning sample corpus with words as a unit and a word as a unit, and obtain a fine-tuning word segmentation result and a fine-tuning word segmentation result;
  • a fine-tuning module used for fine-tuning the pre-trained language representation model by using the fine-tuning word segmentation result and the fine-tuning word segmentation result;
  • the processing module is used to process the natural language to be processed through the fine-tuned language representation model.
  • Example 15 provides an electronic device, the electronic device includes:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement a pre-training method for a language representation model and a natural language processing method as described below:
  • pre-training the word-granularity language representation sub-model in units of words in the language representation model is performed to obtain a pre-trained word-granularity language representation sub-model.
  • the natural language to be processed is processed through the fine-tuned language representation model.
  • Example 16 provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to execute one of the following languages Pre-training methods to represent models as well as natural language processing methods:
  • pre-training the word-granularity language representation sub-model in units of words in the language representation model is performed to obtain a pre-trained word-granularity language representation sub-model.
  • the natural language to be processed is processed through the fine-tuned language representation model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

本公开实施例公开了一种语言表示模型系统、语言表示模型的预训练方法、自然语言处理方法、装置、设备及介质,所述语言表示模型系统包括:以字为分词单位的字粒度语言表示子模型和以词为分词单位的词粒度语言表示子模型;其中,字粒度语言表示子模型用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;词粒度语言表示子模型用于基于以词为分词单位的所述语句,输出每个分词在所述语句中所表达的语义对应的第二语义向量。本公开实施例的技术方案,提供了一种混合粒度语言表示模型,为下游自然语言处理任务提供了模型基础,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。

Description

语言表示模型系统、预训练方法、装置、设备及介质
相关申请的交叉引用
本申请要求于2020年07月29日提交的申请号为202010746066.0、名称为“语言表示模型系统、预训练方法、装置、设备及介质”的中国专利申请的优先权,此申请的内容通过引用并入本文。
技术领域
本公开实施例涉及计算机技术领域,尤其涉及一种语言表示模型系统、语言表示模型预训练方法、自然语言处理方法、装置、电子设备及存储介质。
背景技术
随着人工智能技术的发展,自然语言处理的相关应用已经变得无处不在,常见的自然语言处理的相关应用例如有机器翻译、智能问答机器人以及机器阅读理解等。自然语言处理的相关应用能够快速发展的原因很大程度上归功于通过语言表示模型实现迁移学习的理念。在自然语言处理领域,迁移学习的本质是首先通过一个样本集对语言表示模型进行预训练,然后再根据具体的自然语言处理任务对预训练得到语言表示模型进行二次微调训练,使得训练后的语言表示模型可以执行不同功能的自然语言处理任务,例如语义分析、文本分类或者语言推理等处理任务。
发明内容
本公开实施例提供一种语言表示模型系统、语言表示模型的预训练方法、自然语言处理方法、装置、电子设备及存储介质,提供了一种可以分别以词为单位对自然语言进行词粒度表示,以及以字为单位对自然语言进行字粒度表示的混合粒度语言表示模型,为下游的自然语言处理任务提供了模型基础,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
第一方面,本公开实施例提供了一种语言表示模型系统,该系统包括:
以字为分词单位的字粒度语言表示子模型,和以词为单位的词粒度语言表示子模型;
其中,所述字粒度语言表示子模型用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;
所述词粒度语言表示子模型用于基于以词为分词单位的所述语句,输出每个分词在所述语句中所表达的语义对应的第二语义向量。
第二方面,本公开实施例还提供了一种语言表示模型的预训练方法,该方法包括:
确定用于预训练的语料样本;
对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
第三方面,本公开实施例还提供了一种自然语言处理方法,该方法包括:
基于自然语言处理任务确定微调样本语料;
对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
利用所述微调分词结果以及所述微调分字结果,对预训练好的语言表示模型进行微调;
通过微调后的语言表示模型对待处理自然语言进行处理。
第四方面,本公开实施例还提供了一种语言表示模型的预训练装置,该装置包括:
确定模块,用于确定用于预训练的语料样本;
分词模块,用于对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
第一预训练模块,用于利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
第二预训练模块,用于利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
第五方面,本公开实施例还提供了一种自然语言处理装置,该装置包括:
确定模块,用于基于自然语言处理任务确定微调样本语料;
分词模块,用于对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
微调模块,用于利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
处理模块,用于通过微调后的语言表示模型对待处理自然语言进行处理。
第六方面,本公开实施例还提供了一种设备,所述设备包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开实施例任一所述的语言表示模型的预训练方法。
第七方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的语言表示模型的预训练方法。
第八方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的自然语言处理方法。
第九方面,本公开实施例还提供了一种计算机程序产品,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如本公开实施例任一项所述的语言表示模型的预训练方法。
第十方面,本公开实施例还提供了一种计算机程序产品,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如本公开实施例任一项所述的自然语言处理方法。
第十一方面,本公开实施例还提供了一种计算机程序,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如本公开实施例任一项所述的语言表示模型的预训练方法。
第十二方面,本公开实施例还提供了一种计算机程序,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如本公开实施例任一项所述的自然语言处理方法。本公开实施例提供的一种语言表示模型系统包括:以字为分词单位的字粒度语言表示子模型和以词为单位的词粒度语言表示子模型;其中,所述字粒度语言表示子模型用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;所述词粒度语言表示子模型用于基于以词为分词单位的所述语句输出每个分词在所述语句中所表达的语义对应的第二语义向量的技术手段,提供了一种可以分别以词为单位对自然语言进行词粒度表示,以及以字为单位对自然语言进行字粒度表示的混合粒度语言表示模型,为下游的自然语言处理任务提供了模型基础,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,元件和元素不一定按照比例绘制。
图1为本公开实施例一所提供的一种语言表示模型系统的结构示意图;
图2为本公开实施例二所提供的另一种语言表示模型系统的结构示意图;
图3为本公开实施例三所提供的一种语言表示模型的预训练方法的流程示意图;
图4为本公开实施例三所提供的另一种语言表示模型的预训练方法的流程示意图;
图5为本公开实施例四所提供的一种自然语言处理方法的流程示意图;
图6为本公开实施例四所提供的另一种自然语言处理方法的流程示意图;
图7为本公开实施例五所提供的一种语言表示模型的预训练装置的结构示意图;
图8为本公开实施例六所提供的一种自然语言处理装置的结构示意图;
图9为本公开实施例七所提供的一种电子设备结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表 示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
目前常用的语言表示模型通常是单粒度语言表示模型,例如以字为单位的字粒度语言表示模型或者以词为单位的词粒度语言表示模型。
在实现本申请的过程中,发明人发现上述以字为单位的字粒度语言表示模型没有考虑字符的错误注意力权重问题,当一个字符在特定语境中与另外一个字符没有关系,但是却因为这两个字符在训练样本语料中关系比较紧密,而被给予了不合理的注意力权重,进而导致该字符的语义在该特定语境中被错误表示的问题。上述以词为单位的词粒度语言表示模型又会因为错误的分割带来错误的语义解析。
实施例一
图1为本公开实施例一所提供的一种语言表示模型系统的结构示意图。如图1所示,所述系统包括:以字为分词单位的字粒度语言表示子模型110和以词为单位的词粒度语言表示子模型120。
其中,所述字粒度语言表示子模型110用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;
所述词粒度语言表示子模型120用于基于以词为分词单位的所述语句,输出每个分词在所述语句中所表达的语义对应的第二语义向量。
以语句“商店里的乒乓球拍卖完了”为例,以字为分词单位对所述语句进行分词,获得的分字结果为:商/店/里/的/乒/乓/球/拍/卖/完/了。以词为分词单位对所述语句进行分词,获得的分词结果为:商店/里/的/乒乓球/拍卖/完/了。可以看出在上述语句的语境中,如果以词为分词单位,则会对所述语句的语义产生错误的理解,其原因是对“乒乓球拍”的“拍”与“卖”进行了错误的关联与切分。因此,如果单独采用词粒度语言表示子模型120,则无法对类似上述语句的语义进行准确地理解,通常会因为错误的分词分割带来错误的解析。
以语句“we want to hang a portrait in drawing room”为例,以字为分词单位对所述语句进行分词,获得的分字结果为:we/want/to/hang/a/portrait/in drawing/room。以词为分词单位对所述语句进行分词,获得的分词结果为:we/want to/hang/a/portrait/in drawing room。可以看出在上述语句的语境中,如果以字为分词单位,则可能会对所述语句的语义产生错误的理解,其原因是对“drawing”进行了错误的理解,因为“drawing”的前面出现了“portrait”,因此容易使系统将“drawing”理解为与“portrait”关联的含义,即:绘画,然而所述“drawing”在上述语句中与“room”关联,理解为“客厅”。因此,如果单独采用字粒度语言表示子模型110,则无法对类似上语句“we want to hang a portrait in drawing room”的语义进行准确地理解。
针对上述问题,本实施例提供的一种语言表示模型既包括以字为分词单位的字粒度语言表示子模型,又包括以词为单位的词粒度语言表示子模型,针对同一语句,既可以表达出每个分字在语句中所表达的语义对应的第一语义向量,又可以表达出每个分词在语句中所表达 的语义对应的第二语义向量,为下游的自然语言处理任务提供了语义理解基础,在具体的语言处理任务中通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
实施例二
图2为本公开实施例二所提供的另一种语言表示模型的结构示意图。在上述实施例的基础上,本公开实施例分别对词粒度语言表示子模型以及字粒度语言表示子模型的结构进行了进一步说明,具体是给出了所述词粒度语言表示子模型以及字粒度语言表示子模型的结构。
如图2所示,所述字粒度语言表示子模型210包括:以字为分词单位的字粒度输入层211、字粒度编码层212和字粒度输出层213。其中,所述字粒度输入层211与所述字粒度编码层212相连,用于接收以字为分词单位的语句,并结合每个分字在所述语句中的位置和所述语句所处的段落,将每个分字转换为对应的字向量,将所述字向量发送至所述字粒度编码层212;所述字粒度编码层212与所述字粒度输出层213相连,用于基于接收到字向量确定每个分字在所述语句中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度输出层213;所述字粒度输出层213用于输出接收到的所述第一语义向量。所述字粒度编码层212的数量为至少两层,所述字粒度编码层的结构为Transformer结构。
所述词粒度语言表示子模型220包括:以词为分词单位的词粒度输入层221、词粒度编码层222和词粒度输出层223;
其中,所述词粒度输入层221与所述词粒度编码层222相连,用于接收以词为分词单位的语句,并结合每个分词在所述语句中的位置和所述语句所处的段落,将每个分词转换为对应的词向量,因此,每个分词对应的词向量包括所述分词的位置向量、段落向量以及预先学习到的对应的分词的嵌向量embedding。将所述词向量发送至所述词粒度编码层222;所述词粒度编码层222与所述词粒度输出层223相连,用于基于接收到词向量确定每个分词在所述语句中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度输出层223;所述词粒度输出层223用于输出接收到的所述第二语义向量。所述词粒度编码层222的数量为至少两层,所述词粒度编码层222的结构为Transformer结构。Transformer结构包括编码组件、解码组件以及它们之间的连接关系,编码组件通常由至少两个编码器encoder构成,解码组件由与编码器的数量相同的解码器构成。编码器包含两层,一个自注意力self-attention层和一个前馈神经网络层,自注意力层帮助当前节点不仅仅只关注当前的词,以获取到上下文的语义。解码器decoder也包含一个自注意力层和一个前馈神经网络层,但在这两个层之间还有一层注意力attention层,帮助当前节点获取到当前需要关注的重点内容。Transformer结构广泛应用于自然语言处理NLP领域,例如机器翻译、问答系统、文本摘要和语音识别等方向,所表现出的性能较优越。
其中,如图2所示的“cls”表示语句的开头,“sep”表示语句的结尾。
本公开实施例的技术方案,所述字粒度语言表示子模型包括:以字为分词单位的字粒度输入层、字粒度编码层和字粒度输出层;所述词粒度语言表示子模型包括:以词为分词单位的词粒度输入层、词粒度编码层和词粒度输出层;实现了针对同一语句给出词粒度的每个分词的语义表示,以及字粒度的每个字的语义表示的目的,为下游的自然语言处理任务提供了理解基础,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
实施例三
图3为本公开实施例三提供的一种语言表示模型的预训练方法,用于对上述实施例所述的语言表示模型进行预训练,以使所述语言表示模型具备对输入语句进行词粒度以及字粒度语义表示的功能,为下游的自然语言处理任务提供理解基础。如图3所示,该方法包括如下步骤:
步骤310、确定用于预训练的语料样本。
其中,所述语料样本可从常用网站获取,或者针对容易产生歧义的语句,通过人工的方式进行收集与整理。所述容易产生歧义的语句例如“商店里的乒乓球拍卖完了”或者“we want to hang a portrait in drawing room”等。
步骤320、对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果。
以语句“商店里的乒乓球拍卖完了”为例,以字为分词单位对所述语句进行分词,获得的分字结果为:商/店/里/的/乒/乓/球/拍/卖/完/了。以词为分词单位对所述语句进行分词,获得的分词结果为:商店/里/的/乒乓球/拍卖/完/了。
以语句“we want to hang a portrait in drawing room”为例,以字为分词单位对所述语句进行分词,获得的分字结果为:we/want/to/hang/a/portrait/in drawing/room。以词为分词单位对所述语句进行分词,获得的分词结果为:we/want to/hang/a/portrait/in drawing room。
步骤330、利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型。
示例性的,参考图4所示的另一种语言表示模型的预训练方法的流程示意图,所述利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练包括:
对以词为单位的分词结果中设定比例的分词进行掩码MASK;
将被掩码的分词以及未被掩码的分词输入至所述词粒度语言表示子模型的词粒度输入层,以通过所述词粒度输入层结合每个所述分词在语料中的位置和所述语料所处的段落,将每个所述分词转换为对应的词向量,将所述词向量发送至所述词粒度语言表示子模型的词粒度编码层;
通过所述词粒度编码层确定每个分词在所述语料中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度语言表示子模型的词粒度输出层;
通过所述词粒度输出层基于每个所述分词的所述第二语义向量,输出被掩码的分词;
当所述词粒度输出层所输出分词的准确度达到第二阈值时,预训练结束。
其中,所述设定比例通常为15%,从分词结果中随机确定15%的分词,对这些分词进行掩码MASK,具体的,利用设定的不表示任何含义的嵌向量表示需要被掩码的分词,但需要针对每个分词,不管是需要被掩码的,还是不需要被掩码的,添加位置标识与段落标识,所述位置标识表示当前分词在当前语句中的位置,所述段落标识表示当前分词所在语句所处的段落。
所述词粒度输出层具体用于基于被掩码的分词的位置标识以及段落标识,结合被掩码分词的前后分词的语义预测被掩码的分词,具体是计算被掩码分词为某特定分词的概率,将概 率最大的分词输出。如图4所示,词粒度输出层最终输出被掩码的分词“商店”,字粒度输出层最终输出被掩码的字“商”、“店”。
步骤340、利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
对字粒度语言表示子模型的预训练过程与对词粒度语言表示子模型的预训练过程类似,示例性的,
所述利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,包括:
对以字为单位的分字结果中设定比例的分字进行掩码;
将被掩码的分字以及未被掩码的分字输入至所述字粒度语言表示子模型的字粒度输入层,以通过所述字粒度输入层结合每个所述分字在语料中的位置和所述语料所处的段落,将每个所述分字转换为对应的字向量,将所述字向量发送至所述字粒度语言表示子模型的字粒度编码层;
通过所述字粒度编码层,确定每个分字在所述语料中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度语言表示子模型的字粒度输出层;
通过所述字粒度输出层,基于每个所述分字的所述第一语义向量,输出被掩码的分字;
当所述字粒度输出层所输出分字的准确度达到第一阈值时,预训练结束。
进一步的,所述字粒度编码层的数量为至少两层,所述字粒度编码层的结构为Transformer结构;所述词粒度编码层的数量为至少两层,所述词粒度编码层的结构为Transformer结构。需要说明的是,所述语言表示模型的参数包括:与所述字粒度语言表示子模型对应的字向量参数、与所述词粒度语言表示子模型对应的词向量参数、所述字粒度语言表示子模型与所述词粒度语言表示子模型共享的Transformer的关联参数、位置向量和段落向量。所述字向量参数具体可指预先学习到的每个分字的嵌向量embedding,或者用于确定所述嵌向量的相关矩阵。所述词向量参数具体可指预先学习到的每个分词的嵌向量embedding,或者用于确定所述嵌向量的相关矩阵。所述位置向量具体表示分割出的词或者字在语句中所处的位置,所述段落向量表示被分词的语句所处的段落的信息。所述位置向量与所述段落向量,所述字粒度语言表示子模型与所述词粒度语言表示子模型可以共享,因此保留一套即可,无需针对所述字粒度语言表示子模型与所述词粒度语言表示子模型分别保存,从而可以降低模型参数的数量级,并降低语言表示模型的复杂度。所述Transformer的关联参数指在预训练过程中学习到的参数,例如全连接层的参数等。
可以理解的是,步骤330与步骤340之间并没有时序先后的限定,可以先执行步骤330的操作,后执行步骤340的操作,也可以先执行步骤340的操作,后执行步骤330的操作,还可以步骤330的操作与步骤340的操作并行执行。
本公开实施例提供的语言表示模型的预训练方法,通过将同一语句切割为不同粒度的分词单元,然后分别基于不同粒度的分词对不同粒度的语言表示子模型进行预训练,获得了能够同时将输入语句表示为词粒度的分词的语义向量以及字粒度的分词的语义向量的混合粒度语言表示模型,为下游的自然语言处理任务提供了语义理解基础,在具体的语言处理任务中通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
实施例四
图5为本公开实施例四提供的一种自然语言处理方法的流程示意图,具体是基于上述实施例公开的语言表示模型执行具体的自然语言处理任务,所述自然语言处理任务包括下述至少一种:语义相似度计算、文本分类、语言推理、关键词识别以及阅读理解。
如图5所示,所述自然语言处理方法包括如下步骤:
步骤510、基于自然语言处理任务确定微调样本语料。
其中,上述实施例公开的语言表示模型仅用于对自然语言的语句进行不同粒度的语义表示,为所有的自然语言处理任务提供理解基础。当需要进行具体的自然语言处理任务时,还需根据具体的自然语言处理任务对所述语言表示模型进行进一步的微调训练。
例如,所述自然语言处理任务为语义相似度计算,所述微调样本语料的形式为:语句1:语句2:相似度标签;其中,“0”表示语句1与语句2的含义不同,即两者的语义不相似,“1”表示语句1与语句2的含义相同,即两者的语义相似。
所述自然语言处理任务为文本分类,所述微调样本语料的形式为:分类标签:分类名称:待分类文本。
步骤520、对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果。
步骤530、利用所述微调分词结果以及所述微调分字结果,对预训练好的语言表示模型进行微调。
即利用所述微调分词结果以及所述微调分字结果,对预训练好的语言表示模型进行二次训练,使所述语言表示模型学习具体的任务处理策略。
步骤540、通过微调后的语言表示模型,对待处理自然语言进行处理。
进一步的,参考图6所示的另一种自然语言处理方法的流程示意图,其中,标号610表示混合粒度语言表示模型的预训练阶段,与本公开实施例三公开的内容对应,所述混合粒度语言表示模型即为上述任意实施例所述的语言表示模型,其包括以字为分词单位的字粒度语言表示子模型和以词为分词单位的词粒度语言表示子模型,故将其称为混合粒度语言表示模型。标号620表示对预训练好的语言表示模型进行二次训练,即微调阶段,具体是根据具体的自然语言处理任务,对预训练好的语言表示模型进行微调。标号630表示混合粒度的语言表示模型的下游应用场景,具体包括语义相似度计算任务AFQMC、短文本分类任务TNEWS、长文本分类任务IFLYTEK、语言推理任务CMNLI、中文指代消解任务CLUEWSC2020、论文关键词识别任务CSL、简体中文阅读理解任务CMRC2018、中文成语阅读理解填空任务CHID以及中文多选阅读理解任务C3。
其中,语义相似度计算任务AFQMC具体指:例如:{"sentence1":"双十一信用卡提额在哪","sentence2":"里可以提信用卡额度","label":"0"}。每一条数据有三个属性,从前往后分别是句子1,句子2,句子相似度标签。其中label标签,1表示sentence1和sentence2的含义类似,0表示sentence1和sentence2的含义不同。在具体任务的应用中,首先利用预先收集的与具体任务相关的语料对上述实施例所述的经过预训练后的语言表示模型进行微调,即对预训练后的语言表示模型进行再次训练,以使其适应具体的任务应用。以所述具体任务为语义相似度计算任务为例,首先收集语义相似的以及语义不相似的语料,语料的表示形式为 {"sentence1":"双十一信用卡提额在哪","sentence2":"里可以提信用卡额度","label":"0"},每一条数据有三个属性,从前往后分别是句子1,句子2,句子相似度标签(即其中的label标签,1表示sentence1和sentence2的含义类似,0表示sentence1和sentence2的含义不同)。然后利用收到的语料,对预训练后的语言表示模型进行再次训练,在再次训练的过程中,针对同一语料的语句,由于所述训练后的语言表示模型可分别表达出每个分字在语句中所表达的语义对应的第一语义向量,以及每个分词在语句中所表达的语义对应的第二语义向量,通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,从而可降低对语句的语义产生错误的理解的概率,以及因为错误的分词分割带来错误解析的概率,可提升具体的自然语言处理任务的学习效果,进而提高任务处理精度,所述预训练后的语言表示模型具备较高的模型迁移特性。短文本分类任务TNEWS具体指:例子:{"label":"102","label_des":"news_entertainment","sentence":"甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"}。每一条数据有三个属性,从前往后分别是分类ID、分类名称、新闻字符串(仅含标题)。
长文本分类任务IFLYTEK具体指:例子{"label":"110","label_des":"社区超市","sentence":"开心快送超市创立于2016年,专注于打造移动端30分钟即时配送一站式购物平台,商品品类包含水果、蔬菜、肉禽蛋奶、海鲜水产、粮油调味、酒水饮料、休闲食品、日用品、外卖等。开心公司希望能以全新的商业模式,更高效快捷的仓储配送模式,致力于成为更快、更好、更多、更省的在线零售平台,带给消费者更好的消费体验,同时推动食品安全进程,成为一家让社会尊敬的互联网公司。开心一下,又好又快,1.配送时间提示更加清晰友好2.保障用户隐私的一些优化3.其他提高使用体验的调整4.修复了一些已知bug"}。每一条数据有三个属性,从前往后分别是类别ID、类别名称、文本内容。
语言推理任务CMNLI具体指,例子{"sentence1":"新的权利已经足够好了","sentence2":"每个人都很喜欢最新的福利","label":"neutral"}。每一条数据有三个属性,从前往后分别是句子1、句子2、蕴含关系标签。其中label标签有三种:neutral,entailment,contradiction。
中文指代消解任务CLUEWSC2020具体指,语言表示模型判断代词指代的具体目标是什么。例如:{"target":{"span2_index":37,"span1_index":5,"span1_text":"床","span2_text":"它"},"idx":261,"label":"false","text":"这时候放在床上枕头旁边的手机响了,我感到奇怪,因为欠费已被停机两个月,现在它突然响了。"}"true"表示代词确实是指代span1_text中的名词,"false"代表不是指代span1_text中的名词。
论文关键词识别任务CSL具体指:取自中文论文摘要及其关键词,论文选自部分中文社会科学和自然科学核心期刊。例子:{"id":1,"abst":"为解决传统均匀FFT波束形成算法引起的3维声呐成像分辨率降低的问题,该文提出分区域FFT波束形成算法.远场条件下,以保证成像分辨率为约束条件,以划分数量最少为目标,采用遗传算法作为优化手段将成像区域划分为多个区域.在每个区域内选取一个波束方向,获得每一个接收阵元收到该方向回波时的解调输出,以此为原始数据在该区域内进行传统均匀FFT波束形成。对FFT计算过程进行优化,降低新算法的计算量,使其满足3维成像声呐实时性的要求。仿真与实验结果表明,采用分区域FFT波束形成算法的成像分辨率较传统均匀FFT波束形成算法有显著提高,且满足实时性要求。","keyword":["水声学","FFT","波束形成","3维成像声呐"],"label":"1"}。每一条数据有四个属性,从前往后分别是数据ID、论文摘要、关键词、真假标签。
简体中文阅读理解任务CMRC2018具体指:给一段文章和一个问题,需要在文章中找到答案。
中文成语阅读理解填空任务CHID具体指:给一段文章以及挖去一些成语,并给出候选成语,让语言表示模型选择合适成语进行填空。
中文多选阅读理解任务C3具体指:根据一段对话,来完成结合对话内容的选择题目。
在以上九个任务的测评中,本公开实施提供的语言表示模型的任务处理效果都大幅超过了相同语料训练的单粒度语言表示模型的任务处理效果,以及公开发表的同规模甚至更大规模的语言表示模型的任务处理效果。这是因为本公开实施例提供的语言表示模型是一种混合粒度语言表示模型,既包括以字为分词单位的字粒度语言表示子模型和以词为分词单位的词粒度语言表示子模型。针对同一自然语言的语句,通过本公开实施提供的语言表示模型,既可以表达出每个分字在语句中所表达的语义对应的第一语义向量,又可以表达出每个分词在语句中所表达的语义对应的第二语义向量,通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,从而可降低对语句的语义产生错误的理解的概率,以及因为错误的分词分割带来错误解析的概率,提升了自然语言处理任务的处理精度。
假设Acc.表示精确度,EM表示最大精确匹配,各语言表示模型的任务处理性能对照表如表1所示。模型BERT、RoBERTa、ALBERT-xlarge、XLNet-Mid和ERNIE是已公开的测评结果,部分处理任务(例如短文本分类任务TNEWS、中文指代消解任务CLUEWSC2020以及论文关键词识别任务CSL)使用数据增强后,利用本公开实施例提供的语言表示模型进行了重新测评,BERT复现指申请人通过使用BERT对上述处理任务进行评测获得的任务处理结果;混合粒度语言表示模型指本公开实施例所提供的既包括以字为分词单位的字粒度语言表示子模型又包括以词为分词单位的词粒度语言表示子模型的语言表示模型。其中,混合粒度语言表示代表一种混合粒度语言表示的思想,即对同一自然语句分别表达出其中每个字的语义向量,以及每个词的语义向量,混合粒度语言表示模型可以以任何一种单粒度语言表示模型为基础进行扩展获得,所述单粒度语言表示模型例如可以是BERT、RoBERTa或者ALBERT。通过表1可以看出,混合粒度语言表示模型在大多数任务处理应用所表现出的性能以及平均得分都远远超过了其它现有的单粒度语言表示模型,同时在预训练数据相同的情况下,混合粒度的语言表示模型的处理性能也远好于单粒度的语言表示模型的处理性能。
表1:各语言表示模型的任务处理性能对照表
Figure PCTCN2021108194-appb-000001
本公开实施例提供的自然语言处理方法,通过对上述实施例公开的混合粒度语言表示模型结合具体的处理任务进行微调训练,而后基于微调训练后的混合粒度语言表示模型进行自然语言任务处理,通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,提升了自然语言处理任务的处理精度。
实施例五
图7为本公开实施例五提供的一种语言表示模型的预训练装置,具体包括:确定模块710、分词模块720、第一预训练模块730和第二预训练模块740;
其中,确定模块710,用于确定用于预训练的语料样本;分词模块720,用于对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;第一预训练模块730,用于利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;第二预训练模块740,用于利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
在上述技术方案的基础上,第一预训练模块730包括:
掩码单元,用于对以词为单位的分词结果中设定比例的分词进行掩码;
输入单元,用于将被掩码的分词以及未被掩码的分词输入至所述词粒度语言表示子模型的词粒度输入层,以通过所述词粒度输入层结合每个所述分词在语料中的位置和所述语料所处的段落,将每个所述分词转换为对应的词向量,将所述词向量发送至所述词粒度语言表示子模型的词粒度编码层;
确定单元,用于通过所述词粒度编码层确定每个分词在所述语料中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度语言表示子模型的词粒度输出层;
输出单元,用于通过所述词粒度输出层基于每个所述分词的所述第二语义向量输出被掩码的分词;当所述词粒度输出层所输出分词的准确度达到第二阈值时,预训练结束。
在上述技术方案的基础上,第二预训练模块740,包括:
掩码单元,用于对以字为单位的分字结果中设定比例的分字进行掩码;
输出单元,用于将被掩码的分字以及未被掩码的分字输入至所述字粒度语言表示子模型的字粒度输入层,以通过所述字粒度输入层结合每个所述分字在语料中的位置和所述语料所处的段落,将每个所述分字转换为对应的字向量,将所述字向量发送至所述字粒度语言表示子模型的字粒度编码层;
确定单元,用于通过所述字粒度编码层确定每个分字在所述语料中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度语言表示子模型的字粒度输出层;
输出单元,用于通过所述字粒度输出层基于每个所述分字的所述第一语义向量输出被掩码的分字;当所述字粒度输出层所输出分字的准确度达到第一阈值时,预训练结束。
在上述技术方案的基础上,所述语言表示模型的参数包括:与所述字粒度语言表示子模型对应的字向量参数、与所述词粒度语言表示子模型对应的词向量参数、所述字粒度语言表示子模型与所述词粒度语言表示子模型共享的Transformer的关联参数、位置向量和段落向量。
本公开实施例的技术方案,通过将同一语句切割为不同粒度的分词单元,然后分别基于不同粒度的分词对不同粒度的语言表示子模型进行预训练,获得了能够同时将输入语句表示 为词粒度的分词的语义向量以及字粒度的分词的语义向量的混合粒度语言表示模型,为下游的自然语言处理任务提供了语义理解基础,在具体的语言处理任务中通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,有助于提升下游自然语言处理任务的处理精度,提升了语言表示模型的迁移效果。
本公开实施例所提供的一种语言表示模型的预训练置可执行本公开任意实施例所提供的一种语言表示模型的预训练方法,具备执行方法相应的功能模块和有益效果。
值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。
实施例六
图8为本公开实施例六提供的一种自然语言处理装置,具体包括:确定模块810、分词模块820、微调模块830和处理模块840;
其中,确定模块810,用于基于自然语言处理任务确定微调样本语料;分词模块820,用于对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;微调模块830,用于利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;处理模块840,用于通过微调后的语言表示模型对待处理自然语言进行处理。
在上述技术方案的基础上,所述自然语言处理任务包括下述至少一种:语义相似度计算、文本分类、语言推理、关键词识别以及阅读理解。
本公开实施例的技术方案,通过对上述实施例公开的混合粒度语言表示模型结合具体的处理任务进行微调训练,而后基于微调训练后的混合粒度语言表示模型进行自然语言任务处理,通过字粒度特征矫正词粒度特征,通过词粒度特征丰富字粒度特征的表达,提升了自然语言处理任务的处理精度。
本公开实施例所提供的一种自然语言处理练置可执行本公开任意实施例所提供的一种自然语言处理方法,具备执行方法相应的功能模块和有益效果。
值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。
实施例七
下面参考图9,其示出了适于用来实现本公开实施例的电子设备(例如图9中的终端设备或服务器)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置406加载到随机访问存 储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置406;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置406被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例提供的终端与上述实施例提供的一种语言表示模型的预训练方法、自然语言处理方法属于同一发明构思,未在本公开实施例中详尽描述的技术细节可参见上述实施例,并且本公开实施例与上述实施例具有相同的有益效果。
实施例八
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的一种语言表示模型的预训练方法以及自然语言处理方法。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦式可编程只读存储器(Erasable Programmable ROM,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disk ROM,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:
确定用于预训练的语料样本;
对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
以及,基于自然语言处理任务确定微调样本语料;
对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
通过微调后的语言表示模型对待处理自然语言进行处理。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法、计算机程序和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,可编辑内 容显示单元还可以被描述为“编辑单元”。本公开的实施例还包括一种计算机程序,当其在电子设备上运行或被处理器执行时,执行本公开实施例的方法中限定的上述功能。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Product,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programming Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,【示例一】提供了一种语言表示模型系统,该系统包括:
以字为分词单位的字粒度语言表示子模型和以词为分词单位的词粒度语言表示子模型;
其中,所述字粒度语言表示子模型用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;
所述词粒度语言表示子模型用于基于以词为分词单位的所述语句,输出每个分词在所述语句中所表达的语义对应的第二语义向量。
根据本公开的一个或多个实施例,【示例二】提供了一种语言表示模型系统,可选的,所述字粒度语言表示子模型包括:以字为分词单位的字粒度输入层、字粒度编码层和字粒度输出层;
其中,所述字粒度输入层与所述字粒度编码层相连,用于接收以字为分词单位的语句,并结合每个分字在所述语句中的位置和所述语句所处的段落,将每个分字转换为对应的字向量,将所述字向量发送至所述字粒度编码层;
所述字粒度编码层与所述字粒度输出层相连,用于基于接收到字向量确定每个分字在所述语句中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度输出层;
所述字粒度输出层用于输出接收到的所述第一语义向量。
根据本公开的一个或多个实施例,【示例三】提供了一种语言表示模型系统,可选的,所述字粒度编码层的数量为至少两层,所述字粒度编码层的结构为Transformer结构。
根据本公开的一个或多个实施例,【示例四】提供了一种语言表示模型系统,可选的,
所述词粒度语言表示子模型包括:以词为分词单位的词粒度输入层、词粒度编码层和词粒度输出层;
其中,所述词粒度输入层与所述词粒度编码层相连,用于接收以词为分词单位的语句,并结合每个分词在所述语句中的位置和所述语句所处的段落,将每个分词转换为对应的词向量,将所述词向量发送至所述词粒度编码层;
所述词粒度编码层与所述词粒度输出层相连,用于基于接收到词向量确定每个分词在所述语句中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度输出层;
所述词粒度输出层用于输出接收到的所述第二语义向量。
根据本公开的一个或多个实施例,【示例五】提供了一种语言表示模型系统,可选的,所述词粒度编码层的数量为至少两层,所述词粒度编码层的结构为Transformer结构。
根据本公开的一个或多个实施例,【示例六】提供了一种语言表示模型的预训练方法,该方法包括:
确定用于预训练的语料样本;
对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
根据本公开的一个或多个实施例,【示例七】提供了一种语言表示模型的预训练方法,可选的,
所述利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练包括:
对以词为单位的分词结果中设定比例的分词进行掩码;
将被掩码的分词以及未被掩码的分词输入至所述词粒度语言表示子模型的词粒度输入层,以通过所述词粒度输入层结合每个所述分词在语料中的位置和所述语料所处的段落,将每个所述分词转换为对应的词向量,将所述词向量发送至所述词粒度语言表示子模型的词粒度编码层;
通过所述词粒度编码层确定每个分词在所述语料中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度语言表示子模型的词粒度输出层;
通过所述词粒度输出层基于每个所述分词的所述第二语义向量输出被掩码的分词;
当所述词粒度输出层所输出分词的准确度达到第二阈值时,预训练结束。
根据本公开的一个或多个实施例,【示例八】提供了一种语言表示模型的预训练方法,可选的,
所述利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,包括:
对以字为单位的分字结果中设定比例的分字进行掩码;
将被掩码的分字以及未被掩码的分字输入至所述字粒度语言表示子模型的字粒度输入层,以通过所述字粒度输入层结合每个所述分字在语料中的位置和所述语料所处的段落,将每个所述分字转换为对应的字向量,将所述字向量发送至所述字粒度语言表示子模型的字粒度编码层;
通过所述字粒度编码层确定每个分字在所述语料中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度语言表示子模型的字粒度输出层;
通过所述字粒度输出层基于每个所述分字的所述第一语义向量输出被掩码的分字;
当所述字粒度输出层所输出分字的准确度达到第一阈值时,预训练结束。
根据本公开的一个或多个实施例,【示例九】提供了一种语言表示模型的预训练方法,可选的,
所述字粒度编码层的数量为至少两层,所述字粒度编码层的结构为Transformer结构;
所述词粒度编码层的数量为至少两层,所述词粒度编码层的结构为Transformer结构。
根据本公开的一个或多个实施例,【示例十】提供了一种语言表示模型的预训练方法,可选的,
所述语言表示模型的参数包括:与所述字粒度语言表示子模型对应的字向量参数、与所述词粒度语言表示子模型对应的词向量参数、所述字粒度语言表示子模型与所述词粒度语言表示子模型共享的Transformer的关联参数、位置向量和段落向量。
根据本公开的一个或多个实施例,【示例十一】提供了一种自然语言处理方法,包括:
基于自然语言处理任务确定微调样本语料;
对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
通过微调后的语言表示模型对待处理自然语言进行处理。
根据本公开的一个或多个实施例,【示例十二】提供了一种自然语言处理方法,可选的,所述自然语言处理任务包括下述至少一种:语义相似度计算、文本分类、语言推理、关键词识别以及阅读理解。
根据本公开的一个或多个实施例,【示例十三】提供了一种语言表示模型的预训练装置,包括:
确定模块,用于确定用于预训练的语料样本;
分词模块,用于对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
第一预训练模块,用于利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
第二预训练模块,用于利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
根据本公开的一个或多个实施例,【示例十四】提供了一种自然语言处理装置,包括:确定模块,用于基于自然语言处理任务确定微调样本语料;
分词模块,用于对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
微调模块,用于利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
处理模块,用于通过微调后的语言表示模型对待处理自然语言进行处理。
根据本公开的一个或多个实施例,【示例十五】提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如下所述的一种语言表示模型的预训练方法以及自然语言处理方法:
确定用于预训练的语料样本;
对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
以及,基于自然语言处理任务确定微调样本语料;
对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
通过微调后的语言表示模型对待处理自然语言进行处理。
根据本公开的一个或多个实施例,【示例十六】提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行下述一种语言表示模型的预训练方法以及自然语言处理方法:
确定用于预训练的语料样本;
对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
以及,基于自然语言处理任务确定微调样本语料;
对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
通过微调后的语言表示模型对待处理自然语言进行处理。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相 反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (22)

  1. 一种语言表示模型系统,其特征在于,包括:以字为分词单位的字粒度语言表示子模型和以词为分词单位的词粒度语言表示子模型;
    其中,所述字粒度语言表示子模型用于基于以字为分词单位的语句,输出每个分字在所述语句中所表达的语义对应的第一语义向量;
    所述词粒度语言表示子模型用于基于以词为分词单位的所述语句,输出每个分词在所述语句中所表达的语义对应的第二语义向量。
  2. 根据权利要求1所述的系统,其特征在于,所述字粒度语言表示子模型包括:以字为分词单位的字粒度输入层、字粒度编码层和字粒度输出层;
    其中,所述字粒度输入层与所述字粒度编码层相连,用于接收以字为分词单位的语句,并结合每个分字在所述语句中的位置和所述语句所处的段落,将每个分字转换为对应的字向量,将所述字向量发送至所述字粒度编码层;
    所述字粒度编码层与所述字粒度输出层相连,用于基于接收到的字向量确定每个分字在所述语句中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度输出层;
    所述字粒度输出层用于输出接收到的所述第一语义向量。
  3. 根据权利要求2所述的系统,其特征在于,所述字粒度编码层的数量为至少两层,所述字粒度编码层的结构为Transformer结构。
  4. 根据权利要求1-3任一项所述的系统,其特征在于,所述词粒度语言表示子模型包括:以词为分词单位的词粒度输入层、词粒度编码层和词粒度输出层;
    其中,所述词粒度输入层与所述词粒度编码层相连,用于接收以词为分词单位的语句,并结合每个分词在所述语句中的位置和所述语句所处的段落,将每个分词转换为对应的词向量,将所述词向量发送至所述词粒度编码层;
    所述词粒度编码层与所述词粒度输出层相连,用于基于接收到的词向量确定每个分词在所述语句中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度输出层;
    所述词粒度输出层用于输出接收到的所述第二语义向量。
  5. 根据权利要求4所述的系统,其特征在于,所述词粒度编码层的数量为至少两层,所述词粒度编码层的结构为Transformer结构。
  6. 一种语言表示模型的预训练方法,其特征在于,包括:
    确定用于预训练的语料样本;
    对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
    利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
    利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
  7. 根据权利要求6所述的方法,其特征在于,所述利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,包括:
    对以字为单位的分字结果中设定比例的分字进行掩码;
    将被掩码的分字以及未被掩码的分字输入至所述字粒度语言表示子模型的字粒度输入层,以通过所述字粒度输入层结合每个所述分字在语料中的位置和所述语料所处的段落,将每个所述分字转换为对应的字向量,将所述字向量发送至所述字粒度语言表示子模型的字粒度编码层;
    通过所述字粒度编码层确定每个分字在所述语料中所表达的语义对应的第一语义向量,并将所述第一语义向量输出至所述字粒度语言表示子模型的字粒度输出层;
    通过所述字粒度输出层基于每个所述分字的所述第一语义向量输出被掩码的分字;
    当所述字粒度输出层所输出分字的准确度达到第一阈值时,预训练结束。
  8. 根据权利要求7所述的方法,其特征在于,所述利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,包括:
    对以词为单位的分词结果中设定比例的分词进行掩码;
    将被遮挡的分词以及未被掩码的分词输入至所述词粒度语言表示子模型的词粒度输入层,以通过所述词粒度输入层结合每个所述分词在语料中的位置和所述语料所处的段落,将每个所述分词转换为对应的词向量,将所述词向量发送至所述词粒度语言表示子模型的词粒度编码层;
    通过所述词粒度编码层确定每个分词在所述语料中所表达的语义对应的第二语义向量,并将所述第二语义向量输出至所述词粒度语言表示子模型的词粒度输出层;
    通过所述词粒度输出层基于每个所述分词的所述第二语义向量输出被掩码的分词;
    当所述词粒度输出层所输出分词的准确度达到第二阈值时,预训练结束。
  9. 根据权利要求8所述的方法,其特征在于,所述字粒度编码层的数量为至少两层,所述字粒度编码层的结构为Transformer结构;
    所述词粒度编码层的数量为至少两层,所述词粒度编码层的结构为Transformer结构。
  10. 根据权利要求9所述的方法,其特征在于,所述语言表示模型的参数包括:与所述字粒度语言表示子模型对应的字向量参数、与所述词粒度语言表示子模型对应的词向量参数、所述字粒度语言表示子模型与所述词粒度语言表示子模型共享的Transformer的关联参数、位置向量和段落向量。
  11. 一种自然语言处理方法,其特征在于,包括:
    基于自然语言处理任务确定微调样本语料;
    对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
    利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
    通过微调后的语言表示模型对待处理自然语言进行处理。
  12. 根据权利要求11所述的方法,其特征在于,所述自然语言处理任务包括下述至少一种:语义相似度计算、文本分类、语言推理、中文指代消解、关键词识别以及阅读理解。
  13. 一种语言表示模型的预训练装置,其特征在于,包括:
    确定模块,用于确定用于预训练的语料样本;
    分词模块,用于对所述语料样本分别以词为单位以及以字为单位进行分词,获得分词结果和分字结果;
    第一预训练模块,用于利用所述分词结果,对语言表示模型中以词为单位的词粒度语言表示子模型进行预训练,获得预训练好的词粒度语言表示子模型;
    第二预训练模块,用于利用所述分字结果,对语言表示模型中以字为单位的字粒度语言表示子模型进行预训练,获得预训练好的字粒度语言表示子模型。
  14. 一种自然语言处理装置,其特征在于,包括:
    确定模块,用于基于自然语言处理任务确定微调样本语料;
    分词模块,用于对所述微调样本语料分别以词为单位以及以字为单位进行分词,获得微调分词结果和微调分字结果;
    微调模块,用于利用所述微调分词结果以及所述微调分字结果对预训练好的语言表示模型进行微调;
    处理模块,用于通过微调后的语言表示模型对待处理自然语言进行处理。
  15. 一种电子设备,其特征在于,所述电子设备包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求6-10任一项所述的语言表示模型的预训练方法。
  16. 一种电子设备,其特征在于,所述电子设备包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求11或12所述的自然语言处理方法。
  17. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求6-10任一项所述的语言表示模型的预训练方法。
  18. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求11或12所述的自然语言处理方法。
  19. 一种计算机程序产品,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如权利要求6-10任一项所述的语言表示模型的预训练方法。
  20. 一种计算机程序产品,包括承载在非暂态计算机可读介质上的计算机程序,所述计算机程序包含程序代码,用于执行如权利要求11或12所述的自然语言处理方法。
  21. 一种计算机程序,当处理装置执行所述计算机程序时,执行如权利要求6-10任一项所述的语言表示模型的预训练方法。
  22. 一种计算机程序,当处理装置执行所述计算机程序时,执行如权利要求11或12所述的自然语言处理方法。
PCT/CN2021/108194 2020-07-29 2021-07-23 语言表示模型系统、预训练方法、装置、设备及介质 WO2022022421A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/923,316 US20230244879A1 (en) 2020-07-29 2021-07-23 Language representation model system, pre-training method and apparatus, device, and medium
JP2023504177A JP2023535709A (ja) 2020-07-29 2021-07-23 言語表現モデルシステム、事前訓練方法、装置、機器及び媒体
EP21848640.5A EP4134865A4 (en) 2020-07-29 2021-07-23 LANGUAGE REPRESENTATION MODEL SYSTEM, PRE-LEARNING METHOD AND APPARATUS, DEVICE AND MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010746066.0 2020-07-29
CN202010746066.0A CN111914551B (zh) 2020-07-29 2020-07-29 自然语言处理方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022022421A1 true WO2022022421A1 (zh) 2022-02-03

Family

ID=73286790

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108194 WO2022022421A1 (zh) 2020-07-29 2021-07-23 语言表示模型系统、预训练方法、装置、设备及介质

Country Status (5)

Country Link
US (1) US20230244879A1 (zh)
EP (1) EP4134865A4 (zh)
JP (1) JP2023535709A (zh)
CN (1) CN111914551B (zh)
WO (1) WO2022022421A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114626363A (zh) * 2022-05-16 2022-06-14 天津大学 一种基于翻译的跨语言短语结构分析方法及装置
CN114661904A (zh) * 2022-03-10 2022-06-24 北京百度网讯科技有限公司 文档处理模型的训练方法、装置、设备、存储介质及程序
CN114970666A (zh) * 2022-03-29 2022-08-30 北京百度网讯科技有限公司 一种口语处理方法、装置、电子设备及存储介质
CN115630646A (zh) * 2022-12-20 2023-01-20 粤港澳大湾区数字经济研究院(福田) 一种抗体序列预训练模型的训练方法及相关设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914551B (zh) * 2020-07-29 2022-05-20 北京字节跳动网络技术有限公司 自然语言处理方法、装置、电子设备及存储介质
CN112560497B (zh) * 2020-12-10 2024-02-13 中国科学技术大学 语义理解方法、装置、电子设备和存储介质
CN113011126B (zh) * 2021-03-11 2023-06-30 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备及计算机可读存储介质
CN113326693B (zh) * 2021-05-28 2024-04-16 智者四海(北京)技术有限公司 一种基于词粒度的自然语言模型的训练方法与系统
CN113239705B (zh) * 2021-07-12 2021-10-29 北京百度网讯科技有限公司 语义表示模型的预训练方法、装置、电子设备和存储介质
CN114817295B (zh) * 2022-04-20 2024-04-05 平安科技(深圳)有限公司 多表Text2sql模型训练方法、系统、装置和介质
CN115017915B (zh) * 2022-05-30 2023-05-30 北京三快在线科技有限公司 一种模型训练、任务执行的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160293161A1 (en) * 2009-12-15 2016-10-06 At&T Intellectual Property I, L.P. System and Method for Combining Geographic Metadata in Automatic Speech Recognition Language and Acoustic Models
CN109858041A (zh) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 一种半监督学习结合自定义词典的命名实体识别方法
CN110032644A (zh) * 2019-04-03 2019-07-19 人立方智能科技有限公司 语言模型预训练方法
CN110489555A (zh) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 一种结合类词信息的语言模型预训练方法
CN111914551A (zh) * 2020-07-29 2020-11-10 北京字节跳动网络技术有限公司 语言表示模型系统、预训练方法、装置、设备及介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180055189A (ko) * 2016-11-16 2018-05-25 삼성전자주식회사 자연어 처리 방법 및 장치와 자연어 처리 모델을 학습하는 방법 및 장치
CN111354333B (zh) * 2018-12-21 2023-11-10 中国科学院声学研究所 一种基于自注意力的汉语韵律层级预测方法及系统
CN111435408B (zh) * 2018-12-26 2023-04-18 阿里巴巴集团控股有限公司 对话纠错方法、装置和电子设备
CN110674639B (zh) * 2019-09-24 2022-12-09 识因智能科技有限公司 一种基于预训练模型的自然语言理解方法
CN110717339B (zh) * 2019-12-12 2020-06-30 北京百度网讯科技有限公司 语义表示模型的处理方法、装置、电子设备及存储介质
CN111078842A (zh) * 2019-12-31 2020-04-28 北京每日优鲜电子商务有限公司 查询结果的确定方法、装置、服务器及存储介质
CN111310438B (zh) * 2020-02-20 2021-06-08 齐鲁工业大学 基于多粒度融合模型的中文句子语义智能匹配方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160293161A1 (en) * 2009-12-15 2016-10-06 At&T Intellectual Property I, L.P. System and Method for Combining Geographic Metadata in Automatic Speech Recognition Language and Acoustic Models
CN109858041A (zh) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 一种半监督学习结合自定义词典的命名实体识别方法
CN110032644A (zh) * 2019-04-03 2019-07-19 人立方智能科技有限公司 语言模型预训练方法
CN110489555A (zh) * 2019-08-21 2019-11-22 创新工场(广州)人工智能研究有限公司 一种结合类词信息的语言模型预训练方法
CN111914551A (zh) * 2020-07-29 2020-11-10 北京字节跳动网络技术有限公司 语言表示模型系统、预训练方法、装置、设备及介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661904A (zh) * 2022-03-10 2022-06-24 北京百度网讯科技有限公司 文档处理模型的训练方法、装置、设备、存储介质及程序
CN114970666A (zh) * 2022-03-29 2022-08-30 北京百度网讯科技有限公司 一种口语处理方法、装置、电子设备及存储介质
CN114970666B (zh) * 2022-03-29 2023-08-29 北京百度网讯科技有限公司 一种口语处理方法、装置、电子设备及存储介质
CN114626363A (zh) * 2022-05-16 2022-06-14 天津大学 一种基于翻译的跨语言短语结构分析方法及装置
CN115630646A (zh) * 2022-12-20 2023-01-20 粤港澳大湾区数字经济研究院(福田) 一种抗体序列预训练模型的训练方法及相关设备
CN115630646B (zh) * 2022-12-20 2023-05-16 粤港澳大湾区数字经济研究院(福田) 一种抗体序列预训练模型的训练方法及相关设备

Also Published As

Publication number Publication date
CN111914551B (zh) 2022-05-20
JP2023535709A (ja) 2023-08-21
CN111914551A (zh) 2020-11-10
EP4134865A4 (en) 2023-09-27
US20230244879A1 (en) 2023-08-03
EP4134865A1 (en) 2023-02-15

Similar Documents

Publication Publication Date Title
WO2022022421A1 (zh) 语言表示模型系统、预训练方法、装置、设备及介质
US11249774B2 (en) Realtime bandwidth-based communication for assistant systems
CN112164391B (zh) 语句处理方法、装置、电子设备及存储介质
WO2020177282A1 (zh) 一种机器对话方法、装置、计算机设备及存储介质
CN109657054B (zh) 摘要生成方法、装置、服务器及存储介质
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN111602147A (zh) 基于非局部神经网络的机器学习模型
CN110728298A (zh) 多任务分类模型训练方法、多任务分类方法及装置
CN111931513A (zh) 一种文本的意图识别方法及装置
CN112988979B (zh) 实体识别方法、装置、计算机可读介质及电子设备
US11636272B2 (en) Hybrid natural language understanding
CN113239169B (zh) 基于人工智能的回答生成方法、装置、设备及存储介质
US11263400B2 (en) Identifying entity attribute relations
CN114330354B (zh) 一种基于词汇增强的事件抽取方法、装置及存储介质
CN116127020A (zh) 生成式大语言模型训练方法以及基于模型的搜索方法
US20230073602A1 (en) System of and method for automatically detecting sarcasm of a batch of text
CN110678882A (zh) 使用机器学习从电子文档选择回答跨距
CN112085120B (zh) 多媒体数据的处理方法、装置、电子设备及存储介质
CN114491077A (zh) 文本生成方法、装置、设备及介质
CN113761190A (zh) 文本识别方法、装置、计算机可读介质及电子设备
CN114298055B (zh) 基于多级语义匹配的检索方法、装置、计算机设备和存储介质
CN113961679A (zh) 智能问答的处理方法、系统、电子设备及存储介质
CN112307738A (zh) 用于处理文本的方法和装置
CN110717316A (zh) 字幕对话流的主题分割方法及装置
CN116821781A (zh) 分类模型的训练方法、文本分析方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21848640

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021848640

Country of ref document: EP

Effective date: 20221107

ENP Entry into the national phase

Ref document number: 2023504177

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE