CN114528919A

CN114528919A - Natural language processing method and device and computer equipment

Info

Publication number: CN114528919A
Application number: CN202210044925.0A
Authority: CN
Inventors: 侯盼盼; 黄明星; 王福钋; 张航飞; 徐华韫; 曹富康; 沈鹏
Original assignee: Beijing Absolute Health Ltd
Current assignee: Beijing Absolute Health Ltd
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-24

Abstract

The application discloses a natural language processing method, a natural language processing device and computer equipment, relates to the technical field of artificial intelligence, and can solve the technical problems that aiming at different natural language processing tasks, corresponding models need to be tailored, various 'magic changes' are carried out at the same time, so that the task processing efficiency is low, the cost is high, the model representation capability in the general field is limited, and the fitting capability in vertical downstream tasks is poor. The method comprises the following steps: performing incremental pre-training on the BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level; acquiring text data to be subjected to natural language processing, and preprocessing the text data, wherein the preprocessing comprises at least one of data cleaning processing and stop word filtering processing; and inputting the preprocessed text data into a natural language processing model to obtain a natural language processing result.

Description

Natural language processing method and device and computer equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a natural language processing method and apparatus, and a computer device.

Background

In the medical insurance scene, various natural language processing tasks are enriched, including text classification, emotion analysis, text clustering, entity recognition, text similarity calculation, information extraction and the like. There are many subdivided scenes in each kind of task, for example, in entity identification, there are extraction of disease names and person names, extraction of drugs and time, extraction of key entities in medical diagnosis certification, and the like. In emotion analysis, besides common user emotion classification, multi-dimensional emotion analysis needs to be performed on the same event, such as seat utterance, seat-user interaction turns, and the like, so that business personnel can perform deep competition analysis, performance prediction, and the like on the target conveniently. Therefore, the NLP task in the field of insurance medical treatment has the characteristics of numerous and complicated tasks, heterogeneity and the like.

At present, for natural language processing tasks, in common solutions, an algorithm engineer generally customizes a corresponding model for each type of problem, and performs various "magic changes" at the same time. This results in a great deal of time and effort being expended in model selection and testing. Meanwhile, the problem of the vertical field of insurance medicine often faces the dilemma of low resources, which mainly includes two aspects: few samples, i.e., the cost of data collection, is high. Because many problems are related to specific business scenarios, the total amount of data that can be collected is limited; and the labeling is less, namely the data labeling cost is high. For problems in the insurance field, data annotation often requires deep participation of professional medical teams, which greatly increases data annotation costs.

Disclosure of Invention

In view of this, the application discloses a natural language processing method, a natural language processing device and a computer device, which can be used for solving the technical problems that when natural language task processing is performed at present, corresponding models need to be customized for different processing tasks, and various types of magic changes are performed at the same time, so that the task processing efficiency is low, the cost is high, the model representation capability in the general field is limited, and the fitting capability in vertical downstream tasks is poor.

According to an aspect of the present application, there is provided a natural language processing method including:

performing incremental pre-training on a BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level;

acquiring text data to be subjected to natural language processing, and preprocessing the text data, wherein the preprocessing comprises at least one of data cleaning processing and stop word filtering processing;

and inputting the preprocessed text data into the natural language processing model to obtain a natural language processing result.

Optionally, the performing incremental pre-training on the BERT model in the general field according to a preset training task to obtain a natural language processing model includes:

acquiring a first sample corpus corresponding to the first training task and a second sample corpus corresponding to the second training task;

performing word-level first pre-training on the BERT model according to a first training task and the first sample corpus;

performing second pre-training of task level on the BERT model according to a second training task and the second sample corpus;

and after the BERT model is judged to finish the first pre-training and the second pre-training, determining the BERT model as a natural language processing model.

Optionally, the first training task includes a full-word Mask task and a sentence order prediction task, and performing a first pre-training on the BERT model at a word level according to the first training task and the first sample corpus includes:

performing word segmentation processing on the first sample corpus to obtain a text sequence containing each character, and extracting characters co-occurring with a preset dictionary or words formed by at least two characters from the text sequence to perform whole-word Mask pre-training on the BERT model; and the number of the first and second groups,

performing statement division on the first sample corpus according to preset character identifications to obtain a statement sequence containing each statement, constructing a positive example sample statement pair of the statement sequential prediction task by using two continuous statements in the statement sequence, constructing a negative example sample statement pair of the statement sequential prediction task after sequentially exchanging the two continuous statements, and performing statement sequential prediction pre-training on the BERT model by using the positive example statement pair and the negative example sample statement pair.

Optionally, before extracting characters co-occurring with a preset dictionary or a word composed of at least two characters from the text sequence and performing full-word Mask pre-training on the BERT model, the method further includes:

extracting an industry keyword corresponding to the preset training task from a standard industry file based on a TF-IDF algorithm;

acquiring related words of the industry keywords according to the industry keywords and the intra-language related relations of the languages of the industry keywords in the corpus, wherein the related words comprise at least one of synonyms, similar words and similar words;

and constructing a preset dictionary based on the industry keywords and the associated words.

Optionally, the performing statement order prediction pre-training on the BERT model by using the positive example statement pair and the negative example statement pair includes:

respectively inputting the positive sample statement pair and the negative sample statement pair into the BERT model, and obtaining a first statement vector and a second statement vector corresponding to two statements in the positive sample statement pair, and a third statement vector and a fourth statement vector corresponding to two statements in the negative sample statement pair;

calculating a first vector feature distance of the first statement vector and the second statement vector, and a second vector feature distance of the third statement vector and the fourth statement vector, and updating model parameters of the BERT model according to the first vector feature distance and the second vector feature distance so that the first vector feature distance is smaller than a first preset threshold value, and the second vector feature distance is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.

Optionally, the second training task includes a classification task and an entity recognition task of the dialog scene object;

and performing second pre-training of task level on the BERT model according to a second training task and the second sample corpus, wherein the second pre-training comprises the following steps:

configuring a task label for the second sample corpus, wherein the task label comprises an object label and an entity label;

taking the second sample corpus as an input feature of the BERT model, and taking the object label or the entity label as a training label to train the BERT model to obtain a task training result;

calculating a loss function of the BERT model according to the task label and the task training result;

if the loss function meets the requirement of model convergence, judging that the BERT model completes second pre-training of the classification task;

and if the loss function is judged not to meet the model convergence requirement, updating the model parameters of the BERT model, and performing iterative training on the updated BERT model until the loss function meets the model convergence requirement.

Optionally, before the inputting the preprocessed text data into the natural language processing model and obtaining a natural language processing result, the method further includes:

determining a target downstream task corresponding to the text data, and performing fine tuning processing on the natural language processing model by using adaptive data matched with the target downstream task;

inputting the preprocessed text data into the natural language processing model to obtain a natural language processing result, wherein the natural language processing result comprises:

and inputting the preprocessed text data into the natural language processing model after fine tuning processing, and acquiring a natural language processing result corresponding to the target downstream task.

According to another aspect of the present application, there is provided a natural language processing apparatus including:

the training module is used for carrying out incremental pre-training on the BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level;

the processing module is used for acquiring text data to be subjected to natural language processing and preprocessing the text data, wherein the preprocessing comprises at least one of data cleaning processing and stop word filtering processing;

and the input module is used for inputting the preprocessed text data into the natural language processing model to obtain a natural language processing result.

Optionally, the training module comprises: the device comprises an acquisition unit, a first training unit, a second training unit and a first determination unit;

the obtaining unit is configured to obtain a first sample corpus corresponding to the first training task and a second sample corpus corresponding to the second training task;

the first training unit is used for carrying out word-level first pre-training on the BERT model according to a first training task and the first sample corpus;

the second training unit is used for performing second pre-training of task levels on the BERT model according to a second training task and the second sample corpus;

the first determining unit is used for determining the BERT model as a natural language processing model after judging that the BERT model completes the first pre-training and the second pre-training.

Optionally, the first training task includes a full-word Mask task and a sentence sequential prediction task;

the first training unit is used for performing word segmentation processing on the first sample corpus to obtain a text sequence containing each character, extracting characters which coexist with a preset dictionary or words formed by at least two characters from the text sequence, and performing whole-word Mask pre-training on the BERT model; and performing statement division on the first sample corpus according to preset character identifications to obtain a statement sequence containing each statement, constructing a positive example sample statement pair of the statement sequence prediction task by using two continuous statements in the statement sequence, constructing a negative example sample statement pair of the statement sequence prediction task after sequentially exchanging the two continuous statements, and performing statement sequence prediction pre-training on the BERT model by using the positive example statement pair and the negative example sample statement pair.

Optionally, the training module further comprises: a building unit;

the building unit is used for extracting the industry key words corresponding to the preset training tasks from the standard industry files based on a TF-IDF algorithm; obtaining related words of each industry keyword according to the language-internal related relation of each industry keyword and the language of the industry keyword in a corpus, wherein the related words comprise at least one of synonyms, similar words and similar words; and constructing a preset dictionary based on the industry keywords and the associated words.

Optionally, the first training unit is configured to input the positive example statement pair and the negative example statement pair into the BERT model respectively, and obtain a first statement vector and a second statement vector corresponding to two statements in the positive example statement pair, and a third statement vector and a fourth statement vector corresponding to two statements in the negative example statement pair; calculating a first vector feature distance of the first statement vector and the second statement vector, and a second vector feature distance of the third statement vector and the fourth statement vector, and updating model parameters of the BERT model according to the first vector feature distance and the second vector feature distance so that the first vector feature distance is smaller than a first preset threshold value, and the second vector feature distance is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.

the second training unit is configured to configure a task label for the second sample corpus, where the task label includes an object label and an entity label; taking the second sample corpus as an input feature of the BERT model, and taking the object label or the entity label as a training label to train the BERT model to obtain a task training result; calculating a loss function of the BERT model according to the task label and the task training result; if the loss function meets the requirement of model convergence, judging that the BERT model completes second pre-training of the classification task; and if the loss function is judged not to meet the requirement of model convergence, updating the model parameters of the BERT model, and performing iterative training on the updated BERT model until the loss function meets the requirement of model convergence.

Optionally, the apparatus further comprises: a fine tuning module;

the fine tuning module is used for determining a target downstream task corresponding to the text data and performing fine tuning processing on the natural language processing model by using adaptive data matched with the target downstream task;

and the input module is used for inputting the preprocessed text data into the natural language processing model after fine tuning processing, and acquiring a natural language processing result corresponding to the target downstream task.

According to yet another aspect of the present application, a storage medium is provided, on which a computer program is stored, which program, when executed by a processor, implements the above-mentioned method of matching clinical drug trial patients.

According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the above method of matching clinical drug trial patients when executing the program.

The application provides a natural language processing method, a natural language processing device and computer equipment, which can be used for performing incremental pre-training on a BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level; after the text data to be subjected to natural language processing is acquired, a series of preprocessing operations such as data cleaning, data standardization and the like can be performed on the text data; and finally, inputting the preprocessed text data into a natural language processing model to obtain a natural language processing result. According to the technical scheme, the BERT model with good performance in the general field can be obtained based on the migration technology, increment pre-training in the specific task field is carried out on the basis of the BERT model, and the applicability of the model to downstream tasks is further improved. In addition, complex model 'magic change' work is not needed, and the model after incremental pre-training has good performance on downstream tasks.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a natural language processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for natural language processing according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating natural language processing provided by an embodiment of the invention;

FIG. 4 is a schematic structural diagram of a natural language processing apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of another natural language processing apparatus provided in an embodiment of the present invention;

fig. 6 shows a physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

An embodiment of the present invention provides a natural language processing method, as shown in fig. 1, the method includes:

101. and performing incremental pre-training on the BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level.

Two milestones are generally considered to exist for the natural language processing technology in the deep learning era. The first milestone is a Word vector technology represented by Word2Vec, which is gradually developed in 2013; the second milestone was the deep Pre-trained Language model (Pre-trained Language Models) represented by BERT in 2018. On one hand, the deep pre-training model represented by BERT achieves new development in almost all sub-fields including text classification, named entity recognition, question answering and the like; on the other hand, as a general pre-training model, the appearance of BERT also significantly reduces the heavy work of NLP algorithm engineers in specific applications, and a baseline model with excellent performance can be quickly obtained by converting the conventional magic network into Fine tune BERT. In various application scenarios, there are various natural language processing tasks including text classification, emotion analysis, text clustering, entity recognition, text similarity calculation, information extraction, and the like. Accordingly, the technical solution in the present application can be applied to natural language processing tasks in any industry vertical field, and in this embodiment and the following steps of the embodiment, the technical solution in the present application is described by taking the natural language task processing as an example in an insurance medical scenario, but the application scenario to which the technical solution in the present application is applied is not specifically limited.

In the insurance medical scene, there are many subdivided scenes for each kind of task, for example, in the entity identification in the insurance field, there are extraction of disease names and person names, extraction of drugs and time, extraction of key entities in medical diagnosis certification, and the like. In emotion analysis, besides common user emotion classification, multi-dimensional emotion analysis needs to be performed on the same event, such as seat utterance, seat-user interaction turns, and the like, so that business personnel can perform deep competition analysis, performance prediction, and the like on the target conveniently. Therefore, the NLP task in the field of insurance medical treatment has the characteristics of numerous and complicated tasks, heterogeneity and the like. In a common solution, an algorithm engineer typically customizes a corresponding model for each type of problem, and makes various "magic changes". This results in a great deal of time and effort being expended in model selection and testing. Meanwhile, the problem of the vertical field of insurance medicine often faces the dilemma of low resources, which mainly includes two aspects: few samples, i.e., the cost of data collection is high. Because many problems are related to specific business scenarios, the total amount of data that can be collected is limited; and the labeling is less, namely the data labeling cost is high. For the problem in the insurance field, data annotation often requires deep participation of professional medical teams, which greatly increases data annotation cost. In order to solve the two general problems in the insurance field, in the application, based on field data accumulated for many years, as shown in fig. 3, the increment pre-training is performed on the BERT model in the general field according to a preset training task based on a Chinese open source BERT model, wherein the preset training task specifically comprises a word-level pre-training task and a task-level pre-training task, and the model after the pre-training is more suitable for the downstream tasks in the field by executing the increment pre-training.

The execution main body of the application can be a system for supporting natural language processing, and can be configured at a client or a server, incremental pre-training is performed on a BERT model in the general field according to a preset training task in the system, an obtained natural language processing model is configured, after text data to be subjected to natural language processing is obtained and is pre-processed, the pre-processed text data can be input into the natural language processing model, and a natural language processing result is obtained.

102. The method comprises the steps of obtaining text data to be subjected to natural language processing, and preprocessing the text data, wherein the preprocessing comprises at least one of data cleaning processing and stop word filtering processing.

For this embodiment, in a specific application scenario, the text data to be subjected to natural language processing may include a large amount of noise data and irrelevant data while including valid data, so to ensure accuracy of the natural language processing result, the text data to be input into the natural language processing model needs to be preprocessed, so that the preprocessed main data does not have default values, abnormal values and stop words, and can be directly recognized by the computer. In a specific application scenario, the preprocessing specifically may include a data cleaning process and a stop word filtering process. The data cleaning processing comprises default value filling and abnormal value deleting, the default value filling refers to filling of missing items of data, and the data filling method comprises average value filling, and specifically comprises the following steps: determining the characteristics of the data missing item, calculating the average value of the data under the characteristics, and taking the average value as a filling value. The abnormal value refers to data which is contrary to the actual situation or data collected under the system fault, and the abnormal value data is deleted. The stop word filtering processing can comprise word segmentation processing and stop word filtering, wherein the word segmentation processing can use the existing word segmentation technology to segment each text data into each independent word segment, but in the segmented word segments, some redundant words or low-quality words often exist, so in order to improve the processing efficiency and save the occupied running space, some words with little effect on natural language processing can be screened out through the stop word filtering processing, and are filtered out.

When stop word filtering processing is performed, word segmentation can be performed on text data based on an existing word segmentation tool (such as a CRF word segmenter) to obtain each independent first word and a word sequence with a corresponding format of [ word 1, word2, word 3, and word … N ], wherein each first word is marked with a corresponding target part of speech; and determining a second word matched with the preset stop part of speech based on the target part of speech of the first word so as to remove the second word from the word sequence. The preset disuse part of speech can be a word assistant, a adverb, a preposition, a conjunction, etc., which usually has no definite meaning, and can only be put into a complete sentence to play a certain role, such as a conventional "aike", "having", "another", "resulting", "letting", "not excessive", etc. Since these words rarely express information about the relevance of the document alone and these functional words do little to natural language processing, these meaningless words can be filtered out in advance in order to improve the efficiency of natural language processing and save storage space. Specifically, the recognition and filtering of stop words can be realized based on the existing stop word list, such as a Baidu stop word list, a Hayada stop word list and the like.

103. And inputting the preprocessed text data into a natural language processing model to obtain a natural language processing result.

After the incremental pre-training, the obtained natural language processing model insterr has better applicability to a specific vertical field, so that after text data to be subjected to natural language processing is pre-processed, the insterr model can be applied to specific downstream tasks, the specific use mode is the same as that of the Chinese BERT model, the pre-training model is firstly finely adjusted to various tasks, and if a sentence-to-classification task is made, two sentences are input together, for example: [ CLS ] you want to buy insurance [ SEP ] to get the diabetes to buy insurance die, finally, take the output representation of the first token [ CLS ] and feed it to a softmax layer to get the classification result output. The task is simpler for single sentence classification, also inputting a sentence, and then taking the output of the first token [ CLS ] to classify. And for the Q & A problem, simultaneously inputting [ CLS ] Question [ SEP ] Answer, and then taking the interval between Start and End in the output Answer as the Answer. For Single sentence marking tasks (NER), the final layer transform output of all tokens is taken and fed to the softmax layer for classification. In short, different types of tasks need different modifications to the model, but the modifications are very simple, and at most one layer of neural network is added. For the embodiment, a target downstream task corresponding to the text data can be determined, then the natural language processing model is subjected to fine tuning processing by using the adaptive data matched with the target downstream task, and after the fine tuning is completed, the preprocessed text data is input into the natural language processing model to obtain a natural language processing result.

Compared with the current natural language processing mode, the natural language processing method provided by the embodiment of the application can be used for performing incremental pre-training on a BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level; after the text data to be subjected to natural language processing is acquired, a series of preprocessing operations such as data cleaning, data standardization and the like can be performed on the text data; and finally, inputting the preprocessed text data into the natural language processing model which is subjected to incremental pre-training to obtain a natural language processing result. According to the technical scheme, the BERT model with good performance in the general field can be obtained based on the migration technology, increment pre-training in the specific task field is carried out on the basis of the BERT model, and the applicability of the model to downstream tasks is further improved. In addition, complex model 'magic change' work is not needed, and the model after incremental pre-training has good performance on downstream tasks.

Further, in order to better explain the processing procedure of the natural language, as a refinement and extension of the above embodiment, an embodiment of the present invention provides another natural language processing method, as shown in fig. 2, the method includes:

201. and acquiring a first sample corpus corresponding to the first training task and a second sample corpus corresponding to the second training task.

The first sample corpus and the second sample corpus may be the same or different, and the sample corpus may be determined according to an actual application scenario.

202. And performing word-level first pre-training on the BERT model according to a first training task and a first sample corpus, wherein the first training task comprises a full-word Mask task and a sentence sequence prediction task.

In a specific application scenario, Word-level pre-training first includes two types of subtasks, namely, an instrument Word Mask (IWWM) and a Sentence order Prediction (NSP). In training, to save resources, a two-stage pre-training approach similar to Google can be used, with the first stage pre-training maximum sentence length of 128 and the second stage pre-training maximum sentence length of 512. The two types of tasks are specifically formed as follows:

for a Whole Word Mask task, the White Word Mask (WWM) is generally translated into a Whole Word Mask or a Whole Word Mask, and the departure is that Google issues an upgraded version of BERT in 5 months in 2019, and the training sample generation strategy in the original pre-training stage is mainly changed. In short, an original word segmentation method based on WordPiece can segment a complete word into a plurality of sub-words, and the segmented sub-words can be masked randomly when training samples are generated. In the full-word Mask, if a WordPiece subword of a complete word is masked, other parts of the word belonging to the same genus are also masked, i.e. the full-word Mask. In Google native Chinese BERT, the input is segmented by taking characters as granularity, and the relation between co-occurring words or phrases in the field is not considered, so that the prior knowledge hidden in the field cannot be learned, and the learning effect of the model is reduced. For the implementation, the method of the whole word Mask can be applied to corpus pre-training in the field of insurance medical treatment, namely, all Chinese characters forming the same word are masked. Firstly, a dictionary in the insurance medical field can be constructed by combining automatic mining with manual checking from insurance, medical dictionary and insurance academic articles, and about 20 words exist. And then extracting the words or phrases which are co-occurring in the pre-linguistic data and the insurance medical dictionary to perform whole-word Mask pre-training, so that the model learns the prior knowledge in the field, such as the correlation between insurance concepts and medical concepts, and the learning effect of the model is further enhanced.

For the Sequence Order Prediction (SOP) task, in Order to make the model have stronger recognition capability for the upper and lower sentences in the same topic, the upper and lower Sentence relation Prediction task is introduced. The specific mode can refer to an Albert original document, and a result of a paper shows that the simple task is very beneficial to a question-answering and natural language reasoning task, and the model effect is slightly improved after an NSP task is changed into an SOP task in the pre-training process, so that the pre-training task of the NSP can be reserved in the application, the learning rate adopts 2e-5 recommended by Google officials, and the warp-steps is 10000 steps.

Correspondingly, for the present embodiment, as an optional manner, the step 202 of the embodiment may specifically include: performing word segmentation processing on the first sample corpus to obtain a text sequence containing each character, and extracting characters co-occurring with a preset dictionary or words formed by at least two characters from the text sequence to perform whole-word Mask pre-training on the BERT model; and performing statement division on the first sample corpus according to preset character identifications to obtain a statement sequence containing each statement, constructing a positive example sample statement pair of a statement sequence prediction task by using two continuous statements in the statement sequence, constructing a negative example sample statement pair of the statement sequence prediction task after sequentially exchanging the two continuous statements, and performing statement sequence prediction pre-training on the BERT model by using the positive example statement pair and the negative example sample statement pair.

In a specific application scenario, before characters which co-occur with a preset dictionary or words formed by at least two characters are extracted from a text sequence to perform whole-word Mask pre-training on a BERT model, a corresponding preset dictionary also needs to be created. Correspondingly, the embodiment steps may specifically include: extracting an industry keyword corresponding to a preset training task from a standard industry file based on a TF-IDF algorithm; acquiring related words of each industry keyword according to the industry keyword and the intra-language related relation of the language of the industry keyword in the corpus, wherein the related words comprise at least one of synonyms, similar words and similar words; and constructing a preset dictionary based on the industry keywords and the associated words.

The TF-IDF algorithm is a statistical method for evaluating the importance of a word to one of the documents in a corpus or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. The main idea of TF-IDF is that if a word occurs in an article with a high frequency of TF and rarely occurs in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. In the application, TF-IDF algorithm can be used for calculating the occurrence frequency of each word contained in the standard industry file, namely the word frequency TF_ijAnd calculating the occurrence frequency of each word in all standard industry files, namely the frequency IDF of the reverse files_i. Frequency TF when determining the occurrence of a certain word in a standard industry file_ijThe method is high, and when the standard industrial documents rarely appear, the word can be considered to have good category distinguishing capability and is suitable for being used as an industrial keyword corresponding to a preset training task. For this embodiment, in a specific application scenario, the TF-IDF algorithm may include a first calculation formula and a second calculation formula, and in order to extract an industry keyword corresponding to a preset training task, the step 202 in the embodiment may specifically include: calculating the word frequency of the words contained in the standard industry file according to a first calculation formula; calculating the reverse file frequency of the words contained in the standard industry file according to a second calculation formula; determining each word based on word frequency and reverse file frequencyThe language corresponds to the correlation degree of the industry field to which the preset training task belongs; determining the words with the correlation degree larger than a preset threshold value as the industry keywords. The standard industry documents may be insurance, medical dictionaries, insurance-like academic articles, and the like.

Specifically, the first calculation formula is characterized by:

wherein, TF_ijIs the word frequency of the word i, ni, j is the number of times the word i appears in the standard industry document dj,

the sum of the number of all words in the standard industry file dj; the second calculation formula is characterized by:

wherein, IDF_iIs the word t_iThe reverse file frequency, | D | is the total number of standard industry files, | { j: t |, and_ibelongs to dj } | +1 to indicate that the word t is contained_iNumber of standard industry documents. Correspondingly, based on the word frequency and the reverse file frequency, determining the relevancy of each word corresponding to the preset training task in the industry field to which the word belongs specifically may include: and calculating the product of the word frequency corresponding to the same word and the reverse file frequency, and determining the product as the relevance of the word corresponding to the preset training task in the industry field.

Correspondingly, when obtaining the related terms of each industry keyword according to each industry keyword and the language-dependent relationship of the language to which the industry keyword belongs in the corpus, the language-dependent relationship may include a synonymy-dependent relationship, a near-synonymy-dependent relationship, a pronunciation similarity threshold, and a structure similarity threshold, and specifically, the embodiment may specifically include: obtaining synonyms of the industry keywords according to the synonymy association relation between each industry keyword and the language of the industry keywords in the corpus; acquiring a near-meaning word of each industry keyword according to the near-meaning association relationship of each industry keyword and the language of the industry keyword in a corpus; acquiring similar pronunciation words of the industry keywords according to the pronunciation similarity threshold of each industry keyword and the language of the industry keyword in the corpus; and acquiring the similar words of the industry keywords according to the structural similarity threshold of each industry keyword and the language of the industry keyword in the corpus.

In a specific application scenario, when performing statement sequential prediction pre-training on a BERT model by using a positive example statement pair and a negative example statement pair, the embodiment specifically includes: respectively inputting the positive example sample statement pair and the negative example sample statement pair into a BERT model, and obtaining a first statement vector and a second statement vector corresponding to two statements in the positive example sample statement pair, and a third statement vector and a fourth statement vector corresponding to two statements in the negative example sample statement pair; calculating a first vector feature distance of the first statement vector and the second statement vector and a second vector feature distance of the third statement vector and the fourth statement vector, and updating model parameters of the BERT model according to the first vector feature distance and the second vector feature distance so that the first vector feature distance is smaller than a first preset threshold value and the second vector feature distance is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value. The preset feature Distance calculation formula may be any Distance function formula suitable for measurement, such as Euclidean Distance formula (Euclidean Distance), Manhattan Distance formula (Manhattan Distance), Jaccard Distance formula (Jaccard Distance), Mahalanobis Distance formula (Mahalanobis Distance), and the like, and may be specifically selected according to an actual application scenario, and no specific limitation is made herein.

203. And performing second pre-training of the task level on the BERT model according to the second training task and the second sample corpus.

In the medical insurance scene, various natural language processing tasks are abundant, including a text classification task, an emotion analysis task, a text clustering task, an entity recognition task, a text similarity calculation task, an information extraction task, a classification task of a conversation scene object, an entity recognition task and the like. In a specific application scene, in order to enable the model to better learn knowledge in the insurance medical field of the semantic layer and more comprehensively learn the feature distribution of the field words and sentences, a plurality of types of supervised learning tasks can be introduced at the same time, namely, second pre-training at the task level is carried out. In this embodiment, the technical solution in this embodiment is described by taking an example that the second training task includes a user and agent classification task in an online real conversation scene and an entity identification task in a medical diagnosis result, but does not constitute a specific limitation to the technical solution in this application. For the classification task of the conversation scene object, in view of the conversation content of the user and the seat in the online real scene, the conversation voice naturally has good industry attributes, so that a large amount of linguistic data with industry labels can be automatically generated by utilizing the conversation voice, and the document-level supervised task of industry classification is constructed according to the linguistic data. For the medical diagnosis result entity recognition task, similar to the user and seat classification task in the on-line real conversation scene, the task corpus of the named entity recognition class is constructed based on the insurance medical field by utilizing the existing information of a patient case result report, a medical diagnosis result certificate and the like, and the task corpus totally comprises 100 ten thousand supervised corpuses. On the whole, in order to enable the natural language processing model InsBERT to learn semantic knowledge in the insurance field more fully, the following improvements are made on the basis of the pre-training of the native BERT model: the training time is longer, and the training process is more sufficient. In order to obtain better model learning effect, the pre-training time of the second stage of the model can be prolonged to be consistent with the total number of tokens of the first stage. And the knowledge in the insurance field is fused, word groups and semantic level tasks are introduced, proper nouns or word groups in the field are extracted, and pre-training is carried out by adopting a covering mode of full-word Mask and two types of supervised tasks; the NSP task is changed into the SOP task, so that the model has stronger identification capability for classification tasks with fine grain boundary.

As an optional way for the present embodiment, the steps of the embodiment may specifically include: configuring task tags for the second sample corpus, wherein the task tags comprise object tags and entity tags; taking the second sample corpus as an input feature of the BERT model, and taking the object label or the entity label as a training label to train the BERT model, so as to obtain a task training result; calculating a loss function of the BERT model according to the task label and the task training result; if the loss function meets the requirement of model convergence, judging that the BERT model is finished with second pre-training of a classification task; and if the loss function does not meet the model convergence requirement, updating the model parameters of the BERT model, and performing iterative training on the updated BERT model until the loss function meets the model convergence requirement.

204. And after judging that the BERT model completes the first pre-training and the second pre-training, determining the BERT model as a natural language processing model.

For this embodiment, after the BERT model passes through the first training task at the word level and the second training task at the task level in embodiment steps 202 to 203, a natural language processing model instert applicable to a downstream task in the insurance field may be obtained, and the natural language processing model instert may be applied to the specific downstream task.

205. The method comprises the steps of obtaining text data to be subjected to natural language processing, and preprocessing the text data, wherein the preprocessing comprises at least one of data cleaning processing and stop word filtering processing.

For this embodiment, when performing preprocessing on text data, the specific implementation process may refer to the related description in step 102 of the embodiment, and is not described herein again.

206. And determining a target downstream task corresponding to the text data, and performing fine adjustment processing on the natural language processing model by using the adaptive data matched with the target downstream task.

In a specific application scenario, as an optional mode, the preprocessed text data can be directly input into a natural language processing model, and the natural language processing model is used for determining a natural language processing result of the text data corresponding to a target downstream task. Correspondingly, in order to better improve the processing effect, as another optional mode, before the natural language processing model is used to determine the natural language processing result of the text data under the target downstream task, the natural language processing model may be further fine-tuned based on the specific target downstream task, and further the pre-trained model may be used to improve the processing effect for the target downstream task, where the reason for improving the effect is: the parameters of the pre-training model are well learned, a part of previously learned text information is contained, and the model can be finely adjusted by using a small amount of adaptive data without learning from the beginning.

When the natural language processing model is subjected to fine adjustment processing by using adaptive data matched with a target downstream task, a preset number of sample texts can be extracted in advance, and a sample text sequence is generated; further, 15% of the text sequences are randomly extracted from the sample text sequences, 80% of the text sequences are replaced by the [ MASK ] symbol, 10% are replaced by a random Chinese character in the vocabulary, and 10% remain unchanged, so that the masked words are predicted by context. Correspondingly, the embodiment steps may specifically include: performing word segmentation processing on the adaptive data to determine a sample text sequence; randomly screening 15% of first text sequences in the sample text sequences; covering 80% of words in the first text sequence, replacing 10% of words with characters in a vocabulary table, and keeping 10% of words unchanged to obtain a second text sequence; adjusting model parameters of a natural language processing model by using a sample text containing a first text sequence and a second text sequence, and calculating an objective function; and if the target function is judged to be larger than the preset threshold value, determining that the fine tuning of the natural language processing model is finished. The objective function is a log-likelihood function, the training purpose of the application is to maximize the objective function value, and when the objective function value reaches the maximum value, the fine tuning is judged to be successful.

Accordingly, the expression of the objective function is as follows:

L(θ；X)＝∑x∈Xlog(xmask|x\mask；θ)

X＝{x1,x2,...,xn}

wherein L (theta; X) represents an objective function of the natural language processing model, X represents all sample text sequences, xn represents the nth sample text sequence, theta represents model parameters of the pre-training model BERT, xmask represents 15% of the sample text sequences which are masked in X, and X \ mask represents the sample text sequences which are divided by the remaining unmasked 85% in X.

207. And inputting the preprocessed text data into the natural language processing model after fine tuning processing, and acquiring a natural language processing result corresponding to the target downstream task.

By means of the natural language processing method, incremental pre-training can be performed on a BERT model in the general field according to a preset training task to obtain a natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level; after the text data to be subjected to natural language processing is acquired, a series of preprocessing operations such as data cleaning, data standardization and the like can be performed on the text data; and finally, inputting the preprocessed text data into the natural language processing model which is subjected to incremental pre-training to obtain a natural language processing result. According to the technical scheme, the BERT model with good performance in the general field can be obtained based on the migration technology, increment pre-training in the specific task field is carried out on the basis of the BERT model, and the applicability of the model to downstream tasks is further improved. In addition, complex model 'magic change' work is not needed, and the model after incremental pre-training has good performance on downstream tasks.

Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a natural language processing apparatus, as shown in fig. 4, the apparatus includes: training module 31, processing module 32, input module 33.

The training module 31 is configured to perform incremental pre-training on the BERT model in the general field according to a preset training task to obtain a natural language processing model, where the preset training task includes a first training task at a word level and a second training task at a task level;

the processing module 32 is configured to obtain text data to be subjected to natural language processing, and perform preprocessing on the text data, where the preprocessing includes at least one of data cleaning processing and stop word filtering processing;

the input module 33 may be configured to input the preprocessed text data into the natural language processing model, so as to obtain a natural language processing result.

In a specific application scenario, as shown in fig. 5, the training module 31 includes: an obtaining unit 311, a first training unit 312, a second training unit 313, and a first determining unit 314;

an obtaining unit 311, configured to obtain a first sample corpus corresponding to a first training task and a second sample corpus corresponding to a second training task;

a first training unit 312, configured to perform a first pre-training at a word level on the BERT model according to a first training task and a first sample corpus;

the second training unit 313 is configured to perform second pre-training at a task level on the BERT model according to a second training task and a second sample corpus;

the first determining unit 314 may be configured to determine the BERT model as the natural language processing model after determining that the BERT model completes the first pre-training and the second pre-training.

In a specific application scenario, the first training task comprises a full-word Mask task and a sentence sequential prediction task; the first training unit 312 is specifically configured to perform word segmentation on the first sample corpus to obtain a text sequence including each character, and extract characters co-occurring with a preset dictionary or words formed by at least two characters from the text sequence to perform full-word Mask pre-training on the BERT model; and performing statement division on the first sample corpus according to preset character identifications to obtain a statement sequence containing each statement, constructing a positive example sample statement pair of a statement sequence prediction task by using two continuous statements in the statement sequence, constructing a negative example sample statement pair of the statement sequence prediction task after sequentially exchanging the two continuous statements, and performing statement sequence prediction pre-training on the BERT model by using the positive example statement pair and the negative example sample statement pair.

In a specific application scenario, as shown in fig. 5, the training module 31 further includes: a construction unit 315;

the construction unit 315 is configured to extract an industry keyword corresponding to a preset training task from a standard industry file based on a TF-IDF algorithm; acquiring related words of each industry keyword according to the industry keyword and the intra-language related relation of the language of the industry keyword in the corpus, wherein the related words comprise at least one of synonyms, similar words and similar words; and constructing a preset dictionary based on the industry keywords and the associated words.

In a specific application scenario, when performing statement sequential prediction pre-training on the BERT model by using a positive sample statement pair and a negative sample statement pair, the first training unit 312 may be configured to input the positive sample statement pair and the negative sample statement pair into the BERT model respectively, and obtain a first statement vector and a second statement vector corresponding to two statements in the positive sample statement pair, and a third statement vector and a fourth statement vector corresponding to two statements in the negative sample statement pair; calculating a first vector feature distance of the first statement vector and the second statement vector and a second vector feature distance of the third statement vector and the fourth statement vector, and updating model parameters of the BERT model according to the first vector feature distance and the second vector feature distance so that the first vector feature distance is smaller than a first preset threshold value and the second vector feature distance is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.

In a specific application scene, the second training task comprises a classification task and an entity recognition task of a conversation scene object; the second training unit 313 is specifically configured to configure task tags for the second sample corpus, where the task tags include object tags and entity tags; taking the second sample corpus as an input feature of the BERT model, and taking the object label or the entity label as a training label to train the BERT model, so as to obtain a task training result; calculating a loss function of the BERT model according to the task label and the task training result; if the loss function meets the requirement of model convergence, judging that the BERT model is finished with second pre-training of a classification task; and if the loss function does not meet the model convergence requirement, updating the model parameters of the BERT model, and performing iterative training on the updated BERT model until the loss function meets the model convergence requirement.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: a fine-tuning module 34;

the fine-tuning module 34 is configured to determine a target downstream task corresponding to the text data, and perform fine-tuning processing on the natural language processing model by using adaptation data matched with the target downstream task;

and the input module 33 may be configured to input the preprocessed text data into the natural language processing model after the fine-tuning processing, and obtain a natural language processing result corresponding to the target downstream task.

Based on the methods shown in fig. 1 and fig. 2, correspondingly, the embodiment of the invention further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the methods shown in fig. 1 to fig. 2.

Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 4, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 6, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are arranged on a bus 43 such that the method shown in fig. 1-2 is implemented when the processor 41 executes the program.

According to the technical scheme, the BERT model in the general field can be subjected to incremental pre-training according to the preset training task to obtain the natural language processing model, wherein the preset training task comprises a first training task at a word level and a second training task at a task level; after the text data to be subjected to natural language processing is acquired, a series of preprocessing operations such as data cleaning, data standardization and the like can be performed on the text data; and finally, inputting the preprocessed text data into the natural language processing model which is subjected to incremental pre-training to obtain a natural language processing result. According to the technical scheme, the BERT model with good performance in the general field can be obtained based on the migration technology, increment pre-training in the specific task field is carried out on the basis of the BERT model, and the applicability of the model to downstream tasks is further improved. In addition, complex model 'magic change' work is not needed, and the model after incremental pre-training has good performance on downstream tasks.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The method and system of the present invention may be implemented in a number of ways. For example, the methods and systems of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A natural language processing method, comprising:

2. The method according to claim 1, wherein the incremental pre-training of the BERT model in the general field according to the pre-training task to obtain the natural language processing model comprises:

3. The method of claim 2, wherein the first training task comprises a full-word Mask task and a sentence order prediction task, and the performing a first pre-training on the BERT model at a word level according to the first training task and the first sample corpus comprises:

4. The method of claim 3, wherein before extracting characters co-occurring with a preset dictionary or words consisting of at least two characters in the text sequence to perform whole-word Mask pre-training on the BERT model, the method further comprises:

5. The method of claim 3, wherein the sentence-sequential prediction pre-training of the BERT model using the positive and negative example sample sentence pairs comprises:

6. The method of claim 2, wherein the second training task comprises a classification task and an entity recognition task for dialog scene objects;

training the BERT model by taking the second sample corpus as an input feature of the BERT model and taking the object label or the entity label as a training label to obtain a task training result;

and if the loss function is judged not to meet the requirement of model convergence, updating the model parameters of the BERT model, and performing iterative training on the updated BERT model until the loss function meets the requirement of model convergence.

7. The method according to claim 1, wherein before the inputting the preprocessed text data into the natural language processing model and obtaining natural language processing results, further comprising:

8. A natural language processing apparatus, comprising:

9. A storage medium on which a computer program is stored, characterized in that the program realizes the natural language processing method of any one of claims 1 to 7 when executed by a processor.

10. A computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the natural language processing method of any one of claims 1 to 7 when executing the program.