CN115293133A

CN115293133A - Vehicle insurance fraud behavior identification method based on extracted text factor enhancement

Info

Publication number: CN115293133A
Application number: CN202210564739.XA
Authority: CN
Inventors: 陈奎; 那崇宁; 丁锴; 杨佳熹
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-11-04

Abstract

The invention discloses an extracted text factor enhancement-based vehicle insurance fraud behavior recognition method. The invention integrates part-of-speech and syntax information and designs a pre-training language model combined framework. And extracting accident trigger words by using the pre-training language model knowledge, and optimizing an extraction result by part-of-speech filtering in combination with the attention mechanism learning syntactic relation weight. The invention also provides and designs an accident reason translation template to help extract the accident reason, so that the transmission error in the middle of the model can be effectively reduced. And finally, integrating the extracted text factors, encoding the discrete structured text by using a Transformer encoder, and predicting whether the vehicle insurance fraud behavior exists by using an integrated learning model.

Description

Vehicle insurance fraud behavior identification method based on extracted text factor enhancement

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an insurance fraud behavior identification method based on extracted text factor enhancement.

Background

The traditional car insurance fraud behavior identification method mainly aims at artificially marked structural data and cannot effectively process rich natural language information. In addition, the natural language description text contains a lot of redundant noise, which may cause model confusion and failure to obtain important information. Therefore, how to extract the structured key factors from the vehicle insurance related description text and help the model to identify the vehicle insurance fraud behavior has important significance.

The extraction of text factors belongs to one of natural language processing tasks, and the traditional text extraction method has template matching and a recurrent neural network, but the method has the defect that the semantics in the text cannot be effectively captured, so the extraction accuracy is not ideal. Most of the most advanced text extraction models at present are based on pre-training language models, and the semantic understanding capability of the pre-training language models is utilized to assist in extraction. The process is approximately that a pre-training language model encodes a text to obtain a word element representation, then a corresponding word element classification module is designed to predict the type of each word element, and finally the content needing to be extracted is judged according to the type. The pre-trained language model not only obtains a large amount of semantic knowledge from the originally trained corpus, but also focuses on the text context information in the extraction process. However, the existing pre-training model still has defects, firstly, the model with a plurality of parameters has a huge volume and is not suitable for processing semi-structured texts; secondly, the existing pre-training language model is not completely adapted to downstream tasks, such as a text extraction task, which causes error transmission among models and can not fully exert the performance of the pre-training language model.

The traditional prediction model only uses simple one-hot coding discrete data, which directly loses all semantic information; other words use word2vec coding mode, and the mode of independent word element coding cannot obtain context information. In addition, in the prediction stage, a single prediction model is very sensitive to data, poor in robustness and poor in prediction performance.

Disclosure of Invention

The invention aims to provide an insurance fraud behavior identification method based on text factor extraction enhancement aiming at the defect that a structural key factor is extracted from an insurance related description text in the prior art. The invention designs a pre-training language model combined framework fusing part of speech syntactic information, and provides an accident reason translation template aiming at the problem of error transmission between a pre-training language model and a downstream task; and integrating the extracted text factors, encoding the discrete structured text by using a Transformer encoder, and predicting whether the vehicle insurance fraud behaviors exist by using an integrated learning model.

The purpose of the invention is realized by the following technical scheme: a vehicle insurance fraud behavior identification method based on extraction text factor enhancement comprises the following steps:

1) Extracting structured address information from the address text of the vehicle accident;

2) Constructing a vehicle accident description keyword vocabulary to assist in vehicle accident description and accident survey description text segmentation;

3) Extracting accident trigger words in the vehicle accident description by using a pre-training language model, and acquiring an accident object by combining part-of-speech analysis and syntactic analysis;

4) Designing an accident reason translation template, and extracting accident reasons in an accident survey description text by combining a pre-training language model;

5) Performing text classification on the accident survey description text by using a pre-training language model to obtain accident result classification;

6) Integrating to extract structured data from the text, and constructing a text encoder to encode the structured text;

7) And (5) adopting an integrated learning model to learn and identify the car insurance fraud behavior.

Further, in step 1):

writing a corresponding regular matching template, and extracting structured address information from a natural language description text of the vehicle accident occurrence address by using the template; uniformly processing the problem of inconsistent formats, and adopting a similar text substitution method for wrong and default texts; wherein the text similarity calculation is an edit distance measure, for text T _a 、T _b The edit distance recurrence formula between is as follows:

further, in step 2):

the method for constructing the vehicle accident description keyword vocabulary table comprises the steps of constructing 2-gram, 3-gram and 4-gram vocabulary of a vehicle accident description and accident survey description text, and sequencing according to word frequency to obtain a final vehicle accident description keyword vocabulary table;

segmenting the texts of vehicle accident description and accident survey description according to special high-frequency words related to the vehicle accident field in the vehicle accident description keyword vocabulary; the word segmentation process includes two cases:

a) If the word elements obtained by word segmentation belong to the keywords in the vocabulary table, searching and judging whether the words formed by combining the word elements and the surrounding word elements also belong to the keywords or not, and taking the combined word elements as new word segmentation word elements;

b) And the word elements obtained by word segmentation do not belong to the keywords in the vocabulary, and are treated as independent word elements.

Further, in step 3):

adding a plurality of classifiers to form a trigger word extractor on the basis of a Chinese pre-training language model BERT-Base Chinese; for an input word segmentation result X = { X = { [ X ] ₁ ,x ₂ ,…,x _n }，x _i′ Representing the corresponding word element index, and the processing flow of the trigger word extractor is as follows:

H＝BERT _Chinese (X)

O＝softmax(tanh(HW _hidden +b _hidden )W+b)

wherein BERT _Chinese (. -) represents a BERT-Base Chinese pre-training language model, H is an intermediate variable output by the BERT-Base Chinese, W is a learnable matrix, b is an error, and finally, the output O of a multi-classifier of a multi-layer perceptron is combined as the output of a prediction probability;

and (3) labeling the part of speech and syntactic analysis of the text by adopting a StanfordCoreNLP integrated framework, obtaining the syntactic relation between the morpheme and the trigger word in the text, and constructing the relation weight corresponding to the syntactic relation:

S _r ＝[S _r,1 ,S _r,2 ,…,S _r,m ]

wherein S is _r,i″ Representing the learnable relation weight corresponding to the syntactic relation;

according to the intermediate variable H output by the trigger word extractor, acquiring the attention scores between the keywords and other lemmas:

wherein H _i′ An intermediate variable representing the ith' lemma;

and calculating the association degree between the trigger word and other word elements by using the relation weight and the attention of the trigger word:

S＝S _r,i″ attention (trigger word, current element)

Wherein S is _r,i″ The learnable relation weight represents the syntactic relation between the trigger word and the current lemma, and the attention () represents the attention score between the trigger word and the current lemma, and the product of the two obtains the association degree and arranges the association degree in sequence;

finally, filtering out the lemmas with abnormal parts of speech according to the characteristics of the parts of speech of the lemmas, and taking the lemmas with the highest relevance degree with the trigger words as final accident objects;

the accident trigger word extraction and the accident object extraction are divided into two independent sub-modules, so that the two modules need to be optimized separately; x = { X for input text ₁ ,x ₂ ,…,x _n } ofTherefore, the training loss function of the trigger extraction module is designed as follows:

wherein the content of the first and second substances,

the probability distribution of the output labels of the ith' lemma of the accident trigger word extraction module,

then represent

Corresponding actual label probability distribution.

Further, in step 4):

designing an accident reason translation template, and converting the named entity recognition task into an input suitable for a pre-training model by using the template; wherein meaningless punctuation symbols are removed, the [ CLS ] placeholders input by the model represent the beginning of the sentence, and the [ SEP ] placeholders represent the separation and the end of the sentence; the model training is similar to the text extraction process in the step 3), and the translated text is directly used as a pre-training language model to be input; it should be noted that the input of the model also requires accident cause translation in the model prediction phase, except that the accident cause to be predicted is replaced by a placeholder [ MASK ], and the output corresponding to the [ MASK ] is directly predicted as the prediction result.

Further, in step 5), the accident result classification is obtained by text classification:

a) Firstly, original texts need to be cleaned, and redundant texts without practical significance are deleted;

b) Acquiring text representation by using a [ CLS ] label in a Chinese pre-training language model;

c) The multi-layer perceptron classifies the text, its probability distribution p _y The following:

p _y ＝softmax(MLP(H _[CLS] ))

the class with the highest probability is classified as the final accident result.

Further, in step 6):

before learning the vehicle insurance fraud behavior model, structured data extracted from the text, such as address information, accident trigger words, accident objects, accident reasons, accident results, accident road types, vehicle repair shop classes, needs to be integrated; in addition, information which possibly influences the final recognition performance in the text is extracted, wherein the information comprises escape information, vehicle damage conditions and personnel injury conditions; aiming at different characteristics of the structured data, different coding methods are adopted; the system comprises a plurality of addresses, a vehicle repair factory level, an accident object and an accident reason, wherein province and city information, an accident result, an accident road type, a vehicle repair factory level, escape information, a vehicle damage condition and a personnel injury condition in the addresses are used as discrete type data, and an accident trigger word, an accident object and an accident reason are used as text data; the type data is coded in a one-hot coding mode, and the text data is coded by a Transformer coder based on a self-attention mechanism; the Transformer encoder is formed by stacking a plurality of Transformer components, and the lexical element x _i′ The corresponding coding method is as follows:

wherein the content of the first and second substances,

representing a lemma x _i′ At the output of the L-th layer transformer,

then represents the Transformer encoder input, word2vec (·) and position (·) represent the lemma x, respectively _i′ Word embedding and position embedding.

Further, in step 7):

the essence of the integrated learning model is that a final prediction result is voted out by a plurality of sub-models; 6 seed models including Catboost, lightGBMLarge, lightGBMXT, lightGBM, XGboost and NeuralNetMXNet are selected, and a set of learnable voting weights is designed:

y _p ＝λ ₁ O _CatBoost +λ ₂ O _LightGBM +λ ₃ O _{LightGBMLarge} + λ ₄ O _LightGBMXT +λ ₅ O _XGBoost +λ ₆ O _{NeuralNetMXNet}

the model loss calculation function is as follows:

the invention has the beneficial effects that:

(1) The problems of default and error in the original data can be effectively solved by using the text similarity;

(2) The vehicle accident description keyword vocabulary constructed by knowledge in the field of vehicle accidents can obviously improve the word segmentation accuracy, which lays a solid foundation for subsequent extraction tasks, and compared with a simple word segmentation result, a plurality of special word elements related to vehicle insurance can be accurately segmented;

(3) The pre-training language model combined framework model fused with the part-of-speech syntactic information can accurately extract accident trigger words and accident objects, the accuracy rates respectively reach 97% and 86.5%, and the extraction accuracy rate of long sentence patterns is improved by more than 7.4%;

(4) The event translation template greatly reduces error transfer among models, the test accuracy of the accident cause unit extraction reaches 83%, and the performance of the event translation template is obviously improved compared with that of a traditional named entity recognition model;

(5) The transform coding mode can convert discrete data into vectors which are easy to understand and process by a machine, and related domain knowledge is reserved in the coding learning process;

(6) The method has the advantages that the integrated learning model is adopted to predict the vehicle insurance fraud behavior, and the accuracy is higher compared with a single simple model;

(7) Ablation experiment results prove that the text extraction factor can obviously improve the prediction accuracy, help a prediction model to more accurately identify the fraudulent behavior and achieve the accuracy rate of more than 87%.

Drawings

FIG. 1 is a flow chart of the method for identifying vehicle insurance fraud based on text factor extraction enhancement according to the present invention;

FIG. 2 is a schematic diagram of accident trigger extraction and accident object extraction;

FIG. 3 is a diagram illustrating an example of an accident cause translation template.

Detailed Description

The invention is further described in the following with reference to the accompanying drawings.

The invention discloses an insurance fraud behavior recognition method based on extracted text factor enhancement. The description text related to the car insurance can be mainly divided into three categories: the system comprises a vehicle accident occurrence address, a vehicle accident description text and an accident survey description text. In addition, the related text also comprises address information, accident trigger words, accident objects, accident reasons, accident results, accident road types, vehicle repair factories, escape information, vehicle damage conditions, personnel injury conditions and other structured data. Firstly, extracting text factors of vehicle accident occurrence addresses by adopting a template matching mode, and solving the problems of default and error of original data by measuring text similarity. The factor extraction method of the vehicle accident description text and the accident investigation description text is based on a pre-training language model, and a vehicle accident description keyword vocabulary is constructed by using knowledge in the vehicle insurance field at the beginning stage. The difference between the two is that part-of-speech syntactic information is fused in the process of extracting the vehicle accident description text factor, and an accident trigger word and an accident object factor are extracted by designing a pre-training language model combined model fused with the part-of-speech syntactic information; an accident reason translation template is proposed and designed in the process of extracting the accident survey description text factors to help extract the accident reasons, and the scheme can effectively reduce the transmission error existing in the middle of the model. And finally, integrating the extracted text factors, encoding the discrete structured text by using a Transformer encoder, and predicting whether the vehicle insurance fraud behavior exists by using an integrated learning model.

As shown in fig. 1, the present invention mainly comprises the following steps:

1) And extracting the structured address information from the natural language description text of the vehicle accident occurrence address.

A natural language description text of a vehicle accident occurrence address is semi-structured data, such as "jiangning district (district/county/town) × road ×"; therefore, the invention directly adopts a template matching method to extract the structured address information. The method comprises the following specific steps: firstly, introducing a regular expression module, then compiling a corresponding regular matching template, and extracting structured address information from a natural language description text of an address where a vehicle accident occurs by using the template; for example, { province: "Jiangsu", city: "Nanjing", district/county/town/county: "Jiangning district", detailed position: "× × asterisk" }.

The method has the advantages that the problem of inconsistent format and even wrong format exist in the original text, the problem of inconsistent format is treated uniformly, and similar texts are adopted for replacing the wrong text and the default text. Wherein, the text similarity calculation adopts an edit distance measurement, and for the text T _a 、T _b The edit distance recurrence formula between is as follows:

wherein i and j respectively represent T _a 、T _b The corresponding text sequence number is then transmitted to the user,

representing a text T _a And T _b The edit distance between.

2) And constructing a vehicle accident description keyword vocabulary to assist in vehicle accident description and accident survey description text segmentation.

The method for constructing the vehicle accident description keyword vocabulary comprises the following steps: firstly, 2-gram, 3-gram and 4-gram vocabularies of vehicle accident description and accident survey description texts need to be constructed, the vocabularies are ordered according to word frequency, and then a final vehicle accident description keyword vocabulary is obtained in a manual screening mode. According to special high-frequency vocabularies related to the field of vehicle accidents in the vehicle accident description keyword vocabulary, the natural language texts of vehicle accident description and accident survey description are segmented on the basis of the special high-frequency vocabularies, and the segmentation accuracy is improved.

The following two cases may occur in a common word segmentation process:

2.1 Token) if the word elements obtained by word segmentation belong to the keywords in the vehicle accident description keyword vocabulary, searching surrounding word elements, and judging whether the words formed by combining the word elements with the surrounding word elements also belong to the keywords. If so, the combined lemma is taken as a new participle lemma, which indicates that the combined lemma has more semantic integrity and is more suitable to be processed as a single lemma. If not, the lemma is processed as a separate lemma. For example, in the text "traveling, the subject vehicle is scratched by electric motorization. The Chinese, electric and motorcycle are divided into two independent word elements, but the whole word element of the electric motorcycle has semantic integrity, and the optimization method can effectively improve the word segmentation accuracy.

2.2 Token) if the token obtained by word segmentation does not belong to the keyword in the vehicle accident description keyword vocabulary, the token obtained by word segmentation is treated as an independent token.

3) As shown in fig. 2, the pre-training language model is used to extract accident trigger words in the vehicle accident description, and the accident object is obtained by combining part-of-speech analysis and syntactic analysis.

The trigger extractor aims to predict whether the lemma in the text triggers the event to happen. On the basis of a Chinese pre-training language model BERT-Base Chinese, a trigger word extractor is formed by adding multiple classifiers.

For an input word segmentation result X = { X = { [ X ] ₁ ,x ₂ ,…,x _n }，x _i′ Representing corresponding lemma indexes. The trigger word extractor comprises the following processing flows:

H＝BERT _Chinese (X)

O＝softmax(tanh(HW _hidden +b _hidden )W+b)

wherein BERT _Chinese (. H) represents a BERT-Base Chinese pre-training language model, and H is an intermediate variable output by the BERT-Base Chinese; w is the learnable matrix, b is the error, W _hidden 、b _hidden Matrix parameters and error parameters which can be learnt by a final output layer of the model are represented, tanh and softmax are activation functions, and O is output of a multi-classifier combined with a multilayer perceptron and is used as prediction probability output.

Extracting trigger words in the text according to the t-morpheme prediction category; for example, the text "in motion, the subject vehicle is scratched by the electric motorcycle. "the word element" scratch-off "is the word element with the maximum prediction probability in the O, and is extracted as the keyword for triggering the occurrence of the event.

The invention adopts a StanfordCoreNLP integration framework, labels the part of speech and syntactic analysis of the text, and obtains the syntactic relation r between each word element and the trigger word in the text _i” And a relation weight S corresponding to the syntactic relation is constructed _r ：

S _r ＝[S _r,1 ,S _r,2 ,…,S _r,m ]

Wherein S is _r Learnable relationship weights, S, representing all syntactic relationships _r,i″ Then represents the syntactic relation r _i” Corresponding learnable relationship weights, m denotes the number of syntactic relationships, and the subscript i "is used here to distinguish the different syntactic relationships.

According to the intermediate variable H output by the trigger word extractor, the attention scores between the keywords and other lemmas can be obtained, and the specific operation is as follows:

wherein H _i′ Intermediate variables representing the ith' lemma, H _j′ Representing non-key words X in X _i′ An intermediate variable of one other lemma; h _k For summing.

By using the relationship weight and the attention of the trigger word, the association degree between the trigger word and other word elements can be calculated:

S＝S _r,i″ attention (trigger word, current element)

Wherein S is _r,i″ Representing syntactic relation r between trigger word and current lemma _i” The attention (-) represents the attention function between the trigger word and the current lemma, and the product of the two obtains the degree of association and arranges the degrees in sequence.

Finally, according to the part-of-speech characteristics of the lemma, the lemma with abnormal part-of-speech is filtered, and the lemma with the highest degree of association with the trigger word is used as a final accident object.

The accident trigger word extraction and the accident object extraction are divided into two independent sub-modules, so that the two modules need to be optimized separately. For input text X = { X ₁ ,x ₂ ,…,x _n The training loss function of the accident trigger word extraction module is designed as follows:

wherein the content of the first and second substances,

then represent

Corresponding actual label probability distribution (one-hot type), p ₁ And the probability distribution similarity between the output label of the accident trigger word extraction module and the actual label is expressed for measuring the prediction accuracy.

The training loss function of the accident object extraction module is consistent with the accident trigger word extraction module:

wherein, the first and the second end of the pipe are connected with each other,

the probability distribution of the output label of the ith' lemma of the accident object extraction module,

then represent

Corresponding actual label probability distribution, p ₂ And the probability distribution similarity between the output label of the accident object extraction module and the actual label is expressed so as to measure the prediction accuracy.

4) And designing an accident reason translation template, and extracting accident reasons in the accident investigation description text by combining a pre-training language model.

The extraction of the accident cause in the accident investigation description text essentially belongs to the named entity recognition task in natural language processing. Different from the traditional named entity recognition task, the method does not adopt a method of pre-training a model and debugging a downstream task, but designs an accident cause translation template, and converts the named entity recognition task into the input suitable for the pre-training model by utilizing the template so as to avoid the error influence generated in the task adaptation process of the model.

The accident reason translation template specifically comprises: [ CLS ] placeholders represent the beginning of a sentence, two [ SEP ] placeholders represent the division and end of the sentence; text is included between [ CLS ] and the first [ SEP ] (meaningless punctuation marks are removed), and accident cause lemmas are included between the first [ SEP ] and the second [ SEP ]. For example:

[CLS]punctuation-removed text [ SEP ]]The accident is due toAccident reason word elementResult in [ SEP]

As shown in fig. 3, for example, the text "target vehicle changes lane to run straight. The 'and' subject vehicle carelessly turns over into the ditch when running, and the vehicle is seriously damaged. ", wherein" lane change "and" careless "are respectively marked as the corresponding cause of the accident. Finally, the accident cause translation template converts the corresponding pre-training model inputs into: the [ CLS ] target vehicle lane change collision straight-driving [ SEP ] accident is caused by the [ SEP ] caused by lane change, and the [ SEP ] accident caused by the [ CLS ] target vehicle driving carelessly turning over into the ditch and seriously damaging the vehicle is caused by carelessly causing the [ SEP ].

Training the accident reason extraction model is similar to the text extraction process in the step 3), and the text translated by the accident reason translation template is directly used as pre-training language model input to extract the accident reason lemma in the text.

4.1 Translate the cause of the incident into a pre-trained language model input in a particular format using the translation template.

It should be noted that the input in the model prediction stage also needs to be translated through an accident cause translation template, except that the accident cause to be predicted is replaced by a placeholder [ MASK ], and the output corresponding to the [ MASK ] is directly predicted. For example:

[ CLS ] Description text [ SEP ] accidents are [ SEP ] caused by [ MASK ]

The accident reason translation template reduces error transmission between the language model and the named entity recognition task to the maximum extent, retains text semantic information, and can effectively improve the accuracy of named entity recognition.

4.2 The pre-training language model obtains text semantic expression vectors through a multi-head self-attention mechanism and text language order, and obtains [ MASK ] semantic expression through a fine-tuning module.

4.3 Output the probability distribution of the accident cause through the final [ MASK ] semantic representation of the model.

5) And utilizing a pre-training language model to classify the text of the accident survey description text to obtain accident result classification.

The accident survey description text roughly explains the final result of the accident, the invention carries out sorting and analysis on the accident result, and the accident result is divided into six types according to the characteristics: "collision with a moving object", "collision with a fixed object", "vehicle collision", "running damage", "parking damage".

The invention obtains accident result classification by text classification, and the operation is as follows:

5.1 The original accident investigation description text needs to be cleaned first, and redundant texts without practical significance are deleted;

5.2 Using [ CLS ] in Chinese pretrained language model]Tag fetch text representation H _[CLS] ；

5.3 Multi-layer perceptron for text H _[CLS] Classification is made of the probability distribution p _y The following were used:

p _y ＝softmax(MLP(H _[CLS] ))

wherein softmax is an activation function, and MLP represents a multi-layer perceptron. The class with the highest probability is classified as the final accident result.

6) And integrating the structured data extracted from the text to construct a text encoder for encoding the structured text.

Before learning the vehicle insurance fraud behavior model, structured data extracted from the text by the steps are integrated, wherein the structured data comprise address information, accident trigger words, accident objects, accident reasons, accident results, accident road types, vehicle repair shop grades and the like. In addition, the invention extracts information which possibly influences the final recognition performance in the text, wherein the information comprises escape information, vehicle damage conditions, personnel injury conditions and the like. The type of the accident road, the vehicle repair factory, escape information, the damage condition of the vehicle, the injury condition of personnel and the like are extracted by a template matching method.

Aiming at different characteristics of the structured data, the invention adopts different coding methods. The information of province and city, accident result, escape information, vehicle damage condition and personnel injury condition in the address are used as discrete type data; and taking the accident trigger words, the accident objects and the accident reasons as text data. The type data is coded by a one-hot coding mode, and the text data is coded by a Transformer coder based on a self-attention mechanism.

The transform encoder consists ofMultiple transform components are stacked, and the lemma x _i′ The corresponding coding method is as follows:

wherein the content of the first and second substances,

representing a word element x _i′ At the output of the L-th layer transformer,

then it represents the transform encoder input, word2vec (-) and position (-) represent the lemma x, respectively _i′ Word embedding and position embedding.

7) And according to the data characteristics and the complexity, learning and identifying the vehicle insurance fraud behavior by adopting an integrated learning model.

The essence of the integrated learning model is that a final prediction result is voted by a plurality of submodels, the submodels have extremely high response speed and high accuracy, and the integrated learning model can obtain high prediction accuracy and high robustness and flexibility. The embodiment of the invention selects 6 seed models including Catboost, lightGBMLarge, lightGBMXT, lightGBM, XGboost and NeuralNet MXNet, and designs a group of learnable voting weight lambda ₁ ～λ ₆ ：

Wherein, O _CatBoost 、O _LightGBM 、O _{LightGBMLarge} 、O _LightGBMXT 、O _XGBoost 、O _{NeuralNetMXNet} For each sub-modelAnd (4) outputting the prediction probability. y is _p To predict a probability distribution.

The loss calculation function of the ensemble learning model is as follows:

where N represents the training data set, | N | represents the number of samples of the training data, used here to calculate the average loss value; y is _p,i″′ Output label probability distribution for the i' th training sample, y _i″′ Then represents y _p,i″′ And (3) corresponding actual label probability distribution (one-hot type), wherein p represents the probability distribution similarity between the output label and the actual label, and is used for measuring the prediction accuracy.

According to the experimental result, the method breaks through the limitation of the traditional vehicle insurance fraudulent behavior identification method, extracts important information from the related description text, reduces the interference of redundant information and noise on a prediction task, and simultaneously converts the extraction factor into a more effective expression form to enhance the vehicle insurance fraudulent behavior identification performance. The constructed vehicle accident description keyword vocabulary table can effectively improve the word segmentation accuracy and help the subsequent steps to obtain better text semantic integrity; the method for extracting the accident object by combining the accident trigger words with the part of speech analysis and the syntactic analysis and utilizing the attention mechanism is innovatively provided, and the accuracy rate of extracting the accident object can be effectively improved; the designed accident reason translation template breaks through the limitation between the traditional language model and downstream tasks, and the advantage of the pre-training language model can be exerted to the maximum extent; the constructed text encoder can effectively convert the text into information which is easy to process by a computer, and the information utilization rate is effectively improved. The results of various comparison experiments and unit tests show that various extracted text factors can obviously enhance the identification performance of the vehicle insurance fraud behaviors.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A vehicle insurance fraud behavior identification method based on extracted text factor enhancement is characterized by comprising the following steps:

1) Extracting structured address information from the vehicle accident occurrence address text;

4) Designing an accident reason translation template, and extracting accident reasons in an accident investigation description text by combining a pre-training language model;

6) Integrating and extracting structured data from the text, and constructing a text encoder to encode the structured text;

2. The vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 1):

3. the vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 2):

the method for constructing the vehicle accident description keyword vocabulary comprises the steps of constructing 2-gram, 3-gram and 4-gram vocabularies of a vehicle accident description and accident survey description text, and sequencing according to word frequency to obtain a final vehicle accident description keyword vocabulary;

according to special high-frequency vocabularies related to the vehicle accident field in the vehicle accident description keyword vocabulary, performing word segmentation on texts of vehicle accident description and accident survey description; the word segmentation process includes two cases:

2.1 ) the word elements obtained by word segmentation belong to keywords in a vocabulary table, searching and judging whether a word formed by combining the word elements and surrounding word elements also belongs to the keywords or not, and taking the combined word elements as new word segmentation word elements;

2.2 ) the lemma obtained by word segmentation does not belong to the keywords in the vocabulary table, and the lemma obtained by word segmentation is treated as an individual lemma.

4. The vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 3):

adding a multi-classifier to form a trigger word extractor on the basis of a Chinese pre-training language model BERT-Base Chinese; for an input word segmentation result X = { X = { [ X ] ₁ ，x ₂ ，…，x _n }，x _i′ Representing the corresponding word element index, and the processing flow of the trigger word extractor is as follows:

H＝BERT _Chinese (X)

O＝softmax(tanh(HW _hidden +b _hidden )W+b)

S _r ＝[S _r，1 ，S _r，2 ，…，S _r，m ]

wherein S is _ri″ Representing the learnable relation weight corresponding to the syntactic relation;

according to the intermediate variable H output by the trigger word extractor, obtaining the attention scores between the keywords and other lemmas:

wherein H _i′ An intermediate variable representing the ith' lemma;

calculating the association degree between the trigger word and other word elements by using the relation weight and the attention of the trigger word:

S＝S _r，i″ attention (trigger word, current element)

Wherein S is _r，i″ The learnable relation weight represents the syntactic relation between the trigger word and the current lemma, and the attention () represents the attention score between the trigger word and the current lemma, and the product of the two obtains the association degree and arranges the association degree in sequence;

the accident trigger word extraction and the accident object extraction are divided into two independent sub-modules, so that the two modules need to be optimized separately; for input text X = { X ₁ ，x ₂ ，…，x _n The training loss function of the accident trigger word extraction module is designed as follows:

wherein the content of the first and second substances,

then represent

Corresponding actual label probability distribution.

5. The vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 4):

designing an accident reason translation template, and converting the named entity recognition task into an input suitable for a pre-training model by using the template; wherein meaningless punctuation symbols are removed, the [ CLS ] placeholders input by the model represent the beginning of the sentence, and the [ SEP ] placeholders represent the separation and the end of the sentence; the model training is similar to the text extraction process in the step 3), and the translated text is directly used as a pre-training language model to be input; it should be noted that the input of the model also requires the translation of the accident cause in the model prediction stage, except that the accident cause to be predicted is replaced by a placeholder [ MASK ], and the output corresponding to the [ MASK ] is directly predicted as the prediction result.

6. The vehicle insurance fraud recognition method based on extracted text factor enhancement as claimed in claim 1, wherein in step 5), the accident result classification is obtained by text classification:

5.1 Original texts need to be cleaned first, and redundant texts without practical significance are deleted;

5.2 Obtaining a text representation using [ CLS ] tags in a Chinese pre-training language model;

5.3 Multi-layer perceptron classifies text with probability distribution p _y The following:

p _y ＝softmax(MLP(H _[CLS] ))

7. The vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 6):

wherein the content of the first and second substances,

8. The vehicle insurance fraud recognition method based on extracted text factor enhancement according to claim 1, characterized in that in step 7):

y _p ＝λ ₁ O _CatBoost +λ ₂ O _LightGBM +λ ₃ O _{LightGBMLarge} +λ ₄ O _LightGBMXT +λ ₅ O _XGBoost +λ ₆ O _{NeuralNetMXNet}

the model loss calculation function is as follows: