CN116611432A

CN116611432A - Drunk driving risk identification method and device, computer equipment and storage medium

Info

Publication number: CN116611432A
Application number: CN202310389078.6A
Authority: CN
Inventors: 杨逢
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-08-18

Abstract

The application discloses a drunk driving risk identification method, a drunk driving risk identification device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. Training data is obtained from a preset database, a training sample is constructed based on the training data, drunk driving data is used as a positive sample, non-drunk driving data is used as a negative sample, sample equalization processing is carried out on the training sample, an equalization sample is obtained, word segmentation processing is carried out on the equalization sample, word segmentation frequency is calculated, drunk driving characteristic word segmentation is determined based on the word segmentation frequency, structural processing is carried out on the equalization sample, a structural sample is used for training, target data of drunk driving risk to be identified is led into a drunk driving risk identification model, and drunk driving risk identification results are output. In addition, the application also relates to a blockchain technology, and target data can be stored in the blockchain. According to the drunk driving feature word segmentation method, the drunk driving feature word segmentation is extracted through word segmentation word frequency, and the sample is subjected to structural processing, so that the problem of context semantics is solved, and the prediction accuracy of the drunk driving risk recognition model is remarkably improved.

Description

Drunk driving risk identification method and device, computer equipment and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a drunk driving risk identification method, a drunk driving risk identification device, computer equipment and a storage medium.

Background

The claim settlement is a very important link in the insurance industry, the user needs to settle the claim through a proper channel when an accident occurs after purchasing the insurance, and the risk of fraud is generally existed in a plurality of claim settlement scenes at present, and drunk driving is one of all fraud with higher probability.

How to quickly identify the fraudulent behavior of the user in the process of claiming a case is a very important task of large insurance companies. In the prior art, drunk driving risk identification is generally achieved through a deep learning bert structural model, but the deep learning bert structural model has defects, and the bert model cannot identify the problem that a large amount of data are misunderstood in the escape process, so that the semantics of a context cannot be well learned, and the drunk driving risk prediction effect is affected.

Disclosure of Invention

The embodiment of the application aims to provide a drunk driving risk identification method, a drunk driving risk identification device, computer equipment and a storage medium, which are used for solving the technical problem that the drunk driving risk prediction effect is poor due to a large amount of data escape errors in the process of incapability of identifying and escaping in the existing drunk driving risk identification scheme by adopting a bert model.

In order to solve the technical problems, the embodiment of the application provides a drunk driving risk identification method, which adopts the following technical scheme:

a drunk driving risk identification method comprises the following steps:

acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data;

constructing a training sample based on the training data, wherein the training sample comprises a positive sample and a negative sample, the drunk driving data is used as the positive sample, and the non-drunk driving data is used as the negative sample;

performing sample equalization on the training samples to obtain equalized samples;

performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency;

determining drunk driving feature word segmentation based on the word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample;

training a preset gradient decision tree model by using the structured sample to obtain a drunk driving risk identification model;

and receiving a drunk driving risk identification instruction, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into the drunk driving risk identification model, and outputting drunk driving risk identification results.

Further, the sample equalization processing is performed on the training sample to obtain an equalized sample, which specifically includes:

oversampling is carried out on the positive samples in the training samples to obtain a capacity-expanding sample;

the equalized sample is constructed based on the positive sample, the negative sample, and the expanded sample.

undersampling a negative sample in the training sample to obtain a volume-reduced sample;

the equalized sample is constructed based on the positive sample and the reduced volume sample.

Further, the word segmentation processing is performed on the equalization sample, and word segmentation word frequency is calculated, which specifically includes:

performing word segmentation processing on all samples in the balanced samples to obtain a plurality of sample word segmentation;

determining a sample where each sample word is located, and obtaining a target sample;

counting the occurrence times of each sample word in the target sample to obtain a first word segmentation number;

counting the sum of the occurrence times of each sample word in the balanced sample to obtain a second word number;

and calculating word segmentation word frequency of each sample word segmentation based on the first word segmentation number and the second word segmentation number.

Further, the determining drunk driving feature word segmentation based on the word segmentation word frequency specifically includes:

acquiring word segmentation word frequency of sample word segmentation corresponding to the positive sample in the balanced sample, and acquiring the word segmentation word frequency of the positive sample;

obtaining word segmentation word frequency of sample word segmentation corresponding to the negative sample in the balanced sample, and obtaining the word segmentation word frequency of the negative sample;

calculating word frequency ratio of positive and negative sample word segmentation based on the positive sample word segmentation word frequency and the negative sample word segmentation word frequency;

and determining drunk driving feature word segmentation based on the word frequency ratio of the positive and negative sample word segmentation.

Further, the structuring processing is performed on the equalization sample based on the drunk driving feature word segmentation to obtain a structured sample, which specifically comprises:

traversing the equalization sample, and judging whether the equalization sample contains the drunk driving characteristic word;

if the equalization sample contains the drunk driving feature word, assigning the equalization sample containing the drunk driving feature word as a preset first value;

and if the equalization sample does not contain the drunk driving feature word, assigning the equalization sample which does not contain the drunk driving feature word as a preset second value.

Further, before determining the drunk driving feature word based on the word segmentation word frequency and performing structural processing on the equalization sample based on the drunk driving feature word, the method further comprises the following steps:

Calculating the feature weights corresponding to the balanced samples, and sequencing the feature weights obtained by calculation;

and carrying out sample screening on the balanced samples based on the sorting result of the characteristic weights.

In order to solve the technical problems, the embodiment of the application also provides a drunk driving risk identification device, which adopts the following technical scheme:

a drunk driving risk identification device, comprising:

the data acquisition module is used for acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data;

the sample construction module is used for constructing training samples based on the training data, wherein the training samples comprise positive samples and negative samples, the drunk driving data are used as positive samples, and the non-drunk driving data are used as negative samples;

the sample equalization module is used for carrying out sample equalization processing on the training samples to obtain equalized samples;

the word segmentation processing module is used for carrying out word segmentation processing on the balanced sample and calculating word segmentation word frequency;

the structuring processing module is used for determining drunk driving feature word segmentation based on the word segmentation word frequency, and carrying out structuring processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structuring sample;

The model training module is used for training a preset gradient decision tree model by utilizing the structural sample to obtain a drunk driving risk identification model;

the risk identification module is used for receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into the drunk driving risk identification model, and outputting drunk driving risk identification results.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the drunk driving risk identification method according to any one of the preceding claims.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the drunk driving risk identification method according to any of the preceding claims.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

the application discloses a drunk driving risk identification method, a drunk driving risk identification device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. Acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data; training samples are constructed based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples; sample equalization processing is carried out on the training samples, and equalized samples are obtained; performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency; determining drunk driving feature word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample; training a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model; and receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results. The method solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level well based on the statistical sample word segmentation word frequency, extracts key drunk driving feature word segmentation by analyzing the sample word segmentation word frequency, carries out structural processing on the sample based on the drunk driving feature word segmentation, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of the drunk driving risk recognition model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, trains the drunk driving risk identification model through sample structuring processing and adopting a lightweight gradient decision tree model, simplifies the complexity of an identification algorithm and the complexity of a model framework, and improves the timeliness of drunk driving risk identification processing.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow chart of one embodiment of a drunk driving risk identification method according to the present application;

FIG. 3 shows a schematic structural diagram of one embodiment of a drunk driving risk identification device according to the present application;

fig. 4 shows a schematic structural diagram of an embodiment of a computer device according to the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101, 102, 103, and may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

It should be noted that, the drunk driving risk identification method provided by the embodiment of the application is generally executed by a server, and correspondingly, the drunk driving risk identification device is generally arranged in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a drunk driving risk identification method according to the present application is shown. The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the prior drunk driving risk identification scheme, the drunk driving risk identification scheme is usually realized through a deep learning bert structural model, but the deep learning bert structural model has own defects. Firstly, the input of the bert model is limited, the longest input is 512 units long, but the bert model cannot process the scene that thousands of words or even tens of thousands of words possibly appear in the text length of the telephone communication content in the claim settlement process; secondly, the bert model cannot identify the problem of a large number of data escape errors in the escape process, so that the semantics of the context cannot be well learned, and the prediction effect is influenced by verification; finally, the deep learning bert structure model consumes longer time in the processing process due to the complexity of the algorithm, and the timeliness of the processing speed is difficult to meet the requirement of real-time processing in the case-part peak period.

In order to solve the technical problems, the application discloses a drunk driving risk identification method, a device, computer equipment and a storage medium, belongs to the technical field of artificial intelligence, well solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level based on word segmentation word frequency of a statistical sample, extracts key drunk driving feature words by analyzing the word segmentation word frequency of the sample, and performs structural processing on the sample based on the drunk driving feature words, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of a drunk driving risk identification model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, further improves the prediction precision, reduces the data length of the model input through sample structuring processing, simplifies the complexity of the recognition algorithm and the complexity of the model architecture by training the drunk driving risk recognition model through adopting a lightweight gradient decision tree model, and improves the timeliness of drunk driving risk recognition processing.

The drunk driving risk identification method comprises the following steps:

s201, training data is obtained from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data.

In this embodiment, the preset database stores all the history drunk driving report data, the content of the main body of the history drunk driving report data is the call record between the report personnel and the claim settlement agent, and the history drunk driving report data already includes whether drunk driving labels. The server acquires historical drunk driving report data from a preset database as training data, wherein the data containing drunk driving labels are drunk driving data, and the data containing drunk driving labels are non-drunk driving data. It should be noted that, based on the fact that fraud is commonly found in many drunk driving claims at present, the amount of drunk driving data in the database is far greater than that of non-drunk driving data.

S202, constructing training samples based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples.

In this embodiment, after the server acquires the training data from the preset database, a training sample is built according to the acquired training data, wherein the training sample includes a positive sample and a negative sample, drunk driving data is taken as the positive sample, non-drunk driving data is taken as the negative sample, and the number of drunk driving data is far greater than that of non-drunk driving data, so that the positive sample and the negative sample in the training sample are unbalanced.

S203, sample equalization processing is carried out on the training samples, and equalized samples are obtained.

When the classification of the raw data is extremely unbalanced, it is obviously problematic to want to model with such data, and especially when few classes of problems are of greater concern, problems caused by unbalanced classification of data are more prominent, such as insurance fraud, credit card fraud, case analysis, etc. In the case of unbalanced data classification, the prediction model using the machine learning algorithm may not be able to make accurate predictions, because in essence, the machine learning algorithm calculates some experiences from a large number of data sets to determine whether the data is normal or not, while the prediction result of the prediction model obtained by training the unbalanced training data obviously tends to predict a majority set, and a minority set may be regarded as a noise point or ignored, resulting in a larger deviation of the prediction result.

In this embodiment, before training a model, sample equalization processing is performed on training samples, so that equalization is performed on positive and negative samples, and even the proportion of the positive samples is improved, that is, the proportion of samples carrying drunk driving labels is improved, so that the prediction accuracy of a drunk driving risk identification model is improved. The sample equalization processing mode of the training samples comprises over-sampling and under-sampling, wherein the over-sampling is to generate new data samples for few sample classes to participate in training, so that the few sample classes are expanded, and the under-sampling is to screen a part of samples for multiple sample classes to participate in training, so that the multiple sample classes are reduced in volume.

In a specific embodiment of the application, in order to ensure the final prediction accuracy, the application adopts over-sampling to expand the capacity of the positive sample so as to realize the equalization of the positive sample and the negative sample, considering that the undersampling may cause the loss of the negative sample characteristic data.

Further, sample equalization processing is performed on the training samples to obtain equalized samples, which specifically includes:

oversampling is carried out on the positive sample in the training sample to obtain a capacity-expanding sample;

an equalization sample is constructed based on the positive, negative and expanded samples.

In this embodiment, the server performs oversampling on the positive samples in the training samples to obtain the expanded samples, and constructs the equalization samples based on the positive samples, the negative samples, and the expanded samples. In a specific embodiment of the application, random resampling is performed on positive samples in the training samples by adopting a random oversuppler function, and repeated sampling is performed in a few categories with a put back until the number of positive sample data plus the number of expanded samples is equal to the number of negative samples, so that sample equalization is realized.

An equalized sample is constructed based on the positive sample and the reduced-volume sample.

In this embodiment, the negative samples in the training samples are undersampled to obtain the volume-reduced samples, and the equalization samples are constructed based on the positive samples and the volume-reduced samples. In a specific embodiment of the present application, samples in most classes are deleted by using a random outstanding mapping function or a tomekliklinks function until the number of volume-reduced samples is equal to the number of positive samples, so as to realize sample equalization.

In another specific embodiment of the present application, the easynsemble method or the balancecascades method performs undersampling, wherein the easynsemble randomly divides a plurality of types of samples into n subsets, the number of each subset is equal to the number of minority types of samples, each subset is combined with the minority types of samples to respectively train a model, and n models are integrated, so that the total information amount is not reduced after integration although the samples of each subset are less than the total samples. BalanceCascade uses a supervised and Boosting approach (Boosting approach is a method used to improve the accuracy of weak classification algorithms by constructing a series of predictive functions and then combining them into a predictive function in a certain way). In the nth training, a base learner H is trained by combining a subset sampled from the majority class samples and the minority class samples, and samples which can be correctly classified by H in the majority class after training are removed. In the next (n+1) th round, a subset is generated from the most class samples after being culled for training with the few class samples, and finally different base learners are integrated.

In the above embodiment, the present application performs sample equalization processing on the training samples in an oversampling or undersampling manner, so as to achieve positive and negative sample equalization.

S204, word segmentation processing is carried out on the balanced sample, and word segmentation word frequency is calculated.

In this embodiment, the server performs word segmentation processing on all samples in the balanced samples to obtain a plurality of sample word segments, and calculates word segmentation word frequency of each sample word segment. In a specific embodiment of the present application, TF-IDF algorithm may be used to count word segmentation frequencies of each sample segmentation, TF-IDF (term frequency-inverse document frequency) is a word frequency statistics method used to evaluate the importance of a word to one of a set of documents or a corpus, where the importance of a word increases in proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency of its occurrence in the corpus.

Further, word segmentation processing is carried out on the balanced sample, and word segmentation word frequency is calculated, specifically comprising the following steps:

Counting the occurrence times of each sample word in a target sample to obtain a first word segmentation number;

based on the first word segmentation number and the second word segmentation number, word segmentation word frequency of each sample word segmentation is calculated.

In this embodiment, the server performs word segmentation processing on all samples in the balanced samples to obtain a plurality of sample words, determines the sample where the sample word is located as a target sample, counts the occurrence times of the sample word in the target sample to obtain a first word segmentation number, counts the sum of the occurrence times of the sample word in the balanced sample to obtain a second word segmentation number, calculates word segmentation word frequency of the sample word based on the first word segmentation number and the second word segmentation number, and counts the word segmentation word frequency of each sample word based on the above scheme.

In the embodiment, the application well solves the problems of escape errors, the problems of missing context semantics, confusion and the like on the global level based on the word segmentation frequency of the statistical sample by carrying out word segmentation processing on the equalization sample and counting the word segmentation frequency of each sample word segmentation based on the TF-IDF algorithm.

S205, determining drunk driving characteristic word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving characteristic word segmentation to obtain a structural sample.

In this embodiment, after the word segmentation word frequency of each sample word is obtained, the positive sample word segmentation word frequency and the negative sample word segmentation word frequency are respectively obtained, the word frequency ratio of the positive sample word segmentation is calculated based on the positive sample word segmentation word frequency and the negative sample word segmentation word frequency, and drunk driving feature word segmentation is determined based on the word frequency ratio of the positive sample word segmentation word and the negative sample word segmentation word, wherein the higher the word frequency ratio is, the higher the feature of the sample word serving as the positive sample is, namely the sample word has richer drunk driving features, and in the specific embodiment of the present application, the word frequency ratio top500 is selected as the feature word segmentation word.

Further, determining drunk driving feature word segmentation based on word segmentation word frequency specifically comprises the following steps:

In this embodiment, a server obtains word segmentation word frequency of a sample word corresponding to a positive sample in an balanced sample, obtains word segmentation word frequency of the sample word corresponding to a negative sample in the balanced sample, obtains word segmentation word frequency of the negative sample, calculates word frequency ratio of positive and negative sample word segments based on the positive sample word segmentation word frequency and the negative sample word segmentation word frequency, divides the positive sample word segmentation word frequency by the negative sample word segmentation word frequency to obtain word frequency ratio, and determines drunk driving feature word segmentation based on the magnitude of the word frequency ratio of the positive and negative sample word segments.

In the embodiment, the word frequency ratio of positive sample word segmentation and the word frequency of negative sample word segmentation are calculated, and drunk driving characteristic word segmentation is determined based on the word frequency ratio of the positive sample word segmentation and the negative sample word segmentation, so that the balanced sample is conveniently subjected to subsequent structural processing.

Further, the equalization sample is subjected to structural processing based on drunk driving feature word segmentation to obtain a structural sample, and the method specifically comprises the following steps:

traversing the equalization sample, and judging whether the equalization sample contains drunk driving feature word segmentation;

if the equalization sample contains drunk driving feature words, assigning the equalization sample containing drunk driving feature words to a preset first value;

if the equalization sample does not contain drunk driving feature words, assigning the equalization sample which does not contain drunk driving feature words to a preset second value.

In this embodiment, after obtaining drunk driving feature words, the server traverses the equalization sample, determines whether the equalization sample contains drunk driving feature words, if the equalization sample contains drunk driving feature words, assigns the equalization sample containing drunk driving feature words to a preset first value, and if the equalization sample does not contain drunk driving feature words, assigns the equalization sample not containing drunk driving feature words to a preset second value. In a specific embodiment of the present application, the first value is "1", the second value is "0", and the equalized sample is encoded by the first value and the second value, so as to implement the structuring of the equalized sample.

In the embodiment, the method and the device for detecting drunk driving risk are used for carrying out structural processing on the sample based on drunk driving feature word segmentation, so that key feature information of a training sample is well reserved and extracted, and the prediction accuracy of a drunk driving risk recognition model is remarkably improved.

S206, training a preset gradient decision tree model by using the structured sample to obtain a drunk driving risk identification model.

In this embodiment, the server trains a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model. In a specific embodiment of the application, an xgboost model is adopted as a gradient decision tree model, the xgboost model is trained through a structured sample, parameter tuning is performed through a gridsearch technology, a trained drunk driving risk identification model is obtained, and the model is placed in a production environment for deployment.

The xgboost model is essentially a GBDT decision tree, is a lightweight gradient decision tree model, is used for training a tree, and then training the next tree to predict the difference between the next tree and the true distribution, continuously training the tree for compensating the difference, and finally realizing the simulation of the true distribution by using the combination of the trees, wherein the prediction value of the sample is obtained by only adding the score corresponding to each tree. The Grid Search technology is a model parameter tuning means, and through an exhaustive Search mode, each possibility is tried through circulation traversal in all candidate parameter selections, and the parameter with the best performance is selected as a final parameter result.

In the embodiment, the drunk driving risk identification model is trained by adopting the lightweight gradient decision tree model, so that the complexity of an identification algorithm and the complexity of a model architecture are simplified, and the timeliness of drunk driving risk identification processing is improved.

S207, receiving drunk driving risk identification instructions, acquiring target data of drunk driving risks to be identified, importing the target data of drunk driving risks to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results.

In this embodiment, after finishing training of the drunk driving risk identification model, when receiving a drunk driving risk identification instruction, the server acquires target data of drunk driving risk to be identified, performs word segmentation on the target data of drunk driving risk to be identified to obtain target word segmentation, counts word frequency of the target word segmentation, performs structural processing on the target data of drunk driving risk to be identified according to the word frequency of the target word segmentation, and guides the target data after the structural processing into the drunk driving risk identification model to output a drunk driving risk identification result.

In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the drunk driving risk identification method operates may receive the drunk driving risk identification instruction through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.

Further, before determining drunk driving feature word segmentation based on word segmentation word frequency and performing structural processing on the equalization sample based on drunk driving feature word segmentation, the method further comprises the following steps:

In this embodiment, after the server performs word segmentation on the balanced samples, the server may also calculate the feature weights corresponding to the balanced samples, rank the feature weights obtained by calculation, and perform sample screening on the balanced samples based on the ranking result of the feature weights, for example, to screen a top5000 sample in the ranking result of the feature weights for training of the drunk driving risk recognition model.

In a specific embodiment of the present application, the sample word segmentation weights in each balanced sample may be calculated by the TF-IDF algorithm, and the sample word segmentation weights may be accumulated to obtain the feature weights of the balanced samples. In another specific embodiment of the present application, the samples may be first equalized and structured, the structured data may be placed in xgboost for weight sorting, the word segmentation weight feature of top50 may be selected, and the redundant samples in the equalized samples may be deleted according to the word segmentation weight feature.

In the embodiment, the method and the device perform sample screening on the balanced samples through the feature weight calculation of the balanced samples, reduce the training data amount, and simultaneously ensure that key feature information of the training samples is well reserved and extracted.

In the embodiment, the application discloses a drunk driving risk identification method, and belongs to the technical field of artificial intelligence. Acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data; training samples are constructed based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples; sample equalization processing is carried out on the training samples, and equalized samples are obtained; performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency; determining drunk driving feature word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample; training a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model; and receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results. The method solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level well based on the statistical sample word segmentation word frequency, extracts key drunk driving feature word segmentation by analyzing the sample word segmentation word frequency, carries out structural processing on the sample based on the drunk driving feature word segmentation, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of the drunk driving risk recognition model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, trains the drunk driving risk identification model through sample structuring processing and adopting a lightweight gradient decision tree model, simplifies the complexity of an identification algorithm and the complexity of a model framework, and improves the timeliness of drunk driving risk identification processing.

It is emphasized that the target data may also be stored in a blockchain node in order to further ensure the privacy and security of the target data.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a drunk driving risk identification device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

As shown in fig. 3, the drunk driving risk identification device 300 according to the present embodiment includes:

the data acquisition module 301 is configured to acquire training data from a preset database, where the training data includes drunk driving data and non-drunk driving data;

The sample construction module 302 is configured to construct a training sample based on training data, where the training sample includes a positive sample and a negative sample, drunk driving data is used as the positive sample, and non-drunk driving data is used as the negative sample;

the sample equalization module 303 is configured to perform sample equalization processing on the training sample to obtain an equalized sample;

the word segmentation processing module 304 is used for performing word segmentation processing on the balanced sample and calculating word segmentation word frequency;

the structuring processing module 305 is configured to determine drunk driving feature word segmentation based on word segmentation word frequency, and perform structuring processing on the equalization sample based on the drunk driving feature word segmentation to obtain a structured sample;

the model training module 306 is configured to train a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model;

the risk recognition module 307 is configured to receive a drunk driving risk recognition instruction, obtain target data of drunk driving risk to be recognized, import the target data of drunk driving risk to be recognized into a drunk driving risk recognition model, and output a drunk driving risk recognition result.

Further, the sample equalization module 303 specifically includes:

the capacity expansion unit is used for oversampling the positive samples in the training samples to obtain capacity expansion samples;

And a first sample construction unit for constructing an equalized sample based on the positive sample, the negative sample, and the expanded sample.

Further, the sample equalization module 303 may further include:

the volume reduction unit is used for undersampling the negative samples in the training samples to obtain volume reduction samples;

and a second sample construction unit for constructing an equalized sample based on the positive sample and the volume-reduced sample.

Further, the word segmentation processing module 304 specifically includes:

the word segmentation processing unit is used for carrying out word segmentation processing on all samples in the balanced samples to obtain a plurality of sample word segments;

the sample confirming unit is used for confirming the sample where each sample word is located and obtaining a target sample;

the first statistics unit is used for counting the occurrence times of each sample word in the target sample to obtain a first word segmentation number;

the second statistical unit is used for counting the sum of the occurrence times of each sample word in the balanced sample to obtain a second word number;

and the word frequency calculation unit is used for calculating word frequency of each sample word based on the first word segmentation number and the second word segmentation number.

Further, the structural processing module 305 specifically includes:

the first word frequency acquisition unit is used for acquiring word segmentation word frequencies of sample word segmentation corresponding to the positive samples in the balanced samples to obtain positive sample word segmentation word frequencies;

The second word frequency acquisition unit is used for acquiring word segmentation word frequencies of sample word segmentation corresponding to the negative samples in the balanced samples to obtain negative sample word segmentation word frequencies;

the word frequency ratio calculation unit is used for calculating the word frequency ratio of positive and negative sample word segmentation based on the positive sample word segmentation word frequency and the negative sample word segmentation word frequency;

and the characteristic word segmentation confirming unit is used for determining drunk driving characteristic word segmentation based on the word frequency ratio of the positive and negative sample word segmentation.

Further, the structured processing module 305 further includes:

the word segmentation judging unit is used for traversing the equalization sample and judging whether the equalization sample contains drunk driving characteristic word segmentation;

the first judgment result unit is used for assigning the equalization sample containing drunk driving feature words to a preset first numerical value when the equalization sample contains drunk driving feature words;

and the second judging result unit is used for assigning the equalization sample which does not contain drunk driving feature words to a preset second value when the equalization sample does not contain drunk driving feature words.

Further, the drunk driving risk recognition device 300 further includes:

the weight calculation module is used for calculating the feature weights corresponding to the balanced samples and sequencing the feature weights obtained by calculation;

and the sample screening module is used for carrying out sample screening on the balanced samples based on the sorting result of the characteristic weights.

In the embodiment, the application discloses a drunk driving risk identification device, and belongs to the technical field of artificial intelligence. Acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data; training samples are constructed based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples; sample equalization processing is carried out on the training samples, and equalized samples are obtained; performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency; determining drunk driving feature word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample; training a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model; and receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results. The method solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level well based on the statistical sample word segmentation word frequency, extracts key drunk driving feature word segmentation by analyzing the sample word segmentation word frequency, carries out structural processing on the sample based on the drunk driving feature word segmentation, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of the drunk driving risk recognition model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, trains the drunk driving risk identification model through sample structuring processing and adopting a lightweight gradient decision tree model, simplifies the complexity of an identification algorithm and the complexity of a model framework, and improves the timeliness of drunk driving risk identification processing.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a drunk driving risk identification method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the drunk driving risk identification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The application discloses computer equipment, and belongs to the technical field of artificial intelligence. Acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data; training samples are constructed based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples; sample equalization processing is carried out on the training samples, and equalized samples are obtained; performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency; determining drunk driving feature word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample; training a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model; and receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results. The method solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level well based on the statistical sample word segmentation word frequency, extracts key drunk driving feature word segmentation by analyzing the sample word segmentation word frequency, carries out structural processing on the sample based on the drunk driving feature word segmentation, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of the drunk driving risk recognition model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, trains the drunk driving risk identification model through sample structuring processing and adopting a lightweight gradient decision tree model, simplifies the complexity of an identification algorithm and the complexity of a model framework, and improves the timeliness of drunk driving risk identification processing.

The present application also provides another embodiment, namely, a computer readable storage medium, where computer readable instructions are stored, where the computer readable instructions are executable by at least one processor, so that the at least one processor performs the steps of the drunk driving risk identification method as described above.

The application discloses a storage medium, and belongs to the technical field of artificial intelligence. Acquiring training data from a preset database, wherein the training data comprises drunk driving data and non-drunk driving data; training samples are constructed based on training data, wherein the training samples comprise positive samples and negative samples, drunk driving data are used as positive samples, and non-drunk driving data are used as negative samples; sample equalization processing is carried out on the training samples, and equalized samples are obtained; performing word segmentation processing on the balanced sample, and calculating word segmentation word frequency; determining drunk driving feature word segmentation based on word segmentation word frequency, and carrying out structural processing on the balanced sample based on the drunk driving feature word segmentation to obtain a structural sample; training a preset gradient decision tree model by using a structured sample to obtain a drunk driving risk identification model; and receiving drunk driving risk identification instructions, acquiring target data of drunk driving risk to be identified, importing the target data of drunk driving risk to be identified into a drunk driving risk identification model, and outputting drunk driving risk identification results. The method solves the problems of escape errors, contextual semantic deletion, confusion and the like on the global level well based on the statistical sample word segmentation word frequency, extracts key drunk driving feature word segmentation by analyzing the sample word segmentation word frequency, carries out structural processing on the sample based on the drunk driving feature word segmentation, well reserves and extracts key feature information of a training sample, and remarkably improves the prediction precision of the drunk driving risk recognition model. In addition, the application realizes the equalization of the positive sample and the negative sample through sample equalization processing, trains the drunk driving risk identification model through sample structuring processing and adopting a lightweight gradient decision tree model, simplifies the complexity of an identification algorithm and the complexity of a model framework, and improves the timeliness of drunk driving risk identification processing.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The drunk driving risk identification method is characterized by comprising the following steps of:

2. The drunk driving risk identification method of claim 1, wherein the sample equalization processing is performed on the training sample to obtain an equalized sample, and specifically comprises:

3. The drunk driving risk identification method of claim 1, wherein the sample equalization processing is performed on the training sample to obtain an equalized sample, and specifically comprises:

4. The drunk driving risk identification method of claim 1, wherein the word segmentation processing is performed on the equalization sample, and word segmentation word frequency is calculated, and the method specifically comprises the following steps:

5. The drunk driving risk identification method of claim 1, wherein the determining drunk driving feature word based on the word segmentation word frequency specifically comprises:

6. The drunk driving risk identification method of claim 1, wherein the structuring process is performed on the equalization sample based on the drunk driving feature word to obtain a structured sample, and the method specifically comprises:

7. The drunk driving risk recognition method according to any one of claims 1 to 6, wherein before determining drunk driving feature word based on the word segmentation word frequency and performing structural processing on the equalization sample based on the drunk driving feature word to obtain a structural sample, the method further comprises:

8. Drunk driving risk identification device, characterized by, include:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the drunk driving risk identification method of any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the drunk driving risk identification method according to any of claims 1 to 7.