CN114861661A - Entity identification method, device, equipment and storage medium - Google Patents

Entity identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN114861661A
CN114861661A CN202110077227.6A CN202110077227A CN114861661A CN 114861661 A CN114861661 A CN 114861661A CN 202110077227 A CN202110077227 A CN 202110077227A CN 114861661 A CN114861661 A CN 114861661A
Authority
CN
China
Prior art keywords
data
illegal
sequence
clue
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110077227.6A
Other languages
Chinese (zh)
Inventor
贺敏
王秀文
董琳
郭富民
杨菁林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202110077227.6A priority Critical patent/CN114861661A/en
Publication of CN114861661A publication Critical patent/CN114861661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Abstract

The application relates to an entity identification method, an entity identification device, entity identification equipment and a storage medium. The method comprises the steps of obtaining illegal fundamentation line data; determining a word vector sequence corresponding to the illegal fundamentary line data; reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to illegal collective resource clue data; and extracting a target label belonging to the entity label from the label sequence, and taking the data corresponding to the target label as an illegal funding thread entity in the illegal funding thread data. Therefore, the technical scheme of the application realizes automatic identification of illegal funding thread entities, not only has high identification efficiency, but also can realize real-time identification.

Description

Entity identification method, device, equipment and storage medium
Technical Field
The present application relates to the field of computers, and in particular, to a method, an apparatus, a device, and a storage medium for entity identification.
Background
At present, the internet fusion industry represented by internet finance is developed vigorously to generate good economic benefit and social benefit, but meanwhile, the problems of security risk and the like associated with the high-speed development of the internet finance are not ignored.
Illegal funding, as a typical illegal case in internet finance, becomes the key point for being managed. Currently, illegal fundamentary thread entities are extracted as an important component of an illegal fundamentation monitoring platform. Wherein, the illegal fundamentation cue entity is the name of an enterprise or the name of a platform where illegal fundamentation occurs.
In the related art, the illegal funding thread entities are identified from the illegal funding thread data in a manual identification mode, but the manual identification is not only inefficient, but also difficult to identify in real time.
Disclosure of Invention
The application provides an entity identification method, an entity identification device, entity identification equipment and a storage medium, which are used for solving the problems that manual identification is not only low in efficiency, but also real-time identification is difficult to achieve.
In a first aspect, an entity identification method is provided, including:
obtaining illegal fundament line data;
determining a word vector sequence corresponding to the illegal fundamentary line data;
reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data;
and extracting a target label belonging to the entity label from the label sequence, and taking the data corresponding to the target label as an illegal fundamentation thread entity in the illegal fundamentation thread data.
Optionally, determining a word vector sequence corresponding to the illegal fundamentation line data includes:
performing word segmentation on the illegal resource gathering clue data to obtain at least one text character;
respectively obtaining a word vector of each text word in the at least one text word;
and generating the word vector sequence according to the respective word vector of each text word.
Optionally, obtaining a word vector of each text word of the at least one text word respectively comprises:
and respectively carrying out character vector mapping on each text character in the at least one text character by adopting a character-level word vector model to obtain the word vector.
Optionally, the BiLSTM-CRF model comprises a bidirectional LSTM layer and a CRF layer;
reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data, wherein the label sequence comprises the following steps:
inputting the word vector sequence into the bidirectional LSTM layer to obtain a statement feature matrix of the illegal fundamentation data;
and reasoning the statement feature matrix by using the CRF layer to obtain a label sequence corresponding to the illegal resource collection clue data.
Optionally, the reasoning is performed on the statement feature matrix by using the CRF layer to obtain a tag sequence corresponding to the illegal resource gathering cue data, where the tag sequence includes:
labeling and decoding the sentence characteristic matrix to obtain at least one label sequence;
determining, for each of at least one sequence of tags, a probability that the illegal fundamentation data corresponds to the sequence of tags, respectively;
and taking the label sequence with the highest probability in the at least one label sequence as the label sequence corresponding to the illegal fundamentation line data.
Optionally, determining a probability that the illegal fundamentation data corresponds to the tag sequence comprises:
determining a predicted probability that the illegal fundamentation data corresponds to the sequence of tags;
and carrying out normalization processing on the prediction probability to obtain the probability.
Optionally, the bidirectional LSTM layer comprises a forward LSTM layer and a backward LSTM layer;
inputting the word vector sequence into the bidirectional LSTM layer to obtain a statement feature matrix of the illegal fundamentation data, wherein the statement feature matrix comprises:
inputting the word vector sequence into the forward LSTM to obtain a first hidden state sequence;
inputting the word vector sequence into the reverse LSTM to obtain a second hidden state sequence;
splicing the first hidden state sequence and the second hidden state sequence to obtain a complete hidden state sequence;
and carrying out dimension mapping on the complete hidden state sequence to obtain the statement feature matrix.
Optionally, the obtaining illegal fundamentation line data includes:
acquiring M pieces of original data;
screening N suspected clue data from the M original data;
and respectively determining the category of the suspected clue data by adopting a preset classifier for the N suspected clue data, and determining the suspected clue data as the illegal funding clue data when the category is clue information.
Optionally, the preset classifier comprises a first classifier, a second classifier and a third classifier;
determining the category of the suspected clue data by adopting a preset classifier, wherein the method comprises the following steps:
determining a first category of the suspected clue data with the first classifier, a second category of the suspected clue data with the second classifier, and a third category of the suspected clue data with the third classifier;
counting a first number of clue information and a second number of non-clue information from the first category, the second category and the third category;
when the first number is larger than the second number, determining the category of the suspected clue data as the clue information;
and when the first number is smaller than the second number, determining the category of the suspected clue data as the non-clue information.
In a second aspect, an entity identification apparatus is provided, including:
the acquisition unit is used for acquiring illegal fundamentary line data;
a determining unit for determining a word vector sequence corresponding to the illegal fundamentation line data;
the reasoning unit is used for reasoning the word vector sequence by utilizing a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data;
and the identification unit is used for extracting a target label belonging to an entity label from the label sequence and taking the data corresponding to the target label as an illegal funding thread entity in the illegal funding thread data.
In a third aspect, an electronic device is provided, including: the system comprises a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor is configured to execute the program stored in the memory to implement the entity identification method according to the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program, and the computer program realizes the entity identification method of the first aspect when being executed by a processor.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the technical scheme provided by the embodiment of the application, illegal resource gathering clue data are obtained, a word vector sequence corresponding to the illegal resource gathering clue data is determined, the word vector sequence is inferred by using a BiLSTM-CRF model obtained through pre-training, a label sequence corresponding to the illegal resource gathering clue data is obtained, a target label belonging to an entity label is extracted from the label sequence, and the data corresponding to the target label is used as an illegal resource gathering clue entity in the illegal resource gathering clue data. Therefore, the technical scheme of the application realizes automatic identification of illegal funding thread entities, not only has high identification efficiency, but also can realize real-time identification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of an entity identification method in an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating another entity identification method according to an embodiment of the present application;
FIG. 3 is a schematic flowchart illustrating another entity identification method according to an embodiment of the present application;
FIG. 4 is a schematic flowchart of another entity identification method in the embodiment of the present application;
FIG. 5 is a schematic flowchart of another entity identification method in the embodiment of the present application;
FIG. 6 is a schematic flowchart of another entity identification method in the embodiment of the present application;
FIG. 7 is a schematic structural diagram of an entity identification apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the related art, the illegal fundamentation line entity is identified from the illegal fundamentation line data by the following process:
inputting the illegal funding cue data into the electronic equipment one by one, and displaying each piece of illegal funding cue data through a display interface by the electronic equipment;
the user manually judges whether the displayed illegal fundamentation cue data contains an illegal fundamentation cue entity;
if the illegal fundament entity is determined to be contained, manually marking the illegal fundament entity as a preset color so as to distinguish clue entity data and non-clue entity data in the illegal fundament data;
if it is determined that an illegal funding entity is not included, no tagging is performed.
Therefore, in the manual marking method in the related art, illegal funding cue data are marked manually one by one, so that the efficiency is low, and real-time identification is difficult to achieve.
In order to solve the above problems in the related art, embodiments of the present application provide an entity identification method, which may be applied to any electronic device, as shown in fig. 1, and the method may include the following steps:
step 101, obtaining illegal fundament data.
In this embodiment, the illegal funding cue data may include advertisement cue data and pop cue data.
In practical application, the advertisement clue data can be data aiming at the early stage and the development stage of illegal funding; the explosive type clue data can be data aiming at the fund breaking period and the accident period in the later stage of illegal funding.
Step 102, determining a word vector sequence corresponding to the illegal fundamentary data.
In this embodiment, the word segmentation may be performed on the illegal resource gathering thread data to obtain text characters, and a character vector sequence may be obtained according to a character vector of each text character.
Optionally, as shown in fig. 2, step 102 may include the steps of:
step 201, performing word segmentation on the illegal fundamentals cue data to obtain at least one text character.
Optionally, the word segmentation tool may be used to segment the illegal fundamentation cue data, for example, a jieba word segmentation is used to segment the illegal fundamentation cue data, so as to obtain at least one text word.
Optionally, before performing word segmentation on the illegal fundamentation cue data, in order to avoid that the accuracy of the word segmentation result is affected by the non-text words in the illegal fundamentation cue data, preprocessing may be performed on the illegal fundamentation cue data.
Illustratively, the preprocessing may be to remove special characters (such as punctuation, special symbols, etc.) and HTML tags, etc. from the illegal fundamentals data.
Step 202, obtaining a word vector of each text word in the at least one text word.
Optionally, a character-level word vector model may be used to perform character vector mapping on the text word, so as to obtain a word vector of the text word.
In this embodiment, the character-level word vector model includes, but is not limited to, the word2vec model.
Wherein the word2vec model maps one-dimensional (one-hot) text words to a distributed form of word vectors using a layer of neural network.
The word2vec model input is discrete representation of a word or a word, the hidden layer has no activation function, namely the hidden layer is a linear unit, and the output layer is a corresponding word vector.
Step 203, generating a word vector sequence according to the respective word vector of each text word.
103, reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal collection lead data.
Wherein the dimensions of the tag sequence are the same as the dimensions of the word vector sequence.
For example, when the dimension of the word vector sequence is n-dimensional, the dimension of the tag sequence is also n-dimensional.
The label sequence comprises two types of labels, namely an entity label and a non-entity label.
Alternatively, the entity tags may include B-ORG tags and I-ORG tags in the BIO tags.
Wherein, the B-ORG label is the label of the starting position of the illegal fundamentary entity, and the I-ORG label is the label of the non-starting position of the illegal fundamentary entity.
In addition, the non-entity tag in the tag sequence may be an O-tag in a BIO tag.
Wherein, the O-tag is a tag of a non-entity part in the illegal fundamentals main body.
Optionally, in this embodiment, the BiLSTM-CRF model includes a bidirectional LSTM layer and a CRF layer.
Wherein, the bidirectional LSTM layer is used for extracting sentence characteristics of the obtained illegal collective resource data, namely a sentence characteristic matrix; and the CRF layer is used for carrying out sentence-level sequence marking to obtain a label sequence corresponding to the illegal collection thread data.
Specifically, when a word vector sequence is reasoned by adopting a BiLSTM-CRF model obtained by pre-training, the word vector sequence can be input into a bidirectional LSTM layer to obtain a statement feature matrix of illegal resource collection data, and the CRF layer is used for reasoning the statement feature matrix to obtain a tag sequence corresponding to the illegal resource collection thread data.
Optionally, as shown in fig. 3, the obtaining process of the tag sequence corresponding to the illegal fundamentary data may include the following steps:
step 301, labeling and decoding the sentence characteristic matrix to obtain at least one label sequence.
Each label sequence comprises a prediction label corresponding to each text word, and in different label sequences, the prediction labels corresponding to the same text word can be different.
Step 302, respectively determining the probability that the illegal funding data corresponds to the tag sequence for each tag sequence in at least one tag sequence;
specifically, the process of determining the probability includes: determining a prediction probability that the illegal fundamentation data corresponds to a tag sequence; and carrying out normalization processing on the prediction probability to obtain the probability.
Wherein the prediction probability that the illegal fundamentation data corresponds to the tag sequence can be determined using the following formula:
Figure BDA0002908020010000081
wherein x is illegal fundamental line data, y is a tag sequence,
Figure BDA0002908020010000082
a transition matrix of (k +2) × (k +2) dimensions for each text word from the i-1 st tag to the i-th tag represents the probability of all text word tag transitions in the illegal fundamentation thread data x.
And step 303, taking the label sequence with the highest probability in the at least one label sequence as the label sequence corresponding to the illegal resource gathering thread data.
Optionally, the following formula may be adopted to select a tag sequence corresponding to the illegal fundamentation line data from the at least one tag sequence:
Figure BDA0002908020010000083
wherein the content of the first and second substances,
Figure BDA0002908020010000084
is at least one tag sequence, X is illegal fundamentary data, y * And the label sequence corresponding to the illegal resource gathering thread data.
Optionally, the bidirectional LSTM layer includes a forward LSTM layer and a backward LSTM layer, and as shown in fig. 4, the obtaining process of the sentence feature matrix may include the following steps:
step 401, inputting the word vector sequence into a forward LSTM to obtain a first hidden state sequence;
illustratively, the word vector sequence may be (x) 1 ,x 2 ,...,x n ) (ii) a The first hidden state sequence may be
Figure BDA0002908020010000085
Step 402, inputting the word vector sequence into a reverse LSTM to obtain a second hidden state sequence;
illustratively, the second hidden state sequence may be
Figure BDA0002908020010000091
Step 403, splicing the first hidden state sequence and the second hidden state sequence to obtain a complete hidden state sequence;
illustratively, the complete hidden state sequence may be (h) 1 ,h 2 ,...,h n )∈R n×m
And step 404, performing dimension mapping on the complete hidden state sequence to obtain a statement feature matrix.
Alternatively, the complete hidden state sequence may be mapped from m dimensions to k dimensions, with the sentence feature matrix P ═ P (P) 1 ,p 2 ,...,p n )∈R n×k
In short, the sentence feature matrix obtaining process may be:
the word vector sequence (x) 1 ,x 2 ,...,x n ) Hidden state sequence for outputting forward LSTM as input of each time step of bidirectional LSTM
Figure BDA0002908020010000092
Hidden state sequence with inverted LSTM output
Figure BDA0002908020010000093
Hidden state output at each position by positionSplicing
Figure BDA0002908020010000094
Obtaining a complete hidden state sequence:
(h 1 ,h 2 ,...,h n )∈R n×m
then a linear layer is accessed, the complete hidden state sequence is mapped from m dimension to k dimension, and a sentence characteristic matrix P ═ P 1 ,p 2 ,...,p n )∈R n×k
And 104, extracting a target label belonging to the entity label from the label sequence, and taking the data corresponding to the target label as an illegal funding thread entity in the illegal funding thread data.
Taking illegal funding thread data as ". about. financial illegal funding fraud" as an example, the corresponding label can be (B-ORG, I-ORG, O), that is:
financial illegal funding fraud
B-ORG I-ORG I-ORG I-ORG O O O O O O
Since the B-ORG and the I-ORG belong to the entity tag, their corresponding ". x finance" is used as the illegal fundamentation entity of the illegal fundamentation data ". x. finance illegal fundamentation fraud".
According to the technical scheme provided by the embodiment of the application, illegal resource gathering clue data are obtained, a word vector sequence corresponding to the illegal resource gathering clue data is determined, the word vector sequence is inferred by using a BiLSTM-CRF model obtained through pre-training, a label sequence corresponding to the illegal resource gathering clue data is obtained, a target label belonging to an entity label is extracted from the label sequence, and the data corresponding to the target label is used as an illegal resource gathering clue entity in the illegal resource gathering clue data. Therefore, the technical scheme of the application realizes automatic identification of the illegal funding thread main body, has high identification efficiency and can realize real-time identification.
The training process of the BilSTM-CRF model is introduced as follows:
first, each training text in the training dataset is labeled.
Optionally, each training text may be labeled with a BIO label set.
Wherein, the BIO label set comprises a B-ORG label, an I-ORG label and an O label.
Wherein, the B-ORG label is the label of the starting position of the illegal fundament entity, the I-ORG label is the label of the non-starting position of the illegal fundament entity, and the O label is the non-entity label.
Wherein the B-ORG tag and the I-ORG tag belong to an entity tag.
Taking the training text as ". about. financial illegal collective resource fraud" as an example, after the BIO annotation set is adopted to label the training text, the corresponding entity label can be (B-ORG, I-ORG, O), that is:
financial illegal funding fraud
B-ORG I-ORG I-ORG I-ORG O O O O O O
Optionally, the specific implementation process of labeling each training text may be:
performing word segmentation on the training text to obtain at least one text character in the training text;
and respectively labeling the text characters by adopting a BIO labeling set for each text character in at least one text character.
Optionally, the word segmentation tool may be used to segment the illegal fundamentation cue data, for example, a jieba word segmentation is used to segment the illegal fundamentation cue data, so as to obtain at least one text word.
For example, taking the training text as ". x. financial illegal funding fraud" as an example, the obtained at least one text word may be:,. x, gold, finance, non, law, funding, fraud.
Optionally, before performing word segmentation on the illegal fundamentation cue data, in order to avoid that the accuracy of the word segmentation result is affected by the non-text words in the illegal fundamentation cue data, preprocessing may be performed on the illegal fundamentation cue data.
Illustratively, the preprocessing may be to remove special characters (such as punctuation, special symbols, etc.) and HTML tags, etc. from the illegal fundamentals data.
Secondly, performing character vector mapping on the training text by using the pre-trained character-level word vector model, and generating a word vector for each text word in the training text. And taking the mapped word vector as the input of the next BilSTM-CRF neural network.
Optionally, the character-level word vector model includes, but is not limited to, the word2vec model.
Wherein the word2vec model maps one-dimensional (one-hot) text word vectors to distributed form word vectors using a layer of neural network.
Thirdly, a word vector sequence (x) corresponding to the training text 1 ,x 2 ,...,x n ) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM
Figure BDA0002908020010000111
With inverse LSTM
Figure BDA0002908020010000112
Position-by-position splicing is carried out in hidden states output at various positions
Figure BDA0002908020010000113
The complete hidden state sequence (h) is obtained 1 ,h 2 ,...,h n )∈R n ×m After setting output, a linear layer is accessed, the hidden state vector is mapped from m dimension to k dimension, thereby obtaining an automatically extracted sentence characteristic matrix, and the sentence characteristic matrix is recorded as a matrix P ═ P (P) 1 ,p 2 ,...,p n )∈R n×k
Fourthly, sentence-level sequence marking is carried out on a CRF layer by utilizing a sentence characteristic matrix.
The parameter of the CRF layer is a matrix a of (k +2) × (k +2), i.e. a label sequence y with a length equal to the length of the training text-y 1 ,y 2 ,...,y n ) From this, the score of the model for the label of the training text equal to Y can be calculated. And normalizing by using a softmax function to obtain a correct labeling sequence y of the training text. The maximum of the BiLSTM-CRF model can be obtained by taking the logarithm of the probability of the correct labeling sequence yAnd (4) quantizing the log-likelihood function, namely the loss function, so as to finish the training of the parameters in the transition probability matrix A and the BilSTM.
Alternatively, the BilSTM-CRF model uses a time-based back propagation algorithm (BPTT) to train the BilSTM, and the BPTT is different from the common back propagation algorithm in that the parameters are updated by considering not only longitudinal propagation between upper and lower gradients but also transverse propagation in a sequence, and after the input passes through the BilSTM, the forward and backward two hidden state results are combined to generate the output of the BilSTM, and then the output is used as the input of the CRF, so that the BilSTM-CRF model is formed.
In another embodiment of the present application, based on the foregoing embodiments of step 101 to step 104, obtaining illegal fundamentation data may include, as shown in fig. 5, step 501 to step 503:
501, acquiring M pieces of original data;
wherein M is a positive integer.
Illustratively, the raw data may be internet financial data.
Step 502, screening N suspected clue data from M original data;
wherein N is a positive integer and is less than or equal to M.
Optionally, the suspected clue data may include suspected advertisement-like clue data and suspected pop-like clue data.
Optionally, in this embodiment, a pre-constructed advertisement-type clue feature rule base and an explosive-type clue feature rule base may be adopted, and a fast multi-modal matching algorithm (DAT) is combined to screen M pieces of raw data to obtain N pieces of suspected clue data.
Specifically, the advertisement clue feature rule base comprises at least one advertisement clue keyword, when suspected advertisement clue data are screened from M pieces of original data by combining a rapid multi-mode matching algorithm, the original data are matched with the advertisement keywords in the advertisement clue feature rule base, and when the original data are determined to contain the advertisement keywords in the advertisement clue feature rule base, the original data are determined to be the suspected advertisement clue data.
Similarly, the explosive class clue characteristic rule base comprises at least one explosive class keyword, when suspected explosive class clue data are screened from the M pieces of original data by combining a rapid multi-mode matching algorithm, the original data are matched with the explosive class keywords in the explosive class clue characteristic rule base, and when the original data are determined to contain the explosive class keywords in the explosive class clue rule base, the original data are determined to be the suspected explosive class clue data.
Optionally, the building process of the advertisement clue feature rule base and the blasting clue feature rule base may be as follows:
acquiring a first quantity of advertisement clue texts and a second quantity of explosive clue texts;
alternatively, the advertisement-class clue text may be an advertisement-class article, and the explosive-class clue text may be an explosive-class article.
In practical application, articles extracted from internet data can be artificially labeled to obtain advertisement articles and explosive articles.
The relationship between the first quantity and the second quantity is not specifically limited in this embodiment, that is, the first quantity and the second quantity may be the same or different.
When the first number and the second number are different, the first number may be greater than the second number or may be smaller than the second number.
Extracting advertisement keywords from each advertisement thread text and extracting explosive material keywords from each explosive material thread text by adopting a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm;
optionally, before extracting the advertisement keywords (or explosive material keywords), the advertisement clue text (or explosive material clue text) may be preprocessed.
Illustratively, special characters (such as punctuation, special symbols, etc.), participles, stop words, etc. in the advertisement-type clue text (or pop-type clue text) may be deleted.
Integrating and de-duplicating all advertisement keywords to obtain advertisement clue characteristic rules, and de-duplicating all explosive material keywords to obtain explosive material clue characteristic rules.
The process of generating the advertisement cue characteristic rules by using all the advertisement keywords may be:
calculating tfidf value of each advertisement keyword, performing reverse arrangement on all advertisement keywords according to the tfidf values, extracting the first S keywords from the advertisement keywords after the reverse arrangement, removing the duplication of the first S keywords, and generating an advertisement clue characteristic rule by using the keywords after the duplication removal.
The process of generating the explosive category clue characteristic rule by using all explosive category keywords may be as follows:
calculating tfidf value of each explosive class keyword, performing reverse arrangement on all explosive class keywords according to the tfidf values, extracting the first S keywords from the explosive class keywords after the reverse arrangement, removing the duplication of the first S keywords, and generating an explosive class clue characteristic rule by using the keywords after the duplication removal.
Step 503, for each suspected clue data in the N suspected clue data, determining the category of the suspected clue data by using a preset classifier, and determining the suspected clue data as illegal funding clue data when the category is clue information.
Alternatively, the preset classifier may include a first classifier, a second classifier, and a third classifier.
Illustratively, the preset classifier may be a Naive Bayes (Naive Bayes) model classifier, a Support Vector Machine (SVM) model classifier, and a Logistic Regression (LR) model classifier.
Optionally, as shown in fig. 6, the process of determining the category of the suspected clue data by using the preset classifier may include the following steps:
601. determining a first category of the suspected clue data by using a first classifier, determining a second category of the suspected clue data by using a second classifier, and determining a third category of the suspected clue data by using a third classifier;
wherein, the first category can be clue information or non-clue information;
the second category may be clue information or non-clue information;
the third category may be clue information or non-clue information.
602. Counting a first number of clue information and a second number of non-clue information from the first category, the second category and the third category;
603. when the first number is larger than the second number, determining the category of the suspected clue data as clue information;
604. and when the first number is smaller than the second number, determining the category of the suspected clue data as the non-clue information.
The method adopts a multi-classifier combined voting algorithm to classify each suspected clue data, and judges whether the suspected clue data belongs to clue information or non-clue information. In the embodiment, three classifiers, namely naive Bayes, a support vector machine and logistic regression, are selected, voting is performed according to results calculated by the multiple classifiers to determine a final result, a hard voting mechanism is adopted in the voting method, the final result is determined according to the result evaluation of each classifier, each classifier has a vote, and the classes with more votes are selected as the result of the multi-classifier combined voting algorithm. The multi-classifier combination voting algorithm can combine the advantages of each model and improve the accuracy of clue identification.
Based on the same concept, an entity identification apparatus is provided in the embodiments of the present application, and specific implementation of the apparatus may refer to the description of the method embodiment, and repeated details are not repeated, as shown in fig. 7, the apparatus mainly includes:
an obtaining unit 701, configured to obtain illegal fundamentation line data;
a determining unit 702, configured to determine a word vector sequence corresponding to illegal fundamentation line data;
the reasoning unit 703 is configured to perform reasoning on the word vector sequence by using a previously trained BiLSTM-CRF model to obtain a tag sequence corresponding to the illegal collection thread data;
the identifying unit 704 is configured to extract a target tag belonging to the entity tag from the tag sequence, and use data corresponding to the target tag as an illegal funding thread entity in the illegal funding thread data.
Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 8, the electronic device mainly includes: a processor 801, a memory 802, and a communication bus 803, wherein the processor 801 and the memory 802 communicate with each other via the communication bus 803. The memory 802 stores a program executable by the processor 801, and the processor 801 executes the program stored in the memory 802, so as to implement the following steps:
obtaining illegal fundament line data; determining a word vector sequence corresponding to the illegal fundamentary line data; reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to illegal collective resource clue data; and extracting a target label belonging to the entity label from the label sequence, and taking the data corresponding to the target label as an illegal funding thread entity in the illegal funding thread data.
The communication bus 803 mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 803 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The Memory 802 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor 801.
The Processor 801 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.
In yet another embodiment of the present application, there is also provided a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the entity identification method described in the above embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes, etc.), optical media (e.g., DVDs), or semiconductor media (e.g., solid state disks), among others.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. An entity identification method, comprising:
obtaining illegal fundament line data;
determining a word vector sequence corresponding to the illegal fundamentary line data;
reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data;
and extracting a target label belonging to the entity label from the label sequence, and taking the data corresponding to the target label as an illegal fundamentation thread entity in the illegal fundamentation thread data.
2. The method of claim 1, wherein determining a sequence of word vectors corresponding to the illegal fundamentation line data comprises:
performing word segmentation on the illegal resource gathering clue data to obtain at least one text character;
respectively obtaining a word vector of each text word in the at least one text word;
and generating the word vector sequence according to the respective word vector of each text word.
3. The method of claim 2, wherein obtaining a word vector for each of the at least one text word, respectively, comprises:
and respectively carrying out character vector mapping on each text character in the at least one text character by adopting a character-level word vector model to obtain the word vector.
4. The method of claim 1, wherein the BiLSTM-CRF model comprises a bi-directional LSTM layer and a CRF layer;
reasoning the word vector sequence by using a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data, wherein the label sequence comprises the following steps:
inputting the word vector sequence into the bidirectional LSTM layer to obtain a statement feature matrix of the illegal fundamentation data;
and reasoning the statement feature matrix by using the CRF layer to obtain a label sequence corresponding to the illegal resource collection clue data.
5. The method of claim 4, wherein reasoning the statement feature matrix using the CRF layer to obtain a tag sequence corresponding to the illegal fundamentation data comprises:
labeling and decoding the sentence characteristic matrix to obtain at least one label sequence;
determining, for each of at least one sequence of tags, a probability that the illegal fundamentation data corresponds to the sequence of tags, respectively;
and taking the label sequence with the highest probability in the at least one label sequence as the label sequence corresponding to the illegal fundamentation line data.
6. The method of claim 5, wherein determining the probability that the illegal fundamentation data corresponds to the tag sequence comprises:
determining a predicted probability that the illegal fundamentation data corresponds to the sequence of tags;
and carrying out normalization processing on the prediction probability to obtain the probability.
7. The method of claim 4, wherein the bi-directional LSTM layers comprise forward LSTM layers and backward LSTM layers;
inputting the word vector sequence into the bidirectional LSTM layer to obtain a statement feature matrix of the illegal fundamentation data, wherein the statement feature matrix comprises:
inputting the word vector sequence into the forward LSTM to obtain a first hidden state sequence;
inputting the word vector sequence into the reverse LSTM to obtain a second hidden state sequence;
splicing the first hidden state sequence and the second hidden state sequence to obtain a complete hidden state sequence;
and carrying out dimension mapping on the complete hidden state sequence to obtain the statement feature matrix.
8. The method of claim 1, wherein obtaining illegal fundamentation data comprises:
acquiring M pieces of original data;
screening N suspected clue data from the M original data;
and respectively determining the category of each suspected clue data in the N suspected clue data by adopting a preset classifier, and determining the suspected clue data as the illegal funding clue data when the category is clue information.
9. The method of claim 8, wherein the preset classifier comprises a first classifier, a second classifier, and a third classifier;
determining the category of the suspected clue data by adopting a preset classifier, wherein the step of determining the category of the suspected clue data comprises the following steps:
determining a first category of the suspected clue data with the first classifier, a second category of the suspected clue data with the second classifier, and a third category of the suspected clue data with the third classifier;
counting a first number of clue information and a second number of non-clue information from the first category, the second category and the third category;
when the first number is larger than the second number, determining the category of the suspected clue data as the clue information;
and when the first number is smaller than the second number, determining the category of the suspected clue data as the non-clue information.
10. An entity identification apparatus, comprising:
the acquisition unit is used for acquiring illegal fundamentary line data;
a determining unit for determining a word vector sequence corresponding to the illegal fundamentation line data;
the reasoning unit is used for reasoning the word vector sequence by utilizing a BiLSTM-CRF model obtained by pre-training to obtain a label sequence corresponding to the illegal resource gathering clue data;
and the identification unit is used for extracting a target label belonging to an entity label from the label sequence and taking the data corresponding to the target label as an illegal funding thread entity in the illegal funding thread data.
11. An electronic device, comprising: the system comprises a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;
the memory for storing a computer program;
the processor, executing the program stored in the memory, implementing the entity identification method of any one of claims 1-9.
12. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the entity identification method of any one of claims 1-9.
CN202110077227.6A 2021-01-20 2021-01-20 Entity identification method, device, equipment and storage medium Pending CN114861661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110077227.6A CN114861661A (en) 2021-01-20 2021-01-20 Entity identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110077227.6A CN114861661A (en) 2021-01-20 2021-01-20 Entity identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114861661A true CN114861661A (en) 2022-08-05

Family

ID=82623098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110077227.6A Pending CN114861661A (en) 2021-01-20 2021-01-20 Entity identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114861661A (en)

Similar Documents

Publication Publication Date Title
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN107808011B (en) Information classification extraction method and device, computer equipment and storage medium
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN112347778B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
US9245243B2 (en) Concept-based analysis of structured and unstructured data using concept inheritance
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN113095076B (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN111126067B (en) Entity relationship extraction method and device
US20230177626A1 (en) Systems and methods for determining structured proceeding outcomes
CN112818218B (en) Information recommendation method, device, terminal equipment and computer readable storage medium
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN112395421B (en) Course label generation method and device, computer equipment and medium
Huang et al. Hierarchical multi-attention networks for document classification
CN112188312A (en) Method and apparatus for determining video material of news
Aralikatte et al. Fault in your stars: an analysis of android app reviews
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN112632964B (en) NLP-based industry policy information processing method, device, equipment and medium
CN111523311B (en) Search intention recognition method and device
CN111950265A (en) Domain lexicon construction method and device
Diwakar et al. Proposed machine learning classifier algorithm for sentiment analysis
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN110888983A (en) Positive and negative emotion analysis method, terminal device and storage medium
CN111488452A (en) Webpage tampering detection method, detection system and related equipment
CN109446318A (en) A kind of method and relevant device of determining auto repair document subject matter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination