CN113505221B - Enterprise false propaganda risk identification method, equipment and storage medium - Google Patents

Enterprise false propaganda risk identification method, equipment and storage medium Download PDF

Info

Publication number
CN113505221B
CN113505221B CN202010214386.1A CN202010214386A CN113505221B CN 113505221 B CN113505221 B CN 113505221B CN 202010214386 A CN202010214386 A CN 202010214386A CN 113505221 B CN113505221 B CN 113505221B
Authority
CN
China
Prior art keywords
risk
texts
enterprise
text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010214386.1A
Other languages
Chinese (zh)
Other versions
CN113505221A (en
Inventor
贺敏
张东雷
杜慧
柳力多
董琳
彭鑫
王秀文
罗引
王磊
赵菲菲
曹家
张西娜
郭富民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Wenge Technology Co ltd
Guoke Zhian Beijing Technology Co ltd
National Computer Network and Information Security Management Center
Original Assignee
Beijing Zhongke Wenge Technology Co ltd
Guoke Zhian Beijing Technology Co ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Wenge Technology Co ltd, Guoke Zhian Beijing Technology Co ltd, National Computer Network and Information Security Management Center filed Critical Beijing Zhongke Wenge Technology Co ltd
Priority to CN202010214386.1A priority Critical patent/CN113505221B/en
Publication of CN113505221A publication Critical patent/CN113505221A/en
Application granted granted Critical
Publication of CN113505221B publication Critical patent/CN113505221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a false propaganda risk identification method, equipment and a storage medium for enterprises. The method comprises the following steps: extracting suspected risk texts from a plurality of enterprise public opinion texts corresponding to a target enterprise; extracting risk features of corresponding types from each suspected risk text to form risk feature vectors corresponding to each suspected risk text; sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, enabling the risk recognition model to recognize each suspected risk text, and determining the suspected risk text recognized as having false propaganda risk as a risk text; according to the determined information of all risk texts, determining false propaganda risk intensity values corresponding to target enterprises; and if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has the false propaganda risk. The method and the device can avoid the limitation of manual matching rules and improve the accuracy of false propaganda risk identification.

Description

Enterprise false propaganda risk identification method, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for identifying false propaganda risks of enterprises.
Background
With the development of economy and society, the number of various financing platforms is increased in an explosive manner, and a series of new financing modes are generated by the integration of the Internet and finance. Meanwhile, corresponding supervision measures are not sound, so that the confusion of financing markets and the frequent occurrence of illegal violations are caused, the healthy development of the industry is limited, and huge risks and hidden dangers are brought to economic construction. For example, illegal funding brings great harm to society. Because of the characteristics of no regional division, quick transmission, wide crowded area and the like of the Internet, the method brings great challenges for preventing and disposing illegal fund collecting work. Enterprises suspected of illegal funding generally report false language theory to mark the reality of the enterprises through the Internet in the early stage of funding, and products issued by the mass investment enterprises are induced. Therefore, the false propaganda of the enterprise is identified, the false propaganda of the enterprise is significant for early judging illegal fund collecting risks, and illegal behaviors such as illegal fund collecting can be effectively prevented, beaten and treated.
Currently, whether enterprises have false propaganda risks is mainly identified through a mode based on manual rule matching. The identification mode specifically comprises the following steps: acquiring text data related to enterprises from massive internet data in a mode of matching enterprise keywords; inquiring whether false propaganda risk keywords exist in the acquired text data; if the text data contains the false propaganda risk keywords, the enterprise is judged to have the false propaganda risk, otherwise, the enterprise is judged to not have the false propaganda risk. However, because the limitation of manually setting the matching rule results in lower recognition accuracy, actual application cannot be met, and the manually set matching rule cannot meet the dynamically-changed financial environment, so that the adaptability is poor.
Disclosure of Invention
The embodiment of the invention mainly aims to provide a false propaganda risk identification method, equipment and storage medium for enterprises, which are used for solving the problem that whether the enterprises have false propaganda risks or not is identified in a mode of matching based on manual rules, and the identification accuracy is low due to the limitation of manually setting the matching rules.
Aiming at the technical problems, the embodiment of the invention is solved by the following technical scheme:
the embodiment of the invention provides a false propaganda risk identification method for enterprises, which comprises the following steps: obtaining a plurality of enterprise public opinion texts corresponding to a target enterprise from the Internet; extracting enterprise public opinion texts comprising preset risk keywords from a plurality of enterprise public opinion texts corresponding to the target enterprise to serve as suspected risk texts; extracting risk features of corresponding types from each suspected risk text according to the types of preset risk features, and forming risk feature vectors corresponding to each suspected risk text; sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, enabling the risk recognition model to recognize false propaganda risks of each suspected risk text, and determining the suspected risk text recognized as the risk text with the false propaganda risks; according to the determined information of all risk texts, determining false propaganda risk intensity values corresponding to the target enterprises; and if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has false propaganda risk.
Before the extracting the enterprise public opinion text including the preset risk keywords, the method further includes: extracting a plurality of enterprise false propaganda texts and a plurality of financial field texts from the Internet; preprocessing each enterprise false propaganda text and each financial field text respectively; generating an LDA model by using a preset document theme, extracting a plurality of false propaganda theme keywords from the preprocessed plurality of enterprise false propaganda texts, and setting the plurality of false propaganda theme keywords as initial risk keywords; extracting context information from each preprocessed text in the financial field by using a preset Word2Vec model, and generating a plurality of vocabulary semantic vectors according to the context information; for each vocabulary semantic vector, if the vector similarity between the semantic vector of at least one initial risk keyword and the vocabulary semantic vector is greater than a preset vector similarity threshold, setting the vocabulary corresponding to the vocabulary semantic vector as an expanded risk keyword; setting each set initial risk keyword and each set expanded risk keyword as risk keywords.
The method for extracting the enterprise public opinion texts including the preset risk keywords from the enterprise public opinion texts corresponding to the target enterprise is used as suspected risk texts and comprises the following steps: aiming at each enterprise public opinion text, carrying out clause processing on the enterprise public opinion text to obtain a plurality of clauses corresponding to the enterprise public opinion text; respectively carrying out similarity calculation on each clause and a plurality of preset risk keywords; and if at least one of the clauses in the enterprise public opinion text and one of the risk keywords is more than a preset keyword similarity threshold, determining the enterprise public opinion text as a suspected risk text.
Before sequentially inputting the risk feature vectors corresponding to the multiple suspected risk texts into the pre-trained risk recognition model, the method further comprises: step 1, obtaining a plurality of samples; wherein the training data set, the validation data set and the test data set are partitioned according to the plurality of samples; step 2, extracting risk features of preset types from each sample by using a preset vector space model to form risk feature vectors corresponding to each sample; step 3, training parameters in the risk identification model by using risk feature vectors corresponding to each sample in the training data set; step 4, verifying whether the risk identification model is converged by using the risk feature vector corresponding to each sample in the verification data set, if so, executing step 5, otherwise, jumping to step 3; step 5, determining an identification effect index of the risk identification model by using the risk feature vector corresponding to each sample in the test data set; and if the identification effect index is larger than a preset effect threshold value, finishing training the risk identification model, otherwise, jumping to the step 3.
Before extracting the risk features of the preset types from each sample, the method further comprises: obtaining a plurality of samples in the training dataset; respectively calculating each vocabulary appearing in a plurality of samples of the training data set as an evaluation value of risk characteristics by using a preset characteristic evaluation function; sequentially acquiring the first N vocabularies according to the sequence from the large evaluation value to the small evaluation value, wherein N is more than or equal to 1; and constructing the types of risk features according to the first N words and a preset vector space model.
The risk identification model is a Support Vector Machine (SVM) model.
The determining the false propaganda risk intensity value corresponding to the target enterprise according to the determined information of all risk texts comprises the following steps: determining the number of risk texts, the ratio of the number of the risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts and/or the credibility of the source account numbers of the risk texts according to the determined information of all the risk texts; performing normalization processing on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts; determining a ratio of the number of the risk texts to the number of the enterprise public opinion texts, normalizing the number of the risk texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or a weighted average of the credibility of the source account numbers of the risk texts, wherein the weighted average is used as a false propaganda risk intensity value corresponding to the target enterprise.
The normalizing processing is performed on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts, and the normalizing processing includes: carrying out normalization processing on the quantity of the risk texts and/or the credibility of the source account numbers of the risk texts by using a preset logarithmic Min-Max normalization method; and carrying out normalization processing on the number of source sites of the risk text and/or the credibility of the source sites of the risk text by using a preset Min-Max normalization method.
The embodiment of the invention also provides equipment for identifying false propaganda risks of enterprises, which comprises: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the enterprise false promotion risk identification method of any one of the above.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium is stored with an enterprise false propaganda risk identification program, and the enterprise false propaganda risk identification program realizes the steps of the enterprise false propaganda risk identification method when being executed by a processor.
The embodiment of the invention has the following beneficial effects:
according to the method, firstly, suspected risk texts of suspected false propagations related to a target enterprise are screened out based on risk keyword recognition, then, whether the suspected risk texts have risks of false propagations or not is recognized by adopting a risk recognition model according to each suspected risk text, finally, the false propaganda risk intensity of the target enterprise is comprehensively evaluated in a mode of quantifying the false propaganda risk intensity, and further whether the target enterprise has risks of false propagations or not is determined. The enterprise false propaganda risk identification method provided by the embodiment of the invention can be used for accurately and comprehensively evaluating the risk intensity of false propaganda of the target enterprise. The embodiment of the invention can avoid the limitation of manual matching rules, and can improve the intelligence, effectiveness and accuracy of false propaganda risk identification by matching the risk keywords, identifying the risk identification model and quantifying the false propaganda risk intensity of the public opinion text of enterprises, and can well adapt to the dynamic change of financial environment so as to meet the actual application requirements.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of a method for identifying corporate false promotion risk in accordance with one embodiment of the present invention;
FIG. 2 is a flowchart of steps for setting risk keywords, according to one embodiment of the present invention;
FIG. 3 is a flowchart of training steps for a risk identification model according to an embodiment of the present invention;
FIG. 4 is a flowchart of steps for determining false hype risk intensity values, in accordance with an embodiment of the present invention;
fig. 5 is a block diagram of an enterprise false promotion risk identification apparatus in accordance with an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and the embodiments, in order to make the objects, technical solutions and advantages of the present invention more apparent.
According to the embodiment of the invention, an enterprise false propaganda risk identification method is provided. As shown in FIG. 1, a flow chart of a method for identifying corporate false promotion risk according to an embodiment of the present invention is provided.
Step S110, a plurality of enterprise public opinion texts corresponding to the target enterprise are obtained from the Internet.
Presetting a target enterprise, and setting a plurality of target enterprise keywords according to the target enterprise; and acquiring a plurality of enterprise public opinion texts corresponding to the target enterprise from the Internet by utilizing the plurality of target enterprise keywords.
Target enterprise keywords, including but not limited to: enterprise short, platform name, and product name.
The platform name refers to the financial platform name operated by the target enterprise.
The product name refers to the name of the product to which the target business relates.
Further, a plurality of enterprise public opinion texts corresponding to the target enterprise can be obtained from a plurality of data sources. A plurality of data sources including, but not limited to: microblog, bar, forum, news.
Step S120, extracting enterprise public opinion texts including preset risk keywords from a plurality of enterprise public opinion texts corresponding to the target enterprise as suspected risk texts.
The risk keywords are used for reflecting core content of the enterprise public opinion texts and can be used for measuring whether the enterprise public opinion texts are suspected of false propaganda. The risk keywords may be false hype-related keywords.
The suspected risk text refers to enterprise public opinion text suspected to have false propaganda risk.
The following operations are performed for each enterprise public opinion text: sentence dividing processing is carried out on the enterprise public opinion texts to obtain a plurality of sentences corresponding to the enterprise public opinion texts; respectively carrying out similarity calculation on each clause and a plurality of preset risk keywords; and if the similarity between at least one clause and one of the risk keywords in the enterprise public opinion text is greater than a preset keyword similarity threshold, determining the enterprise public opinion text as a suspected risk text. The similarity calculation is a word similarity calculation. The keyword similarity threshold may be an empirical value or a value obtained through experimentation.
Step S130, extracting risk features of corresponding types from each suspected risk text according to the types of preset risk features, and forming risk feature vectors corresponding to each suspected risk text.
Risk features to reflect the core content of the false promotional text.
The kind of risk features may be preset. In this embodiment, the kinds of risk features include, but are not limited to: the method comprises the steps of presetting the frequency of occurrence of vocabulary in suspected risk texts, presetting the inverse document frequency of the vocabulary in all the suspected risk texts, and presetting the weight of the vocabulary in the suspected risk texts. The preset vocabulary may be one or more.
The risk feature vector corresponding to the suspected risk text is generated according to the element position corresponding to each risk feature after the risk feature types corresponding to each element position in the risk feature vector are preset and a plurality of risk features are extracted from the suspected risk text.
Step S140, sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, so that the risk recognition model recognizes the false propaganda risk of each suspected risk text, and determines the suspected risk text recognized as having the false propaganda risk as a risk text.
And the risk identification model is used for identifying whether the suspected risk text corresponding to the risk feature vector has false propaganda risk or not according to the input risk feature vector. The risk recognition model may be an SVM (Support Vector Machine ) model.
The risk text refers to enterprise public opinion text with false propaganda risk, that is, the risk text has the risk of false propaganda.
Further, the recognition result of the risk recognition model may be a risk probability that a false propaganda exists; if the risk probability is greater than 0.5, judging that the suspected risk text corresponding to the input risk feature vector has a risk (of false propaganda), and if the risk probability is less than or equal to 0.5, judging that the suspected risk text corresponding to the input risk feature vector has no risk.
Further, the recognition result of the risk recognition model may also be a binary value. If the identification result is 1, judging that the suspected risk text corresponding to the input risk feature vector is at risk; if the identification result is 0, judging that the suspected risk text corresponding to the input risk feature vector does not have risk.
And step S150, determining a false propaganda risk intensity value corresponding to the target enterprise according to the determined information of all risk texts.
The higher the risk intensity, the higher the risk that the enterprise has false promotions.
Because the credibility of different enterprise public opinion texts is different, the risk of being reflected by one enterprise public opinion text is insufficient to indicate that a target enterprise has a false propaganda risk. In view of this, in this embodiment, according to the determined information of all risk texts, a false propaganda risk intensity value corresponding to the target enterprise is determined.
Information for all risk texts, including but not limited to: the method comprises the steps of determining the number of risk texts, the ratio of the number of the risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source accounts of the risk texts, the credibility of the source sites of the risk texts and the credibility of the source accounts of the risk texts. How to determine the information of all risk texts will be described later, and thus will not be described in detail herein.
Step S160, judging whether the false propaganda risk intensity value is larger than a preset risk threshold value; if yes, go to step S170; if not, step S180 is performed.
The risk threshold is used to gauge whether the target enterprise is at risk of spurious promotions.
The risk threshold may be an empirical value, or a value obtained experimentally.
And step S170, if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has a false propaganda risk.
Step S180, if the false propaganda risk intensity value is less than or equal to the risk threshold, determining that the target enterprise does not have a false propaganda risk.
According to the method, firstly, suspected risk texts of suspected false propagations related to a target enterprise are screened out based on risk keyword recognition, then, whether the suspected risk texts have risks of false propagations or not is recognized by adopting a risk recognition model according to each suspected risk text, finally, the false propaganda risk intensity of the target enterprise is comprehensively evaluated in a mode of quantifying the false propaganda risk intensity, and further whether the target enterprise has risks of false propagations or not is determined. The enterprise false propaganda risk identification method provided by the embodiment of the invention can be used for accurately and comprehensively evaluating the risk intensity of false propaganda of the target enterprise. The embodiment of the invention can avoid the limitation of manual matching rules, and can improve the intelligence, the effectiveness and the accuracy of false propaganda risk identification by matching the risk keywords, identifying the risk identification model and quantifying the false propaganda risk intensity of the enterprise public opinion text, well adapt to the dynamic change of the financial environment and meet the actual application requirements.
Further, in the field of data mining, the enterprise false propaganda risk identification method can be used for evaluating whether a target enterprise has false propaganda risks. By matching the risk keywords of the enterprise public opinion texts, identifying the risk identification model and quantifying the false propaganda risk intensity, the early judgment of the illegal fund collecting risk of the enterprise can be supported.
The steps in the enterprise false propaganda risk identification method according to the embodiment of the invention will be further described below.
In this embodiment, before the extracting the enterprise public opinion text including the preset risk keywords, the risk keywords may also be set according to the real text data.
The keyword extraction is to extract the word most relevant to the meaning of the text from the text, and has important roles in the fields of text clustering, classification, abstract and the like. The embodiment extracts risk keywords by using a method combining an LDA (Latent Dirichlet Allocation, document theme generation) model and a Word2Vec (Word to vector) model. The method for extracting the risk keywords can avoid the problem that the traditional keyword extraction can only extract the initial keywords, but cannot automatically expand the keywords.
As shown in fig. 2, there is a flowchart of steps for setting risk keywords according to an embodiment of the present invention.
Step S210, extracting a plurality of enterprise false propaganda texts and a plurality of financial field texts from the Internet.
A plurality of texts are extracted from the Internet, and enterprise false propaganda texts are identified in the plurality of texts. Further, whether false propaganda exists in the texts can be marked by a plurality of independent business personnel respectively, and finally, a plurality of enterprise false propaganda texts are determined from the marked texts through a majority voting method to form a risk corpus.
And extracting a plurality of texts comprising financial domain keywords from the Internet, and taking each text as one financial domain text, wherein the plurality of financial domain texts form a financial corpus. Financial domain keywords refer to finance-related words.
Further, multiple enterprise false promotional text and multiple financial domain text may be extracted from multiple data sources. A plurality of data sources including, but not limited to: microblog, bar, forum, news. For example: respectively extracting and labeling 500 false propaganda texts of enterprises from microblogs, bar sticking, news and forums to form a risk corpus; 1 ten thousand texts matched with keywords in the financial field are extracted from the microblog, the bar, the news and the forum respectively to form a financial corpus.
And step S220, preprocessing each enterprise false propaganda text and each financial field text respectively.
Preprocessing false propaganda text of each enterprise in the risk corpus, including: deleting special symbols, word segmentation, stop word removal and the like in the false propaganda text of the enterprise.
Preprocessing each financial domain text in a financial corpus, including: deleting special symbols, word segmentation, stop word removal and the like in the text in the financial field.
Step S230, extracting a plurality of false propaganda topic keywords from the preprocessed plurality of enterprise false propaganda texts by using a preset LDA model, and setting the plurality of false propaganda topic keywords as initial risk keywords.
False promotional topic keywords refer to topic keywords related to false promotions.
The topic number and the iteration number are preset for the LDA model. Further, the number of topics and the number of iterations may be empirical values. Modeling a risk corpus by using an LDA model, extracting topics of the topic number and a plurality of topic keywords corresponding to each topic from the risk corpus by using the LDA model according to the preset topic number and iteration times, and determining a plurality of false propaganda topic keywords in the topic keywords. The plurality of false promotional topic keywords may be manually determined from a plurality of topic keywords. And the topic keywords matched with the false propaganda feature words in the topic keywords can be used as false propaganda topic keywords. The dummy promotional feature words may be words that are set according to characteristics of the dummy promotion.
And step S240, extracting context information from each preprocessed financial field text by using a preset Word2Vec model, and generating a plurality of vocabulary semantic vectors according to the context information.
The following is performed for each financial domain text: and extracting the context information of each Word in the text of the financial field by using a Word2Vec model, and generating Word semantic vectors corresponding to each Word according to the context information of each Word.
The context information of the vocabulary refers to the vocabulary and the front and rear vocabularies of the vocabulary.
Step S250, for each vocabulary semantic vector, if there is a vector similarity between the semantic vector of at least one initial risk keyword and the vocabulary semantic vector that is greater than a preset vector similarity threshold, setting the vocabulary corresponding to the vocabulary semantic vector as an expanded risk keyword.
Each initial risk keyword is converted into a semantic vector. And calculating vocabulary semantic vectors with higher semantic vector similarity with the initial risk keywords by using a preset vector similarity algorithm, and taking the vocabulary corresponding to the vocabulary semantic vectors as the expanded risk keywords.
Further, the Word2Vec model may be utilized to convert each initial risk keyword into a semantic vector. The vocabulary which is the same as the initial risk keywords can be queried in the financial field text, and after the vocabulary which is the same as the initial risk keywords is queried, the word semantic vector corresponding to the vocabulary is directly used as the semantic vector of the initial risk keywords.
Step S260, setting each of the set initial risk keywords and each of the set expanded risk keywords as risk keywords.
And taking the set formed by the initial risk keywords and the expanded risk keywords as a risk keyword library.
In the embodiment, modeling a risk corpus by using an LDA model, and extracting initial risk keywords from the risk corpus; semantic modeling is carried out on the financial field text through a Word2Vec model, and vocabulary with high semantic similarity with the initial risk keywords in the financial field text is calculated to be used as the risk expansion keywords, so that the risk keywords are expanded, and the self-adaptive expansion of the risk characteristics is realized. The initial risk keywords and the expanded risk keywords can be continuously increased along with the continuous accumulation of the risk corpus and the financial corpus, so that the adaptive expansion of the risk features is realized.
In this embodiment, before the risk recognition model is used, model training needs to be performed, so that the risk recognition model can accurately recognize whether the suspected risk text has a risk of false propaganda.
In this embodiment, the risk recognition model may be an SVM model. The core of the SVM model is to find the optimal hyperplane to divide the training positive and negative samples, the support vector is the nearest point to the hyperplane, and the distance from the point to the hyperplane is the interval. The goal of the SVM model training is to make the interval between the hyperplane and the support vector as large as possible, so that the hyperplane can accurately separate the two types of samples, thereby ensuring that the risk class decision classifier error is as small as possible and as robust as possible. Thus, the expression of the objective function of the SVM model is as follows:
wherein,the values representing the parameters w and b minimize the output value of the objective function.
w is a weight vector; b is a bias vector; the value of w is the modulus of w; c is a regularization constant; m is the number of training samples; max is a maximum function; y is i Labeling the risk of the ith training sample with a result (namely labeling whether the labeled ith training sample has false propaganda risk or not); x is x i The risk feature vector corresponding to the input ith training sample is obtained; w (w) T Is a transpose of the weight vector. And w and b are parameter vectors to be learned in the risk identification model.
The training process of the risk identification model is further described below.
As shown in fig. 3, a flowchart of training steps for a risk identification model according to an embodiment of the present invention is shown.
Step S310, a plurality of samples are obtained; wherein the training data set, the validation data set and the test data set are partitioned according to the plurality of samples.
Each of the plurality of samples is pre-labeled with a marker. Further, a plurality of positive samples and a plurality of negative samples are included in the plurality of samples. Positive samples refer to samples that are marked as having a risk of false promotions. Negative examples refer to examples that are marked as not having a risk of false promotions.
A portion of the positive samples and a portion of the negative samples may be partitioned among the plurality of samples as a training data set; taking the remaining positive samples and negative samples in the plurality of samples as a test data set; in the training dataset, a part of positive samples and a part of negative samples are partitioned as verification dataset.
Step S320, extracting risk features of a preset type from each sample by using a preset VSM (Vector Space Model ) to form a risk feature vector corresponding to each sample.
In this embodiment, the vector space model may be a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm model.
Acquiring a plurality of samples in the training dataset before extracting risk features of preset types from each sample respectively; respectively calculating each vocabulary appearing in a plurality of samples of the training data set as an evaluation value of risk characteristics by using a preset characteristic evaluation function; sequentially acquiring the first N vocabularies according to the sequence from the large evaluation value to the small evaluation value, wherein N is more than or equal to 1; and constructing the types of risk features according to the first N words and a preset vector space model.
In particular, the feature evaluation function may be a chi-square test (chi-square test) function. Firstly, each sample in the training data set is preprocessed respectively, and the method mainly comprises the following steps: deleting special symbols, word segmentation and stop word removal; and calculating an evaluation value of each vocabulary appearing in the training data set through the chi-square test function, sequencing all the vocabularies according to the evaluation value from large to small, and selecting TopN vocabularies as optimal vocabularies.
The evaluation value chi of each vocabulary appearing in the training sample is calculated using the chi-square test function:
The word represents a word appearing in the training data set, F refers to F types of samples, namely samples marked as having false propaganda risks in the training data set, M represents the total number of samples in the training data set, A represents the number of samples containing the word in the F types of samples, B represents the number of samples containing no word in the F types of samples, C represents the number of samples containing the word in other types of samples, and D represents the number of samples containing no word in the other types of samples. Other classes refer to samples in the training dataset that are marked as not having a risk of spurious promotions. The evaluation value of each vocabulary appearing in the training data set is calculated using the above formula.
The TopN vocabulary and the TF-IDF algorithm model are utilized to construct the types of risk features, namely, each sample in the training data set is expressed by vectors based on the TopN vocabulary through the TF-IDF algorithm model.
For example: the TopN vocabularies and TF-IDF algorithm model are used to construct the following three risk features:
first risk feature: TF (TF) ij
Wherein TF is ij Representing the frequency of occurrence of the vocabulary i in the sample j in the training data set, wherein i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to M.
Second risk feature:
wherein, IDF i Representing the inverse document frequency of the vocabulary i, |D| represents the total number of samples, | { j: i ε D- j And the number of samples in which the word i appears.
Third risk feature: TFIDF (tfIDF) ij =TF ij ×IDF i
Wherein TFIDF ij Representing the weights of the vocabulary i in sample j in the training dataset.
Therefore, no matter when the risk identification model is trained or the false propaganda identification of enterprises is carried out, the TF-IDF algorithm model can be utilized to extract the risk characteristics of the three types from the sample or the suspected risk text, and further the risk characteristic vector of the sample or the suspected risk text is obtained. When the false propaganda identification of the enterprise is performed, the sample j in the three kinds of risk features can be replaced by a suspected risk text j.
Step S330, training parameters in the risk identification model by using risk feature vectors corresponding to each sample in the training dataset.
In the training stage, the risk recognition model is trained through manually marked positive and negative samples with false propaganda risks or not, and optimal parameters are learned.
Sequentially inputting risk feature vectors corresponding to each sample in the training data set into a risk recognition model, acquiring a recognition result output by the risk recognition model, comparing the recognition result with a mark marked by the sample, inputting the next sample into the risk recognition model if the recognition result is the same as the mark, and adjusting parameters in the risk recognition model if the recognition result is different from the mark. By continuously adjusting the parameters, the value of the objective function of the risk identification model can be gradually minimized, i.e. the interval between the support vector and the hyperplane is gradually maximized.
In this embodiment, the recognition result may be a probability of whether the sample has a false propaganda risk, and if the probability is greater than 0.5, it is determined that the sample has a false propaganda risk, that is, it is determined as a risk text.
Step S340, verifying whether the risk identification model converges by using the risk feature vector corresponding to each sample in the verification dataset, if so, executing step S350, otherwise, jumping to step S330.
Performing multiple rounds of authentication operations; each round of verification operations performed includes: sequentially inputting risk feature vectors corresponding to the samples in the verification data set into a risk identification model, so that the risk identification model carries out false propaganda risk identification on the samples corresponding to each risk feature vector; after the first round of verification operation is executed, determining the value of the objective function of the risk identification model, if the value of the objective function of the previous round is larger than the value of the objective function of the next round, continuing to train the risk identification model by using the training data set, otherwise, indicating that the risk identification model converges, and at the moment, the value of the objective function of the risk identification model is minimum, namely the interval between the support vector and the hyperplane is maximum.
And step S350, determining the identification effect index of the risk identification model by using the risk characteristic vector corresponding to each sample in the test data set.
The identification effect index of the risk identification model may be an F1 value of the risk identification model.
Step S360, judging whether the identification effect index is larger than a preset effect threshold value; if yes, go to step S370; if not, step S330 is performed.
And step S370, if the identification effect index is larger than a preset effect threshold value, finishing training of the risk identification model.
Because the credibility of different enterprise public opinion texts is different, the risk represented by one enterprise public opinion text cannot indicate whether the enterprise really exists. In this embodiment, the intensity of the false propaganda risk of the enterprise can be quantified, so as to determine whether the enterprise really has the false propaganda risk.
Considering the influence of the dimensions such as enterprise risk information quantity, enterprise risk information source credibility and the like on the enterprise false propaganda risk intensity, the enterprise false propaganda risk intensity quantification algorithm provided by the embodiment can comprehensively consider the indexes of the dimensions such as enterprise risk information quantity, enterprise risk information quantity ratio, enterprise risk information site/account number, enterprise risk information site credibility, enterprise risk information account credibility and the like, so that the false propaganda risk intensity of an enterprise is calculated.
The manner in which the corresponding false hype risk intensity value for the target business is determined will be described further below.
FIG. 4 is a flowchart of the steps for determining false hype risk intensity values, in accordance with an embodiment of the present invention.
Step S410, determining the number of risk texts, the ratio of the number of risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts and/or the credibility of the source account numbers of the risk texts according to the determined information of all the risk texts.
Step S420, performing normalization processing on the number of the risk texts, the number of the source sites of the risk texts, the number of the source accounts of the sum of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source accounts of the risk texts.
Step S430, determining a ratio of the number of the risk texts to the number of the public opinion texts of the enterprise, the normalized number of the risk texts, the number of source sites of the risk texts, the number of source accounts of the risk texts, the credibility of the source sites of the risk texts, and/or a weighted average of the credibility of the source accounts of the risk texts, and taking the weighted average as a false propaganda risk intensity value corresponding to the target enterprise.
In this embodiment, the following influencing factors of the false promotional risk intensity value may be determined: the enterprise risk information amount, the enterprise risk information amount ratio, the enterprise risk information site/account number, the enterprise risk information site credibility and the enterprise risk information account number credibility.
The amount of risk information (the number of risk texts) of the enterprise refers to the number of risk texts (suspected risk texts with false publicity risk) identified by the risk identification model. The more the risk information amount is, the larger the false propaganda risk intensity of the enterprise is, and the dimension index is positively related to the false propaganda risk intensity of the enterprise.
The enterprise risk information amount ratio (ratio of the number of risk texts to the number of enterprise public opinion texts) refers to the ratio of the number of risk texts identified by the risk identification model to the number of enterprise public opinion texts acquired from the internet. If the more risk texts about false propaganda risks in the enterprise public opinion texts, the more intensity that the enterprise has the false propaganda risks, the dimension index is positively correlated with the enterprise risk intensity.
The number of enterprise risk information sites (the number of source sites of the risk text) refers to the number of source sites identified as the risk text by the risk identification model. The source site of the risk text can be extracted from the risk text, and a data source for acquiring the risk text can also be used as the source site. The more the enterprise risk information sites are, the more data sources for reporting that the enterprise has false propaganda risks are, the higher the credibility of the risk information is, so that the higher the intensity of the false propaganda risks of the enterprise is, and the dimension index is positively related to the enterprise risk intensity.
The enterprise risk information account number (source account number of the risk text) refers to the number of source accounts identified as the risk text by the risk identification model. The source account number of the risk text can be extracted from the risk text, and the source account number of the risk text can be queried in a data source for acquiring the risk text. The more the enterprise risk information account number is, the more the data sources of the false propaganda risk exist in the reporting enterprise, the higher the reliability of the risk information is, so that the higher the intensity of the false propaganda risk exists in the enterprise, and the dimension index is positively related to the enterprise risk intensity.
The enterprise risk information site credibility (credibility of source sites of risk texts) refers to the comprehensive corresponding credibility of each source site of the risk texts identified by the risk identification model. A plurality of site levels are preset, and the types of the site levels include but are not limited to: the confidence level is reduced in turn at the central level, provincial level, municipal level, county level. Each source site may correspond to a respective site level. A weight value may be set for each site level correspondence. The greater the influence degree of the enterprise risk information site is, the greater the credibility of the risk information is, the greater the intensity of the risk of the enterprise is, and the dimension index is positively related to the false propaganda risk intensity of the enterprise.
The source site of each risk text is determined in all risk texts, and the site level corresponding to each source site is determined, so that the credibility I of the enterprise risk information site can be expressed by using the following formula:
where K is the total number of source sites involved in all risk texts, theta k For the weight of the site level corresponding to the kth source site, count k And (3) the sum function is given as the number of the risk texts corresponding to the kth source site. The weight corresponding to each site level is as follows:
the credibility of the enterprise risk information account (credibility of the source account of the risk text) refers to the credibility of the comprehensive correspondence of each source account of the risk text identified by the risk identification model. Further, the number of vermicelli of the source account numbers can be used for measuring the comprehensive corresponding credibility of each source account number, and the larger the number of vermicelli is, the larger the credibility of the account number source is, the larger the credibility of risk information is, and the larger the intensity of false propaganda risks of enterprises is. The dimension index is positively correlated with the enterprise risk intensity.
The formula for determining the credibility S of the enterprise risk information account is as follows:
wherein T is the total number of source accounts related to all risk texts, fas t Count as the number of vermicelli of the t-th source account t And the number of the risk texts corresponding to the t-th source account number.
Carrying out normalization processing on the quantity of the risk texts and/or the credibility of the source account numbers of the risk texts by using a preset logarithmic Min-Max normalization method; and carrying out normalization processing on the number of source sites of the risk text, the number of source account numbers of the risk text and the credibility of the source sites of the risk text by using a preset Min-Max normalization method.
Aiming at two dimensions of enterprise risk information quantity and enterprise risk information account credibility, the data distribution accords with the power law distribution. The power law distribution is characterized in that the data distribution is mostly concentrated in a range with smaller values, and the data distribution is smaller as the values are larger. Because of the extreme non-uniformity of the power law distribution, extreme data can cause larger interference on the normalization result, so that extreme abnormal values need to be eliminated, the data distribution is converted into linear distribution by utilizing a logarithmic function, and the quantity of the risk texts and the credibility of source accounts of the risk texts can be unified to be between 0 and 1 respectively by adopting a logarithmic Min-Max normalization method.
The expression of the logarithmic Min-Max normalization method is as follows:
wherein, norm score is the logarithmic normalization result, X is the first original feature value, xmin is the minimum value in the first class feature value, xmax is the maximum value in the first class feature value, and the norm score has a value between 0 and 1.
The first original feature value refers to the current normalized object, namely the number of the risk texts or the credibility of the source account numbers of the risk texts.
The minimum value in the first class feature values refers to determining the number of the risk texts or the credibility of source accounts of the risk texts corresponding to each target enterprise after the false propaganda risk identification method of the enterprise is executed for a plurality of times, and taking the minimum number of the risk texts or the credibility of the source accounts of the risk texts in the target enterprises as Xmin.
The maximum value in the first class of feature values refers to that after the false propaganda risk identification method of enterprises is executed for many times, the number of the corresponding risk texts or the credibility of source accounts of the risk texts of each target enterprise is determined, and the maximum number of the risk texts or the credibility of the source accounts of the risk texts in the target enterprises is taken as Xmax.
The data distribution of the enterprise risk information sites, the account numbers and the enterprise risk information site credibility accords with the linear distribution, and the Min-Max normalization method can be adopted to normalize the enterprise risk information sites, the account numbers and the enterprise risk information site credibility to be between 0 and 1 respectively.
The expression of the Min-Max normalization method is as follows:
wherein, norm score 'is the normalization result, Y is the second original feature value, ymin is the minimum value in the second class feature value, ymax is the maximum value in the second class feature value, and the norm score' has a value between 0 and 1.
The second original feature value refers to a current normalized object, namely the number of enterprise risk information sites, the number of accounts or the credibility of the enterprise risk information sites.
The minimum value in the second class of feature values refers to determining the number of enterprise risk information sites, the number of accounts and the reliability of the enterprise risk information sites corresponding to each target enterprise after the enterprise false propaganda risk identification method is executed for a plurality of times, and taking the minimum enterprise risk information sites, the number of accounts or the reliability of the enterprise risk information sites in a plurality of target enterprises as Ymin.
The maximum value in the second class of feature values refers to that after the false propaganda risk identification method of the enterprise is executed for multiple times, the number of enterprise risk information sites, the number of accounts and the reliability of the enterprise risk information sites corresponding to each target enterprise are determined, and the maximum enterprise risk information sites, the number of accounts or the reliability of the enterprise risk information sites in the multiple target enterprises are taken as Ymax.
For the enterprise risk information quantity occupying ratio dimension, the numerical range is between 0 and 1, and no further normalization processing is needed.
Thus, the method can be formed to include: the ratio of the number of the risk texts to the number of the enterprise public opinion texts, the number of the normalized risk texts, the number of source sites of the risk texts, the number of source accounts of the risk texts, the credibility of the source sites of the risk texts and the credibility of the source accounts of the risk texts, and the risk intensity of 6-dimensional features influences the vector. Of course, the number of features in the risk intensity influence vector may be increased or decreased, and is not limited to the above-described 6 features.
After the normalization processing is performed, according to the weight set in advance for each dimension feature in the risk intensity influence vector, the false promotion risk intensity value avgScore corresponding to the target enterprise is determined by using the following calculation formula:
wherein L is the dimension of the risk intensity influence vector, θ l For the weight corresponding to the first dimension feature, num l And (5) the value or the normalization result corresponding to the first dimension characteristic.
Further, the ratio of the number of the risk texts to the number of the enterprise public opinion texts, the number of the normalized risk texts, the number of source sites and the number of source accounts of the risk texts, the credibility of the source sites of the risk texts and the credibility of the source accounts of the risk texts respectively correspond to weights, which can be an empirical value or a value obtained through experiments, or a value determined through an expert scoring method.
In the embodiment, the magnitude of the false propaganda risk intensity value of the enterprise is normalized to be between 0 and 1 through an algorithm for quantifying the false propaganda risk intensity of the enterprise, the larger the false propaganda risk intensity value is, the larger the intensity of the false propaganda risk exists in the target enterprise, and therefore the magnitude of the intensity of the false propaganda risk of the enterprise is quantitatively reflected, a risk threshold is set according to a specific business application scene, and the target enterprise with the false propaganda risk intensity value larger than the risk threshold is determined to be the enterprise with the false propaganda risk.
The method and the device can be used for identifying enterprises with false propaganda risks by matching risk keywords, identifying a risk identification model and quantifying false propaganda risk intensity based on the fields of machine learning, data mining, natural language processing, data source management and the like. The false propaganda risk identification method can accurately and comprehensively evaluate the risk intensity of false propaganda of enterprises. The higher the false promotion risk intensity value, the higher the risk that the enterprise has false promotions. According to the embodiment of the invention, the false propaganda risk assessment of enterprises is carried out by comprehensively utilizing the risk keyword matching, the risk identification model identification and the false propaganda risk intensity quantization algorithm, so that the disadvantages of low accuracy, poor intelligence, poor adaptability and the like of the manual rule matching scheme are improved, and the accuracy, the effectiveness and the intellectualization of the false propaganda risk identification are greatly improved.
The embodiment provides enterprise false propaganda risk identification equipment. FIG. 5 is a block diagram of an enterprise false promotion risk identification apparatus according to one embodiment of the present invention.
In this embodiment, the enterprise false propaganda risk identification device includes, but is not limited to: a processor 510, a memory 520.
The processor 510 is configured to execute the enterprise false propaganda risk identification program stored in the memory 520 to implement the enterprise false propaganda risk identification method described above.
Specifically, the processor 510 is configured to execute the enterprise false hype risk identification program stored in the memory 520, so as to implement the following steps:
obtaining a plurality of enterprise public opinion texts corresponding to a target enterprise from the Internet; extracting enterprise public opinion texts comprising preset risk keywords from a plurality of enterprise public opinion texts corresponding to the target enterprise to serve as suspected risk texts; extracting risk features of corresponding types from each suspected risk text according to the types of preset risk features, and forming risk feature vectors corresponding to each suspected risk text; sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, enabling the risk recognition model to recognize false propaganda risks of each suspected risk text, and determining the suspected risk text recognized as the risk text with the false propaganda risks; according to the determined information of all risk texts, determining false propaganda risk intensity values corresponding to the target enterprises; and if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has false propaganda risk.
Before the extracting the enterprise public opinion text including the preset risk keywords, the method further includes: extracting a plurality of enterprise false propaganda texts and a plurality of financial field texts from the Internet; preprocessing each enterprise false propaganda text and each financial field text respectively; generating an LDA model by using a preset document theme, extracting a plurality of false propaganda theme keywords from the preprocessed plurality of enterprise false propaganda texts, and setting the plurality of false propaganda theme keywords as initial risk keywords; extracting context information from each preprocessed text in the financial field by using a preset Word2Vec model, and generating a plurality of vocabulary semantic vectors according to the context information; for each vocabulary semantic vector, if the vector similarity between the semantic vector of at least one initial risk keyword and the vocabulary semantic vector is greater than a preset vector similarity threshold, setting the vocabulary corresponding to the vocabulary semantic vector as an expanded risk keyword; setting each set initial risk keyword and each set expanded risk keyword as risk keywords.
The method for extracting the enterprise public opinion texts including the preset risk keywords from the enterprise public opinion texts corresponding to the target enterprise is used as suspected risk texts and comprises the following steps: aiming at each enterprise public opinion text, carrying out clause processing on the enterprise public opinion text to obtain a plurality of clauses corresponding to the enterprise public opinion text; respectively carrying out similarity calculation on each clause and a plurality of preset risk keywords; and if at least one of the clauses in the enterprise public opinion text and one of the risk keywords is more than a preset keyword similarity threshold, determining the enterprise public opinion text as a suspected risk text.
Before sequentially inputting the risk feature vectors corresponding to the multiple suspected risk texts into the pre-trained risk recognition model, the method further comprises: step 1, obtaining a plurality of samples; wherein the training data set, the validation data set and the test data set are partitioned according to the plurality of samples; step 2, extracting risk features of preset types from each sample by using a preset vector space model to form risk feature vectors corresponding to each sample; step 3, training parameters in the risk identification model by using risk feature vectors corresponding to each sample in the training data set; step 4, verifying whether the risk identification model is converged by using the risk feature vector corresponding to each sample in the verification data set, if so, executing step 5, otherwise, jumping to step 3; step 5, determining an identification effect index of the risk identification model by using the risk feature vector corresponding to each sample in the test data set; and if the identification effect index is larger than a preset effect threshold value, finishing training the risk identification model, otherwise, jumping to the step 3.
Before extracting the risk features of the preset types from each sample, the method further comprises: obtaining a plurality of samples in the training dataset; respectively calculating each vocabulary appearing in a plurality of samples of the training data set as an evaluation value of risk characteristics by using a preset characteristic evaluation function; sequentially acquiring the first N vocabularies according to the sequence from the large evaluation value to the small evaluation value, wherein N is more than or equal to 1; and constructing the types of risk features according to the first N words and a preset vector space model.
The risk identification model is a Support Vector Machine (SVM) model.
The determining the false propaganda risk intensity value corresponding to the target enterprise according to the determined information of all risk texts comprises the following steps: determining the number of risk texts, the ratio of the number of the risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts and/or the credibility of the source account numbers of the risk texts according to the determined information of all the risk texts; performing normalization processing on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts; determining a ratio of the number of the risk texts to the number of the enterprise public opinion texts, normalizing the number of the risk texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or a weighted average of the credibility of the source account numbers of the risk texts, wherein the weighted average is used as a false propaganda risk intensity value corresponding to the target enterprise.
The normalizing processing is performed on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts, and the normalizing processing includes: carrying out normalization processing on the quantity of the risk texts and/or the credibility of the source account numbers of the risk texts by using a preset logarithmic Min-Max normalization method; and carrying out normalization processing on the number of source sites of the risk text and/or the credibility of the source sites of the risk text by using a preset Min-Max normalization method.
The embodiment of the invention also provides a storage medium. The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.
The one or more programs, when executed by the one or more processors, implement the enterprise false hype risk identification method described above.
Specifically, the processor is configured to execute the enterprise false propaganda risk identification program stored in the memory, so as to implement the following steps:
obtaining a plurality of enterprise public opinion texts corresponding to a target enterprise from the Internet; extracting enterprise public opinion texts comprising preset risk keywords from a plurality of enterprise public opinion texts corresponding to the target enterprise to serve as suspected risk texts; extracting risk features of corresponding types from each suspected risk text according to the types of preset risk features, and forming risk feature vectors corresponding to each suspected risk text; sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, enabling the risk recognition model to recognize false propaganda risks of each suspected risk text, and determining the suspected risk text recognized as the risk text with the false propaganda risks; according to the determined information of all risk texts, determining false propaganda risk intensity values corresponding to the target enterprises; and if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has false propaganda risk.
Before the extracting the enterprise public opinion text including the preset risk keywords, the method further includes: extracting a plurality of enterprise false propaganda texts and a plurality of financial field texts from the Internet; preprocessing each enterprise false propaganda text and each financial field text respectively; generating an LDA model by using a preset document theme, extracting a plurality of false propaganda theme keywords from the preprocessed plurality of enterprise false propaganda texts, and setting the plurality of false propaganda theme keywords as initial risk keywords; extracting context information from each preprocessed text in the financial field by using a preset Word2Vec model, and generating a plurality of vocabulary semantic vectors according to the context information; for each vocabulary semantic vector, if the vector similarity between the semantic vector of at least one initial risk keyword and the vocabulary semantic vector is greater than a preset vector similarity threshold, setting the vocabulary corresponding to the vocabulary semantic vector as an expanded risk keyword; setting each set initial risk keyword and each set expanded risk keyword as risk keywords.
The method for extracting the enterprise public opinion texts including the preset risk keywords from the enterprise public opinion texts corresponding to the target enterprise is used as suspected risk texts and comprises the following steps: aiming at each enterprise public opinion text, carrying out clause processing on the enterprise public opinion text to obtain a plurality of clauses corresponding to the enterprise public opinion text; respectively carrying out similarity calculation on each clause and a plurality of preset risk keywords; and if at least one of the clauses in the enterprise public opinion text and one of the risk keywords is more than a preset keyword similarity threshold, determining the enterprise public opinion text as a suspected risk text.
Before sequentially inputting the risk feature vectors corresponding to the multiple suspected risk texts into the pre-trained risk recognition model, the method further comprises: step 1, obtaining a plurality of samples; wherein the training data set, the validation data set and the test data set are partitioned according to the plurality of samples; step 2, extracting risk features of preset types from each sample by using a preset vector space model to form risk feature vectors corresponding to each sample; step 3, training parameters in the risk identification model by using risk feature vectors corresponding to each sample in the training data set; step 4, verifying whether the risk identification model is converged by using the risk feature vector corresponding to each sample in the verification data set, if so, executing step 5, otherwise, jumping to step 3; step 5, determining an identification effect index of the risk identification model by using the risk feature vector corresponding to each sample in the test data set; and if the identification effect index is larger than a preset effect threshold value, finishing training the risk identification model, otherwise, jumping to the step 3.
Before extracting the risk features of the preset types from each sample, the method further comprises: obtaining a plurality of samples in the training dataset; respectively calculating each vocabulary appearing in a plurality of samples of the training data set as an evaluation value of risk characteristics by using a preset characteristic evaluation function; sequentially acquiring the first N vocabularies according to the sequence from the large evaluation value to the small evaluation value, wherein N is more than or equal to 1; and constructing the types of risk features according to the first N words and a preset vector space model.
The risk identification model is a Support Vector Machine (SVM) model.
The determining the false propaganda risk intensity value corresponding to the target enterprise according to the determined information of all risk texts comprises the following steps: determining the number of risk texts, the ratio of the number of the risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts and/or the credibility of the source account numbers of the risk texts according to the determined information of all the risk texts; performing normalization processing on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts; determining a ratio of the number of the risk texts to the number of the enterprise public opinion texts, normalizing the number of the risk texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or a weighted average of the credibility of the source account numbers of the risk texts, wherein the weighted average is used as a false propaganda risk intensity value corresponding to the target enterprise.
The normalizing processing is performed on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts, and the normalizing processing includes: carrying out normalization processing on the quantity of the risk texts and/or the credibility of the source account numbers of the risk texts by using a preset logarithmic Min-Max normalization method; and carrying out normalization processing on the number of source sites of the risk text and/or the credibility of the source sites of the risk text by using a preset Min-Max normalization method.
The above description is only an example of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (9)

1. A method for identifying false propaganda risks of enterprises, comprising the steps of:
obtaining a plurality of enterprise public opinion texts corresponding to a target enterprise from the Internet;
Extracting enterprise public opinion texts comprising preset risk keywords from a plurality of enterprise public opinion texts corresponding to the target enterprise to serve as suspected risk texts;
extracting risk features of corresponding types from each suspected risk text according to the types of preset risk features, and forming risk feature vectors corresponding to each suspected risk text;
sequentially inputting risk feature vectors corresponding to the multiple suspected risk texts into a pre-trained risk recognition model, enabling the risk recognition model to recognize false propaganda risks of each suspected risk text, and determining the suspected risk text recognized as the risk text with the false propaganda risks;
according to the determined information of all risk texts, determining false propaganda risk intensity values corresponding to the target enterprises; if the false propaganda risk intensity value is larger than a preset risk threshold value, determining that the target enterprise has false propaganda risk;
the determining the false propaganda risk intensity value corresponding to the target enterprise according to the determined information of all risk texts comprises the following steps:
determining the number of risk texts, the ratio of the number of the risk texts to the number of enterprise public opinion texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts and/or the credibility of the source account numbers of the risk texts according to the determined information of all the risk texts;
Performing normalization processing on the number of the risk texts, the number of the source sites of the risk texts, the number of the source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or the credibility of the source account numbers of the risk texts;
determining a ratio of the number of the risk texts to the number of the enterprise public opinion texts, normalizing the number of the risk texts, the number of source sites of the risk texts, the number of source account numbers of the risk texts, the credibility of the source sites of the risk texts, and/or a weighted average of the credibility of the source account numbers of the risk texts, wherein the weighted average is used as a false propaganda risk intensity value corresponding to the target enterprise.
2. The method of claim 1, further comprising, prior to the extracting the corporate public opinion text including the preset risk keywords:
extracting a plurality of enterprise false propaganda texts and a plurality of financial field texts from the Internet;
preprocessing each enterprise false propaganda text and each financial field text respectively;
generating an LDA model by using a preset document theme, extracting a plurality of false propaganda theme keywords from the preprocessed plurality of enterprise false propaganda texts, and setting the plurality of false propaganda theme keywords as initial risk keywords;
Extracting context information from each preprocessed text in the financial field by using a preset Word2Vec model, and generating a plurality of vocabulary semantic vectors according to the context information;
for each vocabulary semantic vector, if the vector similarity between the semantic vector of at least one initial risk keyword and the vocabulary semantic vector is greater than a preset vector similarity threshold, setting the vocabulary corresponding to the vocabulary semantic vector as an expanded risk keyword;
setting each set initial risk keyword and each set expanded risk keyword as risk keywords.
3. The method according to claim 1 or 2, wherein extracting, from a plurality of enterprise public opinion texts corresponding to the target enterprise, the enterprise public opinion text including a preset risk keyword as a suspected risk text includes:
aiming at each enterprise public opinion text, carrying out clause processing on the enterprise public opinion text to obtain a plurality of clauses corresponding to the enterprise public opinion text;
respectively carrying out similarity calculation on each clause and a plurality of preset risk keywords;
and if at least one of the clauses in the enterprise public opinion text and one of the risk keywords is more than a preset keyword similarity threshold, determining the enterprise public opinion text as a suspected risk text.
4. The method according to claim 1, further comprising, before sequentially inputting risk feature vectors corresponding to the plurality of the suspected risk texts into a pre-trained risk recognition model:
step 1, obtaining a plurality of samples; wherein the training data set, the validation data set and the test data set are partitioned according to the plurality of samples;
step 2, extracting risk features of preset types from each sample by using a preset vector space model to form risk feature vectors corresponding to each sample;
step 3, training parameters in the risk identification model by using risk feature vectors corresponding to each sample in the training data set;
step 4, verifying whether the risk identification model is converged by using the risk feature vector corresponding to each sample in the verification data set, if so, executing step 5, otherwise, jumping to step 3;
step 5, determining an identification effect index of the risk identification model by using the risk feature vector corresponding to each sample in the test data set; and if the identification effect index is larger than a preset effect threshold value, finishing training the risk identification model, otherwise, jumping to the step 3.
5. The method of claim 4, further comprising, prior to said extracting the risk features of the predetermined category in each of said samples, respectively:
obtaining a plurality of samples in the training dataset;
respectively calculating each vocabulary appearing in a plurality of samples of the training data set as an evaluation value of risk characteristics by using a preset characteristic evaluation function;
sequentially acquiring the first N vocabularies according to the sequence from the large evaluation value to the small evaluation value, wherein N is more than or equal to 1;
and constructing the types of risk features according to the first N words and a preset vector space model.
6. The method of claim 4, wherein the risk identification model is a support vector machine, SVM, model.
7. The method according to claim 1, wherein normalizing the number of risk texts, the number of source sites of risk texts, the number of source account numbers of risk texts, the credibility of source sites of risk texts, and/or the credibility of source account numbers of risk texts comprises:
carrying out normalization processing on the quantity of the risk texts and/or the credibility of the source account numbers of the risk texts by using a preset logarithmic Min-Max normalization method;
And carrying out normalization processing on the number of source sites of the risk text and/or the credibility of the source sites of the risk text by using a preset Min-Max normalization method.
8. An enterprise false propaganda risk identification device, characterized in that the enterprise false propaganda risk identification device comprises: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the enterprise false promotion risk identification method of any one of claims 1 to 7.
9. A computer readable storage medium, wherein an enterprise false hype risk identification program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of the enterprise false hype risk identification method according to any of claims 1-7.
CN202010214386.1A 2020-03-24 2020-03-24 Enterprise false propaganda risk identification method, equipment and storage medium Active CN113505221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010214386.1A CN113505221B (en) 2020-03-24 2020-03-24 Enterprise false propaganda risk identification method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010214386.1A CN113505221B (en) 2020-03-24 2020-03-24 Enterprise false propaganda risk identification method, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113505221A CN113505221A (en) 2021-10-15
CN113505221B true CN113505221B (en) 2024-03-12

Family

ID=78008263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010214386.1A Active CN113505221B (en) 2020-03-24 2020-03-24 Enterprise false propaganda risk identification method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113505221B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193796A (en) * 2016-03-14 2017-09-22 北大方正集团有限公司 A kind of public sentiment event detecting method and device
CN107239439A (en) * 2017-04-19 2017-10-10 同济大学 Public sentiment sentiment classification method based on word2vec
CN109523153A (en) * 2018-11-12 2019-03-26 平安科技(深圳)有限公司 Acquisition methods, device, computer equipment and the storage medium of illegal fund collection enterprise
CN109543985A (en) * 2018-11-15 2019-03-29 李志东 Business risk appraisal procedure, system and medium
CN109670837A (en) * 2018-11-30 2019-04-23 平安科技(深圳)有限公司 Recognition methods, device, computer equipment and the storage medium of bond default risk
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
CN109993448A (en) * 2019-04-08 2019-07-09 湖北风口网络科技有限公司 A kind of appraisal procedure and system of enterprise network public sentiment potential risk
CN110493190A (en) * 2019-07-15 2019-11-22 平安科技(深圳)有限公司 Processing method, device, computer equipment and the storage medium of data information
CN110704572A (en) * 2019-09-04 2020-01-17 北京航空航天大学 Suspected illegal fundraising risk early warning method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193796A (en) * 2016-03-14 2017-09-22 北大方正集团有限公司 A kind of public sentiment event detecting method and device
CN107239439A (en) * 2017-04-19 2017-10-10 同济大学 Public sentiment sentiment classification method based on word2vec
CN109523153A (en) * 2018-11-12 2019-03-26 平安科技(深圳)有限公司 Acquisition methods, device, computer equipment and the storage medium of illegal fund collection enterprise
CN109543985A (en) * 2018-11-15 2019-03-29 李志东 Business risk appraisal procedure, system and medium
CN109670837A (en) * 2018-11-30 2019-04-23 平安科技(深圳)有限公司 Recognition methods, device, computer equipment and the storage medium of bond default risk
CN109885675A (en) * 2019-02-25 2019-06-14 合肥工业大学 Method is found based on the text sub-topic for improving LDA
CN109993448A (en) * 2019-04-08 2019-07-09 湖北风口网络科技有限公司 A kind of appraisal procedure and system of enterprise network public sentiment potential risk
CN110493190A (en) * 2019-07-15 2019-11-22 平安科技(深圳)有限公司 Processing method, device, computer equipment and the storage medium of data information
CN110704572A (en) * 2019-09-04 2020-01-17 北京航空航天大学 Suspected illegal fundraising risk early warning method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Feature extension for Chinese short text classification based on LDA and Word2vec;Sun,Fanke 等;《2018 13th IEEE Conference on Industrial Electronics and Applications(ICIEA)》;1189-1194 *
基于情感词向量的微博情感分类;杜慧 等;《中文信息学报》;第 31 卷(第 3 期);170-176 *
基于词向量特征扩展的中文短文本分类研究;雷朔 等;《计算机应用与软件》;第 35 卷(第 8 期);269-274 *

Also Published As

Publication number Publication date
CN113505221A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
Day et al. Deep learning for financial sentiment analysis on finance news providers
Xu et al. Identifying the semantic orientation of terms using S-HAL for sentiment analysis
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
Vijayaraghavan et al. Fake news detection with different models
CN111831824A (en) Public opinion positive and negative face classification method
Ebrahimpour et al. Automated authorship attribution using advanced signal classification techniques
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
US11900250B2 (en) Deep learning model for learning program embeddings
Budhiraja et al. A supervised learning approach for heading detection
CN109614490A (en) Money article proneness analysis method based on LSTM
CN115473726A (en) Method and device for identifying domain name
Luthfi et al. Implementation of TF-IDF method and support vector machine algorithm for job applicants text classification
Haryono et al. Aspect-based sentiment analysis of financial headlines and microblogs using semantic similarity and bidirectional long short-term memory
Alshahrani et al. Hunter Prey Optimization with Hybrid Deep Learning for Fake News Detection on Arabic Corpus.
CN111523311B (en) Search intention recognition method and device
Bruno et al. Natural language processing and classification methods for the maintenance and optimization of US weapon systems
CN117216687A (en) Large language model generation text detection method based on ensemble learning
Oelke et al. Visual evaluation of text features for document summarization and analysis
Liu Automatic argumentative-zoning using word2vec
CN113505221B (en) Enterprise false propaganda risk identification method, equipment and storage medium
Spichakova et al. Application of Machine Learning for Assessment of HS Code Correctness.
Bender et al. To extend or not to extend? complementary documents
CN109254993B (en) Text-based character data analysis method and system
KR102406961B1 (en) A method of learning data characteristics and method of identifying fake information through self-supervised learning
CN111400496B (en) Public praise emotion analysis method for user behavior analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100029 Beijing city Chaoyang District Yumin Road No. 3

Applicant after: NATIONAL COMPUTER NETWORK AND INFORMATION SECURITY MANAGEMENT CENTER

Applicant after: BEIJING ZHONGKE WENGE TECHNOLOGY Co.,Ltd.

Applicant after: Guoke Zhian (Beijing) Technology Co.,Ltd.

Address before: 100029 Beijing city Chaoyang District Yumin Road No. 3

Applicant before: NATIONAL COMPUTER NETWORK AND INFORMATION SECURITY MANAGEMENT CENTER

Applicant before: BEIJING ZHONGKE WENGE TECHNOLOGY Co.,Ltd.

Applicant before: Beijing Zhongke Wenge Zhian Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant