CN112906384A - Data processing method, device and equipment based on BERT model and readable storage medium - Google Patents

Data processing method, device and equipment based on BERT model and readable storage medium Download PDF

Info

Publication number
CN112906384A
CN112906384A CN202110259634.9A CN202110259634A CN112906384A CN 112906384 A CN112906384 A CN 112906384A CN 202110259634 A CN202110259634 A CN 202110259634A CN 112906384 A CN112906384 A CN 112906384A
Authority
CN
China
Prior art keywords
data
text
word segmentation
bert model
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110259634.9A
Other languages
Chinese (zh)
Other versions
CN112906384B (en
Inventor
苏雪琦
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110259634.9A priority Critical patent/CN112906384B/en
Publication of CN112906384A publication Critical patent/CN112906384A/en
Application granted granted Critical
Publication of CN112906384B publication Critical patent/CN112906384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to artificial intelligence, and provides a data processing method and device based on a BERT model, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring text data related to target data in a preset data extraction range by using a script crawler tool; performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data; training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein, the emotion polarity comprises positive emotion direction and negative emotion direction; constructing a data change index of the target data according to the emotion polarity of the text data; acquiring data change information of target data in a preset working day according to the data change index; and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data. The invention mainly aims to obtain correct investment decision through sentiment analysis on stock evaluation.

Description

Data processing method, device and equipment based on BERT model and readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data processing method and device based on a BERT model, electronic equipment and a computer readable storage medium.
Background
Currently, when making investment decisions, stock investors are easily influenced by self factors such as emotional and psychological factors. Based on the stock market, the stock market is used as a weather chart of macro economy, and is influenced by various policies, news and public opinions, so that the stock market is easy to fluctuate severely. With the development of internet technology, people increasingly tend to express and communicate on a network platform, and real-time stock evaluation often contains rich financial information, so that emotional and psychological changes of investors are reflected; at present, most stock investors adopt a common simple model to analyze and process financial information, but due to the simplification of the model, the model cannot fully analyze and calculate rich financial information, and finally, the obtained investment decision is not accurate.
Therefore, in order to solve the above problems, mining analysis of stock evaluation information of a stock market and sentiment analysis of the stock evaluation are performed to obtain an accurate investment decision, it is desirable to provide a data processing method, apparatus, electronic device and computer-readable storage medium based on a BERT model.
Disclosure of Invention
The invention provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium based on a BERT model, and mainly aims to obtain a correct investment decision through sentiment analysis on stock evaluation.
In order to achieve the above object, the present invention provides a method for processing data based on a BERT model, which is applied to an electronic device, the method comprising:
acquiring text data related to target data in a preset data extraction range by using a script crawler tool;
performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data;
training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
constructing a data change index of the target data according to the emotion polarity of the text data;
acquiring data change information of the target data in a preset working day according to the data change index;
and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
Optionally, the obtaining, by the script crawler tool, text data related to the target data within a preset data extraction range includes the following steps:
a social network crawler system is constructed by adopting a Scapy frame, and stock comment information, a timestamp and comment person data information of target stocks in a social network are obtained by adopting a distributed crawler algorithm; the target data is the target stock, and the text data related to the target data is the stock comment information, the timestamp and the comment person data information;
and storing the stock comment information, the timestamp and the comment person data information of the crawled target stock into a MongoDB database.
Optionally, the performing word segmentation and vectorization processing on the text data to obtain a word segmentation vector of the text data includes the following steps:
performing word segmentation processing on the acquired text data according to parts of speech to acquire text words, wherein the parts of speech comprise verbs, nouns, adjectives and adverbs;
and matching the obtained text participles with entries in a dictionary of the pre-trained BERT model to obtain a text participle vector of each text participle in the text data.
Optionally, the training processing is performed on the word segmentation vector through a pre-training BERT model obtained in advance, so as to obtain the emotion polarity of the text data, including the following steps:
preprocessing the word segmentation vector by using a runlasifier. At the same time, the user can select the desired position,
pre-training the BERT model to obtain a trained BERT model with emotion types corresponding to the text data;
and inputting the acquired preprocessing data into a trained BERT model with emotion types corresponding to the text data for processing to acquire the emotion polarity of the text data.
Optionally, the formula of the data change index is as follows:
Figure BDA0002969291660000021
wherein r ispRepresenting the influence contribution rate of positive emotional polarity comments;
cprepresenting the influence degree of positive emotion polarity;
rnrepresenting the influence contribution rate of negative emotion polarity comments;
cnrepresenting the influence degree of negative emotion polarity;
t represents the online time length of the reviewer, p represents positive emotion, and n represents negative emotion.
Optionally, the classifying the data change information by using a logistic regression function to obtain the data change trend of the target data includes the following steps:
classifying the fall and rise information of the target stock in a preset working day by adopting a logistic regression function, wherein,
and when the data change index of the target stock is more than 0.5, determining the change trend of the target stock.
In order to solve the above problem, the present invention further provides a BERT model-based data processing apparatus, comprising:
the text data acquisition module is used for acquiring text data related to the target data in a preset data extraction range through a script crawler tool;
the word segmentation vector acquisition module is used for carrying out word segmentation and vectorization processing on the text data to acquire word segmentation vectors of the text data;
the emotion polarity acquisition module is used for training the word segmentation vectors through a pre-trained BERT model to acquire the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
the data change index construction module is used for constructing a data change index of the target data according to the emotion polarity of the text data;
the data change information acquisition module is used for acquiring data change information of the target data in a preset working day according to the data change index;
and the data change trend acquisition module is used for classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
Optionally, the word segmentation vector obtaining module comprises a text word segmentation obtaining module and a text word segmentation vector obtaining module, wherein,
the text participle obtaining module is used for carrying out participle processing on the obtained text data according to parts of speech to obtain text participles, wherein the parts of speech comprise verbs, nouns, adjectives and adverbs;
and the text word segmentation vector acquisition module is used for matching the acquired text word segmentation with a vocabulary entry in a dictionary of the pre-trained BERT model to acquire a text word segmentation vector of each text word in the text data.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the BERT model-based data processing method described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the BERT model-based data processing method described above.
The embodiment of the invention obtains text data through a script crawler tool; performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data; training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; constructing a data change index of the target data according to the emotion polarity of the text data; acquiring data change information of the target data in a preset working day according to the data change index; and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data. The main purpose of the invention is to obtain the correct investment decision through the sentiment analysis of the stock evaluation.
Drawings
Fig. 1 is a schematic flow chart of a BERT model-based data processing method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a data processing apparatus based on a BERT model according to an embodiment of the present invention;
fig. 3 is a schematic internal structural diagram of an electronic device implementing a BERT model-based data processing method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a data processing method based on a BERT model. Fig. 1 is a schematic flow chart of a BERT model-based data processing method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the method for processing data based on the BERT model includes:
s1: acquiring text data related to target data in a preset data extraction range by using a script crawler tool;
s2: performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data;
s3: training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
s4: constructing a data change index of the target data according to the emotion polarity of the text data;
s5: acquiring data change information of the target data in a preset working day according to the data change index;
s6: and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
The above-mentioned method for processing data based on BERT model based on artificial intelligence of the present inventors, in step S1, the obtaining text data related to target data within a preset data extraction range by a script crawler tool, includes the following steps:
s11: a Scapy frame is adopted to construct a microblog crawler system, and a distributed crawler algorithm is used to obtain stock comment information, a timestamp and comment person data information of a target stock in a microblog social network; the target data is the target stock, and the text data comprises stock comment information, a timestamp and comment person data information of the target stock;
s12: and storing the stock comment information, the timestamp and the comment person data information of the target stock crawled by the network into a MongoDB database.
In the embodiment of the invention, a crawler based on Scapy is constructed, comment information, a timestamp, comment person data information (age, influence, number of posts) and the like of a certain stock in the last 3 months are acquired from social media such as microblogs, stock bars, snowballs and the like or professional stock forums, and are stored in MongoDB.
Among them, script is a commonly used crawler tool, and the main purpose is to crawl information data from a webpage. Wherein, script is a packaged framework, and comprises a downloader, a parser, a log and an exception handling module. Crawling development for stationary single websites has the advantage of data crawling.
In step S2, the performing word segmentation and vectorization processing on the text data to obtain a word segmentation vector of the text data includes the following steps:
s21, performing word segmentation processing on the acquired text data according to the part of speech to acquire text words;
s22: and matching the obtained text participles with entries in a dictionary of the pre-trained BERT model to obtain a text participle vector of each text participle in the text data.
In the embodiment of the present invention, in step S21, the text data is subjected to word segmentation processing according to parts of speech such as verbs, nouns, adjectives, adverbs, and the like; in step S22, in the pre-trained text processing model, the text segmentation segments obtained in S21 are matched with entries in the dictionary of the model, so as to obtain vectorized expression of each segmentation segment in the text data.
In step S3, the training process is performed on the word segmentation vector through a pre-training BERT model obtained in advance to obtain the emotion polarity of the text data, including the following steps:
s31: preprocessing the word segmentation vector by using a runlasifier. At the same time, the user can select the desired position,
s32: pre-training the BERT model to obtain a trained BERT model with emotion types corresponding to the text data;
s33: and inputting the acquired preprocessing data into a trained BERT model with emotion types corresponding to the text data for processing to acquire the emotion polarity of the text data.
In an embodiment of the present invention, in step S31, the training data is preprocessed, the labeled text data is split into training, validation, and test sets (which can be split in a ratio of 8:1: 1), and the text data is preprocessed using runlasifier.
The BERT model is applied in two stages, one is a training stage and the other is an inference stage. Step 32 is that in the stage of training the BERT model, the input of the BERT model is text data, the output is emotion categories (the source is labeled data) corresponding to the text data, and the model learns the corresponding relationship between the text data and the emotion categories by training the mapping from the input to the output.
After the BERT model is trained, the model can be used as an inference engine to process the data, in step 33, where the input to the BERT model is text data and the output is emotion classification. Unlike the training phase, the input text may be text that is not the same as the model in the training phase, and the output is the result of the belief model inference.
In practice, a Logistic function is added to the output of the model, and the objective is to use the probabilities from 0 to 1 to describe the final emotion polarity result, i.e. the higher the probability, the higher the emotion normality.
In an embodiment of the present invention, in steps S32 and 33, the BERT model is modified, trained, and model trimmed.
Modifying the model: inheriting a DataProcessor class and writing an emotion classification processing class custorProcessors, wherein the class is used for reading in pre-prepared text and label data according to a format and registering the custorProcessors in a defmain (_) function in run _ classifierpy under a bert folder.
The same applies to other deep learning models of the Bert model, and the deep learning model in the field of artificial intelligence can be understood as a black box. The black box here functions to process the input of the black box and output a result. During training, the model is required to be modified according to different tasks, so that the model can complete the tasks. In the embodiment of the invention, the data is classified into positive emotion and negative emotion by using the Bert model, so that the Bert model needs to be told that the current task is positive emotion and negative emotion classification; if the task becomes a classification of the speaking intent, and therefore the task of the Bert model is a classification of the speaking intent, this "telling process" is the process of modifying the model to adapt to the corresponding task.
Training a model: and running a run _ classifier. py script in a directory where the BERT pre-training model is located to perform model training to obtain a preliminary training model.
Fine-tuning the model: because the comment information of the stock is short, max _ seq _ length represents the maximum sequence length and can be adjusted to 128; num _ train _ epochs represents the number of training rounds of the whole training set sample, and can be adjusted to be larger as appropriate. After fine adjustment, model training is carried out, and whether the values of auc, call and precision are improved or not is observed after parameter adjustment.
In the embodiment of the invention, the annotation can be carried out according to an emotion dictionary, for example, BosonNLP is an emotion dictionary constructed based on data sources such as microblogs, news and forums, provides a positive emotion vocabulary and a negative emotion vocabulary, and expands the positive emotion vocabulary and the negative emotion vocabulary to serve as a judgment standard of the annotation.
In step S4, the data change index is the expanding index of the target stock; the specific formula is as follows:
Figure BDA0002969291660000081
rprepresents: influence contribution rate of positive emotional polarity comments;
cprepresents: is a variable parameter, which represents the influence degree of the positive emotion polarity, and is usually set to 0.5;
rnrepresents: influence of negative emotional polarity commentsA contribution rate;
cnrepresents: is a variable parameter, which represents the degree of influence of negative emotion polarity, and is usually set to 0.5;
ri=∑Tifiv Σ Tf represents the influence contribution rate of positive (or negative) emotion polarity comments;
t.f respectively representing the online duration of the reviewer and the number of forwarded reviews; p represents positive emotion (positive), and n represents negative emotion (negative).
In steps S5 and S6, data change information of the target data in a preset working day is obtained according to the data change index, and a logistic regression function is used to classify the data change information, so as to obtain a data change trend of the target data. In the embodiment of the invention, 5 trading days are taken as a window period, sliding is carried out, the opening price of the starting day and the closing price of the ending day are compared to be taken as the basis for judging whether the price is 1, whether the price is 0 or not, the result of the price is taken as an explained variable, and the proxy index PI of the 5 trading days, the stock exchange rate, the fund trading volume data mean value and the standard difference are respectively taken as the explained variables to be input.
Wherein, the input range of the logistic function is ∞ → + ∞ and the output is (0, 1), and the classifier is described by probability just meeting the requirement that the probability distribution is (0, 1); the function is a monotonously rising function which has good continuity and no discontinuous point, so Logistic regression is selected to train the model, when the output probability of the model is greater than 0.5, the stock has a higher looking trend, and the stock tends to be bought on the second trading day judged by the model.
At step S6: the method for classifying the data change information by adopting the logistic regression function to acquire the data change trend of the target data comprises the following steps: and classifying the fall and rise information of the target stock in a preset working day by adopting a logistic regression function, wherein when the data change index of the target stock is more than 0.5, the change trend of the target stock is determined.
In the embodiment of the invention, rich financial information contained in the stock comments is acquired in real time through a crawler tool, and the emotional and psychological changes of investors are reflected; and mining and analyzing the stock evaluation through the BERT model to obtain the sentiment analysis of the stock evaluation, thereby realizing the investment decision of the BERT model.
The embodiment of the invention obtains text data through a script crawler tool; performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data; training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; constructing a data change index of the target data according to the emotion polarity of the text data; acquiring data change information of the target data in a preset working day according to the data change index; and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data. The main purpose of the invention is to obtain the correct investment decision through the sentiment analysis of the stock evaluation.
Fig. 2 is a functional block diagram of a BERT model-based data processing apparatus according to the present invention.
The BERT model-based data processing apparatus 100 according to the present invention may be installed in an electronic device. According to an implemented function, the BERT model-based data processing apparatus may include: the system comprises a text data acquisition module 101, a word segmentation vector acquisition module 102, an emotion polarity acquisition module 103, a data change index construction module 104, a data change information acquisition module 105 and a data change trend acquisition module 106. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the text data acquisition module 101 is used for acquiring text data related to the target data within a preset data extraction range through a script crawler tool;
a word segmentation vector acquisition module 102, configured to perform word segmentation and vectorization on the text data to acquire a word segmentation vector of the text data;
an emotion polarity acquisition module 103, configured to perform training processing on the word segmentation vector through a pre-trained BERT model, and acquire an emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
a data change index construction module 104, configured to construct a data change index of the target data according to the emotion polarity of the text data;
a data change information obtaining module 105, configured to obtain, according to the data change index, data change information of the target data in a preset workday;
and the data change trend acquisition module 106 is configured to perform classification processing on the data change information by using a logistic regression function, and acquire a data change trend of the target data.
In addition, the word segmentation vector acquisition module comprises a text word segmentation acquisition module and a text word segmentation vector acquisition module, wherein,
the text participle obtaining module is used for carrying out participle processing on the obtained text data according to parts of speech to obtain text participles, wherein the parts of speech comprise verbs, nouns, adjectives and adverbs;
and the text word segmentation vector acquisition module is used for matching the acquired text word segmentation with a vocabulary entry in a dictionary of the pre-trained BERT model to acquire a text word segmentation vector of each text word in the text data.
In the embodiment of the invention, text data is acquired through a script crawler tool; performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data; training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; constructing a data change index of the target data according to the emotion polarity of the text data; acquiring data change information of the target data in a preset working day according to the data change index; and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data. The main purpose of the invention is to obtain the correct investment decision through the sentiment analysis of the stock evaluation.
Fig. 3 is a schematic structural diagram of an electronic device implementing the BERT model-based data processing method according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a BERT model based data processing program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a data auditing program, but also for temporarily storing data that has been output or will be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by operating or executing programs or modules (e.g., data auditing programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 only shows an electronic device with components, it will be understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores a BERT model-based data processing program 12 that is a combination of instructions that, when executed in the processor 10, enable:
acquiring text data related to target data in a preset data extraction range by using a script crawler tool;
performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data;
training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
constructing a data change index of the target data according to the emotion polarity of the text data;
acquiring data change information of the target data in a preset working day according to the data change index;
and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
In an embodiment of the present invention, a computer-readable storage medium stores a computer program, which when executed by a processor implements the steps of a BERT model-based data processing method, the method specifically including:
acquiring text data related to target data in a preset data extraction range by using a script crawler tool;
performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data;
training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
constructing a data change index of the target data according to the emotion polarity of the text data;
acquiring data change information of the target data in a preset working day according to the data change index;
and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A data processing method based on a BERT model is applied to electronic equipment and is characterized by comprising the following steps:
acquiring text data related to target data in a preset data extraction range by using a script crawler tool;
performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data;
training the word segmentation vectors through a pre-trained BERT model to obtain the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
constructing a data change index of the target data according to the emotion polarity of the text data;
acquiring data change information of the target data in a preset working day according to the data change index;
and classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
2. The BERT model-based data processing method as claimed in claim 1, wherein the obtaining of the text data related to the target data within the preset data extraction range by the script crawler tool comprises the steps of:
a social network crawler system is constructed by adopting a Scapy frame, and stock comment information, a timestamp and comment person data information of target stocks in a social network are obtained by adopting a distributed crawler algorithm; the target data is the target stock, and the text data related to the target data is the stock comment information, the timestamp and the comment person data information;
and storing the stock comment information, the timestamp and the comment person data information of the crawled target stock into a MongoDB database.
3. The BERT model-based data processing method as claimed in claim 1, wherein the performing word segmentation and vectorization processing on the text data to obtain word segmentation vectors of the text data comprises the steps of:
performing word segmentation processing on the acquired text data according to parts of speech to acquire text words, wherein the parts of speech comprise verbs, nouns, adjectives and adverbs;
and matching the obtained text participles with entries in a dictionary of the pre-trained BERT model to obtain a text participle vector of each text participle in the text data.
4. The BERT model-based data processing method of claim 1,
the method for training the word segmentation vectors through the pre-obtained pre-training BERT model to obtain the emotion polarity of the text data comprises the following steps:
preprocessing the word segmentation vector by using a runlasifier. At the same time, the user can select the desired position,
pre-training the BERT model to obtain a trained BERT model with emotion types corresponding to the text data;
and inputting the acquired preprocessing data into a trained BERT model with emotion types corresponding to the text data for processing to acquire the emotion polarity of the text data.
5. The BERT model-based data processing method of claim 1,
the formula of the data change index is as follows:
Figure FDA0002969291650000021
wherein r ispRepresenting the influence contribution rate of positive emotional polarity comments;
cprepresenting the influence degree of positive emotion polarity;
rnrepresenting the influence contribution rate of negative emotion polarity comments;
cnrepresenting the influence degree of negative emotion polarity;
t represents the online time length of the reviewer, p represents positive emotion, and n represents negative emotion.
6. The BERT model-based data processing method of claim 5,
the method for classifying the data change information by adopting the logistic regression function to acquire the data change trend of the target data comprises the following steps:
classifying the fall and rise information of the target stock in a preset working day by adopting a logistic regression function, wherein,
and when the data change index of the target stock is more than 0.5, determining the change trend of the target stock.
7. A BERT model-based data processing apparatus, the apparatus comprising:
the text data acquisition module is used for acquiring text data related to the target data in a preset data extraction range through a script crawler tool;
the word segmentation vector acquisition module is used for carrying out word segmentation and vectorization processing on the text data to acquire word segmentation vectors of the text data;
the emotion polarity acquisition module is used for training the word segmentation vectors through a pre-trained BERT model to acquire the emotion polarity of the text data; wherein the emotion polarities comprise positive emotion direction and negative emotion direction;
the data change index construction module is used for constructing a data change index of the target data according to the emotion polarity of the text data;
the data change information acquisition module is used for acquiring data change information of the target data in a preset working day according to the data change index;
and the data change trend acquisition module is used for classifying the data change information by adopting a logistic regression function to acquire the data change trend of the target data.
8. The BERT model-based data processing apparatus of claim 7,
the word segmentation vector acquisition module comprises a text word segmentation acquisition module and a text word segmentation vector acquisition module, wherein,
the text participle obtaining module is used for carrying out participle processing on the obtained text data according to parts of speech to obtain text participles, wherein the parts of speech comprise verbs, nouns, adjectives and adverbs;
and the text word segmentation vector acquisition module is used for matching the acquired text word segmentation with a vocabulary entry in a dictionary of the pre-trained BERT model to acquire a text word segmentation vector of each text word in the text data.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the BERT model based data processing method as claimed in any one of claims 1 to 6.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the BERT model-based data processing method according to any one of claims 1 to 6.
CN202110259634.9A 2021-03-10 2021-03-10 BERT model-based data processing method, BERT model-based data processing device, BERT model-based data processing equipment and readable storage medium Active CN112906384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110259634.9A CN112906384B (en) 2021-03-10 2021-03-10 BERT model-based data processing method, BERT model-based data processing device, BERT model-based data processing equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110259634.9A CN112906384B (en) 2021-03-10 2021-03-10 BERT model-based data processing method, BERT model-based data processing device, BERT model-based data processing equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112906384A true CN112906384A (en) 2021-06-04
CN112906384B CN112906384B (en) 2024-02-02

Family

ID=76108691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110259634.9A Active CN112906384B (en) 2021-03-10 2021-03-10 BERT model-based data processing method, BERT model-based data processing device, BERT model-based data processing equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112906384B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447820A (en) * 2021-06-29 2021-09-28 国网北京市电力公司 Electric quantity monitoring method and device, intelligent ammeter and processor
CN113591475A (en) * 2021-08-03 2021-11-02 美的集团(上海)有限公司 Unsupervised interpretable word segmentation method and device and electronic equipment
CN114297382A (en) * 2021-12-28 2022-04-08 杭州电子科技大学 Controllable text generation method based on parameter fine adjustment of generative pre-training model
CN114386433A (en) * 2022-01-12 2022-04-22 中国农业银行股份有限公司 Data processing method, device and equipment based on emotion analysis and storage medium
CN116882412A (en) * 2023-06-29 2023-10-13 易方达基金管理有限公司 Semantic reasoning method and system based on NLP classification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124464A1 (en) * 2015-10-28 2017-05-04 Fractal Industries, Inc. Rapid predictive analysis of very large data sets using the distributed computational graph
CN108647823A (en) * 2018-05-10 2018-10-12 北京航空航天大学 Stock certificate data analysis method based on deep learning and device
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
CN111984793A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Text emotion classification model training method and device, computer equipment and medium
CN112231483A (en) * 2020-11-06 2021-01-15 中国水利水电科学研究院 Disaster tracking method, disaster tracking system, disaster tracking device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124464A1 (en) * 2015-10-28 2017-05-04 Fractal Industries, Inc. Rapid predictive analysis of very large data sets using the distributed computational graph
CN108647823A (en) * 2018-05-10 2018-10-12 北京航空航天大学 Stock certificate data analysis method based on deep learning and device
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
CN111984793A (en) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 Text emotion classification model training method and device, computer equipment and medium
CN112231483A (en) * 2020-11-06 2021-01-15 中国水利水电科学研究院 Disaster tracking method, disaster tracking system, disaster tracking device and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447820A (en) * 2021-06-29 2021-09-28 国网北京市电力公司 Electric quantity monitoring method and device, intelligent ammeter and processor
CN113591475A (en) * 2021-08-03 2021-11-02 美的集团(上海)有限公司 Unsupervised interpretable word segmentation method and device and electronic equipment
CN114297382A (en) * 2021-12-28 2022-04-08 杭州电子科技大学 Controllable text generation method based on parameter fine adjustment of generative pre-training model
CN114297382B (en) * 2021-12-28 2022-06-10 杭州电子科技大学 Controllable text generation method based on parameter fine adjustment of generative pre-training model
CN114386433A (en) * 2022-01-12 2022-04-22 中国农业银行股份有限公司 Data processing method, device and equipment based on emotion analysis and storage medium
CN116882412A (en) * 2023-06-29 2023-10-13 易方达基金管理有限公司 Semantic reasoning method and system based on NLP classification

Also Published As

Publication number Publication date
CN112906384B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN112906384B (en) BERT model-based data processing method, BERT model-based data processing device, BERT model-based data processing equipment and readable storage medium
CN112364170B (en) Data emotion analysis method and device, electronic equipment and medium
CN114648392B (en) Product recommendation method and device based on user portrait, electronic equipment and medium
CN113886691A (en) Intelligent recommendation method and device based on historical data, electronic equipment and medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN113449204B (en) Social event classification method and device based on local aggregation graph attention network
CN110705255A (en) Method and device for detecting association relation between sentences
CN115017288A (en) Model training method, model training device, equipment and storage medium
CN111898550A (en) Method and device for establishing expression recognition model, computer equipment and storage medium
CN114817683A (en) Information recommendation method and device, computer equipment and storage medium
CN113807973A (en) Text error correction method and device, electronic equipment and computer readable storage medium
CN111553140A (en) Data processing method, data processing apparatus, and computer storage medium
CN115099680B (en) Risk management method, apparatus, device and storage medium
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN113469237A (en) User intention identification method and device, electronic equipment and storage medium
CN112785415B (en) Method, device and equipment for constructing scoring card model and computer readable storage medium
Bestvater Using machine learning to infer real-world political attitudes and behaviors from social media data
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN117541044B (en) Project classification method, system, medium and equipment based on project risk analysis
US11989520B2 (en) System and method for morality assessment
CN111798217B (en) Data analysis system and method
US11860824B2 (en) Graphical user interface for display of real-time feedback data changes
US20230147585A1 (en) Dynamically enhancing supervised learning
CN116453137A (en) Expression semantic extraction method, device, equipment and storage medium
CN117273503A (en) Method, device, equipment and storage medium for detecting pre-loan operation quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant