CN116956896A - Text analysis method, system, electronic equipment and medium based on artificial intelligence - Google Patents

Text analysis method, system, electronic equipment and medium based on artificial intelligence Download PDF

Info

Publication number
CN116956896A
CN116956896A CN202310972661.XA CN202310972661A CN116956896A CN 116956896 A CN116956896 A CN 116956896A CN 202310972661 A CN202310972661 A CN 202310972661A CN 116956896 A CN116956896 A CN 116956896A
Authority
CN
China
Prior art keywords
text
model
target
analysis
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310972661.XA
Other languages
Chinese (zh)
Inventor
陈飞
卢林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tewei Kechuang Information Technology Co ltd
Original Assignee
Shenzhen Tewei Kechuang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tewei Kechuang Information Technology Co ltd filed Critical Shenzhen Tewei Kechuang Information Technology Co ltd
Priority to CN202310972661.XA priority Critical patent/CN116956896A/en
Publication of CN116956896A publication Critical patent/CN116956896A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of data analysis, and provides a text analysis method, a text analysis system, electronic equipment and a text analysis medium based on artificial intelligence. When analyzing the target text, the method and the device ensure the data quality of the training model by acquiring the original text related to the target text and carrying out data enhancement processing on the original text, expand the data volume of the training model, avoid the overfitting of the model and improve the analysis effect on the target text; the method comprises the steps of obtaining a plurality of model frames corresponding to the application field of a target text from a preset model library, training the model in a targeted manner, improving the performance of the model, training the model based on an original text and an enhanced text by using the plurality of model frames to obtain a plurality of text analysis models, selecting the target text analysis model based on a plurality of evaluation indexes, and carrying out data analysis on the target text, so that the analysis effect on the target text is further improved.

Description

Text analysis method, system, electronic equipment and medium based on artificial intelligence
Technical Field
The application relates to the technical field of data analysis, in particular to a text analysis method, a system, electronic equipment and a medium based on artificial intelligence.
Background
With the progress of society and the development of science and technology, a great amount of comment information of views and feelings can be generated on the internet, and the comment information has a great significance for understanding the demands of users, the trend of social public opinion, social expectation and the like. The emotion analysis technology based on NLP is a technology for analyzing corresponding viewpoints, emotions, moods, evaluations and attitudes by using comment texts of people on products, services, organizations, individuals, problems, events, topics and the like.
According to the emotion analysis technology based on NLP, an artificial intelligent model is usually required to be trained, but due to unbalanced samples, the model is easy to over fit and poor in robustness, so that the model is poor in efficiency when analyzing texts. In addition, the emphasis of text analysis in different application fields is different, and for the text in different application fields, if the same model is used for analysis, the analysis accuracy is lower.
Disclosure of Invention
In view of the above, the application provides a text analysis method, a system, an electronic device and a medium based on artificial intelligence, so as to solve the technical problem of poor text analysis accuracy.
A first aspect of the present application provides an artificial intelligence based text analysis method, the method comprising:
Responding to an analysis instruction of a user on a target text, and acquiring an original text related to the target text from a text library;
performing data enhancement processing on the original text to obtain an enhanced text;
identifying the application field of the target text, and acquiring a plurality of model frames corresponding to the application field from a preset artificial intelligent model library;
training based on the original text and the enhanced text by using the model frames to obtain a plurality of text analysis models;
selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation indexes;
and carrying out data analysis on the target text by using the target text analysis model.
In one possible implementation manner, the performing data enhancement processing on the original text to obtain enhanced text includes:
word segmentation processing is carried out on the original text to obtain a plurality of text keywords;
calculating a first weight of the text keyword in the text library, and calculating a second weight of the text keyword in the original text;
and carrying out enhancement processing on the original text according to the first weight and the second weight of the keyword to obtain the enhanced text.
In one possible implementation manner, the enhancing the original text according to the first weight and the second weight of the keyword, to obtain the enhanced text includes:
comparing the first weight with a first preset weight threshold value, and comparing the second weight with a second preset weight threshold value;
when the first weight is smaller than the first preset weight threshold and the second weight is larger than the second preset weight threshold, a first random probability is obtained from a first preset random probability array, and masking is carried out on the keywords according to the first random probability;
when the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, acquiring a second random probability from a second preset random probability array, and deleting the keyword according to the second random probability;
and when the first weight is greater than the first preset weight threshold and the second weight is greater than the second preset weight threshold, acquiring a third random probability from a third preset random probability array, and replacing the keyword by the third random probability.
In one possible implementation manner, the obtaining the original text related to the target text from the text library includes:
acquiring a preset number of storage texts from the text library;
obtaining the similarity between the storage text and the target text;
acquiring a target storage text with similarity greater than a preset similarity threshold value from the storage text;
performing cluster analysis on all the stored texts in the text library to obtain a plurality of text clusters;
determining a text cluster comprising the target storage text as a target text cluster;
and determining the stored text in the target text cluster as the original text related to the target text.
In one possible implementation, the training based on the original text and the enhanced text using the plurality of model frames to obtain a plurality of text analysis models includes:
extracting the theme information of the original text and the enhanced text;
obtaining a combined feature vector according to the subject information;
training based on the combined feature vectors by using the model frames to obtain a plurality of text analysis models.
In one possible implementation manner, the extracting the theme information of the original text and the enhanced text includes:
And extracting the topics from the original text and the enhanced text by using a hierarchical Dirichlet process algorithm to obtain the topic information, wherein the topic information comprises text-topic distribution and topic-word distribution.
In one possible implementation manner, the selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation indexes includes:
displaying each text analysis model and a plurality of corresponding evaluation index values, and taking the text analysis model selected by a user as the target text analysis model;
and calculating weighted average values of a plurality of evaluation index values corresponding to each text analysis model, and taking the text analysis model with the largest weighted average value as the target text analysis model or taking the text analysis model with the weighted average value larger than the average weighted average value as the target text analysis model.
A second aspect of the present application provides an artificial intelligence based text analysis system, the system comprising:
the text acquisition module is used for responding to an analysis instruction of a user on a target text and acquiring an original text related to the target text from a text library;
the enhancement processing module is used for carrying out data enhancement processing on the original text to obtain an enhanced text;
The model acquisition module is used for identifying the application field of the target text and acquiring a plurality of model frames corresponding to the application field from a preset model library;
the model training module is used for training based on the original text and the enhanced text by using the plurality of model frames to obtain a plurality of text analysis models;
an index evaluation module for selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation indexes;
and the text analysis module is used for carrying out data analysis on the target text by using the target text analysis model.
A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the artificial intelligence based text analysis method when executing the computer program.
A fourth aspect of the application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the artificial intelligence based text analysis method.
According to the text analysis method, system, electronic equipment and medium based on artificial intelligence, when the target text is analyzed, the original text related to the target text is obtained, and the original text is subjected to data enhancement processing, so that the data quality of a training model is ensured, the data volume of the training model is expanded, the overfitting of the model is avoided, and the analysis effect of the target text is improved; the method comprises the steps of obtaining a plurality of model frames corresponding to the application field of a target text from a preset model library, training the model in a targeted manner, improving the performance of the model, training the model based on an original text and an enhanced text by using the plurality of model frames to obtain a plurality of text analysis models, selecting the target text analysis model based on a plurality of evaluation indexes, and carrying out data analysis on the target text, so that the analysis effect on the target text is further improved.
Drawings
FIG. 1 is a flow chart of an artificial intelligence based text analysis method shown in an embodiment of the application;
FIG. 2 is a functional block diagram of an artificial intelligence based text analysis system according to an embodiment of the present application;
fig. 3 is a block diagram of an electronic device shown in an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
The text analysis method based on the artificial intelligence is executed by the electronic equipment, and accordingly, the text analysis system based on the artificial intelligence operates in the electronic equipment.
Fig. 1 is a flowchart of an artificial intelligence based text analysis method according to an embodiment of the present application. The text analysis method based on artificial intelligence specifically comprises the following steps, the sequence of the steps in the flow chart can be changed according to different requirements, and some steps can be omitted.
S11, responding to an analysis instruction of a user on a target text, and acquiring an original text related to the target text from a text library.
The target text refers to data needing text analysis.
The electronic equipment is preset with a text library, and a plurality of texts are stored in the text library. For ease of description, the text stored in the text library will be referred to as stored text.
The user uploads the target text to the electronic device, and can trigger an analysis instruction of the electronic device for analyzing the target text, so that a plurality of storage texts are acquired from a text library, and the acquired storage texts have correlation with the target text.
In one possible implementation manner, the obtaining the original text related to the target text from the text library includes:
acquiring a preset number of storage texts from the text library;
obtaining the similarity between the storage text and the target text;
acquiring a target storage text with similarity greater than a preset similarity threshold value from the storage text;
performing cluster analysis on all the stored texts in the text library to obtain a plurality of text clusters;
determining a text cluster comprising the target storage text as a target text cluster;
and determining the stored text in the target text cluster as the original text related to the target text.
Because a large number of storage texts are stored in the text library, in order to quickly acquire original texts related to the target text from the large number of storage texts, a small number of storage texts can be acquired randomly from the text library according to the preset number, and then the similarity between each acquired storage text and the target text is calculated through Euclidean distance, so that the correlation between the acquired storage text and the target text is judged based on the similarity.
When the similarity of the storage text obtained from the text library is larger than a preset similarity threshold, the storage text is stored as a target storage text if the similarity of the storage text and the target text is indicated to be strong. And when the similarity of the stored text acquired in the text library is smaller than a preset similarity threshold, the stored text is indicated to have weaker correlation with the target text.
Next, all stored text in the text library is analyzed using a cluster analysis algorithm, dividing the all stored text into a plurality of text clusters. Since the stored texts in the same text cluster have strong correlation, the stored texts in different text clusters have weak correlation, if a certain text cluster includes the target stored text, which indicates that the stored texts in the text cluster all have strong correlation with the target text, the text cluster is determined to be the target text cluster, and the stored text in the target text cluster is determined to be the original text related to the target text.
For example, assuming that all the storage texts in the text library are divided into 5 text clusters, 50 storage texts are randomly acquired from the text library, wherein when the similarity between the storage text D1 and the target text is greater than 0.8, the storage text D1 is taken as the target storage text and is stored in the target text set; if the 1 st text cluster comprises the stored text D1, the 1 st text cluster is taken as a target text cluster, and the stored text in the 1 st text cluster is determined to be the original text related to the target text.
In an alternative embodiment, after determining the text cluster including the target stored text as the target text cluster, the number of target stored texts in the target text cluster may be calculated, and the stored text in the target text cluster with the largest number is determined as the original text related to the target text. Alternatively, the stored texts in the target text clusters with the number exceeding the average value are determined as the original texts related to the target texts.
In the above-mentioned alternative embodiment, by acquiring a preset number of storage texts and determining a target storage text with a strong correlation with a target text from the acquired storage texts according to the similarity determination, by means of the principle that the storage texts in the same text cluster have a strong correlation, all the storage texts in the text library are divided into a plurality of text clusters, so that the text cluster including the target storage text is determined as a target text cluster, and then the storage text in the target text cluster is determined as the original text number related to the target text. Avoiding similarity calculation between each stored text in the text library and the target text reduces the data calculation amount, thereby improving the efficiency of acquiring the original text related to the target text.
In one possible implementation, before the original text related to the target text is obtained from the text library, a cleaning process may be further performed on the target text. The cleaning process may include, but is not limited to: deleting irrelevant information in the target text, removing redundant punctuation marks, screening short texts, correcting wrongly written characters, filling missing values, normalizing and the like.
And S12, carrying out data enhancement processing on the original text to obtain an enhanced text.
In order to avoid the phenomenon that the text analysis model is over-fitted due to the fact that the acquired data volume of the original text is small, the original text can be expanded based on a data enhancement technology to increase the data volume of the original text, the text analysis model is trained based on a large number of texts, the generalization capability of the text analysis model can be improved, and the method is suitable for more application scenes.
In one possible implementation manner, the performing data enhancement processing on the original text to obtain enhanced text includes:
word segmentation processing is carried out on the original text to obtain a plurality of text keywords;
calculating a first weight of the text keyword in the text library, and calculating a second weight of the text keyword in the original text;
And carrying out enhancement processing on the original text according to the first weight and the second weight of the keyword to obtain the enhanced text.
Word segmentation is one of the basic operations of natural language processing, namely, the segmentation of continuous text into individual tokens. Most natural language processing tools and language models process and analyze language text at the word level, while most raw corpus is presented in the form of string text. Therefore, word segmentation processing is required for the original text before analyzing the text.
Word boundaries can be identified by using a word segmentation device based on preset rules, the original text is segmented into a series of word groups, and a plurality of text keywords in the original text are obtained, wherein the text keywords can provide some summary information or important characteristics for text content, and the text keywords are helpful for subsequent text analysis and processing.
Each of the text databases stores a plurality of text keywords corresponding to the text, and the text word database can be obtained according to the plurality of text keywords corresponding to each of the stored text. And obtaining the first weight of the text keyword in the text library by calculating the TF-IDF value of the text keyword in the text library word. And obtaining the second weight of the text keywords in the original text by calculating TF-IDF values of a plurality of text keywords corresponding to the text keywords in the original text.
And carrying out enhancement processing on the keywords in the original text based on the first weight and the second weight to obtain the enhanced text.
In one possible implementation manner, the enhancing the original text according to the first weight and the second weight of the keyword, to obtain the enhanced text includes:
comparing the first weight with a first preset weight threshold value, and comparing the second weight with a second preset weight threshold value;
when the first weight is smaller than the first preset weight threshold and the second weight is larger than the second preset weight threshold, a first random probability is obtained from a first preset random probability array, and masking processing is carried out on the keywords according to the first random probability;
when the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, acquiring a second random probability from a second preset random probability array, and deleting the keyword according to the second random probability;
and when the first weight is greater than the first preset weight threshold and the second weight is greater than the second preset weight threshold, acquiring a third random probability from a third preset random probability array, and replacing the keyword by the third random probability.
When the keywords are processed, judging whether the first weight is smaller than a preset threshold value or not by comparing the first weight of the keywords with a first preset weight threshold value, so as to judge the importance degree of the keywords in the text library; and judging the importance degree of the keywords in the original text by comparing the second weight of the keywords with a second preset weight threshold value and judging whether the second weight is smaller than the preset threshold value.
When the first weight is smaller than the first preset weight threshold value and the second weight is larger than the second preset weight threshold value, the keyword is not important in the text library, but is important in the original text, a probability value is randomly selected from a first preset random probability array to serve as the first random probability, and masking processing is conducted on the keyword according to the first random probability, namely masking is conducted on one or part of words in the keyword, so that a new keyword is formed.
When the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, the keyword is not important in the text library and is not important in the original text, a probability value is randomly selected from a second preset random probability array to serve as the second random probability, and the keyword is deleted according to the second random probability, namely the keyword is removed or erased from the original text.
When the first weight is greater than the first preset weight threshold value and the second weight is greater than the second preset weight threshold value, the keyword is indicated to be important in the text library and important in the original text, a probability value is randomly selected from a third preset random probability array to serve as the third random probability, and the keyword is replaced by the third random probability, namely the original keyword is replaced by the synonym.
It should be appreciated that when the first weight is greater than the first preset weight threshold, there is no case where the second weight is less than the second preset weight threshold.
In the above alternative embodiment, new vocabulary can be introduced through random mask and synonym substitution operation, so that the model can be generalized to words which are not in the training set, which is equivalent to introducing a certain degree of noise into the original text, thereby helping to prevent the text analysis model from being overfitted; in the process of enhancing the original text, different modes are selected for enhancing according to the importance degree of the keywords in the text library and the original text, so that enhancement of the original text is more targeted, the effect of enhancing the data of the original text is improved, and the accuracy of a text analysis model is facilitated; the newly generated enhanced text maintains the category labels of the original text, but the meaning of sentences in the original text can be changed, sentences with wrong labels are generated, and the performance of a text analysis model can be reduced, so that the original text is enhanced in a random probability mode, and the performance of the text analysis model can be ensured not to be reduced.
S13, identifying the application field of the target text, and acquiring a plurality of model frames corresponding to the application field from a preset artificial intelligent model library.
The preset artificial intelligence model library stores a plurality of model frames, each model frame corresponds to one or more application fields, and one application field can also correspond to one or more model frames.
Since the text analysis model of which frame is suitable for the target text cannot be known exactly or the effect of the text analysis model of which frame is optimal, the application selects a plurality of model frames related to the application field from a preset artificial intelligent model library to establish a subsequent text analysis model by analyzing the application field of the target text, such as financial field, legal field, emotion field, medical field and the like.
S14, training based on the original text and the enhanced text by using the model frames to obtain a plurality of text analysis models.
Based on an artificial intelligent model framework, a text analysis model is built, and the original text and the enhanced text are used for training the text analysis model, so that a plurality of text analysis models based on different frameworks are obtained.
For example, a text analysis model may be built based on a Long Short-Term Memory (LSTM) model, or may be built based on a recurrent neural network (Recurrent Neural Network, RNN), or a convolutional neural network (Convolutional Neural Networks, CNN).
In one possible implementation, the training based on the original text and the enhanced text using the plurality of model frames to obtain a plurality of text analysis models includes:
extracting the theme information of the original text and the enhanced text;
obtaining a combined feature vector according to the subject information;
training based on the combined feature vectors by using the model frames to obtain a plurality of text analysis models.
The topic information of the original text and the topic information of the enhanced text may be extracted using a method of mining topic models, each topic information comprising a series of related words.
Feature vectors are generated using the extracted topic information, i.e., each text is represented as a vector, where each dimension represents a topic and the value represents the number of occurrences or weights of related topics in the text.
And converting each word in the original text and the enhanced text into a corresponding word embedding vector by using a word embedding model which is already pre-trained, and obtaining a combined feature vector by combining the word embedding vector and the subject information to represent the whole sentence. Relationships between different terms may be considered when generating the combined feature vector. For example, a plurality of word-embedded vectors may be combined into one feature vector by a simple add-and-sum and average operation. Thus, a sentence having a plurality of words can be represented by a vector of a fixed dimension.
For example, when there are 10 topics in a topic model, each text can be represented as a 10-dimensional vector, where each dimension represents a topic, and the number or weight of occurrences of each topic in the text is calculated to obtain a combined feature vector composed of topic information.
And taking the combined feature vector as the input of each model frame, and training to obtain a text analysis model.
In one possible implementation manner, the extracting the theme information of the original text and the enhanced text includes:
and extracting the topics from the original text and the enhanced text by using a hierarchical Dirichlet process algorithm to obtain the topic information, wherein the topic information comprises text-topic distribution and topic-word distribution.
And performing topic extraction on the original text and the enhanced text by using the hierarchical dirichlet process (Hierarchical Dirichlet Process, HDP) algorithm, and obtaining text-topic distribution and topic-word distribution information of the original text and the enhanced text. Each text may contain multiple topics, and each topic is in turn represented by a set of words. The text-to-topic distribution indicates topics contained in the text and the weight of each topic, representing the importance and contribution of different topics in the text. The topic-word distribution represents the probability distribution of words in each topic, and through the topic-word distribution, the concept represented by the topic in the corpus or the words dominated by the topic can be known.
The above optional implementation manner is helpful for understanding the topic structures of the original text and the enhanced text by extracting topic information, so that the follow-up task can be performed.
S15, selecting a target text analysis model from a plurality of text analysis models based on a plurality of evaluation indexes.
After training to obtain a plurality of text analysis models, a test set may be used to evaluate the plurality of text analysis models after training for testing, and based on a plurality of evaluation indexes, such as accuracy, precision, recall, and F1 score, the performance of each text analysis model is evaluated, and the text analysis model with the optimal performance is selected as the target text analysis model.
In an optional implementation manner, each text analysis model and a plurality of corresponding evaluation index values can be displayed, and one or more text analysis models can be selected by a user according to actual requirements as target text analysis models.
In an alternative embodiment, a weighted average of a plurality of evaluation index values corresponding to each text analysis model may be calculated, and the text analysis model with the largest weighted average is taken as the target text analysis model, or the text analysis model with the weighted average larger than the average weighted average is taken as the target text analysis model.
S16, carrying out data analysis on the target text by using the target text analysis model.
And inputting the target text into the target text analysis model, and performing data analysis on the target text through the target text analysis model to obtain analysis results, such as emotion types, topic classifications and the like.
When analyzing the target text, the method and the device ensure the data quality of the training model by acquiring the original text related to the target text and carrying out data enhancement processing on the original text, expand the data volume of the training model, avoid the overfitting of the model and improve the analysis effect on the target text; the method comprises the steps of obtaining a plurality of model frames corresponding to the application field of a target text from a preset model library, training the model in a targeted manner, improving the performance of the model, training the model based on an original text and an enhanced text by using the plurality of model frames to obtain a plurality of text analysis models, selecting the target text analysis model based on a plurality of evaluation indexes, and carrying out data analysis on the target text, so that the analysis effect on the target text is further improved.
Fig. 2 is a block diagram of an artificial intelligence based text analysis system according to a second embodiment of the present invention.
In some embodiments, the artificial intelligence based text analysis system 20 may include a plurality of functional modules comprised of computer program segments. The computer program of the individual program segments in the artificial intelligence based text analysis system 20 can be stored in a memory of an electronic device and executed by at least one processor to perform (see fig. 1 for details) the functions of artificial intelligence based text analysis.
In this embodiment, the text analysis system 20 based on artificial intelligence may be divided into a plurality of functional modules according to the functions performed by the text analysis system. The functional module may include: a text acquisition module 201, an enhancement processing module 202, a model acquisition module 203, a model training module 204, an index evaluation module 205, and a text analysis module 206. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The text acquisition module 201 is used for responding to an analysis instruction of a user on target text data and acquiring original text data related to the target text data from a text database.
And responding to analysis instructions of a user on the target text, and acquiring original text related to the target text from a text library.
The target text refers to data needing text analysis.
The electronic equipment is preset with a text library, and a plurality of texts are stored in the text library. For ease of description, the text stored in the text library will be referred to as stored text.
The user uploads the target text to the electronic device, and can trigger an analysis instruction of the electronic device for analyzing the target text, so that a plurality of storage texts are acquired from a text library, and the acquired storage texts have correlation with the target text.
In one possible implementation manner, the obtaining the original text related to the target text from the text library includes:
acquiring a preset number of storage texts from the text library;
obtaining the similarity between the storage text and the target text;
acquiring a target storage text with similarity greater than a preset similarity threshold value from the storage text;
Performing cluster analysis on all the stored texts in the text library to obtain a plurality of text clusters;
determining a text cluster comprising the target storage text as a target text cluster;
and determining the stored text in the target text cluster as the original text related to the target text.
Because a large number of storage texts are stored in the text library, in order to quickly acquire original texts related to the target text from the large number of storage texts, a small number of storage texts can be acquired randomly from the text library according to the preset number, and then the similarity between each acquired storage text and the target text is calculated through Euclidean distance, so that the correlation between the acquired storage text and the target text is judged based on the similarity.
When the similarity of the storage text obtained from the text library is larger than a preset similarity threshold, the storage text is stored as a target storage text if the similarity of the storage text and the target text is indicated to be strong. And when the similarity of the stored text acquired in the text library is smaller than a preset similarity threshold, the stored text is indicated to have weaker correlation with the target text.
Next, all stored text in the text library is analyzed using a cluster analysis algorithm, dividing the all stored text into a plurality of text clusters. Since the stored texts in the same text cluster have strong correlation, the stored texts in different text clusters have weak correlation, if a certain text cluster includes the target stored text, which indicates that the stored texts in the text cluster all have strong correlation with the target text, the text cluster is determined to be the target text cluster, and the stored text in the target text cluster is determined to be the original text related to the target text.
For example, assuming that all the storage texts in the text library are divided into 5 text clusters, 50 storage texts are randomly acquired from the text library, wherein when the similarity between the storage text D1 and the target text is greater than 0.8, the storage text D1 is taken as the target storage text and is stored in the target text set; if the 1 st text cluster comprises the stored text D1, the 1 st text cluster is taken as a target text cluster, and the stored text in the 1 st text cluster is determined to be the original text related to the target text.
In an alternative embodiment, after determining the text cluster including the target stored text as the target text cluster, the number of target stored texts in the target text cluster may be calculated, and the stored text in the target text cluster with the largest number is determined as the original text related to the target text. Alternatively, the stored texts in the target text clusters with the number exceeding the average value are determined as the original texts related to the target texts.
In the above-mentioned alternative embodiment, by acquiring a preset number of storage texts and determining a target storage text with a strong correlation with a target text from the acquired storage texts according to the similarity determination, by means of the principle that the storage texts in the same text cluster have a strong correlation, all the storage texts in the text library are divided into a plurality of text clusters, so that the text cluster including the target storage text is determined as a target text cluster, and then the storage text in the target text cluster is determined as the original text number related to the target text. Avoiding similarity calculation between each stored text in the text library and the target text reduces the data calculation amount, thereby improving the efficiency of acquiring the original text related to the target text.
In one possible implementation, before the original text related to the target text is obtained from the text library, a cleaning process may be further performed on the target text. The cleaning process may include, but is not limited to: deleting irrelevant information in the target text, removing redundant punctuation marks, screening short texts, correcting wrongly written characters, filling missing values, normalizing and the like.
And the enhancement processing module 202 is configured to perform data enhancement processing on the original text data, so as to obtain enhanced text data.
And carrying out data enhancement processing on the original text to obtain an enhanced text.
In order to avoid the phenomenon that the text analysis model is over-fitted due to the fact that the acquired data volume of the original text is small, the original text can be expanded based on a data enhancement technology to increase the data volume of the original text, the text analysis model is trained based on a large number of texts, the generalization capability of the text analysis model can be improved, and the method is suitable for more application scenes.
In one possible implementation manner, the performing data enhancement processing on the original text to obtain enhanced text includes:
word segmentation processing is carried out on the original text to obtain a plurality of text keywords;
Calculating a first weight of the text keyword in the text library, and calculating a second weight of the text keyword in the original text;
and carrying out enhancement processing on the original text according to the first weight and the second weight of the keyword to obtain the enhanced text.
Word segmentation is one of the basic operations of natural language processing, namely, the segmentation of continuous text into individual tokens. Most natural language processing tools and language models process and analyze language text at the word level, while most raw corpus is presented in the form of string text. Therefore, word segmentation processing is required for the original text before analyzing the text.
Word boundaries can be identified by using a word segmentation device based on preset rules, the original text is segmented into a series of word groups, and a plurality of text keywords in the original text are obtained, wherein the text keywords can provide some summary information or important characteristics for text content, and the text keywords are helpful for subsequent text analysis and processing.
Each of the text databases stores a plurality of text keywords corresponding to the text, and the text word database can be obtained according to the plurality of text keywords corresponding to each of the stored text. And obtaining the first weight of the text keyword in the text library by calculating the TF-IDF value of the text keyword in the text library word. And obtaining the second weight of the text keywords in the original text by calculating TF-IDF values of a plurality of text keywords corresponding to the text keywords in the original text.
And carrying out enhancement processing on the keywords in the original text based on the first weight and the second weight to obtain the enhanced text.
In one possible implementation manner, the enhancing the original text according to the first weight and the second weight of the keyword, to obtain the enhanced text includes:
comparing the first weight with a first preset weight threshold value, and comparing the second weight with a second preset weight threshold value;
when the first weight is smaller than the first preset weight threshold and the second weight is larger than the second preset weight threshold, a first random probability is obtained from a first preset random probability array, and masking processing is carried out on the keywords according to the first random probability;
when the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, acquiring a second random probability from a second preset random probability array, and deleting the keyword according to the second random probability;
and when the first weight is greater than the first preset weight threshold and the second weight is greater than the second preset weight threshold, acquiring a third random probability from a third preset random probability array, and replacing the keyword by the third random probability.
When the keywords are processed, judging whether the first weight is smaller than a preset threshold value or not by comparing the first weight of the keywords with a first preset weight threshold value, so as to judge the importance degree of the keywords in the text library; and judging the importance degree of the keywords in the original text by comparing the second weight of the keywords with a second preset weight threshold value and judging whether the second weight is smaller than the preset threshold value.
When the first weight is smaller than the first preset weight threshold value and the second weight is larger than the second preset weight threshold value, the keyword is not important in the text library, but is important in the original text, a probability value is randomly selected from a first preset random probability array to serve as the first random probability, and masking processing is conducted on the keyword according to the first random probability, namely masking is conducted on one or part of words in the keyword, so that a new keyword is formed.
When the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, the keyword is not important in the text library and is not important in the original text, a probability value is randomly selected from a second preset random probability array to serve as the second random probability, and the keyword is deleted according to the second random probability, namely the keyword is removed or erased from the original text.
When the first weight is greater than the first preset weight threshold value and the second weight is greater than the second preset weight threshold value, the keyword is indicated to be important in the text library and important in the original text, a probability value is randomly selected from a third preset random probability array to serve as the third random probability, and the keyword is replaced by the third random probability, namely the original keyword is replaced by the synonym.
It should be appreciated that when the first weight is greater than the first preset weight threshold, there is no case where the second weight is less than the second preset weight threshold.
In the above alternative embodiment, new vocabulary can be introduced through random mask and synonym substitution operation, so that the model can be generalized to words which are not in the training set, which is equivalent to introducing a certain degree of noise into the original text, thereby helping to prevent the text analysis model from being overfitted; in the process of enhancing the original text, different modes are selected for enhancing according to the importance degree of the keywords in the text library and the original text, so that enhancement of the original text is more targeted, the effect of enhancing the data of the original text is improved, and the accuracy of a text analysis model is facilitated; the newly generated enhanced text maintains the category labels of the original text, but the meaning of sentences in the original text can be changed, sentences with wrong labels are generated, and the performance of a text analysis model can be reduced, so that the original text is enhanced in a random probability mode, and the performance of the text analysis model can be ensured not to be reduced.
The model obtaining module 203 is configured to identify an application domain of the target text data, and obtain a plurality of model frameworks corresponding to the application domain from a preset model library.
And identifying the application field of the target text, and acquiring a plurality of model frames corresponding to the application field from a preset artificial intelligent model library.
The preset artificial intelligence model library stores a plurality of model frames, each model frame corresponds to one or more application fields, and one application field can also correspond to one or more model frames.
Since the text analysis model of which frame is suitable for the target text cannot be known exactly or the effect of the text analysis model of which frame is optimal, the application selects a plurality of model frames related to the application field from a preset artificial intelligent model library to establish a subsequent text analysis model by analyzing the application field of the target text, such as financial field, legal field, emotion field, medical field and the like.
The model training module 204 is configured to train based on the original text data and the enhanced text data by using the multiple model frameworks, so as to obtain multiple text data analysis models.
Training based on the original text and the enhanced text by using the model frames to obtain a plurality of text analysis models.
Based on an artificial intelligent model framework, a text analysis model is built, and the original text and the enhanced text are used for training the text analysis model, so that a plurality of text analysis models based on different frameworks are obtained.
For example, a text analysis model may be built based on a Long Short-Term Memory (LSTM) model, or may be built based on a recurrent neural network (Recurrent Neural Network, RNN), or a convolutional neural network (Convolutional Neural Networks, CNN).
In one possible implementation, the training based on the original text and the enhanced text using the plurality of model frames to obtain a plurality of text analysis models includes:
extracting the theme information of the original text and the enhanced text;
obtaining a combined feature vector according to the subject information;
training based on the combined feature vector using the plurality of model frames, a plurality of text analysis models are obtained.
The topic information of the original text and the topic information of the enhanced text may be extracted using a method of mining topic models, each topic information comprising a series of related words.
Feature vectors are generated using the extracted topic information, i.e., each text is represented as a vector, where each dimension represents a topic and the value represents the number of occurrences or weights of related topics in the text.
And converting each word in the original text and the enhanced text into a corresponding word embedding vector by using a word embedding model which is already pre-trained, and obtaining a combined feature vector by combining the word embedding vector and the subject information to represent the whole sentence. Relationships between different terms may be considered when generating the combined feature vector. For example, a plurality of word-embedded vectors may be combined into one feature vector by a simple add-and-sum and average operation. Thus, a sentence having a plurality of words can be represented by a vector of a fixed dimension.
For example, when there are 10 topics in a topic model, each text can be represented as a 10-dimensional vector, where each dimension represents a topic, and the number or weight of occurrences of each topic in the text is calculated to obtain a combined feature vector composed of topic information.
And taking the combined feature vector as the input of each model frame, and training to obtain a text analysis model.
In one possible implementation manner, the extracting the theme information of the original text and the enhanced text includes:
and extracting the topics from the original text and the enhanced text by using a hierarchical Dirichlet process algorithm to obtain the topic information, wherein the topic information comprises text-topic distribution and topic-word distribution.
And performing topic extraction on the original text and the enhanced text by using the hierarchical dirichlet process (Hierarchical Dirichlet Process, HDP) algorithm, and obtaining text-topic distribution and topic-word distribution information of the original text and the enhanced text. Each text may contain multiple topics, and each topic is in turn represented by a set of words. The text-to-topic distribution indicates topics contained in the text and the weight of each topic, representing the importance and contribution of different topics in the text. The topic-word distribution represents the probability distribution of words in each topic, and through the topic-word distribution, the concept represented by the topic in the corpus or the words dominated by the topic can be known.
The above optional implementation manner is helpful for understanding the topic structures of the original text and the enhanced text by extracting topic information, so that the follow-up task can be performed.
The index evaluation module 205 is configured to select a target text data analysis model from a plurality of text data analysis models based on a plurality of evaluation indexes.
A target text analysis model is selected from a plurality of the text analysis models based on a plurality of the evaluation indicators.
After training to obtain a plurality of text analysis models, a test set may be used to evaluate the plurality of text analysis models after training for testing, and based on a plurality of evaluation indexes, such as accuracy, precision, recall, and F1 score, the performance of each text analysis model is evaluated, and the text analysis model with the optimal performance is selected as the target text analysis model.
In an optional implementation manner, each text analysis model and a plurality of corresponding evaluation index values can be displayed, and one or more text analysis models can be selected by a user according to actual requirements as target text analysis models.
In an alternative embodiment, a weighted average of a plurality of evaluation index values corresponding to each text analysis model may be calculated, and the text analysis model with the largest weighted average is taken as the target text analysis model, or the text analysis model with the weighted average larger than the average weighted average is taken as the target text analysis model.
A text analysis module 206, configured to perform data analysis on the target text data using the target text data analysis model.
And carrying out data analysis on the target text by using the target text analysis model.
And inputting the target text into the target text analysis model, and performing data analysis on the target text through the target text analysis model to obtain analysis results, such as emotion types, topic classifications and the like.
When analyzing the target text, the method and the device ensure the data quality of the training model by acquiring the original text related to the target text and carrying out data enhancement processing on the original text, expand the data volume of the training model, avoid the overfitting of the model and improve the analysis effect on the target text; the method comprises the steps of obtaining a plurality of model frames corresponding to the application field of a target text from a preset model library, training the model in a targeted manner, improving the performance of the model, training the model based on an original text and an enhanced text by using the plurality of model frames to obtain a plurality of text analysis models, selecting the target text analysis model based on a plurality of evaluation indexes, and carrying out data analysis on the target text, so that the analysis effect on the target text is further improved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. In a preferred embodiment of the application, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 is not limiting of the embodiments of the present application, and that either a bus-type configuration or a star-type configuration is possible, and that the electronic device 3 may also include more or less other hardware or software than that shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may further include other electronic devices, including but not limited to any electronic product that can interact with a user by means of a keyboard, a mouse, a remote control, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the electronic device 3 is only used as an example, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application by way of reference.
In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, performs all or part of the steps in the artificial intelligence based text analysis method as described. The Memory 31 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects the various components of the entire electronic device 3 using various interfaces and lines, and performs various functions of the electronic device 3 and processes data by running or executing programs or modules stored in the memory 31, and invoking data stored in the memory 31. For example, the at least one processor 32, when executing the computer programs stored in the memory, implements all or part of the steps of the artificial intelligence based text analysis method described in embodiments of the present application; or to implement all or part of the functionality of an artificial intelligence based text analysis method. The at least one processor 32 may be comprised of integrated circuits, such as a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like.
In some embodiments, the at least one communication bus 33 is arranged to enable connected communication between the memory 31 and the at least one processor 32 or the like. Although not shown, the electronic device 3 may further include a power source (such as a battery) for powering the various components, and preferably the power source may be logically connected to the at least one processor 32 via a power management system, such that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like. The electronic device 3 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing an electronic device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the application.
In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and other manners of division may be implemented in practice.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure is intended to encompass any or all possible combinations of one or more of the listed items. The terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. A text analysis method based on artificial intelligence, the method comprising:
responding to an analysis instruction of a user on a target text, and acquiring an original text related to the target text from a text library;
performing data enhancement processing on the original text to obtain an enhanced text;
identifying the application field of the target text, and acquiring a plurality of model frames corresponding to the application field from a preset model library;
training based on the original text and the enhanced text by using the model frames to obtain a plurality of text analysis models;
selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation indexes;
and carrying out data analysis on the target text by using the target text analysis model.
2. The artificial intelligence based text analysis method of claim 1, wherein the performing data enhancement processing on the original text to obtain enhanced text comprises:
word segmentation processing is carried out on the original text to obtain a plurality of text keywords;
calculating a first weight of the text keyword in the text library, and calculating a second weight of the text keyword in the original text;
And carrying out enhancement processing on the original text according to the first weight and the second weight of the keyword to obtain the enhanced text.
3. The artificial intelligence based text analysis method of claim 2, wherein the enhancing the original text according to the first weight and the second weight of the keyword, the obtaining the enhanced text comprises:
comparing the first weight with a first preset weight threshold value, and comparing the second weight with a second preset weight threshold value;
when the first weight is smaller than the first preset weight threshold and the second weight is larger than the second preset weight threshold, a first random probability is obtained from a first preset random probability array, and masking is carried out on the keywords according to the first random probability;
when the first weight is smaller than the first preset weight threshold value and the second weight is smaller than the second preset weight threshold value, acquiring a second random probability from a second preset random probability array, and deleting the keyword according to the second random probability;
and when the first weight is greater than the first preset weight threshold and the second weight is greater than the second preset weight threshold, acquiring a third random probability from a third preset random probability array, and replacing the keyword by the third random probability.
4. The artificial intelligence based text analysis method of any one of claims 1 to 3, wherein the retrieving the original text related to the target text from the text library comprises:
acquiring a preset number of storage texts from the text library;
obtaining the similarity between the storage text and the target text;
acquiring a target storage text with similarity greater than a preset similarity threshold value from the storage text;
performing cluster analysis on all the stored texts in the text library to obtain a plurality of text clusters;
determining a text cluster comprising the target storage text as a target text cluster;
and determining the stored text in the target text cluster as the original text related to the target text.
5. The artificial intelligence based text analysis method of claim 4, wherein training based on the original text and the enhanced text using the plurality of model frames to obtain a plurality of text analysis models comprises:
extracting the theme information of the original text and the enhanced text;
obtaining a combined feature vector according to the subject information;
training based on the combined feature vectors by using the model frames to obtain a plurality of text analysis models.
6. The artificial intelligence based text analysis method of claim 5, wherein the extracting subject information of the original text and the enhanced text comprises:
and extracting the topics from the original text and the enhanced text by using a hierarchical Dirichlet process algorithm to obtain the topic information, wherein the topic information comprises text-topic distribution and topic-word distribution.
7. The artificial intelligence based text analysis method of claim 6, wherein selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation metrics comprises:
displaying each text analysis model and a plurality of corresponding evaluation index values, and taking the text analysis model selected by a user as the target text analysis model; or alternatively
And calculating weighted average values of a plurality of evaluation index values corresponding to each text analysis model, and taking the text analysis model with the largest weighted average value as the target text analysis model or taking the text analysis model with the weighted average value larger than the average weighted average value as the target text analysis model.
8. An artificial intelligence based text analysis system, the system comprising:
The text acquisition module is used for responding to an analysis instruction of a user on a target text and acquiring an original text related to the target text from a text library;
the enhancement processing module is used for carrying out data enhancement processing on the original text to obtain an enhanced text;
the model acquisition module is used for identifying the application field of the target text and acquiring a plurality of model frames corresponding to the application field from a preset model library;
the model training module is used for training based on the original text and the enhanced text by using the plurality of model frames to obtain a plurality of text analysis models;
an index evaluation module for selecting a target text analysis model from a plurality of the text analysis models based on a plurality of evaluation indexes;
and the text analysis module is used for carrying out data analysis on the target text by using the target text analysis model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the artificial intelligence based text analysis method according to any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the artificial intelligence based text analysis method according to any one of claims 1 to 7.
CN202310972661.XA 2023-08-03 2023-08-03 Text analysis method, system, electronic equipment and medium based on artificial intelligence Pending CN116956896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310972661.XA CN116956896A (en) 2023-08-03 2023-08-03 Text analysis method, system, electronic equipment and medium based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310972661.XA CN116956896A (en) 2023-08-03 2023-08-03 Text analysis method, system, electronic equipment and medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN116956896A true CN116956896A (en) 2023-10-27

Family

ID=88452900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310972661.XA Pending CN116956896A (en) 2023-08-03 2023-08-03 Text analysis method, system, electronic equipment and medium based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN116956896A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236315A (en) * 2023-11-13 2023-12-15 湖南快乐阳光互动娱乐传媒有限公司 Text data intelligent analysis method, device and equipment
CN117370809A (en) * 2023-11-02 2024-01-09 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370809A (en) * 2023-11-02 2024-01-09 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning
CN117370809B (en) * 2023-11-02 2024-04-12 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning
CN117236315A (en) * 2023-11-13 2023-12-15 湖南快乐阳光互动娱乐传媒有限公司 Text data intelligent analysis method, device and equipment
CN117236315B (en) * 2023-11-13 2024-01-30 湖南快乐阳光互动娱乐传媒有限公司 Text data intelligent analysis method, device and equipment

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
Orabi et al. Deep learning for depression detection of twitter users
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN116956896A (en) Text analysis method, system, electronic equipment and medium based on artificial intelligence
CN111984793A (en) Text emotion classification model training method and device, computer equipment and medium
CN112149409B (en) Medical word cloud generation method and device, computer equipment and storage medium
CN111563158B (en) Text ranking method, ranking apparatus, server and computer-readable storage medium
Altheneyan et al. Big data ML-based fake news detection using distributed learning
CN112288337B (en) Behavior recommendation method, behavior recommendation device, behavior recommendation equipment and behavior recommendation medium
CN113806493B (en) Entity relationship joint extraction method and device for Internet text data
CN113592605B (en) Product recommendation method, device, equipment and storage medium based on similar products
CN114648392B (en) Product recommendation method and device based on user portrait, electronic equipment and medium
CN113378970A (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN113704410A (en) Emotion fluctuation detection method and device, electronic equipment and storage medium
CN113591489B (en) Voice interaction method and device and related equipment
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN114662477A (en) Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN113918704A (en) Question-answering method and device based on machine learning, electronic equipment and medium
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
Ding et al. Leveraging text and knowledge bases for triple scoring: an ensemble approach-the Bokchoy triple scorer at WSDM Cup 2017
CN115221323A (en) Cold start processing method, device, equipment and medium based on intention recognition model
CN114186028A (en) Consult complaint work order processing method, device, equipment and storage medium
CN113870478A (en) Rapid number-taking method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination