CN110705268A - Article subject extraction method and device based on artificial intelligence and computer-readable storage medium - Google Patents

Article subject extraction method and device based on artificial intelligence and computer-readable storage medium Download PDF

Info

Publication number
CN110705268A
CN110705268A CN201910826795.4A CN201910826795A CN110705268A CN 110705268 A CN110705268 A CN 110705268A CN 201910826795 A CN201910826795 A CN 201910826795A CN 110705268 A CN110705268 A CN 110705268A
Authority
CN
China
Prior art keywords
word
artificial intelligence
subject
text data
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910826795.4A
Other languages
Chinese (zh)
Other versions
CN110705268B (en
Inventor
陈一峰
周骏红
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910826795.4A priority Critical patent/CN110705268B/en
Priority to PCT/CN2019/116936 priority patent/WO2021042517A1/en
Publication of CN110705268A publication Critical patent/CN110705268A/en
Application granted granted Critical
Publication of CN110705268B publication Critical patent/CN110705268B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses an article subject extraction method based on artificial intelligence, which comprises the following steps: receiving a text data set, performing word segmentation and merging operations on the text data set to obtain a word text set, performing coding operation on the word text set to convert the word text set into a word matrix set, inputting the word matrix set into a word vector conversion model to train to obtain a word vector set, performing dimension reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model to train, converting text data input by a user into word vectors, inputting the word vectors into the convolutional neural network model which completes the training to obtain article themes, and outputting the article themes. The invention also provides an article subject extraction device based on artificial intelligence and a computer readable storage medium. The invention can realize the accurate and efficient artificial intelligence-based article theme extraction function.

Description

Article subject extraction method and device based on artificial intelligence and computer-readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for extracting article themes based on artificial intelligence and a computer-readable storage medium.
Background
At present, the themes of most articles are analyzed by professional industry people, for example, the themes are manually read and researched by enterprise development reports, then the themes are summarized to lead high-level leaders to make decisions, academic reports are summarized by related people and then the themes are simplified for other people to learn, and the mode is time-consuming and labor-consuming. In addition, the subject extraction of the article is carried out based on the traditional naive Bayes algorithm, but the naive Bayes algorithm has large calculation resource and higher error rate of the extracted subject, so that the actual requirement cannot be met.
Disclosure of Invention
The invention provides an article subject extraction method and device based on artificial intelligence and a computer readable storage medium, and mainly aims to perform intelligent subject extraction according to articles input by a user.
In order to achieve the above object, the invention provides an article theme extraction method based on artificial intelligence, which comprises the following steps:
receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;
the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;
inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;
and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
Optionally, the merging operation includes:
traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;
presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;
and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
Optionally, the conditional probability model is:
Figure BDA0002188686600000021
wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
Optionally, the encoding operation comprises:
numbering each word in the word text set by a number to obtain the maximum number;
creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;
and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.
Optionally, the dimension reduction operation comprises:
calculating the covariance of each word vector in the word vector set;
and removing the word vectors with the absolute values larger than the preset covariance threshold value in the covariance to obtain a word vector set after dimension reduction.
In order to achieve the above object, the present invention further provides an artificial intelligence-based article theme extraction device, including a memory and a processor, wherein the memory stores an artificial intelligence-based article theme extraction program executable on the processor, and the artificial intelligence-based article theme extraction program implements the following steps when executed by the processor:
receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;
the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;
inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;
and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
Optionally, the merging operation includes:
traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;
presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;
and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
Optionally, the conditional probability model is:
Figure BDA0002188686600000031
wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
Optionally, the encoding operation comprises:
numbering each word in the word text set by a number to obtain the maximum number;
creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;
and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method as described above.
The method comprises the steps of firstly carrying out word segmentation and merging operation on a text data set to obtain a word text set, avoiding the influence of error words on the theme of the whole article, simultaneously carrying out coding operation and word vector conversion on the word text set to obtain a word vector set, and amplifying characteristic attributes while reducing word dimensionality through the coding operation and the word vector conversion. Therefore, the article subject extraction method and device based on artificial intelligence and the computer readable storage medium can realize accurate article subject output results.
Drawings
Fig. 1 is a schematic flowchart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention;
fig. 2 is a schematic internal structural diagram of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating an article subject extraction program based on artificial intelligence in an article subject extraction device based on artificial intelligence according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an article subject extraction method based on artificial intelligence. Referring to fig. 1, a flow chart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the article subject extraction method based on artificial intelligence includes:
and S1, receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set.
Preferably, the text data set includes multiple types of text, such as news, social, academic, government development planning, enterprise investment, and the like.
The cleaning is to remove the stop words, Arabic letters and other abnormal words in the text data set, and the text classification effect can be reduced because the abnormal words with no actual significance exist. The stop words are words which have no practical meaning and have no influence on text analysis, but have high occurrence frequency, such as commonly used pronouns, prepositions and the like. Specifically, the cleaning is to construct a special-shaped word table in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the special-shaped word table, remove the words until the traversal is completed.
The word segmentation is to segment each word in the text data set to obtain a single word, and word segmentation is essential because there is no clear separation mark between words in the chinese representation. Preferably, the word segmentation of the present invention may be processed by using a final segmentation word library based on programming languages such as Python and JAVA, wherein the final segmentation word library is developed based on the characteristics of the part of speech of chinese, and is developed by converting the occurrence frequency of each word in the text data set into a frequency, searching a maximum probability path based on dynamic programming, and finding a maximum segmentation combination based on a word frequency. For example, the text data set has text segments as follows: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. After being processed by the ending part word library, the method is changed into the following steps: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. Wherein the blank part represents the processing result of the result word bank.
Further, since the subjects of the sentences may be the same, the merging is to merge the sentences having the same subject, so as to achieve the purpose of greatly reducing the words in the text data set. Preferably, said combining comprises: traversing each text in the text data set, dividing the text according to paragraphs to obtain a plurality of paragraphs, presetting words with the occurrence frequency more than or equal to two times in each paragraph as a hypothesis subject, constructing a conditional probability model of each sentence and the hypothesis subject in each paragraph, constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
Specifically, the conditional probability model is:
Figure BDA0002188686600000051
wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, such as D1Is the first paragraph of the text, s is a sentence within the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
Preferably, the log-likelihood function is:
Figure BDA0002188686600000061
wherein argmax is the hypothesis subject corresponding to the maximum partial derivative of all the hypothesis subjects for solving the conditional probability model.
And S2, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train to obtain a word vector set.
Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to number each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and perform one-hot encoding according to the mapping resultAnd completing coding operation on the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use1Human being2Understand that3And4body system5Of switching6At the time of flight7They are8Can be used for9Will be provided with10Is true11Oneself with12Out of the disc13That is to say that14Reality (reality)15And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。
Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word vector in the word vector set, and calculating the weight based on the weight relationship to complete a conversion process from the word matrix set to the word vector set.
Specifically, the weight relationship is:
d={(t1,w1),(t2,w2),......,(ti,wi),......,(tn,wn)}
where d is the set of word matrices, t1、t2、......、tnFor the word matrix in the word matrix set, as described above [0, 0, 0, 0, 0, 0, 0, 0, 1]Etc. w1、w2、......、wnIs the weight of the corresponding word matrix.
Further, the weight calculation method comprises:
Figure BDA0002188686600000062
wherein f isiRepresenting the number of occurrences of a word matrix in the set of word matricesNumber, N is the total number of texts in the text data set, NjRepresenting the total number of words, N, in said text data setiRepresenting the number of occurrences of the word i in said text data set, FmThe weighting factor is generally less than 1.
S3, performing dimensionality reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, continuing training of the convolutional neural network model if the training value is larger than the preset threshold value, and finishing training of the convolutional neural network model if the training value is smaller than the preset threshold value.
Preferably, the dimension reduction operation includes calculating covariance of each word vector in the word vector set, and removing the word vector of which the absolute value is greater than a preset covariance threshold value in the covariance to obtain the word vector set after dimension reduction.
Further, the covariance is:
Figure BDA0002188686600000071
wherein x isi,xjRepresenting each word vector of said set of word vectors, n being the number of said set of word vectors, cov (x)i,xj) Representing a calculation xi,xjThe covariance between. If the calculated covariance cov (x)i,xj) If the average molecular weight is not 0, a positive correlation is represented by more than 0, and a negative correlation is represented by less than 0.
In a preferred embodiment of the present invention, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, the input layer receives the word vector set, and the convolutional layer, the pooling layer, and the fully-connected layer are trained in combination with an activation function to obtain a training value and output the training value through the output layer.
In a preferred embodiment of the present invention, the activation function may comprise a Softmax function, and the loss function is a least squares function. The Softmax function is:
Figure BDA0002188686600000072
wherein, OjRepresents the output value, I, of the jth neuron of the fully-connected layerjRepresenting an input value of a jth neuron of the output layer, t representing a total amount of neurons of the output layer, e being an infinite acyclic fraction;
the least squares method L(s) is:
wherein s is the training value, k is the number of the word vector set after dimension reduction, yiIs the set of word vectors, y'iAnd the predicted value of the convolutional neural network model is obtained.
And S4, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting article themes.
If an article which is input by a user and used for describing ancient time character prisons is received, the article is output by the trained convolutional neural network model, and the theme of the article is as follows: the article describing ancient letter prisons discloses a building system for harsher violence against civilian ink guests, representing the profound sympathy of authors with knowledge officers and a strong angry against the crime.
The invention also provides an article theme extraction device based on artificial intelligence. Fig. 2 is a schematic diagram illustrating an internal structure of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention.
In the present embodiment, the article theme extraction device 1 based on artificial intelligence may be a PC (personal computer), a terminal device such as a smart phone, a tablet computer, or a mobile computer, or may be a server. The article subject extraction device 1 based on artificial intelligence at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the artificial intelligence based article subject matter extracting apparatus 1 in some embodiments, for example, a hard disk of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 may also be an external storage device of the article theme extracting apparatus 1 based on artificial intelligence in other embodiments, such as a plug-in hard disk provided on the article theme extracting apparatus 1 based on artificial intelligence, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 can be used not only to store application software installed in the artificial intelligence-based article theme extraction device 1 and various types of data, such as the code of the artificial intelligence-based article theme extraction program 01, but also to temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the artificial intelligence-based article theme extraction program 01.
The communication bus 13 is used to realize connection communication between these components.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.
Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. Among them, the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the artificial intelligence based article theme extraction apparatus 1 and for displaying a visualized user interface.
While fig. 2 shows only the artificial intelligence based article subject matter extraction apparatus 1 having the components 11-14 and the artificial intelligence based article subject matter extraction program 01, those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the artificial intelligence based article subject matter extraction apparatus 1, and may include fewer or more components than shown, or combine certain components, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores an artificial intelligence-based article theme extraction program 01; the processor 12 implements the following steps when executing the artificial intelligence based article theme extraction program 01 stored in the memory 11:
the method comprises the steps of firstly, receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.
Preferably, the text data set includes multiple types of text, such as news, social, academic, government development planning, enterprise investment, and the like.
The cleaning is to remove the stop words, Arabic letters and other abnormal words in the text data set, and the text classification effect can be reduced because the abnormal words with no actual significance exist. The stop words are words which have no practical meaning and have no influence on text analysis, but have high occurrence frequency, such as commonly used pronouns, prepositions and the like. Specifically, the cleaning is to construct a special-shaped word table in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the special-shaped word table, remove the words until the traversal is completed.
The word segmentation is to segment each word in the text data set to obtain a single word, and word segmentation is essential because there is no clear separation mark between words in the chinese representation. Preferably, the word segmentation of the present invention may be processed by using a final segmentation word library based on programming languages such as Python and JAVA, wherein the final segmentation word library is developed based on the characteristics of the part of speech of chinese, and is developed by converting the occurrence frequency of each word in the text data set into a frequency, searching a maximum probability path based on dynamic programming, and finding a maximum segmentation combination based on a word frequency. For example, the text data set has text segments as follows: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. After being processed by the ending part word library, the method is changed into the following steps: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. Wherein the blank part represents the processing result of the result word bank.
Further, since the subjects of the sentences may be the same, the merging is to merge the sentences having the same subject, so as to achieve the purpose of greatly reducing the words in the text data set. Preferably, said combining comprises: traversing each text in the text data set, dividing the text according to paragraphs to obtain a plurality of paragraphs, presetting words with the occurrence frequency more than or equal to two times in each paragraph as a hypothesis subject, constructing a conditional probability model of each sentence and the hypothesis subject in each paragraph, constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
Specifically, the conditional probability model is:
Figure BDA0002188686600000101
wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subjects, D is the hypothesis subjectParagraph, j is the number of the paragraph, e.g. D1Is the first paragraph of the text, s is a sentence within the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
Preferably, the log-likelihood function is:
wherein argmax is the hypothesis subject corresponding to the maximum partial derivative of all the hypothesis subjects for solving the conditional probability model.
And step two, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.
Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to perform number numbering on each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and complete an encoding operation according to the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use1Human being2Understand that3And4body system5Of switching6At the time of flight7They are8Can be used for9Will be provided with10Is true11Oneself with12Out of the disc13That is to say that14Reality (reality)15And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。
Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word vector in the word vector set, and calculating the weight based on the weight relationship to complete a conversion process from the word matrix set to the word vector set.
Specifically, the weight relationship is:
d={(t1,w1),(t2,w2),......,(ti,wi),......,(tn,wn))
where d is the set of word matrices, t1、t2、......、tnFor the word matrix in the word matrix set, as described above [0, 0, 0, 0, 0, 0, 0, 0, 1]Etc. w1、w2、......、wnIs the weight of the corresponding word matrix.
Further, the weight calculation method comprises:
Figure BDA0002188686600000111
wherein f isiRepresenting the number of occurrences of a word matrix in said set of word matrices, N being the total number of texts in said text data set, NjRepresenting the total number of words, N, in said text data setiRepresenting the number of occurrences of the word i in said text data set, FmThe weighting factor is generally less than 1.
And step three, inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value.
Preferably, the dimension reduction operation includes calculating covariance of each word vector in the word vector set, and removing the word vector of which the absolute value is greater than a preset covariance threshold value in the covariance to obtain the word vector set after dimension reduction.
Further, the covariance is:
Figure BDA0002188686600000112
wherein x isi,xjRepresenting each word vector of said set of word vectors, n being the number of said set of word vectors, cov (x)i,xj) Representing a calculation xi,xjThe covariance between. If the calculated covariance cov (x)i,xj) If the average molecular weight is not 0, a positive correlation is represented by more than 0, and a negative correlation is represented by less than 0.
In a preferred embodiment of the present invention, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, the input layer receives the word vector set, and the convolutional layer, the pooling layer, and the fully-connected layer are trained in combination with an activation function to obtain a training value and output the training value through the output layer.
In a preferred embodiment of the present invention, the activation function may comprise a Softmax function, and the loss function is a least squares function. The Softmax function is:
Figure BDA0002188686600000121
wherein, OjRepresents the output value, I, of the jth neuron of the fully-connected layerjRepresenting an input value of a jth neuron of the output layer, t representing a total amount of neurons of the output layer, e being an infinite acyclic fraction;
the least squares method L(s) is:
Figure BDA0002188686600000122
wherein s is the training value, k is the number of the word vector set after dimension reduction, yiIs the set of word vectors, y'iAnd the predicted value of the convolutional neural network model is obtained.
And step four, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article subject.
If an article which is input by a user and used for describing ancient time character prisons is received, the article is output by the trained convolutional neural network model, and the theme of the article is as follows: the article describing ancient letter prisons discloses a building system for harsher violence against civilian ink guests, representing the profound sympathy of authors with knowledge officers and a strong angry against the crime.
Alternatively, in other embodiments, the article theme extraction program based on artificial intelligence can be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, a schematic diagram of program modules of an artificial intelligence-based article theme extraction program in an embodiment of the artificial intelligence-based article theme extraction apparatus according to the present invention is shown, in this embodiment, the artificial intelligence-based article theme extraction program may be divided into a data receiving module 10, a word vector solving module 20, a model training module 30, and an article theme output module 40, which exemplarily:
the data receiving module 10 is configured to: receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.
The word vector solving module 20 is configured to: and converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.
The model training module 30 is configured to: and after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.
The article subject output module 40 is configured to: and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
The functions or operation steps of the data receiving module 10, the word vector solving module 20, the model training module 30, the article substance output module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not repeated herein.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which an artificial intelligence-based article subject matter extraction program is stored, where the artificial intelligence-based article subject matter extraction program is executable by one or more processors to implement the following operations:
receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.
And converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.
And after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.
And receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An article theme extraction method based on artificial intelligence is characterized by comprising the following steps:
receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;
the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;
inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;
and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
2. The artificial intelligence based article subject matter extraction method of claim 1, wherein the merging operation comprises:
traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;
presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;
and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
3. The method for extracting an article theme based on artificial intelligence as claimed in claim 2, wherein the conditional probability model is:
Figure FDA0002188686590000011
wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
4. An artificial intelligence based article theme extraction method as claimed in any one of claims 1 to 3, wherein the encoding operation comprises:
numbering each word in the word text set by a number to obtain the maximum number;
creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;
and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.
5. The artificial intelligence based article subject matter extraction method of claim 4, wherein the dimension reduction operation comprises:
calculating the covariance of each word vector in the word vector set;
and removing the word vectors with the absolute values larger than the preset covariance threshold value in the covariance to obtain a word vector set after dimension reduction.
6. An artificial intelligence based article subject matter extraction apparatus, the apparatus comprising a memory and a processor, the memory having stored thereon an artificial intelligence based article subject matter extraction program executable on the processor, the artificial intelligence based article subject matter extraction program when executed by the processor implementing the steps of:
receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;
the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;
inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;
and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.
7. The artificial intelligence based article subject matter extraction apparatus of claim 6, wherein the merging operation comprises:
traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;
presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;
and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.
8. The artificial intelligence based article theme extraction device of claim 7, wherein the conditional probability model is:
Figure FDA0002188686590000031
wherein, y1,…,yN,yiFor the hypothesis subject, N is the hypothesis subjectD is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi
9. An artificial intelligence based article theme extraction apparatus as claimed in any one of claims 6 to 8, wherein the encoding operation comprises:
numbering each word in the word text set by a number to obtain the maximum number;
creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;
and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.
10. A computer-readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method according to any one of claims 1 to 5.
CN201910826795.4A 2019-09-02 2019-09-02 Article subject matter extraction method and device based on artificial intelligence and computer readable storage medium Active CN110705268B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910826795.4A CN110705268B (en) 2019-09-02 2019-09-02 Article subject matter extraction method and device based on artificial intelligence and computer readable storage medium
PCT/CN2019/116936 WO2021042517A1 (en) 2019-09-02 2019-11-10 Artificial intelligence-based article gist extraction method and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910826795.4A CN110705268B (en) 2019-09-02 2019-09-02 Article subject matter extraction method and device based on artificial intelligence and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110705268A true CN110705268A (en) 2020-01-17
CN110705268B CN110705268B (en) 2024-06-25

Family

ID=69193514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910826795.4A Active CN110705268B (en) 2019-09-02 2019-09-02 Article subject matter extraction method and device based on artificial intelligence and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110705268B (en)
WO (1) WO2021042517A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651652A (en) * 2020-04-30 2020-09-11 中国平安财产保险股份有限公司 Emotional tendency recognition method, device, equipment and medium based on artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293978A1 (en) * 2017-04-07 2018-10-11 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN110019793A (en) * 2017-10-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of text semantic coding method and device
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509413A (en) * 2018-03-08 2018-09-07 平安科技(深圳)有限公司 Digest extraction method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293978A1 (en) * 2017-04-07 2018-10-11 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content
CN110019793A (en) * 2017-10-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of text semantic coding method and device
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651652A (en) * 2020-04-30 2020-09-11 中国平安财产保险股份有限公司 Emotional tendency recognition method, device, equipment and medium based on artificial intelligence
CN111651652B (en) * 2020-04-30 2023-11-10 中国平安财产保险股份有限公司 Emotion tendency identification method, device, equipment and medium based on artificial intelligence

Also Published As

Publication number Publication date
CN110705268B (en) 2024-06-25
WO2021042517A1 (en) 2021-03-11

Similar Documents

Publication Publication Date Title
CN107291795B (en) Text classification method combining dynamic word embedding and part-of-speech tagging
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
WO2020224213A1 (en) Sentence intent identification method, device, and computer readable storage medium
US20230195773A1 (en) Text classification method, apparatus and computer-readable storage medium
CN110442857B (en) Emotion intelligent judging method and device and computer readable storage medium
CN110263323A (en) Keyword abstraction method and system based on the long Memory Neural Networks in short-term of fence type
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN108009148A (en) Text emotion classification method for expressing based on deep learning
CN106649853A (en) Short text clustering method based on deep learning
CN110427480B (en) Intelligent personalized text recommendation method and device and computer readable storage medium
CN109919175B (en) Entity multi-classification method combined with attribute information
CN111241828A (en) Intelligent emotion recognition method and device and computer readable storage medium
CN109977402B (en) Named entity identification method and system
CN113553510B (en) Text information recommendation method and device and readable medium
CN110489545A (en) File classification method and device, storage medium, computer equipment
US11373043B2 (en) Technique for generating and utilizing virtual fingerprint representing text data
CN110413773A (en) Intelligent text classification method, device and computer readable storage medium
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN111400494A (en) Sentiment analysis method based on GCN-Attention
CN112000778A (en) Natural language processing method, device and system based on semantic recognition
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN114091452A (en) Adapter-based transfer learning method, device, equipment and storage medium
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN112560504A (en) Method, electronic equipment and computer readable medium for extracting information in form document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40019636

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant