CN110705268A

CN110705268A - Article subject extraction method and device based on artificial intelligence and computer-readable storage medium

Info

Publication number: CN110705268A
Application number: CN201910826795.4A
Authority: CN
Inventors: 陈一峰; 周骏红; 汪伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2020-01-17
Anticipated expiration: 2039-09-02
Also published as: WO2021042517A1; CN110705268B

Abstract

The invention relates to an artificial intelligence technology, and discloses an article subject extraction method based on artificial intelligence, which comprises the following steps: receiving a text data set, performing word segmentation and merging operations on the text data set to obtain a word text set, performing coding operation on the word text set to convert the word text set into a word matrix set, inputting the word matrix set into a word vector conversion model to train to obtain a word vector set, performing dimension reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model to train, converting text data input by a user into word vectors, inputting the word vectors into the convolutional neural network model which completes the training to obtain article themes, and outputting the article themes. The invention also provides an article subject extraction device based on artificial intelligence and a computer readable storage medium. The invention can realize the accurate and efficient artificial intelligence-based article theme extraction function.

Description

Article subject extraction method and device based on artificial intelligence and computer-readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for extracting article themes based on artificial intelligence and a computer-readable storage medium.

Background

At present, the themes of most articles are analyzed by professional industry people, for example, the themes are manually read and researched by enterprise development reports, then the themes are summarized to lead high-level leaders to make decisions, academic reports are summarized by related people and then the themes are simplified for other people to learn, and the mode is time-consuming and labor-consuming. In addition, the subject extraction of the article is carried out based on the traditional naive Bayes algorithm, but the naive Bayes algorithm has large calculation resource and higher error rate of the extracted subject, so that the actual requirement cannot be met.

Disclosure of Invention

The invention provides an article subject extraction method and device based on artificial intelligence and a computer readable storage medium, and mainly aims to perform intelligent subject extraction according to articles input by a user.

In order to achieve the above object, the invention provides an article theme extraction method based on artificial intelligence, which comprises the following steps:

receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;

the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;

inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;

and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

Optionally, the merging operation includes:

traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;

presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;

and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Optionally, the conditional probability model is:

wherein, y₁，…，y_N，y_iIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)_iS) is the hypothesis subject y_iProbability of being the subject of sentence s, s (i, y)_i) The hypothetical subject representing the sentence i is y_i。

Optionally, the encoding operation comprises:

numbering each word in the word text set by a number to obtain the maximum number;

creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;

and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.

Optionally, the dimension reduction operation comprises:

calculating the covariance of each word vector in the word vector set;

and removing the word vectors with the absolute values larger than the preset covariance threshold value in the covariance to obtain a word vector set after dimension reduction.

In order to achieve the above object, the present invention further provides an artificial intelligence-based article theme extraction device, including a memory and a processor, wherein the memory stores an artificial intelligence-based article theme extraction program executable on the processor, and the artificial intelligence-based article theme extraction program implements the following steps when executed by the processor:

Optionally, the merging operation includes:

Optionally, the conditional probability model is:

Optionally, the encoding operation comprises:

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method as described above.

The method comprises the steps of firstly carrying out word segmentation and merging operation on a text data set to obtain a word text set, avoiding the influence of error words on the theme of the whole article, simultaneously carrying out coding operation and word vector conversion on the word text set to obtain a word vector set, and amplifying characteristic attributes while reducing word dimensionality through the coding operation and the word vector conversion. Therefore, the article subject extraction method and device based on artificial intelligence and the computer readable storage medium can realize accurate article subject output results.

Drawings

Fig. 1 is a schematic flowchart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention;

fig. 2 is a schematic internal structural diagram of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating an article subject extraction program based on artificial intelligence in an article subject extraction device based on artificial intelligence according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an article subject extraction method based on artificial intelligence. Referring to fig. 1, a flow chart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the article subject extraction method based on artificial intelligence includes:

and S1, receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set.

Preferably, the text data set includes multiple types of text, such as news, social, academic, government development planning, enterprise investment, and the like.

The cleaning is to remove the stop words, Arabic letters and other abnormal words in the text data set, and the text classification effect can be reduced because the abnormal words with no actual significance exist. The stop words are words which have no practical meaning and have no influence on text analysis, but have high occurrence frequency, such as commonly used pronouns, prepositions and the like. Specifically, the cleaning is to construct a special-shaped word table in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the special-shaped word table, remove the words until the traversal is completed.

The word segmentation is to segment each word in the text data set to obtain a single word, and word segmentation is essential because there is no clear separation mark between words in the chinese representation. Preferably, the word segmentation of the present invention may be processed by using a final segmentation word library based on programming languages such as Python and JAVA, wherein the final segmentation word library is developed based on the characteristics of the part of speech of chinese, and is developed by converting the occurrence frequency of each word in the text data set into a frequency, searching a maximum probability path based on dynamic programming, and finding a maximum segmentation combination based on a word frequency. For example, the text data set has text segments as follows: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. After being processed by the ending part word library, the method is changed into the following steps: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. Wherein the blank part represents the processing result of the result word bank.

Further, since the subjects of the sentences may be the same, the merging is to merge the sentences having the same subject, so as to achieve the purpose of greatly reducing the words in the text data set. Preferably, said combining comprises: traversing each text in the text data set, dividing the text according to paragraphs to obtain a plurality of paragraphs, presetting words with the occurrence frequency more than or equal to two times in each paragraph as a hypothesis subject, constructing a conditional probability model of each sentence and the hypothesis subject in each paragraph, constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Specifically, the conditional probability model is:

wherein, y₁，…，y_N，y_iIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, such as D₁Is the first paragraph of the text, s is a sentence within the paragraph, P (y)_iS) is the hypothesis subject y_iProbability of being the subject of sentence s, s (i, y)_i) The hypothetical subject representing the sentence i is y_i。

Preferably, the log-likelihood function is:

wherein argmax is the hypothesis subject corresponding to the maximum partial derivative of all the hypothesis subjects for solving the conditional probability model.

And S2, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train to obtain a word vector set.

Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to number each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and perform one-hot encoding according to the mapping resultAnd completing coding operation on the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use¹Human being²Understand that³And⁴body system⁵Of switching⁶At the time of flight⁷They are⁸Can be used for⁹Will be provided with¹⁰Is true¹¹Oneself with¹²Out of the disc¹³That is to say that¹⁴Reality (reality)¹⁵And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。

Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word vector in the word vector set, and calculating the weight based on the weight relationship to complete a conversion process from the word matrix set to the word vector set.

Specifically, the weight relationship is:

d＝{(t₁，w₁)，(t₂，w₂)，......，(t_i，w_i)，......，(t_n，w_n)}

where d is the set of word matrices, t₁、t₂、......、t_nFor the word matrix in the word matrix set, as described above [0, 0, 0, 0, 0, 0, 0, 0, 1]Etc. w₁、w₂、......、w_nIs the weight of the corresponding word matrix.

Further, the weight calculation method comprises:

wherein f is_iRepresenting the number of occurrences of a word matrix in the set of word matricesNumber, N is the total number of texts in the text data set, N_jRepresenting the total number of words, N, in said text data set_iRepresenting the number of occurrences of the word i in said text data set, F_mThe weighting factor is generally less than 1.

S3, performing dimensionality reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, continuing training of the convolutional neural network model if the training value is larger than the preset threshold value, and finishing training of the convolutional neural network model if the training value is smaller than the preset threshold value.

Preferably, the dimension reduction operation includes calculating covariance of each word vector in the word vector set, and removing the word vector of which the absolute value is greater than a preset covariance threshold value in the covariance to obtain the word vector set after dimension reduction.

Further, the covariance is:

wherein x is_i，x_jRepresenting each word vector of said set of word vectors, n being the number of said set of word vectors, cov (x)_i，x_j) Representing a calculation x_i，x_jThe covariance between. If the calculated covariance cov (x)_i，x_j) If the average molecular weight is not 0, a positive correlation is represented by more than 0, and a negative correlation is represented by less than 0.

In a preferred embodiment of the present invention, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, the input layer receives the word vector set, and the convolutional layer, the pooling layer, and the fully-connected layer are trained in combination with an activation function to obtain a training value and output the training value through the output layer.

In a preferred embodiment of the present invention, the activation function may comprise a Softmax function, and the loss function is a least squares function. The Softmax function is:

wherein, O_jRepresents the output value, I, of the jth neuron of the fully-connected layer_jRepresenting an input value of a jth neuron of the output layer, t representing a total amount of neurons of the output layer, e being an infinite acyclic fraction;

the least squares method L(s) is:

wherein s is the training value, k is the number of the word vector set after dimension reduction, y_iIs the set of word vectors, y'_iAnd the predicted value of the convolutional neural network model is obtained.

And S4, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting article themes.

If an article which is input by a user and used for describing ancient time character prisons is received, the article is output by the trained convolutional neural network model, and the theme of the article is as follows: the article describing ancient letter prisons discloses a building system for harsher violence against civilian ink guests, representing the profound sympathy of authors with knowledge officers and a strong angry against the crime.

The invention also provides an article theme extraction device based on artificial intelligence. Fig. 2 is a schematic diagram illustrating an internal structure of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention.

In the present embodiment, the article theme extraction device 1 based on artificial intelligence may be a PC (personal computer), a terminal device such as a smart phone, a tablet computer, or a mobile computer, or may be a server. The article subject extraction device 1 based on artificial intelligence at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the artificial intelligence based article subject matter extracting apparatus 1 in some embodiments, for example, a hard disk of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 may also be an external storage device of the article theme extracting apparatus 1 based on artificial intelligence in other embodiments, such as a plug-in hard disk provided on the article theme extracting apparatus 1 based on artificial intelligence, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 can be used not only to store application software installed in the artificial intelligence-based article theme extraction device 1 and various types of data, such as the code of the artificial intelligence-based article theme extraction program 01, but also to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the artificial intelligence-based article theme extraction program 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. Among them, the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the artificial intelligence based article theme extraction apparatus 1 and for displaying a visualized user interface.

While fig. 2 shows only the artificial intelligence based article subject matter extraction apparatus 1 having the components 11-14 and the artificial intelligence based article subject matter extraction program 01, those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the artificial intelligence based article subject matter extraction apparatus 1, and may include fewer or more components than shown, or combine certain components, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores an artificial intelligence-based article theme extraction program 01; the processor 12 implements the following steps when executing the artificial intelligence based article theme extraction program 01 stored in the memory 11:

the method comprises the steps of firstly, receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

Specifically, the conditional probability model is:

wherein, y₁，…，y_N，y_iIs the hypothesis subject, N is the number of the hypothesis subjects, D is the hypothesis subjectParagraph, j is the number of the paragraph, e.g. D₁Is the first paragraph of the text, s is a sentence within the paragraph, P (y)_iS) is the hypothesis subject y_iProbability of being the subject of sentence s, s (i, y)_i) The hypothetical subject representing the sentence i is y_i。

Preferably, the log-likelihood function is:

And step two, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to perform number numbering on each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and complete an encoding operation according to the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use¹Human being²Understand that³And⁴body system⁵Of switching⁶At the time of flight⁷They are⁸Can be used for⁹Will be provided with¹⁰Is true¹¹Oneself with¹²Out of the disc¹³That is to say that¹⁴Reality (reality)¹⁵And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。

Specifically, the weight relationship is:

d＝{(t₁，w₁)，(t₂，w₂)，......，(t_i，w_i)，......，(t_n，w_n))

Further, the weight calculation method comprises:

wherein f is_iRepresenting the number of occurrences of a word matrix in said set of word matrices, N being the total number of texts in said text data set, N_jRepresenting the total number of words, N, in said text data set_iRepresenting the number of occurrences of the word i in said text data set, F_mThe weighting factor is generally less than 1.

And step three, inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value.

Further, the covariance is:

the least squares method L(s) is:

And step four, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article subject.

Alternatively, in other embodiments, the article theme extraction program based on artificial intelligence can be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of an artificial intelligence-based article theme extraction program in an embodiment of the artificial intelligence-based article theme extraction apparatus according to the present invention is shown, in this embodiment, the artificial intelligence-based article theme extraction program may be divided into a data receiving module 10, a word vector solving module 20, a model training module 30, and an article theme output module 40, which exemplarily:

the data receiving module 10 is configured to: receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

The word vector solving module 20 is configured to: and converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

The model training module 30 is configured to: and after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.

The article subject output module 40 is configured to: and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

The functions or operation steps of the data receiving module 10, the word vector solving module 20, the model training module 30, the article substance output module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not repeated herein.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which an artificial intelligence-based article subject matter extraction program is stored, where the artificial intelligence-based article subject matter extraction program is executable by one or more processors to implement the following operations:

receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

And converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

And after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An article theme extraction method based on artificial intelligence is characterized by comprising the following steps:

2. The artificial intelligence based article subject matter extraction method of claim 1, wherein the merging operation comprises:

3. The method for extracting an article theme based on artificial intelligence as claimed in claim 2, wherein the conditional probability model is:

4. An artificial intelligence based article theme extraction method as claimed in any one of claims 1 to 3, wherein the encoding operation comprises:

5. The artificial intelligence based article subject matter extraction method of claim 4, wherein the dimension reduction operation comprises:

calculating the covariance of each word vector in the word vector set;

6. An artificial intelligence based article subject matter extraction apparatus, the apparatus comprising a memory and a processor, the memory having stored thereon an artificial intelligence based article subject matter extraction program executable on the processor, the artificial intelligence based article subject matter extraction program when executed by the processor implementing the steps of:

7. The artificial intelligence based article subject matter extraction apparatus of claim 6, wherein the merging operation comprises:

8. The artificial intelligence based article theme extraction device of claim 7, wherein the conditional probability model is:

wherein, y₁，…，y_N，y_iFor the hypothesis subject, N is the hypothesis subjectD is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)_iS) is the hypothesis subject y_iProbability of being the subject of sentence s, s (i, y)_i) The hypothetical subject representing the sentence i is y_i。

9. An artificial intelligence based article theme extraction apparatus as claimed in any one of claims 6 to 8, wherein the encoding operation comprises:

10. A computer-readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method according to any one of claims 1 to 5.