WO2021042517A1 - Artificial intelligence-based article gist extraction method and device, and storage medium - Google Patents

Artificial intelligence-based article gist extraction method and device, and storage medium Download PDF

Info

Publication number
WO2021042517A1
WO2021042517A1 PCT/CN2019/116936 CN2019116936W WO2021042517A1 WO 2021042517 A1 WO2021042517 A1 WO 2021042517A1 CN 2019116936 W CN2019116936 W CN 2019116936W WO 2021042517 A1 WO2021042517 A1 WO 2021042517A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
subject
matrix
text
artificial intelligence
Prior art date
Application number
PCT/CN2019/116936
Other languages
French (fr)
Chinese (zh)
Inventor
陈一峰
周骏红
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042517A1 publication Critical patent/WO2021042517A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting the subject matter of articles based on artificial intelligence.
  • This application provides an artificial intelligence-based article subject extraction method, device, and computer-readable storage medium, the main purpose of which is to perform intelligent subject extraction based on the article input by the user.
  • an artificial intelligence-based article subject extraction method includes: receiving a text data set, performing word segmentation and merging operations on the text data set to obtain a word text set;
  • the text set is converted into a word matrix set after the encoding operation, and the word matrix set is input into the word vector conversion model to train to obtain the word vector set;
  • the word vector set is subjected to dimensionality reduction operation and then input into the convolutional neural network model
  • the training value is obtained by training, and the size of the training value and the preset threshold is judged, if the training value is greater than the preset threshold, the convolutional neural network model continues training, and if the training value is less than the preset threshold ,
  • the convolutional neural network model completes training; receives text data input by the user, converts the text data input by the user into a word vector, and then inputs it into the trained convolutional neural network model to obtain the main idea of the article and output it .
  • this application also provides an artificial intelligence-based article subject extraction device, which includes a memory and a processor, and the memory stores artificial intelligence-based articles that can run on the processor.
  • the subject extraction program when the artificial intelligence-based article subject extraction program is executed by the processor, the following steps are implemented: receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set;
  • the word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model to train to obtain a word vector set;
  • the word vector set is subjected to a dimensionality reduction operation and then input to the convolutional nerve
  • the training value is obtained by training in the network model, and the size of the training value and the preset threshold is judged.
  • the convolutional neural network model continues training, and if the training value is less than the With a preset threshold, the convolutional neural network model completes training; receives text data input by the user, converts the text data input by the user into a word vector, and enters it into the trained convolutional neural network model to obtain an article Subject and output.
  • the present application also provides a computer-readable storage medium on which is stored an artificial intelligence-based article subject extraction program, which can be One or more processors are executed to implement the steps of the above-mentioned artificial intelligence-based article subject extraction method.
  • This application first performs word segmentation and merging operations on the text data set to obtain a word text set, which can avoid the influence of wrong words on the subject of the entire article.
  • the word text set is encoded and word vector transformed to obtain a word vector set.
  • the encoding operation and the word vector transformation reduce the dimension of the word while amplifying the feature attributes.
  • the convolutional neural network model has excellent feature extraction capabilities, can efficiently identify word features, and improve the subject of the article. Output accuracy rate. Therefore, the artificial intelligence-based article subject extraction method, device, and computer-readable storage medium proposed in this application can achieve accurate article subject output results.
  • FIG. 1 is a schematic flowchart of an artificial intelligence-based article subject extraction method provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of the internal structure of an artificial intelligence-based article subject extraction device provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of modules of an artificial intelligence-based article subject extraction program in an artificial intelligence-based article subject extraction device provided by an embodiment of the application.
  • FIG. 1 it is a schematic flowchart of an artificial intelligence-based article subject extraction method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the method for extracting the subject matter of an article based on artificial intelligence includes:
  • S1 Receive a text data set, and perform operations including word segmentation and merging on the text data set to obtain a word text set.
  • the text data set includes multiple types of texts, such as news, social, academic, government development planning, and corporate investment.
  • the cleaning is to remove stop words, Arabic letters and other heteromorphic words in the text data set, because heteromorphic words that have no actual meaning will reduce the text classification effect.
  • the stop words have no practical meaning and have no effect on text analysis, but are frequently used words, such as commonly used pronouns and prepositions.
  • the cleaning is to construct a table of heteromorphic words in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the table of heteromorphic words, remove them until the traversal is completed.
  • the word segmentation is to segment each sentence in the text data set to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable.
  • the word segmentation described in this application can be processed using a stuttering word database based on programming languages such as Python, JAVA, etc.
  • the stuttering word database is developed for research and development based on Chinese part-of-speech features, and is a collection of the text data The number of occurrences of each word is converted to frequency, and the path with the maximum probability is found based on dynamic programming, and the maximum segmentation combination based on word frequency is found.
  • the merging is to merge multiple sentences with the same subject to achieve the purpose of greatly reducing the words in the text data set.
  • the merging includes: traversing each text in the text data set, dividing the text according to paragraphs to obtain several paragraphs, and presetting words that appear more than twice in each paragraph as hypothetical subjects, and constructing Constructing a log-likelihood function based on the conditional probability model of each sentence in each paragraph and the hypothetical subject, and optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, Combine several sentences with the same subject into one sentence to complete the combination operation.
  • conditional probability model is:
  • y 1 ,..., y N , y i are the hypothetical subjects
  • N is the number of the hypothetical subjects
  • D is the paragraph
  • j is the number of the paragraph, such as D 1 is the number of the text
  • s is the sentence in the paragraph
  • s) is the probability of assuming that the subject y i is the subject of the sentence s
  • s(i, y i ) represents the hypothetical subject of the sentence i is y i .
  • the log likelihood function is:
  • argmax is the hypothetical subject corresponding to the maximum partial derivative of the conditional probability model to all the hypothetical subjects.
  • the word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set.
  • the encoding adopts a one-hot encoding form, and the one-hot encoding is to first number each word in the word text set to obtain the largest number, and then create the largest number.
  • Encoding matrices with the same numerical numbering dimension traverse each sentence in the word text set in turn, map each sentence to the encoding matrix, and according to the numerical number of each word in the word text set Complete the encoding operation to obtain the word matrix set.
  • the collection of words and texts is: when people know how to exchange with the system, they can tell their true self and the truth. This is reality.
  • the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word word vector in the word vector set, and calculating the weight based on the weight relationship to complete the The conversion process from the word matrix set to the word vector set.
  • the weight relationship is:
  • d is the word matrix set
  • t 1 , t 2 ,..., t n are word matrices in the word matrix set, as in the above [0,0,0,0,0,0,0,0 ,0 ,0,0,0,1,1] etc.
  • w 1 , w 2 , ..., w n are the weights of the corresponding word matrix.
  • f i represents the number of occurrences of the word matrix in the word matrix set
  • N is the total number of texts in the text data set
  • N j represents the total number of words in the text data set
  • N i represents the word i in the text data set
  • the number of occurrences of, F m is the weighting factor, and the value is generally less than 1.
  • the dimensionality reduction operation includes calculating the covariance of each word vector in the word vector set, and removing word vectors in the covariance whose absolute value is greater than a preset covariance threshold to obtain a dimensionality-reduced word vector set.
  • x i , x j represent each word vector of the word vector set
  • n is the number of the word vector set
  • cov(x i , x j ) represents the calculation of the covariance between x i and x j. If the calculated covariance cov(x i ,x j ) is not 0, if it is greater than 0, it means a positive correlation, and if it is less than 0, it means a negative correlation.
  • the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.
  • the input layer receives the word vector set, and the convolutional layer , Pooling layer, fully connected layer combined with activation function training to obtain training values and output through the output layer.
  • the activation function in the preferred embodiment of the present application may include a Softmax function, and the loss function is a least square function.
  • the Softmax function is:
  • O j represents the output value of the jth neuron in the fully connected layer
  • I j represents the input value of the jth neuron in the output layer
  • t represents the total number of neurons in the output layer
  • e is infinite Do not cycle decimals
  • the least square method L(s) is:
  • s is the training value
  • k is the number of word vector sets after dimensionality reduction
  • y i is the word vector set
  • y′ i is the predicted value of the convolutional neural network model.
  • the convolutional neural network model after the completion of the training outputs the article subject: the article describing ancient literary prisons exposed the fidal rule against literati
  • the wasted tyranny of the author shows the author's deep sympathy for intellectuals and strong resentment of the brutal rule.
  • the invention also provides an article subject extraction device based on artificial intelligence.
  • FIG. 2 it is a schematic diagram of the internal structure of an artificial intelligence-based article subject extraction device provided by an embodiment of the present application.
  • the artificial intelligence-based article subject extraction device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
  • the artificial intelligence-based article subject extraction device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the article subject extraction device 1 based on artificial intelligence, such as the hard disk of the artificial intelligence-based article subject extraction device 1.
  • the memory 11 may also be an external storage device of the article subject extraction device 1 based on artificial intelligence, such as a plug-in hard disk equipped on the article subject extraction device 1 based on artificial intelligence, and a smart media card (Smart Media Card). , SMC), Secure Digital (SD) card, Flash Card, etc.
  • the memory 11 may also include both an internal storage unit of the article subject extraction device 1 based on artificial intelligence and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the artificial intelligence-based article subject extraction device 1, such as the code of the artificial intelligence-based article subject extraction program 01, etc., but also to temporarily store the output or The data to be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as the implementation of the article subject extraction program 01 based on artificial intelligence.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as the implementation of the article subject extraction program 01 based on artificial intelligence.
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the artificial intelligence-based article subject extraction device 1 and to display a visualized user interface.
  • Figure 2 only shows an artificial intelligence-based article subject extraction device 1 with components 11-14 and an artificial intelligence-based article subject extraction program 01.
  • the structure shown in Figure 1 does not constitute
  • the limitation on the article subject extraction device 1 based on artificial intelligence may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
  • the memory 11 stores an artificial intelligence-based article subject extraction program 01; when the processor 12 executes the artificial intelligence-based article subject extraction program 01 stored in the memory 11, the following steps are implemented:
  • Step 1 Receive a text data set, and perform operations including word segmentation and merging on the text data set to obtain a word text set.
  • the text data set includes multiple types of texts, such as news, social, academic, government development planning, and corporate investment.
  • the cleaning is to remove stop words, Arabic letters and other heteromorphic words in the text data set, because heteromorphic words that have no actual meaning will reduce the text classification effect.
  • the stop words have no practical meaning and have no effect on text analysis, but are frequently used words, such as commonly used pronouns and prepositions.
  • the cleaning is to construct a table of heteromorphic words in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the table of heteromorphic words, remove them until the traversal is completed.
  • the word segmentation is to segment each sentence in the text data set to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable.
  • the word segmentation described in this application can be processed using a stuttering word database based on programming languages such as Python, JAVA, etc.
  • the stuttering word database is developed for research and development based on Chinese part-of-speech features, and is a collection of the text data The number of occurrences of each word is converted to frequency, and the path with the maximum probability is found based on dynamic programming, and the maximum segmentation combination based on word frequency is found.
  • the merging is to merge multiple sentences with the same subject to achieve the purpose of greatly reducing the words in the text data set.
  • the merging includes: traversing each text in the text data set, dividing the text according to paragraphs to obtain several paragraphs, and presetting words that appear more than twice in each paragraph as hypothetical subjects, and constructing Constructing a log-likelihood function based on the conditional probability model of each sentence in each paragraph and the hypothetical subject, and optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, Several sentences with the same subject are merged into one sentence, and the merge operation is completed.
  • conditional probability model is:
  • y 1 ,..., y N , y i are the hypothetical subjects
  • N is the number of the hypothetical subjects
  • D is the paragraph
  • j is the number of the paragraph, such as D 1 is the number of the text
  • s is the sentence in the paragraph
  • s) is the probability of assuming that the subject y i is the subject of the sentence s
  • s(i, y i ) represents the hypothetical subject of the sentence i is y i .
  • the log likelihood function is:
  • argmax is the hypothetical subject corresponding to the maximum partial derivative of the conditional probability model to all the hypothetical subjects.
  • Step 2 Perform an encoding operation on the word text set and turn it into a word matrix set, and input the word matrix set into a word vector conversion model for training to obtain a word vector set.
  • the encoding adopts a one-hot encoding form, and the one-hot encoding is to first number each word in the word text set to obtain the largest number, and then create the largest number.
  • Encoding matrices with the same numerical numbering dimension traverse each sentence in the word text set in turn, map each sentence to the encoding matrix, and according to the numerical number of each word in the word text set Complete the encoding operation to obtain the word matrix set.
  • the collection of words and texts is: when people know how to exchange with the system, they can tell their true self and the truth. This is reality.
  • the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word word vector in the word vector set, and calculating the weight based on the weight relationship to complete the The conversion process from the word matrix set to the word vector set.
  • the weight relationship is:
  • d is the word matrix set
  • t 1 , t 2 ,..., t n are word matrices in the word matrix set, as in the above [0,0,0,0,0,0,0,0 ,0 ,0,0,0,1,1] etc.
  • w 1 , w 2 , ..., w n are the weights of the corresponding word matrix.
  • f i represents the number of occurrences of the word matrix in the word matrix set
  • N is the total number of texts in the text data set
  • N j represents the total number of words in the text data set
  • N i represents the word i in the text data set
  • the number of occurrences of, F m is the weighting factor, and the value is generally less than 1.
  • Step 3 After performing the dimensionality reduction operation on the word vector set, input the training value to the convolutional neural network model for training, and determine the size of the training value and the preset threshold, if the training value is greater than the preset threshold , The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training.
  • the dimensionality reduction operation includes calculating the covariance of each word vector in the word vector set, and removing word vectors in the covariance whose absolute value is greater than a preset covariance threshold to obtain a dimensionality-reduced word vector set.
  • x i , x j represent each word vector of the word vector set
  • n is the number of the word vector set
  • cov(x i , x j ) represents the calculation of the covariance between x i and x j. If the calculated covariance cov(x i ,x j ) is not 0, if it is greater than 0, it means a positive correlation, and if it is less than 0, it means a negative correlation.
  • the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer.
  • the input layer receives the word vector set, and the convolutional layer , Pooling layer, fully connected layer combined with activation function training to obtain training values and output through the output layer.
  • the activation function in the preferred embodiment of the present application may include a Softmax function, and the loss function is a least square function.
  • the Softmax function is:
  • O j represents the output value of the jth neuron in the fully connected layer
  • I j represents the input value of the jth neuron in the output layer
  • t represents the total number of neurons in the output layer
  • e is infinite Do not cycle decimals
  • the least square method L(s) is:
  • s is the training value
  • k is the number of word vector sets after dimensionality reduction
  • y i is the word vector set
  • y′ i is the predicted value of the convolutional neural network model.
  • Step 4 Receive text data input by the user, convert the text data input by the user into a word vector, and input it into the trained convolutional neural network model to obtain and output the subject of the article.
  • the convolutional neural network model after the completion of the training outputs the article subject: the article describing ancient literary prisons exposed the fidal rule against literati
  • the wasted tyranny of the author shows the author's deep sympathy for intellectuals and strong resentment of the brutal rule.
  • the artificial intelligence-based article subject extraction program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors ( This embodiment is executed by the processor 12) to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the article subject extraction program based on artificial intelligence. The execution process of the article subject extraction device.
  • FIG. 3 a schematic diagram of program modules of an artificial intelligence-based article subject extraction program in an embodiment of an artificial intelligence-based article subject extraction device of this application.
  • the artificial intelligence-based article subject The extraction program can be divided into a data receiving module 10, a word vector solving module 20, a model training module 30, and an article subject output module 40.
  • a data receiving module 10 a word vector solving module
  • a model training module 30 a model training module
  • an article subject output module 40 Illustratively:
  • the data receiving module 10 is used for receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set.
  • the word vector solving module 20 is configured to: perform an encoding operation on the word text set and convert it into a word matrix set, and input the word matrix set into a word vector conversion model for training to obtain a word vector set.
  • the model training module 30 is configured to: perform a dimensionality reduction operation on the word vector set and input it into a convolutional neural network model for training to obtain a training value, and determine the size of the training value and a preset threshold. If the training value is If the training value is greater than the preset threshold, the convolutional neural network model continues training, and if the training value is less than the preset threshold, the convolutional neural network model completes training.
  • the article subject output module 40 is configured to receive text data input by a user, convert the text data input by the user into a word vector and input it into the trained convolutional neural network model to obtain and output the article subject.
  • the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores an artificial intelligence-based article subject extraction program, and the artificial intelligence-based article subject extraction program can be used by one or more Each processor executes to achieve the following operations:
  • a text data set is received, and operations including word segmentation and merging are performed on the text data set to obtain a word text set.
  • the word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set.
  • a convolutional neural network model After performing the dimensionality reduction operation on the word vector set, input it into a convolutional neural network model to obtain training values, and determine the size of the training value and a preset threshold. If the training value is greater than the preset threshold, the The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training.
  • the text data input by the user is received, and the text data input by the user is converted into a word vector and then input into the trained convolutional neural network model to obtain and output the subject matter of the article.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An artificial intelligence-based article gist extraction method, comprising: receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set; performing an encoding operation on the word text set and then converting same into a word matrix set, and inputting the word matrix set into a word vector transformation model for training to obtain a word vector set; performing a dimensionality reduction operation on the word vector set, and then inputting same into a convolutional neural network model for training; and converting text data inputted by the user into word vectors, and then inputting same into the trained convolutional neural network model so as to obtain an article gist and outputting same. Also provided are an artificial intelligence-based article gist extraction device, and a computer-readable storage medium. The method can achieve a precise and efficient article gist extraction function based on artificial intelligence.

Description

基于人工智能的文章主旨提取方法、装置及存储介质Article subject extraction method, device and storage medium based on artificial intelligence
本申请要求于2019年9月2日提交中国专利局,申请号为201910826795.4、发明名称为“基于人工智能的文章主旨提取方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 2, 2019. The application number is 201910826795.4 and the invention title is "Artificial intelligence-based article subject extraction method, device and computer-readable storage medium". The entire content is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种基于人工智能的文章主旨提取的方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting the subject matter of articles based on artificial intelligence.
背景技术Background technique
目前多数文章的主旨都依靠专业的行业人士进行分析,如人工阅读研究企业发展报告,然后总结出主旨让高层领导决策,学术报告被相关人士进行总结后简化出主旨供其他人学习等,这种模式特别耗时耗力。另外有基于传统的朴素贝叶斯算法进行的文章主旨摘取,但由于朴素贝叶斯算法计算资源大,且摘取的主旨错误率较高,无法满足实际要求。At present, the main points of most articles rely on professional industry professionals for analysis, such as manual reading and research of enterprise development reports, and then summing up the main points for senior leaders to make decisions, academic reports are summarized by relevant persons and then simplifying the main points for others to learn, etc. The mode is particularly time-consuming and labor-intensive. In addition, there is an article subject extraction based on the traditional Naive Bayes algorithm. However, due to the large computational resources of the Naive Bayes algorithm and the high error rate of the extracted subject, it cannot meet the actual requirements.
发明内容Summary of the invention
本申请提供一种基于人工智能的文章主旨提取方法、装置及计算机可读存储介质,其主要目的是根据用户输入的文章进行智能化的主旨提取。This application provides an artificial intelligence-based article subject extraction method, device, and computer-readable storage medium, the main purpose of which is to perform intelligent subject extraction based on the article input by the user.
为实现上述目的,本申请提供的一种基于人工智能的文章主旨提取方法,包括:接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集;将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集;将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训 练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练;接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。In order to achieve the above-mentioned purpose, an artificial intelligence-based article subject extraction method provided in this application includes: receiving a text data set, performing word segmentation and merging operations on the text data set to obtain a word text set; The text set is converted into a word matrix set after the encoding operation, and the word matrix set is input into the word vector conversion model to train to obtain the word vector set; the word vector set is subjected to dimensionality reduction operation and then input into the convolutional neural network model The training value is obtained by training, and the size of the training value and the preset threshold is judged, if the training value is greater than the preset threshold, the convolutional neural network model continues training, and if the training value is less than the preset threshold , The convolutional neural network model completes training; receives text data input by the user, converts the text data input by the user into a word vector, and then inputs it into the trained convolutional neural network model to obtain the main idea of the article and output it .
此外,为实现上述目的,本申请还提供一种基于人工智能的文章主旨提取装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的基于人工智能的文章主旨提取程序,所述基于人工智能的文章主旨提取程序被所述处理器执行时实现如下步骤:接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集;将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集;将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练;接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。In addition, in order to achieve the above purpose, this application also provides an artificial intelligence-based article subject extraction device, which includes a memory and a processor, and the memory stores artificial intelligence-based articles that can run on the processor. The subject extraction program, when the artificial intelligence-based article subject extraction program is executed by the processor, the following steps are implemented: receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set; The word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model to train to obtain a word vector set; the word vector set is subjected to a dimensionality reduction operation and then input to the convolutional nerve The training value is obtained by training in the network model, and the size of the training value and the preset threshold is judged. If the training value is greater than the preset threshold, the convolutional neural network model continues training, and if the training value is less than the With a preset threshold, the convolutional neural network model completes training; receives text data input by the user, converts the text data input by the user into a word vector, and enters it into the trained convolutional neural network model to obtain an article Subject and output.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有基于人工智能的文章主旨提取程序,所述基于人工智能的文章主旨提取程序可被一个或者多个处理器执行,以实现如上所述的基于人工智能的文章主旨提取方法的步骤。In addition, in order to achieve the above-mentioned purpose, the present application also provides a computer-readable storage medium on which is stored an artificial intelligence-based article subject extraction program, which can be One or more processors are executed to implement the steps of the above-mentioned artificial intelligence-based article subject extraction method.
本申请首先对文本数据集进行词语切分及合并操作得到单词文本集,可避免错误词语对整个文章主旨的影响,同时将所述单词文本集进行编码操作和词向量转化得到单词向量集,通过所述编码操作和所述词向量转化在减少单词维度的同时,放大特征属性,进一步地,所述卷积神经网络模型具有优异的特征提取能力,可高效的识别出单词特征,提高文章主旨的输出准确率。因此本申请提出的基于人工智能的文章主旨提取方法、装置及计算机可读存 储介质,可以实现精准的文章主旨输出结果。This application first performs word segmentation and merging operations on the text data set to obtain a word text set, which can avoid the influence of wrong words on the subject of the entire article. At the same time, the word text set is encoded and word vector transformed to obtain a word vector set. The encoding operation and the word vector transformation reduce the dimension of the word while amplifying the feature attributes. Further, the convolutional neural network model has excellent feature extraction capabilities, can efficiently identify word features, and improve the subject of the article. Output accuracy rate. Therefore, the artificial intelligence-based article subject extraction method, device, and computer-readable storage medium proposed in this application can achieve accurate article subject output results.
附图说明Description of the drawings
图1为本申请一实施例提供的基于人工智能的文章主旨提取方法的流程示意图;FIG. 1 is a schematic flowchart of an artificial intelligence-based article subject extraction method provided by an embodiment of the application;
图2为本申请一实施例提供的基于人工智能的文章主旨提取装置的内部结构示意图;2 is a schematic diagram of the internal structure of an artificial intelligence-based article subject extraction device provided by an embodiment of the application;
图3为本申请一实施例提供的基于人工智能的文章主旨提取装置中基于人工智能的文章主旨提取程序的模块示意图。3 is a schematic diagram of modules of an artificial intelligence-based article subject extraction program in an artificial intelligence-based article subject extraction device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请提供一种基于人工智能的文章主旨提取方法。参照图1所示,为本申请一实施例提供的基于人工智能的文章主旨提取方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides an article subject extraction method based on artificial intelligence. Referring to FIG. 1, it is a schematic flowchart of an artificial intelligence-based article subject extraction method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,基于人工智能的文章主旨提取方法包括:In this embodiment, the method for extracting the subject matter of an article based on artificial intelligence includes:
S1、接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集。S1. Receive a text data set, and perform operations including word segmentation and merging on the text data set to obtain a word text set.
较佳地,所述文本数据集包括多种类型的文本,如新闻类、社交类、学术类、政府发展规划类、企业投资类等。Preferably, the text data set includes multiple types of texts, such as news, social, academic, government development planning, and corporate investment.
所述清洗是将所述文本数据集内的停用词、阿拉伯字母等异形词剔除,因为没有实际意义的异形词,会降低文本分类效果。所述停用词是没有实际意义的且对文本分析没有什么影响,但出现频率高的词,如常用的代词、介 词等。具体地,所述清洗是预先构建一副异形词表格,依次遍历所述文本数据集内的词语,若所述词语与所述异形词表格内有相同的则剔除,直至遍历完成。The cleaning is to remove stop words, Arabic letters and other heteromorphic words in the text data set, because heteromorphic words that have no actual meaning will reduce the text classification effect. The stop words have no practical meaning and have no effect on text analysis, but are frequently used words, such as commonly used pronouns and prepositions. Specifically, the cleaning is to construct a table of heteromorphic words in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the table of heteromorphic words, remove them until the traversal is completed.
所述词语切分是对所述文本数据集中的每句话进行切分得到单个的词,因为在汉语表示中,词和词之间没有明确的分隔标识,所以切词是必不可少的。较佳地,本申请所述切词可以使用基于Python、JAVA等编程语言的结巴分词库进行处理,所述结巴分词库是基于中文词性特征而针对研发的,是将所述文本数据集中每个词的出现次数转换为频率,并基于动态规划查找最大概率路径,找出基于词频的最大切分组合。例如,所述文本数据集中有这样的文本片段为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,因为他们的眼里,在与体制作出等价交换以前,真实对他们什么也不是。经过所述结巴分词库进行处理后变为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,因为他们的眼里,在与体制作出等价交换以前,真实对他们什么也不是。其中,空格部分代表所述结巴分词库的处理结果。The word segmentation is to segment each sentence in the text data set to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable. Preferably, the word segmentation described in this application can be processed using a stuttering word database based on programming languages such as Python, JAVA, etc. The stuttering word database is developed for research and development based on Chinese part-of-speech features, and is a collection of the text data The number of occurrences of each word is converted to frequency, and the path with the maximum probability is found based on dynamic programming, and the maximum segmentation combination based on word frequency is found. For example, there are text fragments in the text data set: When people know how to exchange with the system, they can tell their true self and discernment, because in their eyes, before making an equivalent exchange with the system, what is true to them? Nor is it. After the stuttering word database is processed, it becomes: when people know how to exchange with the system, they can tell the truth about themselves, because in their eyes, before making equivalent exchanges with the system, the truth does nothing to them. It's not. Wherein, the space part represents the processing result of the stuttering word database.
进一步地,由于多个句子的主语可能是相同的,因此所述合并是将多个具有相同主语的句子进行合并,达到大幅缩减所述文本数据集内词语的目的。优选地,所述合并包括:遍历所述文本数据集中的每个文本,按照段落划分所述文本得到若干个段落,将每个段落中出现次数大于等于两次的词语预设为假设主语,构建所述每个段落中每个句子与所述假设主语的条件概率模型,构建对数似然函数,并基于所述对数似然函数优化所述条件概率模型得到所述每个句子的主语,将主语相同的若干个句子合并为一个句子,完成所述合并操作。Further, since the subject of multiple sentences may be the same, the merging is to merge multiple sentences with the same subject to achieve the purpose of greatly reducing the words in the text data set. Preferably, the merging includes: traversing each text in the text data set, dividing the text according to paragraphs to obtain several paragraphs, and presetting words that appear more than twice in each paragraph as hypothetical subjects, and constructing Constructing a log-likelihood function based on the conditional probability model of each sentence in each paragraph and the hypothetical subject, and optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, Combine several sentences with the same subject into one sentence to complete the combination operation.
具体地,所述条件概率模型为:Specifically, the conditional probability model is:
Figure PCTCN2019116936-appb-000001
Figure PCTCN2019116936-appb-000001
其中,y 1,…,y N,y i为所述假设主语,N为所述假设主语的个数,D为所述 段落,j为所述段落的编号,如D 1为所述文本的第一段,s为所述段落内的句子,P(y i|s)为假设主语y i为句子s的主语的概率,s(i,y i)表示所述句子i的假设主语为y iAmong them, y 1 ,..., y N , y i are the hypothetical subjects, N is the number of the hypothetical subjects, D is the paragraph, j is the number of the paragraph, such as D 1 is the number of the text In the first paragraph, s is the sentence in the paragraph, P(y i |s) is the probability of assuming that the subject y i is the subject of the sentence s, and s(i, y i ) represents the hypothetical subject of the sentence i is y i .
优选地,所述对数似然函数为:Preferably, the log likelihood function is:
Figure PCTCN2019116936-appb-000002
Figure PCTCN2019116936-appb-000002
其中argmax是求解所述条件概率模型对所有所述假设主语偏导数最大所对应的假设主语。Where argmax is the hypothetical subject corresponding to the maximum partial derivative of the conditional probability model to all the hypothetical subjects.
S2、将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集。S2. The word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set.
较佳地,所述编码采用one-hot编码形式,所述one-hot编码是先将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号,然后创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的每个句子,将所述每个句子都映射到所述编码矩阵,并依据所述单词文本集内的每个单词的数字编号完成编码操作得到单词矩阵集。如单词文本集为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,这就是现实。对所述文本进行数字编号后为:当 12懂得 34体制 5交换的 6时候 7,他们 8可以 910真实的 11自己 12和盘托出 13,这就是 14现实 15,且得到最大的数字编号为15,进而创建一个15维度的编码矩阵,进一步的,若遍历句子为:这就是现实,则编码后为[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1]。 Preferably, the encoding adopts a one-hot encoding form, and the one-hot encoding is to first number each word in the word text set to obtain the largest number, and then create the largest number. Encoding matrices with the same numerical numbering dimension, traverse each sentence in the word text set in turn, map each sentence to the encoding matrix, and according to the numerical number of each word in the word text set Complete the encoding operation to obtain the word matrix set. For example, the collection of words and texts is: when people know how to exchange with the system, they can tell their true self and the truth. This is reality. After the text is numbered, it is: when 1 person 2 understands the exchange of 3 and 4 system 5 , 6 hours 7 , they 8 can 9 make 10 the real 11 themselves 12 and tell them 13 , this is 14 reality 15 , and get the biggest The number is 15, and then a 15-dimensional encoding matrix is created. Further, if the traversal sentence is: This is reality, then the encoded is [0,0,0,0,0,0,0,0,0,0 ,0,0,0,1,1].
优选地,所述词向量转化模型包括假设出所述单词矩阵集内的单词矩阵与所述单词向量集内的单词词向量之间的权重关系、基于所述权重关系计算所述权重完成所述单词矩阵集到所述单词向量集之间转化过程。Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word word vector in the word vector set, and calculating the weight based on the weight relationship to complete the The conversion process from the word matrix set to the word vector set.
具体地,所述权重关系为:Specifically, the weight relationship is:
d={(t 1,w 1),(t 2,w 2),……,(t i,w i),……,(t n,w n)} d={(t 1 ,w 1 ),(t 2 ,w 2 ),……,(t i ,w i ),……,(t n ,w n )}
其中,d为所述单词矩阵集,t 1、t 2、……、t n为所述单词矩阵集内的单词矩阵,如上述[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1]等,w 1、w 2、……、w n为所述 对应单词矩阵的权重。 Where, d is the word matrix set, t 1 , t 2 ,..., t n are word matrices in the word matrix set, as in the above [0,0,0,0,0,0,0,0 ,0,0,0,0,0,1,1] etc., w 1 , w 2 , ..., w n are the weights of the corresponding word matrix.
进一步地,所述权重的计算方法为:Further, the calculation method of the weight is:
Figure PCTCN2019116936-appb-000003
Figure PCTCN2019116936-appb-000003
其中,f i表示单词矩阵在所述单词矩阵集中出现的次数,N为所述文本数据集中文本的总数,N j表示所述文本数据集中单词总数,N i表示单词i在所述文本数据集的出现次数,F m为加权因子,一般取值为小于1。 Wherein, f i represents the number of occurrences of the word matrix in the word matrix set, N is the total number of texts in the text data set, N j represents the total number of words in the text data set, and N i represents the word i in the text data set The number of occurrences of, F m is the weighting factor, and the value is generally less than 1.
S3、将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练。S3. After performing a dimensionality reduction operation on the word vector set, input it into a convolutional neural network model for training to obtain a training value, and determine the size of the training value and a preset threshold, and if the training value is greater than the preset threshold, The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training.
优选地,所述降维操作包括计算所述单词向量集中各单词向量的协方差,去除协方差中绝对值大于预设协方差阈值的单词向量,得到降维后的单词向量集。Preferably, the dimensionality reduction operation includes calculating the covariance of each word vector in the word vector set, and removing word vectors in the covariance whose absolute value is greater than a preset covariance threshold to obtain a dimensionality-reduced word vector set.
进一步地,所述协方差为:Further, the covariance is:
Figure PCTCN2019116936-appb-000004
Figure PCTCN2019116936-appb-000004
其中,x i,x j表示所述单词向量集各单词向量,n为所述单词向量集的数量,cov(x i,x j)表示计算x i,x j之间的协方差。若所计算的协方差cov(x i,x j)不为0,若大于0表示正相关,小于0表示负相关。 Wherein, x i , x j represent each word vector of the word vector set, n is the number of the word vector set, and cov(x i , x j ) represents the calculation of the covariance between x i and x j. If the calculated covariance cov(x i ,x j ) is not 0, if it is greater than 0, it means a positive correlation, and if it is less than 0, it means a negative correlation.
本申请较佳实施例中,所述卷积神经网络模型包括输入层、卷积层、池化层、全连接层和输出层,所述输入层接收所述单词向量集,所述卷积层、池化层、全连接层结合激活函数训练得到训练值并通过输出层输出。In a preferred embodiment of the present application, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. The input layer receives the word vector set, and the convolutional layer , Pooling layer, fully connected layer combined with activation function training to obtain training values and output through the output layer.
本申请较佳实施例所述激活函数可包括Softmax函数,所述损失函数为最小二乘函数。所述Softmax函数为:The activation function in the preferred embodiment of the present application may include a Softmax function, and the loss function is a least square function. The Softmax function is:
Figure PCTCN2019116936-appb-000005
Figure PCTCN2019116936-appb-000005
其中,O j表示所述全连接层第j个神经元的输出值,I j表示所述输出层第j个神经元的输入值,t表示所述输出层神经元的总量,e为无限不循环小数; Among them, O j represents the output value of the jth neuron in the fully connected layer, I j represents the input value of the jth neuron in the output layer, t represents the total number of neurons in the output layer, and e is infinite Do not cycle decimals;
所述最小二乘法L(s)为:The least square method L(s) is:
Figure PCTCN2019116936-appb-000006
Figure PCTCN2019116936-appb-000006
其中,s为所述训练值,k为经过降维后的单词向量集的数量,y i为所述单词向量集,y′ i为所述卷积神经网络模型的预测值。 Where, s is the training value, k is the number of word vector sets after dimensionality reduction, y i is the word vector set, and y′ i is the predicted value of the convolutional neural network model.
S4、接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。S4. Receive the text data input by the user, convert the text data input by the user into a word vector, and input it into the trained convolutional neural network model to obtain and output the subject matter of the article.
如接受到用户输入的一篇描写古时文字狱的文章,经过所述所述完成训练的卷积神经网络模型输出了文章主旨为:所述描写古时文字狱的文章揭露了封建统治下对文人墨客的苛酷暴政,表现了作者对知识份子的深切同情以及对残暴统治的强烈愤恨。For example, receiving a user input describing an article describing ancient literary prisons, the convolutional neural network model after the completion of the training outputs the article subject: the article describing ancient literary prisons exposed the feudal rule against literati The cruel tyranny of the author shows the author's deep sympathy for intellectuals and strong resentment of the brutal rule.
发明还提供一种基于人工智能的文章主旨提取装置。参照图2所示,为本申请一实施例提供的基于人工智能的文章主旨提取装置的内部结构示意图。The invention also provides an article subject extraction device based on artificial intelligence. Referring to FIG. 2, it is a schematic diagram of the internal structure of an artificial intelligence-based article subject extraction device provided by an embodiment of the present application.
在本实施例中,所述基于人工智能的文章主旨提取装置1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该基于人工智能的文章主旨提取装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the artificial intelligence-based article subject extraction device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The artificial intelligence-based article subject extraction device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是基于人工智能的文章主旨提取装置1的内部存储单元,例如该基于人工智能的文章主旨提取装置1的硬盘。存储器11在另一些实施例中也可以是基于人工智能的文章主 旨提取装置1的外部存储设备,例如基于人工智能的文章主旨提取装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括基于人工智能的文章主旨提取装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于基于人工智能的文章主旨提取装置1的应用软件及各类数据,例如基于人工智能的文章主旨提取程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the article subject extraction device 1 based on artificial intelligence, such as the hard disk of the artificial intelligence-based article subject extraction device 1. In other embodiments, the memory 11 may also be an external storage device of the article subject extraction device 1 based on artificial intelligence, such as a plug-in hard disk equipped on the article subject extraction device 1 based on artificial intelligence, and a smart media card (Smart Media Card). , SMC), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the article subject extraction device 1 based on artificial intelligence and an external storage device. The memory 11 can be used not only to store application software and various data installed in the artificial intelligence-based article subject extraction device 1, such as the code of the artificial intelligence-based article subject extraction program 01, etc., but also to temporarily store the output or The data to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行基于人工智能的文章主旨提取程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as the implementation of the article subject extraction program 01 based on artificial intelligence.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在基于人工智能的文章主旨提取装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the artificial intelligence-based article subject extraction device 1 and to display a visualized user interface.
图2仅示出了具有组件11-14以及基于人工智能的文章主旨提取程序01的基于人工智能的文章主旨提取装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对基于人工智能的文章主旨提取装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Figure 2 only shows an artificial intelligence-based article subject extraction device 1 with components 11-14 and an artificial intelligence-based article subject extraction program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute The limitation on the article subject extraction device 1 based on artificial intelligence may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.
在图2所示的装置1实施例中,存储器11中存储有基于人工智能的文章 主旨提取程序01;处理器12执行存储器11中存储的基于人工智能的文章主旨提取程序01时实现如下步骤:In the embodiment of the device 1 shown in FIG. 2, the memory 11 stores an artificial intelligence-based article subject extraction program 01; when the processor 12 executes the artificial intelligence-based article subject extraction program 01 stored in the memory 11, the following steps are implemented:
步骤一、接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集。Step 1: Receive a text data set, and perform operations including word segmentation and merging on the text data set to obtain a word text set.
较佳地,所述文本数据集包括多种类型的文本,如新闻类、社交类、学术类、政府发展规划类、企业投资类等。Preferably, the text data set includes multiple types of texts, such as news, social, academic, government development planning, and corporate investment.
所述清洗是将所述文本数据集内的停用词、阿拉伯字母等异形词剔除,因为没有实际意义的异形词,会降低文本分类效果。所述停用词是没有实际意义的且对文本分析没有什么影响,但出现频率高的词,如常用的代词、介词等。具体地,所述清洗是预先构建一副异形词表格,依次遍历所述文本数据集内的词语,若所述词语与所述异形词表格内有相同的则剔除,直至遍历完成。The cleaning is to remove stop words, Arabic letters and other heteromorphic words in the text data set, because heteromorphic words that have no actual meaning will reduce the text classification effect. The stop words have no practical meaning and have no effect on text analysis, but are frequently used words, such as commonly used pronouns and prepositions. Specifically, the cleaning is to construct a table of heteromorphic words in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the table of heteromorphic words, remove them until the traversal is completed.
所述词语切分是对所述文本数据集中的每句话进行切分得到单个的词,因为在汉语表示中,词和词之间没有明确的分隔标识,所以切词是必不可少的。较佳地,本申请所述切词可以使用基于Python、JAVA等编程语言的结巴分词库进行处理,所述结巴分词库是基于中文词性特征而针对研发的,是将所述文本数据集中每个词的出现次数转换为频率,并基于动态规划查找最大概率路径,找出基于词频的最大切分组合。例如,所述文本数据集中有这样的文本片段为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,因为他们的眼里,在与体制作出等价交换以前,真实对他们什么也不是。经过所述结巴分词库进行处理后变为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,因为他们的眼里,在与体制作出等价交换以前,真实对他们什么也不是。其中,空格部分代表所述结巴分词库的处理结果。The word segmentation is to segment each sentence in the text data set to obtain a single word. Because there is no clear separation mark between words in Chinese representation, word segmentation is indispensable. Preferably, the word segmentation described in this application can be processed using a stuttering word database based on programming languages such as Python, JAVA, etc. The stuttering word database is developed for research and development based on Chinese part-of-speech features, and is a collection of the text data The number of occurrences of each word is converted to frequency, and the path with the maximum probability is found based on dynamic programming, and the maximum segmentation combination based on word frequency is found. For example, there are text fragments in the text data set: When people know how to exchange with the system, they can tell their true self and discernment, because in their eyes, before making an equivalent exchange with the system, what is true to them? Nor is it. After the stuttering word database is processed, it becomes: when people know how to exchange with the system, they can tell the truth about themselves, because in their eyes, before making equivalent exchanges with the system, the truth does nothing to them. It's not. Wherein, the space part represents the processing result of the stuttering word database.
进一步地,由于多个句子的主语可能是相同的,因此所述合并是将多个具有相同主语的句子进行合并,达到大幅缩减所述文本数据集内词语的目的。优选地,所述合并包括:遍历所述文本数据集中的每个文本,按照段落划分 所述文本得到若干个段落,将每个段落中出现次数大于等于两次的词语预设为假设主语,构建所述每个段落中每个句子与所述假设主语的条件概率模型,构建对数似然函数,并基于所述对数似然函数优化所述条件概率模型得到所述每个句子的主语,将主语相同的若干个句子合并为一个句子,完成所述合并操作。Further, since the subject of multiple sentences may be the same, the merging is to merge multiple sentences with the same subject to achieve the purpose of greatly reducing the words in the text data set. Preferably, the merging includes: traversing each text in the text data set, dividing the text according to paragraphs to obtain several paragraphs, and presetting words that appear more than twice in each paragraph as hypothetical subjects, and constructing Constructing a log-likelihood function based on the conditional probability model of each sentence in each paragraph and the hypothetical subject, and optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, Several sentences with the same subject are merged into one sentence, and the merge operation is completed.
具体地,所述条件概率模型为:Specifically, the conditional probability model is:
Figure PCTCN2019116936-appb-000007
Figure PCTCN2019116936-appb-000007
其中,y 1,…,y N,y i为所述假设主语,N为所述假设主语的个数,D为所述段落,j为所述段落的编号,如D 1为所述文本的第一段,s为所述段落内的句子,P(y i|s)为假设主语y i为句子s的主语的概率,s(i,y i)表示所述句子i的假设主语为y iAmong them, y 1 ,..., y N , y i are the hypothetical subjects, N is the number of the hypothetical subjects, D is the paragraph, j is the number of the paragraph, such as D 1 is the number of the text In the first paragraph, s is the sentence in the paragraph, P(y i |s) is the probability of assuming that the subject y i is the subject of the sentence s, and s(i, y i ) represents the hypothetical subject of the sentence i is y i .
优选地,所述对数似然函数为:Preferably, the log likelihood function is:
Figure PCTCN2019116936-appb-000008
Figure PCTCN2019116936-appb-000008
其中argmax是求解所述条件概率模型对所有所述假设主语偏导数最大所对应的假设主语。Where argmax is the hypothetical subject corresponding to the maximum partial derivative of the conditional probability model to all the hypothetical subjects.
步骤二、将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集。Step 2: Perform an encoding operation on the word text set and turn it into a word matrix set, and input the word matrix set into a word vector conversion model for training to obtain a word vector set.
较佳地,所述编码采用one-hot编码形式,所述one-hot编码是先将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号,然后创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的每个句子,将所述每个句子都映射到所述编码矩阵,并依据所述单词文本集内的每个单词的数字编号完成编码操作得到单词矩阵集。如单词文本集为:当人懂得和体制交换的时候,他们可以将真实的自己和盘托出,这就是现实。对所述文本进行数字编号后为:当 12懂得 34体制 5交换的 6时候 7,他们 8可以 910真实的 11自己 12和盘托出 13,这就是 14现实 15,且得到最大的 数字编号为15,进而创建一个15维度的编码矩阵,进一步的,若遍历句子为:这就是现实,则编码后为[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1]。 Preferably, the encoding adopts a one-hot encoding form, and the one-hot encoding is to first number each word in the word text set to obtain the largest number, and then create the largest number. Encoding matrices with the same numerical numbering dimension, traverse each sentence in the word text set in turn, map each sentence to the encoding matrix, and according to the numerical number of each word in the word text set Complete the encoding operation to obtain the word matrix set. For example, the collection of words and texts is: when people know how to exchange with the system, they can tell their true self and the truth. This is reality. After the text is numbered, it is: when 1 person 2 understands the exchange of 3 and 4 system 5 , 6 hours 7 , they 8 can 9 make 10 the real 11 themselves 12 and tell them 13 , this is 14 reality 15 , and get the biggest The number is 15, and then a 15-dimensional coding matrix is created. Further, if the traversal sentence is: This is reality, then the coded is [0,0,0,0,0,0,0,0,0,0 ,0,0,0,1,1].
优选地,所述词向量转化模型包括假设出所述单词矩阵集内的单词矩阵与所述单词向量集内的单词词向量之间的权重关系、基于所述权重关系计算所述权重完成所述单词矩阵集到所述单词向量集之间转化过程。Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word word vector in the word vector set, and calculating the weight based on the weight relationship to complete the The conversion process from the word matrix set to the word vector set.
具体地,所述权重关系为:Specifically, the weight relationship is:
d={(t 1,w 1),(t 2,w 2),……,(t i,w i),……,(t n,w n)} d={(t 1 ,w 1 ),(t 2 ,w 2 ),……,(t i ,w i ),……,(t n ,w n )}
其中,d为所述单词矩阵集,t 1、t 2、……、t n为所述单词矩阵集内的单词矩阵,如上述[0,0,0,0,0,0,0,0,0,0,0,0,0,1,1]等,w 1、w 2、……、w n为所述对应单词矩阵的权重。 Where, d is the word matrix set, t 1 , t 2 ,..., t n are word matrices in the word matrix set, as in the above [0,0,0,0,0,0,0,0 ,0,0,0,0,0,1,1] etc., w 1 , w 2 , ..., w n are the weights of the corresponding word matrix.
进一步地,所述权重的计算方法为:Further, the calculation method of the weight is:
Figure PCTCN2019116936-appb-000009
Figure PCTCN2019116936-appb-000009
其中,f i表示单词矩阵在所述单词矩阵集中出现的次数,N为所述文本数据集中文本的总数,N j表示所述文本数据集中单词总数,N i表示单词i在所述文本数据集的出现次数,F m为加权因子,一般取值为小于1。 Wherein, f i represents the number of occurrences of the word matrix in the word matrix set, N is the total number of texts in the text data set, N j represents the total number of words in the text data set, and N i represents the word i in the text data set The number of occurrences of, F m is the weighting factor, and the value is generally less than 1.
步骤三、将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练。Step 3: After performing the dimensionality reduction operation on the word vector set, input the training value to the convolutional neural network model for training, and determine the size of the training value and the preset threshold, if the training value is greater than the preset threshold , The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training.
优选地,所述降维操作包括计算所述单词向量集中各单词向量的协方差,去除协方差中绝对值大于预设协方差阈值的单词向量,得到降维后的单词向量集。Preferably, the dimensionality reduction operation includes calculating the covariance of each word vector in the word vector set, and removing word vectors in the covariance whose absolute value is greater than a preset covariance threshold to obtain a dimensionality-reduced word vector set.
进一步地,所述协方差为:Further, the covariance is:
Figure PCTCN2019116936-appb-000010
Figure PCTCN2019116936-appb-000010
其中,x i,x j表示所述单词向量集各单词向量,n为所述单词向量集的数量,cov(x i,x j)表示计算x i,x j之间的协方差。若所计算的协方差cov(x i,x j)不为0,若大于0表示正相关,小于0表示负相关。 Wherein, x i , x j represent each word vector of the word vector set, n is the number of the word vector set, and cov(x i , x j ) represents the calculation of the covariance between x i and x j. If the calculated covariance cov(x i ,x j ) is not 0, if it is greater than 0, it means a positive correlation, and if it is less than 0, it means a negative correlation.
本申请较佳实施例中,所述卷积神经网络模型包括输入层、卷积层、池化层、全连接层和输出层,所述输入层接收所述单词向量集,所述卷积层、池化层、全连接层结合激活函数训练得到训练值并通过输出层输出。In a preferred embodiment of the present application, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. The input layer receives the word vector set, and the convolutional layer , Pooling layer, fully connected layer combined with activation function training to obtain training values and output through the output layer.
本申请较佳实施例所述激活函数可包括Softmax函数,所述损失函数为最小二乘函数。所述Softmax函数为:The activation function in the preferred embodiment of the present application may include a Softmax function, and the loss function is a least square function. The Softmax function is:
Figure PCTCN2019116936-appb-000011
Figure PCTCN2019116936-appb-000011
其中,O j表示所述全连接层第j个神经元的输出值,I j表示所述输出层第j个神经元的输入值,t表示所述输出层神经元的总量,e为无限不循环小数; Among them, O j represents the output value of the jth neuron in the fully connected layer, I j represents the input value of the jth neuron in the output layer, t represents the total number of neurons in the output layer, and e is infinite Do not cycle decimals;
所述最小二乘法L(s)为:The least square method L(s) is:
Figure PCTCN2019116936-appb-000012
Figure PCTCN2019116936-appb-000012
其中,s为所述训练值,k为经过降维后的单词向量集的数量,y i为所述单词向量集,y′ i为所述卷积神经网络模型的预测值。 Where, s is the training value, k is the number of word vector sets after dimensionality reduction, y i is the word vector set, and y′ i is the predicted value of the convolutional neural network model.
步骤四、接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。Step 4: Receive text data input by the user, convert the text data input by the user into a word vector, and input it into the trained convolutional neural network model to obtain and output the subject of the article.
如接受到用户输入的一篇描写古时文字狱的文章,经过所述所述完成训练的卷积神经网络模型输出了文章主旨为:所述描写古时文字狱的文章揭露了封建统治下对文人墨客的苛酷暴政,表现了作者对知识份子的深切同情以及对残暴统治的强烈愤恨。For example, receiving a user input describing an article describing ancient literary prisons, the convolutional neural network model after the completion of the training outputs the article subject: the article describing ancient literary prisons exposed the feudal rule against literati The cruel tyranny of the author shows the author's deep sympathy for intellectuals and strong resentment of the brutal rule.
可选地,在其他实施例中,基于人工智能的文章主旨提取程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称 的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述基于人工智能的文章主旨提取程序在基于人工智能的文章主旨提取装置中的执行过程。Optionally, in other embodiments, the artificial intelligence-based article subject extraction program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors ( This embodiment is executed by the processor 12) to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the article subject extraction program based on artificial intelligence. The execution process of the article subject extraction device.
例如,参照图3所示,为本申请基于人工智能的文章主旨提取装置一实施例中的基于人工智能的文章主旨提取程序的程序模块示意图,该实施例中,所述基于人工智能的文章主旨提取程序可以被分割为数据接收模块10、词向量求解模块20、模型训练模块30、文章主旨输出模块40示例性地:For example, referring to FIG. 3, a schematic diagram of program modules of an artificial intelligence-based article subject extraction program in an embodiment of an artificial intelligence-based article subject extraction device of this application. In this embodiment, the artificial intelligence-based article subject The extraction program can be divided into a data receiving module 10, a word vector solving module 20, a model training module 30, and an article subject output module 40. Illustratively:
所述数据接收模块10用于:接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集。The data receiving module 10 is used for receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set.
所述词向量求解模块20用于:将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集。The word vector solving module 20 is configured to: perform an encoding operation on the word text set and convert it into a word matrix set, and input the word matrix set into a word vector conversion model for training to obtain a word vector set.
所述模型训练模块30用于:将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练。The model training module 30 is configured to: perform a dimensionality reduction operation on the word vector set and input it into a convolutional neural network model for training to obtain a training value, and determine the size of the training value and a preset threshold. If the training value is If the training value is greater than the preset threshold, the convolutional neural network model continues training, and if the training value is less than the preset threshold, the convolutional neural network model completes training.
所述文章主旨输出模块40用于:接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。The article subject output module 40 is configured to receive text data input by a user, convert the text data input by the user into a word vector and input it into the trained convolutional neural network model to obtain and output the article subject.
上述数据接收模块10、词向量求解模块20、模型训练模块30、文章主旨输出模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented by the program modules such as the data receiving module 10, the word vector solving module 20, the model training module 30, and the article subject output module 40 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有基于人工智能的文章主旨提取程序,所述基于人工智能的文章主旨提取程序可被一个或多个处理器执行,以实现如下操作:In addition, the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores an artificial intelligence-based article subject extraction program, and the artificial intelligence-based article subject extraction program can be used by one or more Each processor executes to achieve the following operations:
接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集。A text data set is received, and operations including word segmentation and merging are performed on the text data set to obtain a word text set.
将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集。The word text set is converted into a word matrix set after an encoding operation, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set.
将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练。After performing the dimensionality reduction operation on the word vector set, input it into a convolutional neural network model to obtain training values, and determine the size of the training value and a preset threshold. If the training value is greater than the preset threshold, the The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training.
接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。The text data input by the user is received, and the text data input by the user is converted into a word vector and then input into the trained convolutional neural network model to obtain and output the subject matter of the article.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes those elements that are not explicitly included. The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于人工智能的文章主旨提取方法,其特征在于,所述方法包括:An article subject extraction method based on artificial intelligence, characterized in that the method includes:
    接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集;Receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set;
    将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集;After performing an encoding operation on the word text set, it is converted into a word matrix set, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set;
    将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练;After performing the dimensionality reduction operation on the word vector set, input it into a convolutional neural network model to obtain training values, and determine the size of the training value and a preset threshold. If the training value is greater than the preset threshold, the The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training;
    接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。The text data input by the user is received, and the text data input by the user is converted into a word vector and then input into the trained convolutional neural network model to obtain and output the subject matter of the article.
  2. 如权利要求1所述的基于人工智能的文章主旨提取方法,其特征在于,所述合并操作包括:The method for extracting the subject of an article based on artificial intelligence according to claim 1, wherein the merging operation comprises:
    遍历所述文本数据集中的每个文本数据,按照段落划分所述文本数据得到若干个段落;Traverse each text data in the text data set, and divide the text data according to paragraphs to obtain several paragraphs;
    将所述若干个段落中出现次数大于等于两次的词语预设为假设主语,构建所述若干个段落中每个句子与所述假设主语的条件概率模型;Presupposing words that appear more than twice in the plurality of paragraphs as hypothetical subjects, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothetical subject;
    构建对数似然函数,并基于所述对数似然函数优化所述条件概率模型得到所述每个句子的主语,将主语相同的若干个句子合并为一个句子,完成所述合并操作。Construct a log-likelihood function, optimize the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merge several sentences with the same subject into one sentence, and complete the merge operation.
  3. 如权利要求2所述的基于人工智能的文章主旨提取方法,其特征在于,所述条件概率模型为:3. The method for extracting the subject of an article based on artificial intelligence according to claim 2, wherein the conditional probability model is:
    Figure PCTCN2019116936-appb-100001
    Figure PCTCN2019116936-appb-100001
    其中,y 1,…,y N,y i为所述假设主语,N为所述假设主语的个数,D为所述 段落,j为所述段落的编号,s为所述段落内的句子,P(y i|s)为假设主语y i为句子s的主语的概率,s(i,y i)表示所述句子i的假设主语为y iAmong them, y 1 ,..., y N , y i are the hypothetical subjects, N is the number of the hypothetical subjects, D is the paragraph, j is the number of the paragraph, and s is the sentence in the paragraph , P(y i |s) is the probability of assuming that the subject y i is the subject of the sentence s, and s(i, y i ) indicates that the hypothetical subject of the sentence i is y i .
  4. 如权利要求1所述的基于人工智能的文章主旨提取方法,其特征在于,所述编码操作包括:The method for extracting the subject matter of an article based on artificial intelligence according to claim 1, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating an encoding matrix with the same dimension as the largest number number, traversing the sentences in the word text set in turn, and mapping all the sentences to the encoding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  5. 如权利要求2所述的基于人工智能的文章主旨提取方法,其特征在于,所述编码操作包括:5. The method for extracting the subject matter of an article based on artificial intelligence according to claim 2, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating a coding matrix with the same dimension as the largest number number, sequentially traversing the sentences in the word text set, and mapping all the sentences to the coding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  6. 如权利要求3所述的基于人工智能的文章主旨提取方法,其特征在于,所述编码操作包括:The method for extracting the subject of an article based on artificial intelligence according to claim 3, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating an encoding matrix with the same dimension as the largest number number, traversing the sentences in the word text set in turn, and mapping all the sentences to the encoding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  7. 如权利要求4-6任一项所述的基于人工智能的文章主旨提取方法,其特征在于,所述降维操作包括:The artificial intelligence-based article subject extraction method according to any one of claims 4-6, wherein the dimensionality reduction operation comprises:
    计算所述单词向量集中各单词向量的协方差;Calculating the covariance of each word vector in the word vector set;
    去除协方差中绝对值大于预设协方差阈值的单词向量,得到降维后的单词向量集。The word vectors whose absolute value is greater than the preset covariance threshold in the covariance are removed to obtain the word vector set after dimensionality reduction.
  8. 一种基于人工智能的文章主旨提取装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的基于人工智能的文章主旨提取程序,所述基于人工智能的文章主旨提取程序被所述处理器执行时实现如下步骤:An artificial intelligence-based article subject extraction device, characterized in that the device includes a memory and a processor, and an artificial intelligence-based article subject extraction program that can be run on the processor is stored on the memory. When the artificial intelligence-based article subject extraction program is executed by the processor, the following steps are implemented:
    接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集;Receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set;
    将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集;After performing an encoding operation on the word text set, it is converted into a word matrix set, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set;
    将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练;After performing the dimensionality reduction operation on the word vector set, input it into a convolutional neural network model to obtain training values, and determine the size of the training value and a preset threshold. If the training value is greater than the preset threshold, the The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training;
    接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。The text data input by the user is received, and the text data input by the user is converted into a word vector and then input into the trained convolutional neural network model to obtain and output the main idea of the article.
  9. 如权利要求8所述的基于人工智能的文章主旨提取装置,其特征在于,所述合并操作包括:8. The artificial intelligence-based article subject extraction device according to claim 8, wherein the merging operation comprises:
    遍历所述文本数据集中的每个文本数据,按照段落划分所述文本数据得到若干个段落;Traverse each text data in the text data set, and divide the text data according to paragraphs to obtain several paragraphs;
    将所述若干个段落中出现次数大于等于两次的词语预设为假设主语,构建所述若干个段落中每个句子与所述假设主语的条件概率模型;Presupposing words that appear more than twice in the plurality of paragraphs as hypothetical subjects, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothetical subject;
    构建对数似然函数,并基于所述对数似然函数优化所述条件概率模型得到所述每个句子的主语,将主语相同的若干个句子合并为一个句子,完成所述合并操作。Construct a log-likelihood function, optimize the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merge several sentences with the same subject into one sentence, and complete the merge operation.
  10. 如权利要求9所述的基于人工智能的文章主旨提取装置,其特征在 于,所述条件概率模型为:The article subject extraction device based on artificial intelligence according to claim 9, characterized in that the conditional probability model is:
    Figure PCTCN2019116936-appb-100002
    Figure PCTCN2019116936-appb-100002
    其中,y 1,…,y N,y i为所述假设主语,N为所述假设主语的个数,D为所述段落,j为所述段落的编号,s为所述段落内的句子,P(y i|s)为假设主语y i为句子s的主语的概率,s(i,y i)表示所述句子i的假设主语为y iAmong them, y 1 ,..., y N , y i are the hypothetical subjects, N is the number of the hypothetical subjects, D is the paragraph, j is the number of the paragraph, and s is the sentence in the paragraph , P(y i |s) is the probability of assuming that the subject y i is the subject of the sentence s, and s(i, y i ) indicates that the hypothetical subject of the sentence i is y i .
  11. 如权利要求8所述的基于人工智能的文章主旨提取装置,其特征在于,所述编码操作包括:8. The artificial intelligence-based article subject extraction device according to claim 8, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating an encoding matrix with the same dimension as the largest number number, traversing the sentences in the word text set in turn, and mapping all the sentences to the encoding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  12. 如权利要求9所述的基于人工智能的文章主旨提取装置,其特征在于,所述编码操作包括:9. The artificial intelligence-based article subject extraction device according to claim 9, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating an encoding matrix with the same dimension as the largest number number, traversing the sentences in the word text set in turn, and mapping all the sentences to the encoding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  13. 如权利要求10所述的基于人工智能的文章主旨提取装置,其特征在于,所述编码操作包括:10. The artificial intelligence-based article subject extraction device according to claim 10, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating a coding matrix with the same dimension as the largest number number, sequentially traversing the sentences in the word text set, and mapping all the sentences to the coding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理 得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  14. 如权利要求11-13任一项所述的基于人工智能的文章主旨提取装置,其特征在于,所述降维操作包括:The artificial intelligence-based article subject extraction device according to any one of claims 11-13, wherein the dimensionality reduction operation comprises:
    计算所述单词向量集中各单词向量的协方差;Calculating the covariance of each word vector in the word vector set;
    去除协方差中绝对值大于预设协方差阈值的单词向量,得到降维后的单词向量集。The word vectors whose absolute value is greater than the preset covariance threshold in the covariance are removed to obtain the word vector set after dimensionality reduction.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有基于人工智能的文章主旨提取程序,所述基于人工智能的文章主旨提取程序可被一个或者多个处理器执行,以实现如下步骤:A computer-readable storage medium, characterized in that an artificial intelligence-based article subject extraction program is stored on the computer-readable storage medium, and the artificial intelligence-based article subject extraction program can be executed by one or more processors To achieve the following steps:
    接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集;Receiving a text data set, and performing operations including word segmentation and merging on the text data set to obtain a word text set;
    将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集;After performing an encoding operation on the word text set, it is converted into a word matrix set, and the word matrix set is input into a word vector conversion model for training to obtain a word vector set;
    将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练得到训练值,判断所述训练值与预设阈值的大小,若所述训练值大于所述预设阈值,所述卷积神经网络模型继续训练,若所述训练值小于所述预设阈值,所述卷积神经网络模型完成训练;After performing the dimensionality reduction operation on the word vector set, input it into a convolutional neural network model to obtain training values, and determine the size of the training value and a preset threshold. If the training value is greater than the preset threshold, the The convolutional neural network model continues to be trained, and if the training value is less than the preset threshold, the convolutional neural network model completes the training;
    接收用户输入的文本数据,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中,得到文章主旨并输出。The text data input by the user is received, and the text data input by the user is converted into a word vector and then input into the trained convolutional neural network model to obtain and output the main idea of the article.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述合并操作包括:15. The computer-readable storage medium of claim 15, wherein the merging operation comprises:
    遍历所述文本数据集中的每个文本数据,按照段落划分所述文本数据得到若干个段落;Traverse each text data in the text data set, and divide the text data according to paragraphs to obtain several paragraphs;
    将所述若干个段落中出现次数大于等于两次的词语预设为假设主语,构建所述若干个段落中每个句子与所述假设主语的条件概率模型;Presupposing words that appear more than twice in the plurality of paragraphs as hypothetical subjects, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothetical subject;
    构建对数似然函数,并基于所述对数似然函数优化所述条件概率模型得 到所述每个句子的主语,将主语相同的若干个句子合并为一个句子,完成所述合并操作。Construct a log-likelihood function, optimize the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merge several sentences with the same subject into one sentence, and complete the merge operation.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述条件概率模型为:15. The computer-readable storage medium of claim 16, wherein the conditional probability model is:
    Figure PCTCN2019116936-appb-100003
    Figure PCTCN2019116936-appb-100003
    其中,y 1,…,y N,y i为所述假设主语,N为所述假设主语的个数,D为所述段落,j为所述段落的编号,s为所述段落内的句子,P(y i|s)为假设主语y i为句子s的主语的概率,s(i,y i)表示所述句子i的假设主语为y iAmong them, y 1 ,..., y N , y i are the hypothetical subjects, N is the number of the hypothetical subjects, D is the paragraph, j is the number of the paragraph, and s is the sentence in the paragraph , P(y i |s) is the probability of assuming that the subject y i is the subject of the sentence s, and s(i, y i ) indicates that the hypothetical subject of the sentence i is y i .
  18. 如权利要求15所述的计算机可读存储介质,其特征在于,所述编码操作包括:15. The computer-readable storage medium of claim 15, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating a coding matrix with the same dimension as the largest number number, sequentially traversing the sentences in the word text set, and mapping all the sentences to the coding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  19. 如权利要求16或17所述的计算机可读存储介质,其特征在于,所述编码操作包括:The computer-readable storage medium according to claim 16 or 17, wherein the encoding operation comprises:
    将所述单词文本集内的每个单词进行数字编号并得到最大的数字编号;Digitally number each word in the word text set and obtain the largest digital number;
    创建与所述最大的数字编号维度相同的编码矩阵,依次遍历所述单词文本集内的句子,将所述句子都映射到所述编码矩阵;Creating a coding matrix with the same dimension as the largest number number, sequentially traversing the sentences in the word text set, and mapping all the sentences to the coding matrix;
    依据所述单词文本集内的每个单词的数字编号对所述编码矩阵进行处理得到单词矩阵集。The encoding matrix is processed according to the digital number of each word in the word text set to obtain a word matrix set.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述降维操作包括:The computer-readable storage medium of claim 19, wherein the dimensionality reduction operation comprises:
    计算所述单词向量集中各单词向量的协方差;Calculating the covariance of each word vector in the word vector set;
    去除协方差中绝对值大于预设协方差阈值的单词向量,得到降维后的单词向量集。The word vectors whose absolute value is greater than the preset covariance threshold in the covariance are removed to obtain the word vector set after dimensionality reduction.
PCT/CN2019/116936 2019-09-02 2019-11-10 Artificial intelligence-based article gist extraction method and device, and storage medium WO2021042517A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910826795.4 2019-09-02
CN201910826795.4A CN110705268A (en) 2019-09-02 2019-09-02 Article subject extraction method and device based on artificial intelligence and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021042517A1 true WO2021042517A1 (en) 2021-03-11

Family

ID=69193514

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116936 WO2021042517A1 (en) 2019-09-02 2019-11-10 Artificial intelligence-based article gist extraction method and device, and storage medium

Country Status (2)

Country Link
CN (1) CN110705268A (en)
WO (1) WO2021042517A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651652B (en) * 2020-04-30 2023-11-10 中国平安财产保险股份有限公司 Emotion tendency identification method, device, equipment and medium based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509413A (en) * 2018-03-08 2018-09-07 平安科技(深圳)有限公司 Digest extraction method, device, computer equipment and storage medium
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10216724B2 (en) * 2017-04-07 2019-02-26 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content
CN110019793A (en) * 2017-10-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of text semantic coding method and device
CN109871532B (en) * 2019-01-04 2022-07-08 平安科技(深圳)有限公司 Text theme extraction method and device and storage medium
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509413A (en) * 2018-03-08 2018-09-07 平安科技(深圳)有限公司 Digest extraction method, device, computer equipment and storage medium
CN109086340A (en) * 2018-07-10 2018-12-25 太原理工大学 Evaluation object recognition methods based on semantic feature
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment

Also Published As

Publication number Publication date
CN110705268A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
WO2021068329A1 (en) Chinese named-entity recognition method, device, and computer-readable storage medium
WO2020224213A1 (en) Sentence intent identification method, device, and computer readable storage medium
CN109190120B (en) Neural network training method and device and named entity identification method and device
US20230195773A1 (en) Text classification method, apparatus and computer-readable storage medium
WO2021169116A1 (en) Intelligent missing data filling method, apparatus and device, and storage medium
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN111222305A (en) Information structuring method and device
WO2020253042A1 (en) Intelligent sentiment judgment method and device, and computer readable storage medium
WO2020147409A1 (en) Text classification method and apparatus, computer device, and storage medium
US11599727B2 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
WO2021056710A1 (en) Multi-round question-and-answer identification method, device, computer apparatus, and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
WO2020248366A1 (en) Text intention intelligent classification method and device, and computer-readable storage medium
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
WO2021051934A1 (en) Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN113627797A (en) Image generation method and device for employee enrollment, computer equipment and storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN108268629B (en) Image description method and device based on keywords, equipment and medium
CN113947095A (en) Multilingual text translation method and device, computer equipment and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944319

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944319

Country of ref document: EP

Kind code of ref document: A1