CN117540747A - Book publishing intelligent question selecting system based on artificial intelligence - Google Patents

Book publishing intelligent question selecting system based on artificial intelligence Download PDF

Info

Publication number
CN117540747A
CN117540747A CN202410028055.7A CN202410028055A CN117540747A CN 117540747 A CN117540747 A CN 117540747A CN 202410028055 A CN202410028055 A CN 202410028055A CN 117540747 A CN117540747 A CN 117540747A
Authority
CN
China
Prior art keywords
topic
candidate
book publishing
words
candidate word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410028055.7A
Other languages
Chinese (zh)
Other versions
CN117540747B (en
Inventor
马驰
宋宁
赵小萱
谢天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National New Bibliography Magazine Co ltd
Original Assignee
National New Bibliography Magazine Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National New Bibliography Magazine Co ltd filed Critical National New Bibliography Magazine Co ltd
Priority to CN202410028055.7A priority Critical patent/CN117540747B/en
Publication of CN117540747A publication Critical patent/CN117540747A/en
Application granted granted Critical
Publication of CN117540747B publication Critical patent/CN117540747B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of text processing, in particular to a book publishing intelligent question selecting system based on artificial intelligence, which comprises the following steps: and a data acquisition module: acquiring hot spot comment data of each month as a document set of each month; and a data processing module: selecting entity words in a document set; screening according to entity words in the document set to obtain candidate words; obtaining the importance of book publishing topics according to the characteristics of the candidate words; iterating each candidate word by adopting a PageRank algorithm to obtain the importance of book publishing topics; obtaining a book publishing topic index according to topic importance change sequences of the candidate words; the topic selection recommendation module: according to the candidate words and the correlation relation among the entity words in the user input topic requirement, obtaining a semantic matching topic index sequence of the candidate words, and taking the first r candidate words in the sequence as keywords for topic recommendation of the user. The keyword recommendation accuracy of the user choice question requirement is improved.

Description

Book publishing intelligent question selecting system based on artificial intelligence
Technical Field
The application relates to the technical field of text processing, in particular to a book publishing intelligent question selecting system based on artificial intelligence.
Background
With the development of the internet and digital technology, a large amount of text data is generated and stored, and the conventional book question selection mode faces the challenge of massive information. The book publishing intelligent question selecting system based on artificial intelligence can extract valuable information from huge data and help editors to select questions rapidly and accurately. And the artificial intelligence technology can analyze the behavioral preference of readers and know the reading interests and preferences of the readers, so that the readers can be more closely attached to the demands of the readers in the process of selecting questions, books meeting the market demands are provided, and sales and reader satisfaction are improved.
Since the intelligent topic selection system is generally required to be capable of understanding and analyzing a large amount of text data, the natural language processing technology can be used to achieve a good effect. In performing intelligent topic selection, techniques involving text classification, entity recognition, emotion analysis, etc., are often required to help editors better understand and process large amounts of text data. Text data is often highly diverse and complex, comes from different fields, different acquisition sources, and may also be noisy and erroneous. The conventional method can obtain hot topics according to word frequency and other information through a related algorithm for extracting the subject words, and book publishing and topic selecting recommendation is carried out according to the hot points. The keywords obtained by the method are hot topics in current discussion, but the hot topics may contain negative emotion, have no finer granularity of real-time property or insufficient writable content, and do not accord with the properties of book publishing topics.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide an artificial intelligence-based book publishing intelligent question selecting system, which adopts the following technical scheme:
the invention provides a book publishing intelligent question selecting system based on artificial intelligence, which comprises:
and a data acquisition module: acquiring hot spot comment data in each month as a document set of each month;
and a data processing module: selecting entity words in a document set and marking corresponding parts of speech; screening according to entity words in the document set to obtain candidate words and the co-occurrence times of the corresponding co-occurrence relationship; constructing an undirected graph according to candidate words in the document set and corresponding co-occurrence times; obtaining the importance of book publishing topics of each candidate word according to the length, the occurrence frequency and other characteristics of the candidate word in the document set;
obtaining the book publishing topic importance of each candidate word according to the book publishing topic importance of the candidate word with the co-occurrence relationship in the undirected graph by adopting a PageRank algorithm; constructing a topic importance changing sequence of each candidate word according to the book publishing topic importance of each candidate word in all the document sets; obtaining book publishing topic indexes of the candidate words according to the distribution of elements in the topic importance change sequence;
the topic selection recommendation module: obtaining semantic matching topic indexes of the candidate words according to the semantic relativity among the candidate words, the entity words in the user input topic requirement and the book publishing topic indexes, wherein the entity word selection method in the document set is adopted to obtain the entity words in the user input topic requirement;
and sequencing the candidate words according to the semantic matching topic indexes from large to small to obtain a topic sequence, and outputting the first r candidate words in the topic sequence as keywords for topic recommendation of users.
Preferably, the selecting entity words in the document set and labeling corresponding parts of speech includes:
identifying entity words in the document set by adopting a BERT-BiLSTM-CRF model;
part of speech tagging of each entity word using a hidden Markov model results in part of speech of each entity word, including but not limited to: nouns, verbs, adjectives.
Preferably, the filtering to obtain the candidate word and the co-occurrence times of the corresponding co-occurrence relationship according to the entity word in the document set includes:
acquiring the first N words in a document set as candidate words by adopting frequency-inverse document frequency, wherein N is the number of preset candidate words;
and taking two candidate words appearing in the same sentence as a co-occurrence relationship, and counting the number of times that the candidate words appear in the document set as the co-occurrence number.
Preferably, the constructing an undirected graph according to candidate words in the document set and corresponding co-occurrence times includes:
and taking the candidate words in the document set as nodes of the undirected graph, and taking the co-occurrence times between the nodes as edge weights of the connecting lines.
Preferably, the obtaining the importance of the book publishing selection questions of each candidate word according to the length, the occurrence frequency and other features of the candidate word in the document set includes:
for each candidate word in the document set, acquiring the length and the occurrence frequency of the candidate word; obtaining the expression richness of the candidate words;
and multiplying the ratio result of the length to the occurrence frequency by the expression richness to obtain the importance of the book publishing questions of the candidate words.
Preferably, the obtaining the expression richness of the candidate word includes:
dividing a neighborhood window for the candidate words, wherein the neighborhood window comprises the current candidate word and u candidate words before and after the current candidate word;
and counting the number of types of part-of-speech occurrences of all the candidate words in the neighborhood window, and taking the number of types as the expression richness of the candidate words.
Preferably, the obtaining the book publishing choice importance of each candidate word by using the PageRank algorithm according to the importance of the book publishing choice itself of each node candidate word in the undirected graph and the book publishing choice importance of the candidate word with co-occurrence relation comprises:
for each candidate word having a co-occurrence relationship with the current candidate word, obtaining the co-occurrence times of the candidate word and the current candidate word in the same sentence;
calculating the product of the co-occurrence times and the book publishing topic importance of the candidate words, and calculating the sum of the products of all the candidate words with the co-occurrence relation;
taking the sum of the sum value and the book publishing and selecting question of the current candidate word as the book publishing and selecting question importance of the current candidate word;
and (3) carrying out iterative calculation on the book publishing topic importance of each candidate word by adopting a PageRank algorithm until a stopping condition is met, so as to obtain the book publishing topic importance of each candidate word after iteration.
Preferably, the constructing a topic importance changing sequence of each candidate word according to the book publishing topic importance of each candidate word in all the document sets includes:
forming a total candidate word set by the candidate words in all the document sets;
and for each candidate word in the total candidate word set, the book publishing topic importance of the candidate word in each month forms a topic importance change sequence of the candidate word.
Preferably, the obtaining the book publishing topic index of the candidate word according to the distribution of the elements in the topic importance change sequence includes:
for each element in the topic importance change sequence of the candidate word, calculating the product of the month in which the element is located and the book publishing topic importance of the month in which the element is located;
taking the average value of the products of all elements as a book publishing topic index of the candidate words.
Preferably, the obtaining the semantic matching topic index of the candidate word according to the semantic relevance between the candidate word and each entity word in the topic selection requirement input by the user and the book publishing topic index includes:
converting each entity word in the user input question selection requirement and each candidate word in the total candidate word set by adopting a BERT language model to obtain a corresponding semantic vector;
for each entity word in the user input question selection requirement, acquiring cosine values of semantic vectors of the entity word and the candidate word;
and calculating the product of the cosine value and the book publishing topic index of the candidate word, and taking the sum of the products of all entity words in the topic requirement input by the user as the semantic matching topic index of the candidate word.
The invention has at least the following beneficial effects:
according to the method, through analyzing data of the past 12 months, book publishing and topic selecting keywords based on user description are obtained; the hot words are mainly determined as candidate words of a book topic selection system by calculating the heat of each word, the heat condition of each month is calculated according to the part-of-speech distribution of the candidate words, the book publishing topic selection index of each candidate word is calculated by analyzing the integral performance of the past 12 months, and finally the final semantic matching topic selection index is calculated according to the user input descriptionAnd outputting 6 words with the highest scores as keywords of the book choice. The invention not only considers the heat of the keywords when selecting books, but also provides keywords with stronger writeability for users according to the conditions of semantics, part-of-speech distribution and the like, and meets the requirements of the users.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a book publishing intelligent choice question system based on artificial intelligence according to an embodiment of the invention;
FIG. 2 is a user topic keyword recommendation process.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of a book publishing intelligent question selecting system based on artificial intelligence, which is provided by the invention, with reference to the attached drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The invention provides a book publishing intelligent choice system based on artificial intelligence, which is concretely described below with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a book publishing intelligent question selecting system based on artificial intelligence according to an embodiment of the invention is shown, the system includes: the system comprises a data acquisition module 101, a data processing module 102 and a topic recommendation module 103.
The data acquisition module 101 is very important to know the group of target readers when selecting questions, and data in different fields are required to be selected for analysis aiming at different audience groups, so that different hot topics are obtained. For example, if the reader is a science and technology fan, topics related to science and technology innovation, artificial intelligence, etc. may be more attractive. Next, this embodiment takes the audience of the technical field as an example, and the analysis is performed in a plurality of fields by the same method.
In order to analyze the preference of the audience of the technological fan, the embodiment obtains news reports, articles and hot spot comments within 12 months of the relevant field by acquiring the data use permission of the relevant website from the platform such as the technological media and news websites of the technological community, such as the superpassenger park, V2EX, learning, newwave microblog, IT, and the like, and updates the data every month so as to enable the data to have real-time performance.
The data is then pre-processed, in particular as follows: 1) The data usually obtained often contains much noise, so this embodiment uses Pandas, numPy and SciPy libraries in Python to remove HTML tags, emoticons, special symbols, irrelevant characters, etc. from the data. 2) Then, the chinese text data is further required to be subjected to word segmentation, where a jieba word segmentation tool is used to segment all data, where an input is a sentence of text data, and an output is a word sequence separated by space after the word segmentation, where the jieba word segmentation tool is a known technology, and this embodiment will not be described again. 3) In order to analyze the hot spot information in the data, the frequencies of different words are usually used as one of the reference bases, so that in order to avoid interference, words without practical meaning in the data need to be deactivated, and in this embodiment, the words with high frequencies but without practical meaning in the data are removed according to the Ha-Gong-Ji-Xue vocabulary, so as to avoid interference to subsequent analysis.
The data processing module 102 analyzes the historical data of the past 12 months by using artificial intelligence technology, and searches for hot spots to provide technical support for the intelligent choice of book publishing, because text data is very large and complex, and the data changes in real time, and the duration of the hot spots is different, more detailed analysis and prediction of the data are needed.
In order to analyze the real-time rising situation of the hot spot, the text of the technological innovation field of 12 months pretreated by the step one is divided into 12 document sets according to the distance sequence from the current month, and each document set comprises all documents in one month.
And respectively analyzing each document set to obtain keywords of each document set and the heat of the keywords. The method comprises the following steps:
when book publishing and selecting are carried out, entity words are usually used as selecting objects, such as 'artificial intelligence and machine learning', a BERT-BiLSTM-CRF model is adopted for carrying out named entity identification aiming at each document set, and all entities are used as objects to be selected of the book publishing and selecting. Meanwhile, a hidden Markov model is adopted to label the parts of speech of the words, and the parts of speech of each word are obtained, wherein the parts of speech comprise nouns, verbs, adjectives, adverbs, prepositions and the like. The BERT-BiLSTM-CRF model and the hidden Markov model are known techniques, and the description of the embodiment is omitted.
The book publishing topic importance of each entity word is analyzed and calculated by combining the TextRank algorithm ideas
The hot words available for book publishing must be contained in the high-frequency words, in order to facilitate the subsequent calculation, we first obtain the top N words with the highest scores in all the entity words as candidate words of the current month through a frequency-inverse document frequency (TF-IDF) algorithm, which is a well-known technique, and this embodiment will not be repeated. In this embodiment, the experience value 200 is taken for N, and the practitioner can set the experience value by himself, so that the first two hundred words with the highest scores in each document set are reserved in this embodiment.
And then calculating the co-occurrence condition among the words through a statistical method, namely when two candidate words appear in the same sentence in the hot spot comment data of the current month, respectively marking the two candidate words as the occurrence of one co-occurrence. Finally, the final book publishing topic importance is obtained by giving different weights to the words, and the book publishing topic importance is analyzed and selected by combining with the TextRank algorithm idea.
The TextRank algorithm mainly comprises the steps of text preprocessing, graph structure construction, node importance calculation, node ordering and the like.
The text preprocessing refers to the operations of word segmentation, word stopping, part-of-speech tagging and the like on an original text, and the steps are processed and completed in the steps; the construction of the graph structure refers to respectively constructing an undirected graph based on each preprocessed document set, in order to consider the complexity of calculation and select keywords conforming to the publishing scene of the book, the embodiment takes the first N entity words calculated by the frequency-inverse document frequency as nodes, wherein the nodes are candidate words in each document set, and the weight of the edges in the graph represents the co-occurrence times among the candidate words; when the node importance is calculated, an iterative algorithm is required to calculate the importance score of each candidate word, wherein the importance score is calculated based on the book publishing topic importance of the node connected with the current node and the importance of the book publishing topic itself, and iterates until the book publishing topic importance converges; the sorting node sorts the books according to the importance scores of the books publishing questions from big to small.
Calculating the importance of book publishing and selecting questionsBefore, the importance of book publishing topics per se is determined according to the specific distribution situation of the candidate words>The importance of the book publishing and selecting questions represents the possibility that the candidate words have the key words of the book publishing and selecting questions.
Because in the book publishing and selecting process, we pay attention to not only the heat of the keywords, but also whether the keywords have strong writability, the keywords generally have the characteristics of being hot topics, having strong specialization and specificity to highlight the value of books, having strong attraction and expression, and being easy to understand and express. The specific calculation is as follows:
wherein,the importance of book publishing topics of the ith candidate word is represented; />Representing the occurrence frequency of the ith candidate word; />Representing the length of the ith candidate word; />The expression richness of the i candidate word is represented, and the expression richness is obtained by the following method: by dividing a neighborhood window for each candidate word, namely taking 3 candidate words before and after each candidate word, according to the part-of-speech tagging result, recording the number of types of different parts-of-speech in the neighborhood of the candidate word ∈>
When the frequency of occurrence of the candidate word is higher, the word is more likely to be a hot topic, and the higher the importance of the book publishing and selecting questions is, otherwise, the lower the heat of the word, the lower the importance of the book publishing and selecting questions is; when the length of the word is longer, the word possibly has the characteristics of complicated description, difficult understanding and the like, is not suitable for being used as a keyword of book publishing and selecting questions, so that the word has lower importance of the book publishing and selecting questions, and conversely, the word has higher importance of the book publishing and selecting questions; meanwhile, when the expression abundance of the words around the word is higher, the word part distribution around the word is complete and uniform, the viewpoint can be better expressed, the topic content can be displayed, the expansibility is higher, the book writing and expression are convenient, and the higher importance of the book publishing and selecting questions is required, otherwise, the word is probably single in expression and has no stronger writeability, and the lower importance of the book publishing and selecting questions is required.
Calculating the importance of book publishing and selecting questionsWhen each candidate word has importance determined by the importance of the current word itself and the importance of the word that has a co-occurrence relationship with the current word. That is, when the current word has a co-occurrence relationship with the word having higher importance of the book publishing and selecting questions, the current word should also have higher importance of the book publishing and selecting questions.
Therefore, the method is a continuous iterative process, a PageRank algorithm is adopted in a specific iterative mode, an undirected graph is input into the PageRank algorithm, iterative calculation is carried out on candidate words of all nodes in the undirected graph until the algorithm converges to an optimal result, so that the book publishing topic importance of which the output of the algorithm is all the candidate words is obtained, and the PageRank algorithm is a known technology and is not repeated in the embodiment. The specific calculation mode of the book publishing choice question importance of the candidate words is as follows:
wherein,book publishing choice question importance of the ith candidate word is represented; />The importance of book publishing topics of the ith candidate word is represented; n represents the number of candidate words having co-occurrence relation with the ith candidate word; />The book publishing topic importance of the jth candidate word which has a co-occurrence relationship with the ith candidate word is represented; />Representing the number of co-occurrences that occur in the same sentence as the i-th candidate word and the j-th candidate word.
When the importance of the book publishing selection question of the ith candidate word is larger, the more likely the candidate word is a keyword of the book publishing selection question, the larger the importance of the book publishing selection question of the candidate word should be, otherwise, the less likely the candidate word is a keyword of the book publishing selection question, the smaller the importance of the book publishing selection question of the candidate word should be; when the book publishing choice question importance of the candidate word with the co-occurrence relation is larger and the co-occurrence frequency is larger, the current candidate word is related to the candidate word with the more importance, so that the current candidate word also has the higher book publishing choice question importance, otherwise, the current candidate word is related to the candidate word with the less importance, and therefore, the book publishing choice question importance of the current candidate word is lower.
Book publishing topic importance of each candidate word according to different monthsAnd analyzing the change condition of the hot words and providing basis for finally selecting the keywords of the book publishing and topic.
Selecting topics that are hot in selecting books for publication does appeal to more readers, but such hot may be short lived. If the selected topic is only a temporary hot spot, the sales of the book may be only a peak in the short term. Thus, it should be more preferable to select words that last longer, and generally words that last longer at higher levels of heat tend to have higher writing value.
The frequency of the document set in each month in the past 12 months-the inverse document frequency can be obtained by the steps, the importance of the book publishing selection questions of the first 200 candidate words can be calculated, and all the candidate words in 12 months form a total candidate word set of the book publishing selection questions, wherein each candidate word represents a keyword for book publishing selection.
And constructing a topic importance change sequence for each candidate word, wherein each element in the sequence is the book publishing topic importance of the candidate word in 12 months, and setting the book publishing topic importance of the candidate word as '0' for months when the current word is not taken as a node. And based on the results, calculating book publishing topic indexes of each candidate word in the total candidate word setThe specific calculation is as follows:
wherein,book publishing topic indexes for representing the ith candidate word in the total candidate word set; />Representing the month of the selected data, the value of this example is 12, i.e., the last 12 months are selectedThe data is based, and the implementer can set the data according to the actual situation; />Indicating the j-th month in the past 12 months, the larger the month is, the closer the month is to the current time, namely the book publishing and selecting question importance of the book is higher in weight; />And the book publishing topic importance of the ith candidate word in the jth month in the total candidate word set is represented.
The closer the book publishing and selecting question importance is, the larger the weight of the book publishing and selecting question index is, and the more the book publishing and selecting question index is, the more the weight of the book publishing and selecting question importance is, otherwise, the smaller the book publishing and selecting question importance is, the less the book publishing and selecting question index is, and the more the book publishing and selecting question index is, and the more the book publishing and selecting question index is, the more the keyword is. The larger the average book publishing and selecting question importance of all months is, the longer the heat is, the more the average book publishing and selecting question importance is used as the keyword of the book publishing and selecting question, otherwise, the shorter the heat is, the more the average book publishing and selecting question importance is used as the keyword of the book publishing and selecting question.
Thus far, the embodiment obtains the book publishing topic index of each word in the total candidate word set
The topic recommendation module 103 finds out the entity in the topic recommendation module by using a named entity recognition model BERT-BiLSTM-CRF according to the topic requirement input by the user, and converts the entity word proposed by the user and all the candidate words in the total candidate word set into corresponding semantic vectors through a BERT language model, wherein the BERT language model is a well-known technology and is not described in detail in this embodiment. Matching the entity words input by the user with the semantic vectors of each entity word in the candidate words, and determining the semantic matching topic indexes of each candidate word relative to the user according to the book publishing topic indexes of the entity wordsThe method is characterized by comprising the following steps:
wherein,semantic matching topic indexes of the ith candidate word in the total candidate word set are represented; />The number of entity words in the question selection requirement input by the user is represented; />Book publishing topic indexes which represent the ith entity word in the topic selection requirement input by the user; />Representing the semantic vector of the j-th entity word in the question selection requirement input by the user; />Representing the semantic vector of the i-th candidate word in the total candidate word set.
The closer the semantic vector is, the more the word meets the user's needs, and conversely, the less the word meets the user's needs.
And sequencing the semantic matching topic indexes of the candidate words in the total candidate word set according to the sequence from big to small to obtain a topic sequence, selecting the top r words with the highest scores in the topic sequence as keywords of the topic, and outputting the keywords to a user for checking. In this embodiment, r is checked to be 6, and the practitioner can set the value according to the actual situation. The process of recommending the user topic keyword is shown in fig. 2.
Thus, the selection of the intelligent book publishing questions is completed.
In summary, according to the embodiment of the invention, through analyzing the data of the past 12 months, the book publishing and topic selecting keywords based on the user description are obtained; wherein, the hot words are mainly used as candidate words of the book topic selection system by calculating the heat of each word, andcalculating the heat condition of each month according to the part-of-speech distribution of the candidate words, then calculating the book publishing topic index of each candidate word by analyzing the whole performance of the past 12 months, and finally calculating the final semantic matching topic index according to the user input descriptionAnd outputting 6 words with the highest scores as keywords of the book choice. The embodiment of the invention not only considers the heat of the keywords when carrying out book topic selection, but also provides keywords with stronger writeability for users according to the conditions of semantics, part-of-speech distribution and the like, and meets the requirements of the users better.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. An artificial intelligence based book publishing intelligent choice system, characterized in that the system comprises:
and a data acquisition module: acquiring hot spot comment data in each month as a document set of each month;
and a data processing module: selecting entity words in a document set and marking corresponding parts of speech; screening according to entity words in the document set to obtain candidate words and the co-occurrence times of the corresponding co-occurrence relationship; constructing an undirected graph according to candidate words in the document set and corresponding co-occurrence times; obtaining the importance of book publishing topics of each candidate word according to the length, the occurrence frequency and other characteristics of the candidate word in the document set;
obtaining the book publishing topic importance of each candidate word according to the book publishing topic importance of the candidate word with the co-occurrence relationship in the undirected graph by adopting a PageRank algorithm; constructing a topic importance changing sequence of each candidate word according to the book publishing topic importance of each candidate word in all the document sets; obtaining book publishing topic indexes of the candidate words according to the distribution of elements in the topic importance change sequence;
the topic selection recommendation module: obtaining semantic matching topic indexes of the candidate words according to the semantic relativity among the candidate words, the entity words in the user input topic requirement and the book publishing topic indexes, wherein the entity word selection method in the document set is adopted to obtain the entity words in the user input topic requirement;
and sequencing the candidate words according to the semantic matching topic indexes from large to small to obtain a topic sequence, and outputting the first r candidate words in the topic sequence as keywords for topic recommendation of users.
2. The artificial intelligence based book publishing intelligent choice question system of claim 1, wherein the selecting entity words in the document set and labeling the corresponding parts of speech comprises:
identifying entity words in the document set by adopting a BERT-BiLSTM-CRF model;
part of speech tagging of each entity word using a hidden Markov model results in part of speech of each entity word, including but not limited to: nouns, verbs, adjectives.
3. The system for intelligent topic selection in book publishing based on artificial intelligence of claim 2, wherein the step of screening candidate words and the co-occurrence times of the corresponding co-occurrence relationship according to entity words in a document set comprises the steps of:
acquiring the first N words in a document set as candidate words by adopting frequency-inverse document frequency, wherein N is the number of preset candidate words;
and taking two candidate words appearing in the same sentence as a co-occurrence relationship, and counting the number of times that the candidate words appear in the document set as the co-occurrence number.
4. The book publishing intelligent choice question system based on artificial intelligence of claim 3, wherein the constructing an undirected graph according to candidate words in a document set and corresponding co-occurrence times comprises:
and taking the candidate words in the document set as nodes of the undirected graph, and taking the co-occurrence times between the nodes as edge weights of the connecting lines.
5. The system for intelligent choice of book publication based on artificial intelligence as set forth in claim 1, wherein the obtaining the importance of the book publication choice itself of each candidate word according to the length, the frequency of occurrence, etc. of the candidate word in the document set comprises:
for each candidate word in the document set, acquiring the length and the occurrence frequency of the candidate word; obtaining the expression richness of the candidate words;
and multiplying the ratio result of the length to the occurrence frequency by the expression richness to obtain the importance of the book publishing questions of the candidate words.
6. The artificial intelligence based book publishing intelligent choice question system of claim 5, wherein the obtaining the expression richness of the candidate words comprises:
dividing a neighborhood window for the candidate words, wherein the neighborhood window comprises the current candidate word and u candidate words before and after the current candidate word;
and counting the number of types of part-of-speech occurrences of all the candidate words in the neighborhood window, and taking the number of types as the expression richness of the candidate words.
7. The intelligent book publishing and topic selecting system based on artificial intelligence as set forth in claim 5, wherein the adopting the PageRank algorithm obtains the importance of the book publishing and topic of each candidate word according to the importance of the book publishing and topic of each candidate word in the undirected graph and the importance of the book publishing and topic of the candidate word with co-occurrence relationship, comprising:
for each candidate word having a co-occurrence relationship with the current candidate word, obtaining the co-occurrence times of the candidate word and the current candidate word in the same sentence;
calculating the product of the co-occurrence times and the book publishing topic importance of the candidate words, and calculating the sum of the products of all the candidate words with the co-occurrence relation;
taking the sum of the sum value and the book publishing and selecting question of the current candidate word as the book publishing and selecting question importance of the current candidate word;
and (3) carrying out iterative calculation on the book publishing topic importance of each candidate word by adopting a PageRank algorithm until a stopping condition is met, so as to obtain the book publishing topic importance of each candidate word after iteration.
8. The artificial intelligence based book publishing intelligent choice question system as set forth in claim 7, wherein the constructing a choice question importance variation sequence of each candidate word according to the book publishing choice question importance of each candidate word in all document sets comprises:
forming a total candidate word set by the candidate words in all the document sets;
and for each candidate word in the total candidate word set, the book publishing topic importance of the candidate word in each month forms a topic importance change sequence of the candidate word.
9. The intelligent book publishing and topic selecting system based on artificial intelligence as claimed in claim 8, wherein the obtaining the book publishing and topic index of the candidate word according to the distribution of the elements in the topic importance changing sequence comprises:
for each element in the topic importance change sequence of the candidate word, calculating the product of the month in which the element is located and the book publishing topic importance of the month in which the element is located;
taking the average value of the products of all elements as a book publishing topic index of the candidate words.
10. The intelligent book publishing and topic selecting system based on artificial intelligence as claimed in claim 9, wherein the obtaining the semantic matching topic index of the candidate word according to the semantic relevance between the candidate word and each entity word in the user input topic selecting requirement and the book publishing topic index comprises:
converting each entity word in the user input question selection requirement and each candidate word in the total candidate word set by adopting a BERT language model to obtain a corresponding semantic vector;
for each entity word in the user input question selection requirement, acquiring cosine values of semantic vectors of the entity word and the candidate word;
and calculating the product of the cosine value and the book publishing topic index of the candidate word, and taking the sum of the products of all entity words in the topic requirement input by the user as the semantic matching topic index of the candidate word.
CN202410028055.7A 2024-01-09 2024-01-09 Book publishing intelligent question selecting system based on artificial intelligence Active CN117540747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410028055.7A CN117540747B (en) 2024-01-09 2024-01-09 Book publishing intelligent question selecting system based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410028055.7A CN117540747B (en) 2024-01-09 2024-01-09 Book publishing intelligent question selecting system based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN117540747A true CN117540747A (en) 2024-02-09
CN117540747B CN117540747B (en) 2024-04-16

Family

ID=89788429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410028055.7A Active CN117540747B (en) 2024-01-09 2024-01-09 Book publishing intelligent question selecting system based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN117540747B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011150538A (en) * 2010-01-21 2011-08-04 Nippon Telegr & Teleph Corp <Ntt> Important keyword extraction device, method and program
US20150066917A1 (en) * 2013-08-29 2015-03-05 Fujitsu Limited Item selection in curation learning
US20150134652A1 (en) * 2013-11-11 2015-05-14 Lg Cns Co., Ltd. Method of extracting an important keyword and server performing the same
CN105183718A (en) * 2015-09-25 2015-12-23 苏州天梯卓越传媒有限公司 Hotspot topic obtaining method for publishing industry and system thereof
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN109902230A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of processing method and processing device of news data
CN115186050A (en) * 2022-09-08 2022-10-14 粤港澳大湾区数字经济研究院(福田) Method, system and related equipment for recommending selected questions based on natural language processing
CN117333037A (en) * 2023-10-16 2024-01-02 山东出版数字融合产业研究院有限公司 Industrial brain construction method and device for publishing big data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011150538A (en) * 2010-01-21 2011-08-04 Nippon Telegr & Teleph Corp <Ntt> Important keyword extraction device, method and program
US20150066917A1 (en) * 2013-08-29 2015-03-05 Fujitsu Limited Item selection in curation learning
US20150134652A1 (en) * 2013-11-11 2015-05-14 Lg Cns Co., Ltd. Method of extracting an important keyword and server performing the same
CN105183718A (en) * 2015-09-25 2015-12-23 苏州天梯卓越传媒有限公司 Hotspot topic obtaining method for publishing industry and system thereof
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN109902230A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of processing method and processing device of news data
CN115186050A (en) * 2022-09-08 2022-10-14 粤港澳大湾区数字经济研究院(福田) Method, system and related equipment for recommending selected questions based on natural language processing
CN117333037A (en) * 2023-10-16 2024-01-02 山东出版数字融合产业研究院有限公司 Industrial brain construction method and device for publishing big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张曼玲: "误加权评分对比法在优化图书选题中的应用", 情报科学, no. 03, 25 March 2001 (2001-03-25), pages 85 - 87 *
简兴明;游进国;梁月明;贾连印;: "社交网络中基于影响力的紧密子图发现算法", 小型微型计算机系统, no. 06, 15 June 2018 (2018-06-15), pages 224 - 230 *

Also Published As

Publication number Publication date
CN117540747B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110232149B (en) Hot event detection method and system
CN105183833B (en) Microblog text recommendation method and device based on user model
Millstein Natural language processing with python: natural language processing using NLTK
Lavanya et al. Twitter sentiment analysis using multi-class SVM
CN111241410B (en) Industry news recommendation method and terminal
CN112989802A (en) Barrage keyword extraction method, device, equipment and medium
CN113282704A (en) Method and device for judging and screening comment usefulness
CN111259156A (en) Hot spot clustering method facing time sequence
CN112711666B (en) Futures label extraction method and device
VeeraSekharReddy et al. An attention based bi-LSTM DenseNet model for named entity recognition in english texts
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
Tungthamthiti et al. Recognition of sarcasm in microblogging based on sentiment analysis and coherence identification
CN111563361B (en) Text label extraction method and device and storage medium
CN107291686B (en) Method and system for identifying emotion identification
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN117540747B (en) Book publishing intelligent question selecting system based on artificial intelligence
CN113111653B (en) Text feature construction method based on Word2Vec and syntactic dependency tree
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
Saeed et al. An abstractive summarization technique with variable length keywords as per document diversity
Ifrim et al. Learning word-to-concept mappings for automatic text classification
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN110019814B (en) News information aggregation method based on data mining and deep learning
Handayani et al. Sentiment Analysis of Bank BNI User Comments Using the Support Vector Machine Method
Pandi et al. Reputation based online product recommendations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant