CN109871433A - Calculation method, device, equipment and the medium of document and the topic degree of correlation - Google Patents

Calculation method, device, equipment and the medium of document and the topic degree of correlation Download PDF

Info

Publication number
CN109871433A
CN109871433A CN201910131086.4A CN201910131086A CN109871433A CN 109871433 A CN109871433 A CN 109871433A CN 201910131086 A CN201910131086 A CN 201910131086A CN 109871433 A CN109871433 A CN 109871433A
Authority
CN
China
Prior art keywords
document
topic
dictionary
correlation
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910131086.4A
Other languages
Chinese (zh)
Other versions
CN109871433B (en
Inventor
王文超
乔静静
阳任科
牛文娟
李建国
刘浩洋
关扬
郏昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201910131086.4A priority Critical patent/CN109871433B/en
Publication of CN109871433A publication Critical patent/CN109871433A/en
Application granted granted Critical
Publication of CN109871433B publication Critical patent/CN109871433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the calculation method of a kind of document and the topic degree of correlation, and this method obtains collection of document;Obtain dictionary corresponding with default topic;Wherein dictionary is to carry out study building to topic data using semi-supervised learning algorithm, and dictionary includes and the semantic relevant multiple words of default topic;The degree of correlation of any document default topic corresponding with dictionary is calculated according to hit situation of the word in dictionary in collection of document for any document in collection of document.The degree of correlation of document and default topic can represent the related intimate degree between document content and default topic, can be used as and consider document if appropriate for the foundation for being adapted for hot topic correlation films and television programs.In addition, present invention also provides the relevant device of document and the topic degree of correlation, to guarantee the application and realization of the above method in practice.

Description

Calculation method, device, equipment and the medium of document and the topic degree of correlation
Technical field
The present invention relates to technical field of data processing, a kind of calculation method more particularly to document and the topic degree of correlation and Device.
Background technique
Currently, many films and television programs all derive from the reorganization of literary works, such as " high aspirations ", " grave-robbery notes " video display are all It is reorganization in novel works.Literary works quantity is more many kinds of, which literary works is carried out video display reorganization, needs to consider The reorganization of literary works is worth.
Popular literary works possess a large amount of readers, and films and television programs made of reorganization can attract many spectators, therefore, literature The existing method of considering of one kind of the video display reorganization value of works is, to the biggish literary works of amount of reading by commenting on number, thumbing up Several and paid cases assess the literary works, choose the assessment higher literary works of score as video display and adapt subject matter.
But it has been investigated that, some films and television programs relevant to social hotspots by literary works reorganization, which also have, to be changed Value is compiled, such as " names of the people ", " humble abode ", the viewing quantity of " I be not medicine refreshing " works are very high.But this kind of literary works Amount of reading it is but very low, by above method can not consider out this kind of works reorganization be worth.
Summary of the invention
In view of this, the present invention provides the calculation method of a kind of document and the topic degree of correlation, for determining document and words Degree of correlation between topic, and then foundation is provided for selection and the document of topic appropriateness.
To achieve the above object, the embodiment of the present invention provides the following technical solutions:
In a first aspect, the present invention provides the calculation methods of a kind of document and the topic degree of correlation, comprising:
Obtain collection of document;
Obtain dictionary corresponding with default topic;Wherein the dictionary be using semi-supervised learning algorithm to topic data into Row study building, and the dictionary includes and the semantic relevant multiple words of the default topic;
For any document in the collection of document, according to life of the word in the dictionary in the collection of document Middle situation calculates the degree of correlation of any document default topic corresponding with the dictionary.
Second aspect, the present invention provides the computing devices of a kind of document and the topic degree of correlation, comprising:
Document obtains module, for obtaining collection of document;
Dictionary obtains module, for obtaining dictionary corresponding with default topic;Wherein the dictionary is using semi-supervised It practises algorithm and study building is carried out to topic data, and the dictionary includes and the semantic relevant multiple words of the default topic Language;
Relatedness computation module, any document for being directed in the collection of document, according to the word in the dictionary Hit situation in the collection of document calculates the degree of correlation of any document default topic corresponding with the dictionary.
The third aspect, the present invention provides the calculating equipment of a kind of document and the topic degree of correlation, comprising: processor and storage Device, the processor store number in the memory by software program, the calling of operation storage in the memory According to, at least execution following steps:
Obtain collection of document;
Obtain dictionary corresponding with default topic;Wherein the dictionary be using semi-supervised learning algorithm to topic data into Row study building, and the dictionary includes and the semantic relevant multiple words of the default topic;
For any document in the collection of document, according to life of the word in the dictionary in the collection of document Middle situation calculates the degree of correlation of any document default topic corresponding with the dictionary.
Fourth aspect, the present invention provides a kind of storage mediums, are stored thereon with computer program, the computer program When being executed by processor, realize that above-mentioned account shares detection method.
Based on above technical scheme as can be seen that the present invention provides the calculation method of a kind of document and the topic degree of correlation, The available collection of document of this method and dictionary corresponding with default topic, according to the word in default topic dictionary in document sets Hit situation in conjunction can calculate the degree of correlation of any document and default topic in collection of document.Document and default topic The degree of correlation can represent the related intimate degree between document content and default topic, can be used as and consider whether document fits Close the foundation for being adapted for hot topic correlation films and television programs.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 shows the calculation method flow chart of a kind of document provided by the invention Yu the topic degree of correlation;
Fig. 2 shows a kind of flow charts of the building process of topic dictionary provided by the invention;
Fig. 3 shows LDA topic model in the calculation method of a kind of document provided by the invention and the topic degree of correlation and exports Display figure;
Fig. 4 shows the computing device structure schematic diagram of a kind of document provided by the invention Yu the topic degree of correlation.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In the present invention, the terms "include", "comprise" or any other variant thereof is intended to cover non-exclusive inclusion, So that the process, method, article or equipment for including a series of elements not only includes those elements, but also including not having The other element being expressly recited, or further include for elements inherent to such a process, method, article, or device.Do not having There is the element limited in the case where more limiting by sentence "including a ...", it is not excluded that in the mistake including the element There is also other identical elements in journey, method, article or equipment.
See Fig. 1, the present invention provides the calculation methods of a kind of document and the topic degree of correlation, including step S101~S103.
S101: collection of document is obtained.
Specifically, the acquisition modes of document may include a variety of, such as crawls on the net, scans books or other acquisition modes. Wherein, for crawling the such document of novel on the net, using reptile instrument, on the various websites such as such as starting point net, 17k Document carry out data crawler, the data crawled include 5 chapter contents and comment before document title, author, document brief introduction, document Deng.It should be noted that the article that document is made of word content, can be any form, the type of document be can be respectively Kind, such as can be the literary works such as novel.
The data crawled are cleaned, messy code is removed, data format is unitized.Document data after will be processed It can be written into mysql (my structured query language) database.Such as the novel content online to starting point It is crawled, obtains corresponding table 1:
Novel crawler tables of data hot_social_novel_crawl:
Field name Annotation Field type Attribute Remarks
novel_id Novel number int(11) Non-empty major key From increasing
novel_date Novel issuing time varchar(50) Non-empty
novel_title Novel topic varchar(100) Non-empty
novel_author Storywriter varchar(255) Non-empty
novel_centent Novel text text It can be empty
novel_summary Novel brief introduction text It can be empty
Table 1
From table 1 it follows that the novel that will acquire is numbered, and number is assigned to novel_id, wherein Novel_id is 11 character integer types.Other parameters are programmed in this way, the parameters in below table It can be as reference.
In addition, the novel comment content that reptile instrument crawls also can store in tables of data, it is as shown in Table 2 small Say comment data table hot_social_comments_crawl:
Field name Annotation Field type Attribute Remarks
novel_id Novel number int(11) Non-empty major key
comment_date Comment on the time varchar(50) Non-empty
comment_content Comment on text varchar(255) Non-empty
Table 2
In practical applications, a kind of specific embodiment that this step S101 obtains collection of document may include: that use is climbed Worm tool extracts the content-data that can indicate the theme of document from every document, and then will extract from every document Content groups of data is combined into collection of document.
Specifically, in internet, document resources are abundant, provide the document for changing value to excavate, need to obtain to a large amount of Document calculated.Wherein, document is obtained using reptile instrument, for the acquisition of mass data, which is more square Just, fast.Reptile instrument can be from a part of content-data extracted in document in document, this partial content data is being capable of table Show 5 chapter contents and comment before the part of document subject matter, such as document title, author, document brief introduction, document.Certainly, the number crawled According to the content for being not limited to these parts, it can also be other.For the degree of correlation for obtaining a large amount of document and topic, need pair Multiple documents are calculated, for this reason, it may be necessary to crawl multiple documents, and multiple documents group are combined into collection of document, are input to algorithm It is screened in model.
Collection of document is arrived so that upper type is available, other than obtaining collection of document, it is also necessary to obtain the relevant word of topic Allusion quotation.
S102: dictionary corresponding with default topic is obtained;Wherein dictionary is using semi-supervised learning algorithm to topic data Study building is carried out, and dictionary includes and the semantic relevant multiple words of default topic.
Wherein, topic is pre-set, such as may include various hot topics: the people's livelihood, culture, education, is good at economy Health, sport, science popularization etc..In the building process of dictionary, need using pre-set topic, detailed process sees below explanation.
The corresponding dictionary of topic is to extract from topic data, such as obtain topic data by reptile instrument, Topic data is input in subject distillation algorithm model, subject distillation algorithm model segments topic data, and will be each A participle is divided into multiple clusters, and different clusters indicates different themes, it should be noted that theme is implicit theme, is The foundation that algorithm divides automatically.The algorithm model does not need artificially to supervise, and is automatically performed by subject distillation algorithm model.In addition, Topic data refers to that the viewpoint content-data expressed by some topic public, topic data can be any media shape Content-data on formula such as internet, newspapers and periodicals, the form of topic data also may include it is a variety of, as news data, Hot topic of tape label etc..
After multiple Subject Clusterings that subject distillation algorithm model obtains, further according to pre-set topic, in Subject Clustering It is middle to extract word relevant to topic semanteme, to obtain the corresponding dictionary of the topic.If topic is multiple, each topic Corresponding dictionary can be obtained in this manner.
S103: for any document in collection of document, according to hit situation of the word in dictionary in collection of document, Calculate the degree of correlation of any document default topic corresponding with dictionary.
The relevant dictionary of collection of document and topic got according to above step counts the word in the corresponding dictionary of topic Hit situation of the language in collection of document.For example, this topic of the people's livelihood, obtains the corresponding word of people's livelihood topic by above step Allusion quotation counts each document and the case where each word in people's livelihood dictionary occurs in the collection of document of acquisition.It should be noted that " hit " in hit situation refers to that the word in dictionary appears in the document in collection of document.
Occur the case where word in dictionary in document, can reflect the semantic degree of correlation of document Yu the dictionary, Jin Erke With according to the hit situation, to calculate the degree of correlation of document topic corresponding with dictionary.
In practical applications, a kind of specific embodiment of this step S103 is as follows.
The calculating of document and the topic degree of correlation needs first to calculate dictionary frequency of occurrences, i.e. word frequency in single document, then The frequency that dictionary occurs in collection of document, i.e. inverse document frequency are calculated, and then according to word frequency and inverse document frequency, calculates this The degree of correlation of document and topic.Therefore:
Firstly, the frequency of occurrence according to the word in dictionary in any document, calculates dictionary for the word of any document Frequently.
Specifically, include multiple words in dictionary, count in dictionary the number that occurs in a certain document of each word with And total word number in statistics the document, and in the number and the document occurred in a certain document according to word each in dictionary Total word number, to calculate word frequency of the dictionary in the document.
In practical applications, a kind of specific embodiment of this step includes the following steps: A1~A2.
A1: the total degree that each word occurs in any document in the dictionary, and statistics described are counted Total word number in one document.
A2: in the total degree and any document that word each in the dictionary is occurred in any document The ratio of total word number, as dictionary for the word frequency of any document.
Specifically, the total degree that all words in dictionary occur in the document is counted, for example, it is assumed that people's livelihood topic is corresponding Dictionary in include " poverty ", " employment " 2 words, if occurring " poverty " twice in certain document, once " employment ", then the people's livelihood words The word inscribed in corresponding dictionary occurs three times in the document.
According to formulaThe corresponding dictionary of topic is calculated in article The frequency of appearance, wherein TF is word frequency of the dictionary in a certain document.Still taking the above example as an example, if total word number of the document is 100, then the word frequency of the people's livelihood dictionary in the document is 0.03.
Secondly, calculating dictionary for document according to the frequency of occurrence of all documents in collection of document of the word in dictionary The inverse document frequency of set.
Count the document in the frequency of occurrence of all documents and statistical documents set in collection of document of the word in dictionary Total quantity, and according to the word in statistics dictionary in collection of document in the frequency of occurrence of all documents and collection of document The total quantity of document calculates dictionary for the inverse document frequency of collection of document.
In practical applications, a kind of specific embodiment of this step includes the following steps: B1~B2.
B1: frequency of occurrence of the word in collection of document in each document in statistics dictionary, and frequency of occurrence is met The document of preset threshold is as the document comprising dictionary.
Wherein, for all documents in collection of document, frequency of occurrence of the word in dictionary in each document is counted. It is more if there is number, then it is assumed that document includes the dictionary;It is less if there is number on the contrary, then it is assumed that document does not include should Dictionary.Therefore the present invention presets the threshold value standard less or more as frequency of occurrence.Specifically, comprising the text of dictionary Shelves, it is desirable that the word number in dictionary occur and have to be larger than or be equal to log2N times;Wherein n is total word number in dictionary, log2N is For preset threshold.
B2: the ratio of the sum and the document comprising dictionary of all documents in collection of document is calculated, and ratio is determined as Inverse document frequency of the dictionary for collection of document.
Specifically, in statistical documents set all documents total quantity, and then according to the total quantity and including the text of dictionary The quantity of shelves, to calculate inverse document frequency.Specifically for example:
According to formulaDictionary is calculated to occur in all collection of document Number.Wherein, IDF is the frequency of a certain dictionary occurred in all documents, the i.e. inverse document frequency of the dictionary, the formula In to prevent the denominator in logarithm to be not zero, 1 is added in the total number of documents comprising dictionary of statistics.
Finally, calculating the product of word frequency and inverse document frequency, and using product as any document topic corresponding with dictionary The degree of correlation.
Specifically, the word frequency and inverse document frequency obtained according to above step, calculates according to formula TF_IDF=TFIDF The degree of correlation TF_IDF of any document topic corresponding with dictionary out.
Such as: assuming that getting the corresponding dictionary D of topic topic1, dictionary D includes 32 words, and obtaining includes 2000 The collection of document of document, statistics obtains in advance, and the total number of documents comprising dictionary D is 19, and wherein one is chosen from 2000 documents , this document has 10000 words, and the number that the word in dictionary D occurs in the document is 500 times, is based on above-mentioned statistical number It is specific as follows according to calculating this document with topic topic1 to response:
TF_IDFD=TFD·IDFD=0.05 × 2=0.1
The TF_IDF being calculatedDThe as degree of correlation of the document topic topic1 corresponding with dictionary D.
The degree of correlation between document and topic can be stored, it is assumed for example that document is novel, then can be by the degree of correlation According in the field type write-in database of such as the following table 3, the tables of data stored in database is novel topic correlativity table hot_ Social_novel_topic_correlation_degree includes field shown in table 3 in the tables of data.
Table 3
The degree of correlation it can be seen from above-mentioned calculation between a certain document and default topic, comprising two influence because Element, i.e. word frequency and inverse document frequency.It should be noted that when calculating the inverse document frequency of dictionary, it is possible that a kind of feelings Condition, the i.e. a certain document include that the quantity of word in dictionary does not reach threshold requirement, that is to say, that a certain document does not wrap Other documents containing the dictionary, but in collection of document may include the dictionary, according to the meter of the inverse document frequency of above-mentioned offer Calculation mode can still calculate the value of inverse document frequency.But in this case it is fair to consider that dictionary and a certain text Shelves are simultaneously uncorrelated, therefore can directly assert that the inverse document frequency of the dictionary is 0.
Therefore in practical applications, when calculating inverse document frequency of the dictionary for collection of document, dictionary can be first determined whether In word whether meet preset threshold in the frequency of occurrence of a certain document (being currently used in the document of calculating), if it is It is no, then inverse document frequency of the dictionary for collection of document is directly determined as 0, it if NO, then can be according to above-mentioned offer The calculation of inverse document frequency is calculated.
Since inverse document frequency is confirmed as 0, then it is found that this certain in such a way that above-mentioned word frequency is multiplied with inverse document frequency The degree of correlation between one document and default topic is also 0.
From the above technical scheme, the present invention provides the calculation method of a kind of document and the topic degree of correlation, this method Available collection of document and the corresponding dictionary of topic, according to hit situation of the word in topic dictionary in collection of document, The degree of correlation of any document and topic in collection of document can be calculated.The degree of correlation of document and topic can represent in document Hold the related intimate degree between topic, can be used as and consider document if appropriate for being adapted for hot topic correlation films and television programs Foundation.
See Fig. 2, the embodiment of the invention provides a kind of building mode of topic dictionary, specifically include step S201~ S203。
S201: topic data is grabbed using reptile instrument.
Specifically, to get topic, news content can be collected, such as to People's Net, China Daily, China News content in the news media such as Youth Newspaper and free media platform carries out crawler, obtains topic data.
By crawler to data clean, remove messy code, data format is unitized.Topic data after will be processed It can be written into mysql database.
Table is opened in each website design one, and such as social topic data table hot_social_news_crawl is as follows:
Field name Annotation Field type Attribute Remarks
news_tag Mark varchar(100) Non-empty major key Time and title splicing
news_date It gives a news briefing the time varchar(50) Non-empty
news_topic Theme of news varchar(100) It can be empty News Network's plate
news_title Headline varchar(255) Non-empty The degree of correlation of dictionary
news_content News content text It can be empty
news_url Current url varchar(255) It can be empty
from_url Upper level url varchar(255) It can be empty Related news
Table 4
News_url in table is current network address, and from_url is the upper level network address of current network address, for determining Connection between related news.
In addition to social news table, there are also China Daily hot_social_news_crawl_Chinese_daily, People's Nets Database is written with field type shown by table in hot_social_news_crawl_renmin_net, the data crawled In.
Input of the topic data crawled as topic model.
S202: topic data is input in topic model tool, to extract point of Topic word from topic data Class.
Specifically, topic model tool is existing for extracting the tool of theme, can be to being input to itself number Classify according to content, each classification indicates a theme.Topic data is input to theme mould by this step In type tool, which can classify to topic data, to obtain the classification about Topic word.
The detailed process that topic model tool carries out subject classification can be expressed as, for topic data, in theme distribution One theme of middle extraction;A word is randomly selected in the distribution of the word corresponding to the theme being drawn into;It repeats the above process Until each word in traversal entire chapter topic data.The treatment process is recognized based on the analysis of the generating process to document It is document before generation, it is first determined this document needs the theme for including, and then surrounds the relevant word of theme reselection Wording and phrasing are carried out, to generate corresponding document.Topic model tool is based on document structure tree principle, by given topic number According to as a document, the theme distribution of the topic data is deduced as procedure described above.
One specific example of topic model tool is that (Latent Dirichlet Allocation implies Di Li to LDA Cray distribution) topic model, it includes word, theme and document three-decker that LDA, which is also referred to as three layers of bayesian probability model,.LDA is A kind of unsupervised learning method, using bag of words, each document is considered as a word frequency vector, and bag of words do not account for word Sequence between word.Below by taking the topic model tool as an example, the assorting process of topic model tool is briefly described.
Specifically, topic data, theme quantity K, word quantity N are input in LDA topic model tool, wherein theme Quantity K, word quantity N are the parameter preset in tool, and theme quantity K indicates that the tool needs topic data being divided into how many A subject classification includes multiple Topic words in each subject classification, and word quantity N is for indicating to need in each subject classification The quantity of the Topic word of selection.It should be noted that LDA topic model tool can calculate each Topic word relative to The probability value of the subject classification, what probability value indicated is the probability value that Topic word belongs to the subject classification, therefore is being chosen When Topic word can by probability value from high to low in the way of chosen.
Based on the setting of above-mentioned parameter preset, after M topic datas are input in LDA topic model, LDA topic model Topic data can be divided into K Subject Clustering automatically, each Subject Clustering respectively indicates an independent theme, theme Particular content can not be obtained by LDA topic model, therefore Subject Clustering is properly termed as implicit theme.In addition, each theme is poly- It include N number of Topic word in class.
It is assumed that setting 30 for the theme quantity K in LDA topic model tool, word quantity N is set as 10, will be certain In the LDA topic model tool that topic data input value is arranged such, output result shown in Fig. 3 is obtained.As shown in figure 3, LDA Topic model tool exports the cluster of 30 groups of words, and each cluster indicates a theme, the volume that the first column of figure of left side is the theme Number, each theme includes 10 Topic words.
It should be noted that LDA topic model is the algorithm model for obeying Dirichlet (Di Li Cray) distribution, divide K theme obey the Dirichlet that parameter is α and be distributed, N number of word in each theme obeys Dirichlet points of parameter beta Cloth.There are a probability, the probability to indicate the correlation degree with implicit theme for each word, according to implicit theme participle When, LDA topic model chooses classification of the big word of top n probability as implicit descriptor according to the size of probability.And according to general The sequence arrangement of rate from big to small.
S203: the classification of default topic and Topic word is input to term vector and is generated in model, to obtain and topic language Multiple words similar in justice;The plurality of newsy dictionary of word group.
Specifically, one or more topics can be preset.If topic be it is multiple, by each topic respectively with master The classification of epigraph language is input to term vector and generates in model, to respectively obtain the corresponding dictionary of each topic.
Term vector generates model and is used to obtain multiple words with topic semantic similarity from the classification of Topic word, these Word is the dictionary of topic.The example that term vector generates model is that work2vec term vector generates model, with the model For the generating process of dictionary is briefly described.
For example, the classification of " people's livelihood " topic and Topic word is input to work2vec term vector and generated in model, Work2vec finds word relevant to people's livelihood topic semanteme, i.e., according to the people's livelihood topic of input in the classification of Topic word It is calculated in Topic word by algorithm, the probability of each word and preset topic, arrives small sequence greatly according still further to probability and choose one The word of fixed number amount (quantity can be pre-set), as the corresponding dictionary of topic.
In practical applications, it after the degree of correlation of document and topic is calculated, can also be ranked up.
A kind of situation is, preset topic be it is multiple, each topic can calculate document and topic in the manner described above Between the degree of correlation can obtain the degree of correlation between multiple topics in this way for any document, so that this is any The degree of correlation of document and each topic is ranked up.Sequence can be arranges according to topic and the sequence of the document degree of correlation Column, are the topic of the document with the highest topic of the document degree of correlation, and then can be adapted for shadow using the topic as the document It is regarded as the reorganization direction of product.
Another situation is that document in collection of document is multiple, and then can be by document each in collection of document and same The degree of correlation of one topic is ranked up.The standard of sequence is also possible to the sequence according to the degree of correlation from high to low, with the topic phase The highest document of Guan Du is the maximally related document of the topic.
Referring to fig. 4, the embodiment of the invention provides the structures of a kind of document and the computing device of the topic degree of correlation.Such as Fig. 4 Shown, which can specifically include: document obtains module 401, dictionary obtains module 402 and relatedness computation module 403.
Document obtains module 401, for obtaining collection of document.
Dictionary obtains module 402, for obtaining dictionary corresponding with default topic;Wherein the dictionary is using semi-supervised Learning algorithm carries out study building to topic data, and the dictionary includes and the semantic relevant multiple words of the default topic Language.
Relatedness computation module 403, any document for being directed in the collection of document, according to the word in the dictionary Hit situation of the language in the collection of document calculates the related of any document default topic corresponding to the dictionary Degree.
In one embodiment, relatedness computation model 403 can specifically include: word frequency computational submodule, inverse document Frequency computational submodule and relatedness computation submodule.
Word frequency computational submodule, for going out occurrence in any document according to all words in the dictionary Number, calculates the word frequency of the dictionary;
Inverse document frequency computational submodule, for being owned in the collection of document according to all words in the dictionary The frequency of occurrence of document calculates the inverse document frequency of the dictionary;
Relatedness computation submodule, for calculating the product of the word frequency Yu the inverse document frequency, and by the product The degree of correlation as any document default topic corresponding with the dictionary.
In one embodiment, word frequency computational submodule can specifically include: single document dictionary statistic unit, document word Language statistic unit and word frequency computing unit.
Single document dictionary statistic unit, for count each word in the dictionary occur in any document it is total Number;
Document word statistic unit, for counting total word number in any document;
Word frequency computing unit, total degree and institute for occurring word each in the dictionary in any document State the ratio of total word number in any document, the word frequency as the dictionary.
In a kind of embodiment, inverse document frequency computational submodule can be specifically included: more document dictionary statistic units and Inverse document frequency computing unit.
More document dictionary statistic units, for counting each text in the collection of document of all words in the dictionary Frequency of occurrence in shelves, and frequency of occurrence is met into the document of preset threshold as destination document;
Inverse document frequency computing unit, for calculating the number of the sum and destination document of all documents in the collection of document The ratio of amount, and the ratio is determined as to the inverse document frequency of the dictionary.
In a kind of embodiment, inverse document frequency computational submodule can be specifically included: inverse document frequency computing unit.
Preset threshold unit, if the frequency of occurrence for all words in the dictionary in any document is unsatisfactory for The inverse document frequency of the dictionary is then determined as 0 by preset threshold.
In a kind of embodiment, the computing device of document and the topic degree of correlation can also include dictionary creation module, be used for Construct dictionary corresponding with default topic.
Wherein, dictionary creation module can specifically include: news grabs submodule, classification word submodule, word and generates Submodule and dictionary generate submodule.
News grabs submodule, for grabbing topic data using reptile instrument;
Classify word submodule, for the topic data to be input in topic model tool, with from the topic number The classification of Topic word is extracted in;
Word generates submodule, generates model for the classification of default topic and the Topic word to be input to term vector In, to obtain multiple words with the default topic semantic similarity;
Dictionary generates submodule, for forming the default words according to multiple words of the default topic semantic similarity The dictionary of topic.
In a kind of embodiment, the document obtains module and can specifically include: document grabs submodule and document combination Submodule.
Document grabs submodule, for crawling multiple documents from network using reptile instrument, wherein what is crawled is every The content-data of the theme of the document can be indicated in the document;
Sets of documentation zygote module, for the multiple documents group to be combined into collection of document.
In a kind of embodiment, the computing device of document and the topic degree of correlation can also include: sorting module.
Sorting module, if for default topic be it is multiple, by the phase of any document and each default topic Guan Du is ranked up;Or, the degree of correlation of the document each in the collection of document and the same default topic is arranged Sequence.
From the above technical scheme, the present invention provides the computing device of a kind of document and the topic degree of correlation, the devices Available collection of document and dictionary corresponding with default topic, according to hit of the word in topic dictionary in collection of document Situation can calculate the degree of correlation of any document and default topic in collection of document.The degree of correlation of document and default topic can To represent the related intimate degree between document content and default topic, it can be used as and consider document if appropriate for being adapted for heat The foundation of point topic correlation films and television programs.
In addition, being specifically included present invention also provides the calculating equipment of a kind of document and the topic degree of correlation: processor and depositing Reservoir, the processor are stored in the memory by software program, the calling of operation storage in the memory Data, at least execution following steps:
Obtain collection of document;
Obtain dictionary corresponding with default topic;Wherein the dictionary be using semi-supervised learning algorithm to topic data into Row study building, and the dictionary includes and the semantic relevant multiple words of the default topic;
For any document in the collection of document, according to life of the word in the dictionary in the collection of document Middle situation calculates the degree of correlation of any document default topic corresponding with the dictionary.
In addition, being stored thereon with computer program, the computer program quilt present invention also provides a kind of storage medium When processor executes, the calculation method of the document that above-mentioned any embodiment provides and the topic degree of correlation is realized.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system or For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method The part of embodiment illustrates.System and system embodiment described above is only schematical, wherein the conduct The unit of separate part description may or may not be physically separated, component shown as a unit can be or Person may not be physical unit, it can and it is in one place, or may be distributed over multiple network units.It can root According to actual need that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Ordinary skill Personnel can understand and implement without creative efforts.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (12)

1. the calculation method of a kind of document and the topic degree of correlation characterized by comprising
Obtain collection of document;
Obtain dictionary corresponding with default topic;Wherein the dictionary is using semi-supervised learning algorithm to topic data Building is practised, and the dictionary includes and the semantic relevant multiple words of the default topic;
For any document in the collection of document, according to hit feelings of the word in the dictionary in the collection of document Condition calculates the degree of correlation of any document default topic corresponding with the dictionary.
2. the calculation method of document according to claim 1 and the topic degree of correlation, which is characterized in that described according to institute's predicate Hit situation of the word in the collection of document in allusion quotation calculates any document default topic corresponding with the dictionary The degree of correlation, comprising:
According to frequency of occurrence of all words in the dictionary in any document, the word frequency of the dictionary is calculated;
According to the frequency of occurrence of all words all documents in the collection of document in the dictionary, the dictionary is calculated Inverse document frequency;
The product of the word frequency Yu the inverse document frequency is calculated, and using the product as any document and the dictionary The degree of correlation of corresponding default topic.
3. the calculation method of document according to claim 2 and the topic degree of correlation, which is characterized in that described according to institute's predicate Frequency of occurrence of all words in any document in allusion quotation, calculates the word frequency of the dictionary, comprising:
Count the total degree that each word occurs in any document in the dictionary;
Count total word number in any document;
Total word number in total degree and any document that word each in the dictionary is occurred in any document Ratio, the word frequency as the dictionary.
4. the calculation method of document according to claim 2 and the topic degree of correlation, which is characterized in that described according to institute's predicate The frequency of occurrence of all words all documents in the collection of document in allusion quotation, calculates the inverse document frequency of the dictionary, packet It includes:
Count frequency of occurrence of all words in the collection of document in each document in the dictionary, and by frequency of occurrence Meet the document of preset threshold as destination document;
The sum and the ratio of the quantity of destination document of all documents in the collection of document are calculated, and the ratio is determined as The inverse document frequency of the dictionary.
5. the calculation method of document according to claim 2 and the topic degree of correlation, which is characterized in that described according to institute's predicate The frequency of occurrence of all words all documents in the collection of document in allusion quotation, calculates the inverse document frequency of the dictionary, packet It includes:
If all words in the dictionary are unsatisfactory for preset threshold in the frequency of occurrence of any document, by the dictionary Inverse document frequency be determined as 0.
6. the calculation method of document according to claim 1 and the topic degree of correlation, which is characterized in that the default topic pair The building mode for the dictionary answered includes:
Topic data is grabbed using reptile instrument;
The topic data is input in topic model tool, to extract point of Topic word from the topic data Class;
The classification of default topic and the Topic word is input to term vector to generate in model, to obtain and the default topic Multiple words of semantic similarity;
The dictionary of the default topic is formed according to multiple words with the default topic semantic similarity.
7. the calculation method of document according to claim 1 and the topic degree of correlation, which is characterized in that the acquisition document sets It closes, comprising:
Multiple documents are crawled from network using reptile instrument, wherein what is crawled is that can indicate the text in every document The content-data of the theme of shelves;
The multiple documents group is combined into collection of document.
8. the calculation method of document according to claim 1 and the topic degree of correlation, which is characterized in that further include:
If default topic be it is multiple, the degree of correlation of any document and each default topic is ranked up;
Or,
The degree of correlation of the document each in the collection of document and the same default topic is ranked up.
9. the computing device of a kind of document and the topic degree of correlation characterized by comprising
Document obtains module, for obtaining collection of document;
Dictionary obtains module, for obtaining dictionary corresponding with default topic;Wherein the dictionary is calculated using semi-supervised learning Method carries out study building to topic data, and the dictionary includes and the semantic relevant multiple words of the default topic;
Relatedness computation module, any document for being directed in the collection of document, according to the word in the dictionary in institute The hit situation in collection of document is stated, the degree of correlation of any document default topic corresponding with the dictionary is calculated.
10. the computing device of document according to claim 9 and the topic degree of correlation, which is characterized in that further include: dictionary structure Block is modeled, for constructing dictionary corresponding with default topic;
The dictionary creation module includes:
News grabs submodule, for grabbing topic data using reptile instrument;
Classification word submodule, for the topic data to be input in topic model tool, from the topic data Extract the classification of Topic word;
Dictionary generates submodule, generates in model for the classification of default topic and the Topic word to be input to term vector, To obtain multiple words with the default topic semantic similarity;Wherein the multiple word forms the word of the default topic Allusion quotation.
11. the calculating equipment of a kind of document and the topic degree of correlation characterized by comprising processor and memory, the processing Software program, calling storage data in the memory of the device by operation storage in the memory, at least execute Following steps:
Obtain collection of document;
Obtain dictionary corresponding with default topic;Wherein the dictionary is using semi-supervised learning algorithm to topic data Building is practised, and the dictionary includes and the semantic relevant multiple words of the default topic;
For any document in the collection of document, according to hit feelings of the word in the dictionary in the collection of document Condition calculates the degree of correlation of any document default topic corresponding with the dictionary.
12. a kind of storage medium, is stored thereon with computer program, which is characterized in that the computer program is held by processor When row, the calculation method of the document and the topic degree of correlation as described in claim 1-8 any one is realized.
CN201910131086.4A 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic Active CN109871433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910131086.4A CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910131086.4A CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Publications (2)

Publication Number Publication Date
CN109871433A true CN109871433A (en) 2019-06-11
CN109871433B CN109871433B (en) 2021-07-23

Family

ID=66919047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910131086.4A Active CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Country Status (1)

Country Link
CN (1) CN109871433B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143506A (en) * 2019-12-27 2020-05-12 汉海信息技术(上海)有限公司 Topic content ordering method, device, server and storage medium
CN111553144A (en) * 2020-04-28 2020-08-18 深圳壹账通智能科技有限公司 Topic mining method and device based on artificial intelligence and electronic equipment
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113034057A (en) * 2021-05-10 2021-06-25 中国工商银行股份有限公司 Risk identification method, device and equipment
CN113656695A (en) * 2021-08-18 2021-11-16 北京奇艺世纪科技有限公司 Hot data generation method and device, data processing method and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
CN103049568A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Method for classifying documents in mass document library
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN108228555A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Article treating method and apparatus based on column theme
CN108829889A (en) * 2018-06-29 2018-11-16 国信优易数据有限公司 A kind of newsletter archive classification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN103049568A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Method for classifying documents in mass document library
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN108228555A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Article treating method and apparatus based on column theme
CN108829889A (en) * 2018-06-29 2018-11-16 国信优易数据有限公司 A kind of newsletter archive classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨春明: "具有共现关系的中文褒贬词典构建", 《计算机工程与应用》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143506A (en) * 2019-12-27 2020-05-12 汉海信息技术(上海)有限公司 Topic content ordering method, device, server and storage medium
CN111143506B (en) * 2019-12-27 2023-11-03 汉海信息技术(上海)有限公司 Topic content ordering method, topic content ordering device, server and storage medium
CN111553144A (en) * 2020-04-28 2020-08-18 深圳壹账通智能科技有限公司 Topic mining method and device based on artificial intelligence and electronic equipment
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112926297B (en) * 2021-02-26 2023-06-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112597283A (en) * 2021-03-04 2021-04-02 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium
CN113034057A (en) * 2021-05-10 2021-06-25 中国工商银行股份有限公司 Risk identification method, device and equipment
CN113656695A (en) * 2021-08-18 2021-11-16 北京奇艺世纪科技有限公司 Hot data generation method and device, data processing method and electronic equipment

Also Published As

Publication number Publication date
CN109871433B (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN109871433A (en) Calculation method, device, equipment and the medium of document and the topic degree of correlation
CN103258000B (en) Method and device for clustering high-frequency keywords in webpages
US11023471B1 (en) Scalable natural language processing for large and dynamic text environments
US7716207B2 (en) Search engine methods and systems for displaying relevant topics
Pattaniyil et al. Combining TF-IDF Text Retrieval with an Inverted Index over Symbol Pairs in Math Expressions: The Tangent Math Search Engine at NTCIR 2014.
US20060004732A1 (en) Search engine methods and systems for generating relevant search results and advertisements
CN103309960B (en) The method and device that a kind of multidimensional information of network public sentiment event is extracted
CN108170692A (en) A kind of focus incident information processing method and device
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
Leong et al. Going beyond text: A hybrid image-text approach for measuring word relatedness
US20180349360A1 (en) Systems and methods for automatically generating news article
Tran et al. Joint graphical models for date selection in timeline summarization
CN111177559A (en) Text travel service recommendation method and device, electronic equipment and storage medium
Logeswari et al. Biomedical document clustering using ontology based concept weight
Knittel et al. PyramidTags: context-, time-and word order-aware tag maps to explore large document collections
US20180349352A1 (en) Systems and methods for identifying news trends
Shatnawi et al. Text stream mining for Massive Open Online Courses: review and perspectives
El-Roby et al. Sapphire: Querying RDF data made simple
CN104615685B (en) A kind of temperature evaluation method of network-oriented topic
Li et al. Towards retrieving relevant information graphics
CN107807964B (en) Digital content ordering method, apparatus and computer readable storage medium
Faisal et al. A novel framework for social web forums’ thread ranking based on semantics and post quality features
JP4359075B2 (en) Concept extraction system, concept extraction method, concept extraction program, and storage medium
Ramanathan et al. Creating user profiles using wikipedia
Bolognesi Flickr® distributional tagspace: Evaluating the semantic spaces emerging from Flickr® tag distributions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant