CN106598999A - Method and device for calculating text theme membership degree - Google Patents

Method and device for calculating text theme membership degree Download PDF

Info

Publication number
CN106598999A
CN106598999A CN201510680602.0A CN201510680602A CN106598999A CN 106598999 A CN106598999 A CN 106598999A CN 201510680602 A CN201510680602 A CN 201510680602A CN 106598999 A CN106598999 A CN 106598999A
Authority
CN
China
Prior art keywords
node
topic model
sentence
text
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510680602.0A
Other languages
Chinese (zh)
Other versions
CN106598999B (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510680602.0A priority Critical patent/CN106598999B/en
Publication of CN106598999A publication Critical patent/CN106598999A/en
Application granted granted Critical
Publication of CN106598999B publication Critical patent/CN106598999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The invention discloses a method and device for calculating a text theme membership degree, and relates to the technical field of computers. The problem of a relatively large error of membership degree calculation resulting from the irrelevancy of a theme keyword and a text theme in a text is solved. The main technical scheme of the invention is as follows: selecting a corresponding theme model having a tree structure according to a business type, wherein nodes in the theme model are used for dividing the categories of theme keywords, each node in the theme model contains at least one theme keyword, and each node is provided with a node weight value; phrasing a to-be-tested text to obtain a sentence list; counting the number of the sentences in the to-be-tested text contained in the nodes in the theme model according to the them keywords of the nodes in the theme model and the sentence list; and calculating the theme membership degree of the to-be-tested text according to the node weight values of the nodes in the theme model and the number of the clauses. The method and device disclosed by the invention is mainly used for calculating the text theme membership degree.

Description

A kind of method and device for calculating text subject degree of membership
Technical field
The present invention relates to field of computer technology, more particularly to a kind of side for calculating text subject degree of membership Method and device.
Background technology
Under big data background, it is an important topic that relevant information is extracted.Information extraction technique is not Attempt comprehensive understanding entire chapter document, simply the part comprising relevant information in document is analyzed.It is logical Cross the characteristic key words that extract in article to determine the subject content expressed by the article.
Existing relevant information extraction algorithm mostly with article whether there is the feature related to a certain theme Key word, so as to judge whether the content expressed by this article belongs to the theme.It is this to be with key word It is no to occur in article as feature, although can than more comprehensively obtaining the relevant information in article, But the information extracted there may be a large amount of noises because in article not all word all with theme Close association.Therefore, may draw when finally the theme expressed by this article is judged and contrary sentence Disconnected result, causes the bigger error of subsequent analysis.
The content of the invention
In view of this, the present invention provides a kind of method and device for calculating text subject degree of membership, mainly Purpose be by preset themes degree of membership model be the theme keyword classification configure weight, so as to synthesis The theme degree of membership of text to be measured is calculated, the purpose for improving judgment accuracy is reached.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
The corresponding topic model with tree structure is selected according to type of service, in the topic model Node be used to divide the classification of subject key words, wherein, each node in the topic model is wrapped Containing at least one subject key words, and each node is provided with node weight weight values, the node Weighted value is used to represent the degree of association of the node and its father node;
Subordinate sentence is carried out to text to be measured, sentence list is obtained;
Subject key words and the sentence list according to each node of topic model, count the theme mould The quantity of sentence in text described to be measured in type contained by each node;
According to the node weight weight values and subordinate sentence quantity of each node in the topic model, treat described in calculating Survey the theme degree of membership of text.
On the other hand, present invention also offers a kind of device for calculating text subject degree of membership, the device Including:
Select unit, for selecting the corresponding topic model with tree structure according to type of service, Node in the topic model is used to divide the classification of subject key words, wherein, the topic model In each node include at least one subject key words, and each node is provided with node weight Weight values, the node weight weight values are used to represent the degree of association of the node and its father node;
Clause unit, for carrying out subordinate sentence to text to be measured, obtains sentence list;
Statistic unit, the theme for each node in the topic model that selected according to the select unit is closed The sentence list that keyword and the clause unit are obtained, in the statistics topic model contained by each node The quantity of sentence in the text to be measured;
Computing unit, for single according to the node weight weight values of each node in the topic model and statistics The subordinate sentence quantity of unit's statistics, calculates the theme degree of membership of the text to be measured.
The method and device of the calculating text subject degree of membership proposed according to the invention described above, by choosing Select a preset topic model to calculate the theme degree of membership of text to be measured, in the topic model, Different theme keys is classified according to classification, according between different classification and each classification Relation create different nodes in topic model, and different weighted values are set for node. When calculating the theme degree of membership of text to be measured, be by text subordinate sentence after, according to what is contained in each sentence Subject key words are that each sentence distributes section determining sentence correspondence topic model interior joint weighted value After point weighted value, using the structure of topic model, by the sentence quantity meter contained in each node of statistics The sentence quantity that root node contains in model is calculated, and the sentence quantity accounts for total sentence quantity in text to be measured Ratio be exactly theme degree of membership of the text to be measured with respect to the topic model.With existing theme degree of membership Computational methods are compared, and the present invention is classified and arranged difference by setting up the topic model key that is the theme Weighted value, so as to refine subject key words with test theme degree of correlation, then by with text to be measured This matching carrys out the keyword weight accounting contained in COMPREHENSIVE CALCULATING text so that the meter of theme degree of membership The number of times calculated the weighted value with subject key words and occur in the text is associated, and is returned with improving theme The accuracy that category degree is calculated.Additionally, subject of the present invention degree of membership result of calculation is probit, difference The drawbacks of the result of existing two points of computational methods is excessively thought in absolute terms, by text to be measured and test theme Degree of association represent more directly perceived, clearly in the form of probit.
Description of the drawings
By the detailed description for reading hereafter preferred implementation, various other advantage and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred implementation , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
Fig. 1 shows a kind of stream of the method for calculating text subject degree of membership that the embodiment of the present invention is proposed Cheng Tu;
Fig. 2 shows another kind of method for calculating text subject degree of membership that the embodiment of the present invention is proposed Flow chart;
Fig. 3 shows a kind of topic model structural representation that the embodiment of the present invention is proposed;
Fig. 4 shows a kind of group of the device of calculating text subject degree of membership that the embodiment of the present invention is proposed Into block diagram;
Fig. 5 shows another kind of device for calculating text subject degree of membership that the embodiment of the present invention is proposed Composition frame chart;
Fig. 6 shows the device of the third calculating text subject degree of membership that the embodiment of the present invention is proposed Composition frame chart.
Specific embodiment
The exemplary embodiment of the present invention is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the present invention is shown, it being understood, however, that may be realized in various forms the present invention And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more Thoroughly understand the present invention, and can be by the complete technology for conveying to this area of the scope of the present invention Personnel.
A kind of method of calculating text subject degree of membership is embodiments provided, as shown in figure 1, Concrete steps include:
101st, the corresponding topic model with tree structure is selected according to type of service.
When the theme degree of membership of text is calculated, be usually given predetermined theme and with the theme phase The some subject key words closed, then judge whether contain these subject key words to judge text in text With the dependency of predetermined theme.Wherein, the classification of predetermined theme is then according to different industries, subject Or the scope of business makes a distinction.In the present embodiment, it is that theme is carried out according to the type of business Division, for different types of service, corresponding topic model can be selected to treat test text Tested to calculate the dependency of text and theme.
The topic model is built by tree structure, wherein, tree structure refers to data The data structure of the relation on attributes of " one-to-many " is there is between element.According to the structure, in theme Some nodes are there are in model, and is outwards dissipated based on root node.Wherein, father's section Point is comprising the relation with subordinate with the relation of child node.Wherein, each node in topic model Include at least one subject key words, and subject key words are then bases with the corresponding relation of node What the inclusion relation between the classification of subject key words and classification determined.
Topic model in the present embodiment in addition to including corresponding subject key words, also to theme mould Each node in type is provided with node weight weight values, to the degree of association for representing the node and theme.Need Will be it is emphasized that the node weight weight values of topic model be relative weight value, that is, node Weighted value is the weighted value of the father node relative to the node.For example, node 1 includes 2 son sections Point, respectively node 2 and node 3, then the node weight weight values of node 2 and node 3 are all relative sections What point 1 was arranged, if total weight of defining node 1 is 1, then the weighted value of node 2 and node 3 Sum is 1, and node 2 and the weighted value of node 3 freely set in which can be associated.Thus can be with The scope of business of topic model is expanded by the setting of node weight weight values, extension, increase related Subject key words, so that the calculating of theme degree of membership is more comprehensive.
102nd, subordinate sentence is carried out to text to be measured, obtains sentence list.
After it have selected topic model, so that it may carry out related subject degree of membership to text to be measured and calculate. First it is that text to be measured is carried out into subordinate sentence process when calculating, obtains a sentence list.The present invention When text to be measured is processed, the subordinate sentence form for using is carried out in text relative in prior art embodiment For the mode of word segmentation processing, the implementation of subordinate sentence process is more simple, performs speed also more Quickly.Also, because participle technique has that participle is inaccurate in Chinese text, and subordinate sentence Only need to just realize accurate subordinate sentence according to fixed punctuation character.Therefore, subordinate sentence is with respect to participle For implementation it is simpler and in hgher efficiency.
Sentence quantity in the sentence list obtained after statistics subordinate sentence is to follow-up theme degree of membership meter Calculate.
103rd, the subject key words according to each node of topic model and sentence list, count the topic model In in text to be measured contained by each node sentence quantity.
After subordinate sentence is completed to text to be measured, by the sentence in sentence list one by one be brought into what is chosen In topic model, matched with the subject key words in topic model, checked in the sentence whether contain There are subject key words.If contained, the node that the key word is located is determined that, and in the meter of the node Jia 1 in number devices, wherein, the enumerator be for recording node appearance text to be measured in sentence quantity, When in sentence containing the subject key words in the node, just the sentence is recorded in this node, that is, existed Jia 1 on enumerator.
It is important to note that the sentence in sentence list is after topic model is brought into, being only should Sentence matching one node, the i.e. sentence can not do repetition record.That is, work as being deposited in a sentence When there is multiple subject key words, topic model will determine a main key in multiple key words Word Jia 1 determining the corresponding relation of the sentence and node on the enumerator of the node.
By this step, all sentences containing key word in sentence list can be matched topic model In in a unique node.Thus, in the topic model, it is possible in viewing document to be measured Distribution situation of the sentence in each node.
104th, according to the node weight weight values and subordinate sentence quantity of each node in topic model, text to be measured is calculated This theme degree of membership.
After the quantity of each node, carried according to above-mentioned in topic model in the sentence for counting text to be measured The relative property of the node weight weight values for arriving, it is possible to which the sentence quantity of the node is converted to into its father node Sentence quantitative value, by that analogy, it is possible to calculate the text to be measured that the root node of the topic model occurs The quantity of sentence in this.Just can obtain further according to the accounting of quantity sentence total quantity in sentence list Go out theme degree of membership of the text to be measured relative to topic model.
Further, as the extension of above-mentioned calculation, topic model can be split, one Father node can just be individually composed a topic model with its child node and the node for associating still further below Calculate the degree of membership of text to be measured and the father node.Therefore, when topic model is created, can also be Each node specifies a subject name, thus can according to demand under same topic model Calculate the theme degree of membership of multiple related subjects.
The calculating text subject that the embodiment of the present invention is adopted is can be seen that with reference to above-mentioned implementation Degree of membership method, is that the theme that text to be measured is calculated by selecting a preset topic model belongs to Degree, in the topic model, different subject key words is classified according to classification, according to difference Classification and each classification between relation create different nodes in topic model, and for section Point arranges different weighted values.When the theme degree of membership of text to be measured is calculated, be by text subordinate sentence after, Sentence correspondence topic model interior joint weight is determined according to the subject key words contained in each sentence Value, is that each sentence is distributed after node weight weight values, each by statistics using the structure of topic model The sentence quantity that root node contains in the sentence quantity computation model contained in node, and the sentence quantity The ratio for accounting for total sentence quantity in text to be measured is exactly that text to be measured belongs to respect to the theme of the topic model Degree.Compare with existing theme degree of membership computational methods, the present invention is the theme by setting up topic model Key is classified and is arranged different weighted values, so as to refine the phase of subject key words and test theme Pass degree, then accounted for by the keyword weight matched to contain in COMPREHENSIVE CALCULATING text with text to be measured Than so that the calculating of theme degree of membership and the weighted value of subject key words and occur in the text time Number is associated, to improve the accuracy of theme degree of membership calculating.Additionally, subject of the present invention degree of membership Result of calculation is probit, the drawbacks of the result for being different from existing two points of computational methods is excessively thought in absolute terms, Text to be measured and the degree of association of test theme are represented into more directly perceived, clear and definite in the form of probit.
In order to a kind of method for calculating text subject degree of membership proposed by the present invention is explained in more detail, The embodiment of the present invention will be illustrated by concrete implementation mode, as shown in Fig. 2 the method is right Included step is when calculating text subject degree of membership:
201st, the topic model with tree structure is created.
Describe according in above-mentioned 101, different business has different themes.Therefore, one is created Topic model is accomplished by the first scope of business according to belonging to the theme to obtain the subject key words of correlation, Further according to the classification of subject key words, the topic model with tree structure is created.The present embodiment is with master A topic model is created as a example by topic tourism, as shown in figure 3, obtaining the theme related to tourism first Key word, including:Sight spot, destination, hotel, hotel, visitor, admission fee etc..Afterwards will tourism Used as the root node of the model, its child node is provided with:Sight spot, hotel, visitor etc., and sight spot Child node is additionally provided with:Sight spot name, consumption etc..After node in topic model is set, will obtain The subject key words for taking are assigned to in corresponding node, it is ensured that each node includes at least one Subject key words, thus, the main frame of the topic model has just been created and completed.Afterwards, also need Corresponding node weight weight values are arranged to the node in the topic model, it should be noted that the node Weighted value is the degree of correlation of the node and its father node, rather than with the degree of correlation of theme, that is, save Put relative weight value of the weighted value for father node, rather than the absolute right weight values of theme.
It should be noted that the setting of node weight weight values can be carried out according to certain algorithm by computer Automatically distribution, it is also possible to rule of thumb artificially arranged, for specific set-up mode, this reality Apply example specifically not limited.
202nd, the corresponding topic model with tree structure is selected according to type of service.
Before degree of membership calculating is carried out, the theme selection one for first having to test as needed belongs to theme Topic model.For specific selection mode, can be selected by computer according to specific algorithm, Can be by the way that concrete topic model be manually specified, the present embodiment does not do concrete limit to the mode of the selection yet It is fixed.
203rd, subordinate sentence is carried out to text to be measured, obtains sentence list.
This step refers to the content in 102 with 102 above-mentioned steps, particular content, herein no longer Repeat.
204th, the subject key words according to each node of topic model and sentence list, count the topic model In in text to be measured contained by each node sentence quantity.
The theme degree of membership of text to be measured is calculated using topic model, is first by the sentence in text to be measured Son is brought in topic model, judges whether crucial containing the theme included in topic model in the sentence Word.Concrete implementation mode can be that the sentence is first carried out word segmentation processing, and sentence is divided into into some words Afterwards, the carrying out then with all subject key words in topic model one by one matches.In addition it is also possible to will Whether subject key words are brought in sentence comparing for word one by one, judge in the sentence containing the pass Keyword.Both modes above, are used widely in existing technology, therefore, it is right In the details of realizing of particular technique, the present embodiment is not described further.
Secondly, by judging can determine in the sentence whether contain subject key words, when sentence contains During one subject key words, topic model will determine the node that the subject key words are located, and this is saved The recorded sentence quantity of point plus 1.And when the result for judging is that the sentence contains multiple subject key words When, topic model will first determine the node that these key words are located, and be selected according to the diverse location of node One of node, and update the sentence quantitative value of the node.Specifically selection mode is:Judge The position of these subject key words place nodes, when in a node, determines that the node is sentence The node that son is located.And when subject key words adhere to different nodes separately, then need to continue to judge different Whether node is the child node of same father node, if it is, selecting the big node of node weight weight values As the node that the subordinate sentence is located, because the bigger explanation section of node weight weight values in same level Point is higher with the degree of association of father node and root node;If it is not, then selecting closest to root node The node that node is located as the subordinate sentence, because the node in different levels is closer to root node Node, it is also higher with the degree of association of theme.By taking the topic model shown in Fig. 3 as an example, when a sentence The key word contained in son is:The Summer Palace and admission fee, then be added to node weight weight values by the quantity of the sentence In big node;And when the key word contained in a sentence is:When the Summer Palace is with visitor, then should The quantity of sentence is added in the node at visitor place.
By above-mentioned judgement matching way, can avoid due to leading when containing multiple key words in sentence This for causing is by the problem of repeat count so that each sentence containing subject key words is all in sentence list A unique node in correspondence topic model.
205th, according to the node weight weight values and subordinate sentence quantity of each node in topic model, text to be measured is calculated This theme degree of membership.
In topic model is determined after the sentence quantity of each nodes records, with reference to the node of each node Weighted value, the sentence quantity of the node can be converted in the father node of the node, its father node Sentence quantity total value is the quantitative value sum that the quantitative value of this node is converted with all child nodes.Specifically Computing formula is:
Wherein, FrejFor the total value of J node sentence quantity, SentFrejFor the sentence quantity of J node, WeightjFor the node weight weight values of J node, SentFreiFor the sentence quantity of inode, WeightiFor I The node weight weight values of node, inode is the child node of J node.
By above-mentioned formula, the sentence quantity total value of root node in topic model can be calculated, will The total value accounts for the ratio of sentence sum and is defined as theme ownership of the text to be measured relative to topic model Degree.The value of the theme degree of membership is a probit, for representing in the theme expressed by text to be measured Perhaps central idea, the degree of approximation with the theme of topic model defined.By in topic model not With the node and the different weighted values of each node of level, the analysis text to be measured and topic model of synthesis Degree of correlation, the accuracy of judgement greatly improved.
Further, as the realization to said method, the embodiment of the present invention additionally provides a kind of calculating The device of text subject degree of membership, as shown in figure 4, the device embodiment and preceding method embodiment phase Correspondence, for ease of reading, this device embodiment is no longer entered to the detail content in preceding method embodiment Row is repeated one by one, it should be understood that the device in the present embodiment can be corresponded to realizes that preceding method is implemented Full content in example.The device includes:
Select unit 41, for selecting the corresponding topic model with tree structure according to type of service, Node in the topic model is used to divide the classification of subject key words, wherein, the topic model In each node include at least one subject key words, and each node is provided with node weight Weight values, the node weight weight values are used to represent the degree of association of the node and its father node;
Clause unit 42, for carrying out subordinate sentence to text to be measured, obtains sentence list;
Statistic unit 43, for the master of each node in the topic model that selected according to the select unit 41 The sentence list that topic key word and the clause unit 42 are obtained, counts each node in the topic model The quantity of sentence in contained text described to be measured;
Computing unit 44, for according to the node weight weight values and statistics of each node in the topic model The subordinate sentence quantity of the statistics of unit 43, calculates the theme degree of membership of the text to be measured.
Further, as described in Figure 5, described device also includes:
Acquiring unit 45, for selecting corresponding with tree according to type of service in the select unit 41 Before the topic model of shape structure, corresponding subject key words are obtained according to type of service;
Creating unit 46, the classification of the subject key words for being obtained according to acquiring unit 45 is created to be had The topic model of tree structure;
Setting unit 47, for the node in the topic model that created according to the creating unit 46 and its The degree of correlation of father node, arranges node weight weight values of the node with respect to its father node.
Further, as shown in fig. 6, the statistic unit 43 includes:
Judge module 431, for judging the sentence list in subordinate sentence whether contain the topic model In subject key words;
Determining module 432, for containing theme key when the judgement subordinate sentence of the judge module 431 During word, the node in the topic model of the subject key words place is determined;
Statistical module 433, for the subordinate sentence to be counted in the subordinate sentence quantity that the node contains, more The subordinate sentence quantity that the node that the new determining module 432 determines contains.
Further, as shown in fig. 6, the determining module 432 includes:
Judging submodule 4321, for when the subject key words containing multiple different nodes in the subordinate sentence When, judge that whether the plurality of different nodes are the child node of same father node;
Select submodule 4322, for when the judged result of the judging submodule 4321 for belong to when, The node that the node for selecting node weight weight values big is located as the subordinate sentence;
Submodule 4322 is selected to be additionally operable to, when the judged result of the judging submodule 4321 is not belong to Yu Shi, selects the node being located as the subordinate sentence closest to the node of root node.
Further, as shown in fig. 6, the judge module 431 includes:
Participle submodule 4311, for the sentence to be carried out into word segmentation processing;
Matched sub-block 4312, for the participle and the theme that obtain the participle submodule 4311 Subject key words in model are matched one by one.
Further, as shown in fig. 6, the computing unit 44 of described device includes:
Conversion module 441, for according to the node weight weight values of each node, the subordinate sentence quantity of child node being turned It is changed to the subordinate sentence quantity of its father node;
Computing module 442, for calculating the subordinate sentence number of root node in the topic model using recursive algorithm Amount, then the business of subordinate sentence quantity in the subordinate sentence quantity of the root node and the sentence list is calculated, obtain Go out theme degree of membership of the text to be measured relative to the topic model.
In sum, the method and device of the calculating text subject degree of membership that the embodiment of the present invention is adopted, It is to calculate the theme degree of membership of text to be measured by selecting a preset topic model, in the theme In model, different theme keys are classified according to classification, according to different classification and each Relation between classification creates different nodes in topic model, and arranges different power for node Weight values.When the theme degree of membership of text to be measured is calculated, be by text subordinate sentence after, according to each sentence In the subject key words that contain determining sentence correspondence topic model interior joint weighted value, be each sentence After distributing node weight weight values, using the structure of topic model, by the sentence contained in each node of statistics The sentence quantity that root node contains in quantum count computation model, and the sentence quantity accounts for total in text to be measured The ratio of sentence quantity is exactly theme degree of membership of the text to be measured with respect to the topic model.With existing master Topic degree of membership computational methods are compared, and the present invention is classified simultaneously by setting up the topic model key that is the theme Different weighted values are set, so as to refine the degree of correlation of subject key words and test theme, then are passed through With the keyword weight accounting to contain in COMPREHENSIVE CALCULATING text of matching of text to be measured so that theme is returned The calculating of category degree is associated with the weighted value of subject key words and the number of times for occurring in the text, to carry The accuracy that high theme degree of membership is calculated.Additionally, subject of the present invention degree of membership result of calculation is probability Value, the drawbacks of the result for being different from existing two points of computational methods is excessively thought in absolute terms, by text to be measured with The degree of association of test theme represents more directly perceived, clear and definite in the form of probit.
The device for calculating text subject degree of membership includes processor and memorizer, above-mentioned select unit, Clause unit, statistic unit and computing unit etc. as program unit store in memory, by Reason device performs storage said procedure unit in memory to realize corresponding function.
Kernel is included in processor, is gone in memorizer to transfer corresponding program unit by kernel.Kernel can To arrange one or more, test text is calculated relative to topic model by adjusting kernel parameter Theme degree of membership, so as to improve the accuracy of theme degree of membership judgement.
Memorizer potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM), memorizer includes at least one storage chip.
Present invention also provides a kind of computer program, when performing in data handling equipment, It is adapted for carrying out initializing the program code of there are as below methods step:Corresponding tool is selected according to type of service There is the topic model of tree structure, the node in the topic model is used to divide the class of subject key words Not, wherein, each node in the topic model includes at least one subject key words, and institute State each node and be provided with node weight weight values, the node weight weight values are used to represent the node and his father The degree of association of node;Subordinate sentence is carried out to text to be measured, sentence list is obtained;Respectively saved according to topic model The subject key words of point and the sentence list, count described contained by each node in the topic model The quantity of sentence in text to be measured;According to the node weight weight values of each node in the topic model and point Sentence quantity, calculates the theme degree of membership of the text to be measured.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system, Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not Be limited to disk memory, CD-ROM, optical memory etc.) on the computer program implemented Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is describing.It should be understood that can be realized flowing by computer program instructions In each flow process and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram Flow process and/or square frame combination.Can provide these computer program instructions to general purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing devices producing one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices It is raw to be used to realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple sides The device of the function of specifying in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.
Memorizer potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Memorizer is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read only memory (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), the such as data signal and load of modulation Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims (10)

1. it is a kind of calculate text subject degree of membership method, it is characterised in that methods described includes:
The corresponding topic model with tree structure is selected according to type of service, in the topic model Node be used to divide the classification of subject key words, wherein, each node in the topic model is wrapped Containing at least one subject key words, and each node is provided with node weight weight values, the node Weighted value is used to represent the degree of association of the node and its father node;
Subordinate sentence is carried out to text to be measured, sentence list is obtained;
Subject key words and the sentence list according to each node of topic model, count the theme mould The quantity of sentence in text described to be measured in type contained by each node;
According to the node weight weight values and subordinate sentence quantity of each node in the topic model, treat described in calculating Survey the theme degree of membership of text.
2. method according to claim 1, it is characterised in that selected according to type of service described Before selecting the corresponding topic model with tree structure, methods described also includes:
Corresponding subject key words are obtained according to type of service;
Topic model with tree structure is created according to the classification of subject key words;
Node in the topic model and the degree of correlation of its father node, arrange the node phase Node weight weight values to its father node.
3. method according to claim 2, it is characterised in that described respectively to be saved according to topic model The subject key words of point and the sentence list, count described contained by each node in the topic model The quantity of sentence includes in text to be measured:
Judge the subordinate sentence in the sentence list whether containing the subject key words in the topic model;
If it does, then determining the node in the topic model of the subject key words place;
The subordinate sentence is counted in the subordinate sentence quantity that the node contains, updates what the node contained Subordinate sentence quantity.
4. method according to claim 3, it is characterised in that the determination theme is crucial Node in the topic model of word place includes:
When the subject key words containing multiple different nodes in the subordinate sentence, the plurality of difference is judged Whether node is the child node of same father node;
If belonging to, the node that the node for selecting node weight weight values big is located as the subordinate sentence;
If being not belonging to, the node being located as the subordinate sentence closest to the node of root node is selected.
5. method according to claim 3, it is characterised in that the judgement sentence list In subordinate sentence whether include containing the subject key words in the topic model:
The sentence is carried out into word segmentation processing;
The participle is matched one by one with the subject key words in the topic model.
6. the method according to any one of claim 1-5, it is characterised in that described according to institute The node weight weight values and subordinate sentence quantity of each node in topic model are stated, the master of the text to be measured is calculated Topic degree of membership includes:
According to the node weight weight values of each node, the subordinate sentence quantity of child node is converted to into dividing for its father node Sentence quantity;
The subordinate sentence quantity of root node in the topic model is calculated using recursive algorithm, then calculate described The business of the subordinate sentence quantity in the subordinate sentence quantity of node and the sentence list, draws the text phase to be measured For the theme degree of membership of the topic model.
7. it is a kind of calculate text subject degree of membership device, it is characterised in that described device includes:
Select unit, for selecting the corresponding topic model with tree structure according to type of service, Node in the topic model is used to divide the classification of subject key words, wherein, the topic model In each node include at least one subject key words, and each node is provided with node weight Weight values, the node weight weight values are used to represent the degree of association of the node and its father node;
Clause unit, for carrying out subordinate sentence to text to be measured, obtains sentence list;
Statistic unit, the theme for each node in the topic model that selected according to the select unit is closed The sentence list that keyword and the clause unit are obtained, in the statistics topic model contained by each node The quantity of sentence in the text to be measured;
Computing unit, for single according to the node weight weight values of each node in the topic model and statistics The subordinate sentence quantity of unit's statistics, calculates the theme degree of membership of the text to be measured.
8. device according to claim 7, it is characterised in that described device also includes:
Acquiring unit, for being selected that there is accordingly tree-like knot according to type of service in the select unit Before the topic model of structure, corresponding subject key words are obtained according to type of service;
Creating unit, the classification of the subject key words for being obtained according to acquiring unit is created with tree-like The topic model of structure;
Setting unit, saves for the node in the topic model that created according to the creating unit with his father The degree of correlation of point, arranges node weight weight values of the node with respect to its father node.
9. device according to claim 8, it is characterised in that the statistic unit includes:
Judge module, for judging the sentence list in subordinate sentence whether containing in the topic model Subject key words;
Determining module, for when the judgement of the judge module subordinate sentence contains subject key words, Determine the node in the topic model of the subject key words place;
Statistical module, for the subordinate sentence to be counted in the subordinate sentence quantity that the node contains, updates The subordinate sentence quantity that the node that the determining module determines contains.
10. the device according to any one of claim 7-9, it is characterised in that the calculating list Unit includes:
Conversion module, for according to the node weight weight values of each node, the subordinate sentence quantity of child node being changed For the subordinate sentence quantity of its father node;
Computing module, for calculating the subordinate sentence quantity of root node in the topic model using recursive algorithm, The business of the subordinate sentence quantity in the subordinate sentence quantity of the root node and the sentence list is calculated again, draws institute State theme degree of membership of the text to be measured relative to the topic model.
CN201510680602.0A 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree Active CN106598999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510680602.0A CN106598999B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510680602.0A CN106598999B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Publications (2)

Publication Number Publication Date
CN106598999A true CN106598999A (en) 2017-04-26
CN106598999B CN106598999B (en) 2020-02-04

Family

ID=58554937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510680602.0A Active CN106598999B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Country Status (1)

Country Link
CN (1) CN106598999B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193973A (en) * 2017-05-25 2017-09-22 百度在线网络技术(北京)有限公司 The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium
CN107247707A (en) * 2017-06-27 2017-10-13 北京神州泰岳软件股份有限公司 Enterprise's incidence relation information extracting method and device based on completion strategy
CN107392433A (en) * 2017-06-27 2017-11-24 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprise's incidence relation information
CN107562854A (en) * 2017-08-28 2018-01-09 云南大学 A kind of modeling method of quantitative analysis Party building data
CN110209829A (en) * 2018-02-12 2019-09-06 百度在线网络技术(北京)有限公司 Information processing method and device
CN110659655A (en) * 2018-06-28 2020-01-07 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
CN110705308A (en) * 2019-09-18 2020-01-17 平安科技(深圳)有限公司 Method and device for recognizing field of voice information, storage medium and electronic equipment
WO2020140373A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Intention recognition method, recognition device and computer-readable storage medium
CN112100360A (en) * 2020-10-30 2020-12-18 北京淇瑀信息科技有限公司 Dialog response method, device and system based on vector retrieval

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130837A1 (en) * 2001-07-31 2003-07-10 Leonid Batchilo Computer based summarization of natural language documents
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN102254011A (en) * 2011-07-18 2011-11-23 哈尔滨工业大学 Method for modeling dynamic multi-document abstracts
CN103226580A (en) * 2013-04-02 2013-07-31 西安交通大学 Interactive-text-oriented topic detection method
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130837A1 (en) * 2001-07-31 2003-07-10 Leonid Batchilo Computer based summarization of natural language documents
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN102254011A (en) * 2011-07-18 2011-11-23 哈尔滨工业大学 Method for modeling dynamic multi-document abstracts
CN103226580A (en) * 2013-04-02 2013-07-31 西安交通大学 Interactive-text-oriented topic detection method
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193973A (en) * 2017-05-25 2017-09-22 百度在线网络技术(北京)有限公司 The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium
CN107193973B (en) * 2017-05-25 2021-07-20 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying field of semantic analysis information and readable medium
US10777192B2 (en) 2017-05-25 2020-09-15 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of recognizing field of semantic parsing information, device and readable medium
CN107247707B (en) * 2017-06-27 2020-08-04 鼎富智能科技有限公司 Enterprise association relation information extraction method and device based on completion strategy
CN107247707A (en) * 2017-06-27 2017-10-13 北京神州泰岳软件股份有限公司 Enterprise's incidence relation information extracting method and device based on completion strategy
CN107392433A (en) * 2017-06-27 2017-11-24 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprise's incidence relation information
CN107562854A (en) * 2017-08-28 2018-01-09 云南大学 A kind of modeling method of quantitative analysis Party building data
CN107562854B (en) * 2017-08-28 2020-09-22 云南大学 Modeling method for quantitatively analyzing party building data
CN110209829A (en) * 2018-02-12 2019-09-06 百度在线网络技术(北京)有限公司 Information processing method and device
CN110209829B (en) * 2018-02-12 2021-06-29 百度在线网络技术(北京)有限公司 Information processing method and device
CN110659655B (en) * 2018-06-28 2021-03-02 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
CN110659655A (en) * 2018-06-28 2020-01-07 北京三快在线科技有限公司 Index classification method and device and computer readable storage medium
WO2020140373A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Intention recognition method, recognition device and computer-readable storage medium
CN110705308A (en) * 2019-09-18 2020-01-17 平安科技(深圳)有限公司 Method and device for recognizing field of voice information, storage medium and electronic equipment
CN112100360A (en) * 2020-10-30 2020-12-18 北京淇瑀信息科技有限公司 Dialog response method, device and system based on vector retrieval
CN112100360B (en) * 2020-10-30 2024-02-02 北京淇瑀信息科技有限公司 Dialogue response method, device and system based on vector retrieval

Also Published As

Publication number Publication date
CN106598999B (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN106598999A (en) Method and device for calculating text theme membership degree
CN105389349B (en) Dictionary update method and device
CN110019396A (en) A kind of data analysis system and method based on distributed multidimensional analysis
CN106919689A (en) Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
US20150302433A1 (en) Automatic Generation of Custom Intervals
CN107203774A (en) The method and device that the belonging kinds of data are predicted
CN106649464A (en) Method of building Chinese address tree and device
BR112012011091B1 (en) method and apparatus for extracting and evaluating word quality
CN105446952B (en) For handling the method and system of semantic segment
CN112463774B (en) Text data duplication eliminating method, equipment and storage medium
CN107391682B (en) Knowledge verification method, knowledge verification apparatus, and storage medium
CN106649250A (en) Method and device for identifying emotional new words
CN109726758A (en) A kind of data fusion publication algorithm based on difference privacy
CN106598997A (en) Method and device for computing membership degree of text subject
CN106610931A (en) Extraction method and device for topic names
CN108763536A (en) Data bank access method and device
CN111143685A (en) Recommendation system construction method and device
US11288266B2 (en) Candidate projection enumeration based query response generation
CN110968564A (en) Data processing method and training method of data state prediction model
US9324041B2 (en) Function stream based analysis
CN105787004A (en) Text classification method and device
CN106815320B (en) Investigation big data visual modeling method and system based on expanded three-dimensional histogram
US20160357795A1 (en) Method and apparatus for data mining
CN107861950A (en) The detection method and device of abnormal text
CN108108379A (en) Keyword opens up the method and device of word

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant