CN106598997A - Method and device for computing membership degree of text subject - Google Patents

Method and device for computing membership degree of text subject Download PDF

Info

Publication number
CN106598997A
CN106598997A CN201510680277.8A CN201510680277A CN106598997A CN 106598997 A CN106598997 A CN 106598997A CN 201510680277 A CN201510680277 A CN 201510680277A CN 106598997 A CN106598997 A CN 106598997A
Authority
CN
China
Prior art keywords
sentence
text
measured
keyword
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510680277.8A
Other languages
Chinese (zh)
Other versions
CN106598997B (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510680277.8A priority Critical patent/CN106598997B/en
Publication of CN106598997A publication Critical patent/CN106598997A/en
Application granted granted Critical
Publication of CN106598997B publication Critical patent/CN106598997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The invention discloses a method and a device for computing a membership degree of a text subject, relates to the technical field of computers, and solves the problem that computation of the membership degree has much error due to the fact that subject keywords appearing in a text are uncorrelated with the text subject. The main technical scheme of the method provided by the invention comprises the steps of segmenting the text to be tested into sentences, thus obtaining a sentence list; finding the sentences including keywords in a subject keyword set from the sentence list according to the preset subject keyword set; determining position weighted values of the sentences according to the positions of the sentences in the text to be tested; and computing the subject membership degree of the text to be tested according to the position weighted values of the sentences and the quantity of the sentences including the keywords. The method and the device provided by the invention are mainly used for computing the membership degree of the text.

Description

A kind of method and device for calculating text subject degree of membership
Technical field
The present invention relates to field of computer technology, more particularly to a kind of side for calculating text subject degree of membership Method and device.
Background technology
Under big data background, it is an important topic that relevant information is extracted.Information extraction technique is not Attempt comprehensive understanding entire chapter document, simply the part comprising relevant information in document is analyzed.It is logical Cross the characteristic key words that extract in article to determine the subject content expressed by the article.
Existing relevant information extraction algorithm mostly with article whether there is the feature related to a certain theme Keyword, so as to judge whether the content expressed by this article belongs to the theme.It is this to be with keyword It is no to occur in article as feature, although can than more comprehensively obtaining the relevant information in article, But the information extracted there may be a large amount of noises because in article not all word all with theme Close association.Therefore, may draw when finally the theme expressed by this article is judged and contrary sentence Disconnected result, causes the bigger error of subsequent analysis.
The content of the invention
In view of this, the present invention provides a kind of method and device for calculating text subject degree of membership, mainly Purpose is that the theme that the position occurred in the text by subject key words calculates the text with the frequency is returned Category degree, so as to improve the accuracy of degree of membership judgement.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, the invention provides a kind of method for calculating text subject degree of membership, the method includes:
Subordinate sentence process is carried out to text to be measured, sentence list is obtained;
According to preset subject key words collection, search and contain in the sentence list subject key words Concentrate the sentence of keyword;
According to position of the sentence in the text to be measured, the position weight value of the sentence is determined;
The quantity of the sentence according to the position weight value of the sentence and containing keyword, calculates described The theme degree of membership of text to be measured.
On the other hand, present invention also offers a kind of device for calculating text subject degree of membership, the device Including:
Subordinate sentence processing unit, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit, for according to preset subject key words collection, searching in the subordinate sentence processing unit The sentence of keyword is concentrated in the sentence list for obtaining containing the subject key words;
Determining unit, for the sentence that found according to the searching unit in the text to be measured Position, determines the position weight value of the sentence;
Computing unit, for the position weight value of sentence that determined according to the determining unit and lookup The quantity of the sentence containing keyword that unit is searched, calculates the theme degree of membership of the text to be measured.
The method and device of the calculating text subject degree of membership proposed according to the invention described above, is to pass through Text to be measured is carried out after subordinate sentence, whether determined in each sentence containing subject key words and record contains The quantity of subject key words sentence, further according to the position containing the crucial sentence of theme in text to be measured, Determine the position weight value of the sentence, through calculating the position weight value sum containing keyword sentence with The ratio of the position weight value sum of all sentences is worth to the theme degree of membership of text to be measured.With existing skill Art is compared, and the present invention determines the degree of correlation of text to be measured and theme by the size of ratio, it is to avoid The excessively absolute problem of the conclusion of dichotomy.Additionally, the position power by sentence in text to be measured The theme degree of membership that weight values are calculated is to be quantified the position that keyword occurs in text to be measured Analyze and be added in the calculating of degree of membership, caused by the noise because mentioning in background technology can be reduced Analytical error, so as to improve the accuracy of degree of membership judgement.
Description of the drawings
By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
Fig. 1 shows a kind of stream of the method for calculating text subject degree of membership that the embodiment of the present invention is proposed Cheng Tu;
Fig. 2 shows another kind of method for calculating text subject degree of membership that the embodiment of the present invention is proposed Flow chart;
Fig. 3 shows a kind of group of the device of calculating text subject degree of membership that the embodiment of the present invention is proposed Into block diagram;
Fig. 4 shows another kind of device for calculating text subject degree of membership that the embodiment of the present invention is proposed Composition frame chart.
Specific embodiment
The exemplary embodiment of the present invention is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing The exemplary embodiment of the present invention is shown, it being understood, however, that may be realized in various forms the present invention And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more Thoroughly understand the present invention, and can be by the complete technology for conveying to this area of the scope of the present invention Personnel.
A kind of method of calculating text subject degree of membership is embodiments provided, as shown in figure 1, Concrete steps include:
101st, subordinate sentence process is carried out to text to be measured, obtains sentence list.
Referring to Chinese text more text to be measured used in the embodiment of the present invention, and general can express The text of certain topic mostly is the article of big length or multistage.And for existing article subject analysis side Whether method, mainly carries out participle to article, then check in participle containing the keyword related to theme. And existing participle mode is relative complex, and the accuracy of participle has much room for improvement.Therefore, in order to avoid The error analysis caused by participle, the embodiment of the present invention, will by carrying out subordinate sentence process to text to be measured Text to be measured decomposes some sentences, and these sentences are constituted into a sentence list.
102nd, according to preset subject key words collection, search and contain in sentence list the subject key words collection The sentence of middle keyword.
While text to be measured is decomposed, in addition it is also necessary to create a subject key words collection, wrap in the word set Include some keywords related to theme.After the sentence list for obtaining text to be measured, by the sentence The keyword that sentence in list is concentrated with subject key words is matched, by the sentence containing keyword Screen.
Wherein, when sentence is matched with keyword, can be by sentences decomposition into some words, will These words and keyword carry out matching one by one, it is also possible to by keyword being brought in sentence one by one, Matched with word or word in sentence, for specific matching way the present embodiment is not specifically limited. Main purpose is to filter out the sentence containing keyword in sentence list.
103rd, the position according to sentence in text to be measured, determines the position weight value of the sentence.
The position weight value of sentence refers to the significance level of sentence position in the text.And in a text Zhang Zhong, the keyword related to theme often appears at some relatively substantially and in important position, Such as the title of article, first section or latter end of article etc..Therefore, in an article, the position of sentence There is correlation with the theme of article.In embodiments of the present invention, can be obtained according to above-mentioned correlation The position weight value for going out sentence can be used to indicate that the degree of correlation of the sentence and theme.
It should be noted that for the value of position weight value, the embodiment of the present invention is not limited by solid Fixed algorithm takes the position weight value of sentence, or by manually according to the position of experience setting sentence Weighted value.
104th, the quantity of the sentence according to the position weight value of sentence and containing keyword, calculates to be measured The theme degree of membership of text.
After the quantity of the sentence in the position weight value for determining sentence and containing keyword, by tired Plus can obtain all position weight value sums comprising keyword sentence.The institute in text to be measured is calculated There is the position weight value of sentence, it becomes possible to draw the position weight value sum comprising keyword sentence in institute There is the accounting in the position weight value of sentence.The accounting value is exactly text to be measured relative to the theme Theme degree of membership.Accounting is higher, illustrates that the theme of text to be measured is higher with the degree of correlation of test theme.
The calculating text subject that the embodiment of the present invention is adopted is can be seen that with reference to above-mentioned implementation Whether degree of membership method, be by carrying out to text to be measured after subordinate sentence, determining in each sentence containing master Topic keyword simultaneously records the quantity containing subject key words sentence, further according to the sentence containing subject key words Position of the son in text to be measured, determines the position weight value of the sentence, and through calculating keyword is contained The position weight value sum of sentence is worth to text to be measured with the ratio of the position weight value sum of all sentences Theme degree of membership.Compared with prior art, the present invention determines text to be measured by the size of ratio With the degree of correlation of theme, it is to avoid the excessively absolute problem of the conclusion of dichotomy.Additionally, passing through sentence The theme degree of membership that position weight value of the son in text to be measured is calculated is in text to be measured by keyword The position of middle appearance has carried out quantitative analysis and has been added in the calculating of degree of membership, can reduce because of background Analytical error caused by the noise mentioned in technology, so as to improve the accuracy of degree of membership judgement.
In order to a kind of method for calculating text subject degree of membership proposed by the present invention is explained in more detail, The embodiment of the present invention will be illustrated by concrete implementation mode, as shown in Fig. 2 the method is right Included step is when calculating text subject degree of membership:
201st, subordinate sentence process is carried out to text to be measured, obtains sentence list.
When subordinate sentence process is carried out to text to be measured, its processing mode is simple, easy relative to participle. Subordinate sentence is obtained by only needing to be made pauses in reading unpunctuated ancient writings according to fixed punctuation mark.In the punctuation mark of Chinese, At the end of typically to represent in short, it is identified with fullstop, question mark, exclamation mark etc. more.Therefore, In the present embodiment, these punctuation marks can in advance be chosen, then text to be measured is carried out by The comparison of individual byte, when certain byte in judging text to be measured is preset punctuation mark, just The content intercepted between the point and last short sentence point is stored in sentence list as a subordinate sentence.
202nd, according to preset subject key words collection, search and contain in sentence list the subject key words collection The sentence of middle keyword.
Before this step is performed, some subject key words should be also first obtained.These subject key words Selection should be selected according to the sequence of the degree of correlation with theme, quantity fix premise Under, the subject key words high with theme correlation degree should be selected.Select the high theme of degree of correlation crucial Word can also improve the judgment accuracy of theme degree of membership.
After subject key words are determined, it is necessary to which the sentence in sentence list in 201 is screened, Select the sentence containing subject key words.Specific way can be that one is selected in sentence list Sentence, by the sentence word segmentation processing is carried out, and obtains constituting some participles of the sentence.Then by these Participle is matched with all subject key words, if identical, just the sentence is recorded as containing theme The sentence of keyword.Because the quantity of subject key words can be multiple, and the main purpose of this step The sentence containing subject key words is to look for out, and pays no attention to the number containing subject key words in the sentence Amount.Therefore, when the matching of subject key words is carried out, it is not necessary to by all participles in sentence and institute Some subject key words carrying out one by one matches, but once there is participle and master during matching When topic keyword is identical, just terminate the subsequent match process to the sentence, be directly recorded as the sentence Sentence containing subject key words.Wherein, can adopt for sentence is carried out into word segmentation processing concrete mode Existing processing mode is used, here is omitted for detailed process.
203rd, the position according to sentence in text to be measured, determines the position weight value of the sentence.
The position of sentence can substantially be divided into caption position, first section position, head and the tail sentence position in one article Put, according to subject key words the probability of diverse location is occurred in, the position weight of diverse location is set Value, generally, position weight value according to order from high to low be followed successively by taglines weighted value, First section sentence weighted value, head and the tail sentence weighted value, common sentence weighted value.
The mode for determining sentence position in text to be measured can be marked when subordinate sentence is carried out, root The position of sentence is determined according to fixed mark character, for example, can be distinguished not according to the pattern of word The taglines of same level, one after the symbol is then can determine for the first sentence of section according to carriage-return character, should One before symbol is section tail office, and first section sentence be then after taglines and the sentence first carriage return character it Between sentence.Mark to sentence position can just be completed by above-mentioned determination strategy, certainly this Bright embodiment is not specifically limited for the mark mode of sentence position.Main purpose is will be according to sentence The different sentences that are distinguished as of position configure different position weight values.
204th, the position weight value of the cumulative sentence containing keyword, obtains the theme line power of text to be measured Weight values.
Theme line weighted value is the position weight value sum of the sentence containing keyword, when calculating, Can count the sentence quantity with same position weighted value, then by diverse location weighted value and quantity Product carries out adding up and obtains theme line weighted value.By calculating theme line weighted value, master can be shown The height of the frequency of occurrences in several key positions of the topic keyword in text to be measured.Such that it is able to phase To intuitively finding out the theme of text to be measured and the degree of approximation of the theme expressed by subject key words.
Specific computational methods may be referred to computing formula given below:
B=Ntitle*Weighttitle+Nfirst-last*Weightfirst-last+Npara-frist*Weightpara-frist+Ncommon*Weightcommon
Wherein, B is the theme a weighted value, NtitleFor the sentence quantity of taglines, WeighttitleFor title Sentence weighted value, Nfirst-lastFor the sentence quantity of head and the tail sentence, Weightfirst-lastFor head and the tail sentence weighted value, Npara-frist For the sentence quantity of first section sentence, Weightpara-fristFor first section sentence weighted value, NcommonFor the sentence of common sentence Quantity, WeightcommonFor common sentence weighted value.
205th, the weight total value of text to be measured is calculated.
Weight total value is exactly the position weight value sum of all sentences in text to be measured.
Specific computing formula is as follows:
Ball=Ntitle-all*Weighttitle+Nfirst-last-all*Weightfirst-last+Npara-frist-all*Weightpara-frist+ Ncommon-all*Weightcommon
Wherein, BallFor weight total value, Ntitle-allFor the sentence quantity of all taglines, Nfirst-last-allFor institute There are the sentence quantity of head and the tail sentence, Npara-frist-allFor the sentence quantity of all first section sentences, Ncommon-allIt is all The sentence quantity of common sentence.
206th, the quotient of theme line weighted value and weight total value is calculated, the theme ownership of text to be measured is obtained Degree.
The theme degree of membership of text to be measured is the theme or central idea and test theme for judging the text Similarity factor, wherein, from theme line weighted value can be seen that in text to be measured to test theme it is related Content parameters, by the quotient for calculating theme line weighted value and weight total value, it is possible to obtain to be measured The content related to test theme accounts for the ratio of total content, that is, the theme of the text to be measured in text Degree of membership, i.e. B/BallValue.
Further, as the realization to said method, the embodiment of the present invention additionally provides a kind of calculating The device of text subject degree of membership, as shown in figure 3, the device embodiment and preceding method embodiment phase Correspondence, for ease of reading, this device embodiment is no longer entered to the detail content in preceding method embodiment Row is repeated one by one, it should be understood that the device in the present embodiment can be corresponded to realizes that preceding method is implemented Full content in example.The device includes:
Subordinate sentence processing unit 31, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit 32, it is single in subordinate sentence process for according to preset subject key words collection, searching The sentence of keyword is concentrated in the sentence list that unit 31 obtains containing the subject key words;
Determining unit 33, for the sentence that found according to the searching unit 32 in the text to be measured In position, determine the position weight value of the sentence;
Computing unit 34, for the position weight value of sentence that determined according to the determining unit 33 and The quantity of the sentence containing keyword that searching unit 32 is searched, the theme for calculating the text to be measured is returned Category degree.
Further, as shown in figure 4, the determining unit 33 of described device includes:
First determining module 331, it is described for determining position of the sentence in the text to be measured Position includes title, first section, head and the tail sentence and other general positions;
Second determining module 332, for the sentence that determined according to first determining module 331 described Position in text to be measured determines the corresponding position weight value of the sentence, the position weight value according to It is arranged with theme correlation degree, including taglines weighted value, first section sentence weighted value, head and the tail sentence weight Value and common sentence weighted value.
Further, as shown in figure 4, the searching unit 32 of described device includes:
Word-dividing mode 321, for the sentence in the sentence list to be carried out into word segmentation processing;
Matching module 322, the participle and the keyword for the word-dividing mode 321 to be obtained is carried out Matching;
Logging modle 323, be for when the match is successful for the matching module 322, recording the sentence Sentence containing keyword.
Further, the matching module 322 is used to concentrate the participle with the subject key words Keyword matched one by one, when the success of the participle in the sentence and the Keywords matching, No longer other participles in the sentence are matched.
Further, as shown in figure 4, the computing unit 34 includes:
First computing module 341, for the position weight value of the sentence containing the keyword that adds up, obtains To the theme line weighted value of the text to be measured;
Second computing module 342, calculates the weight total value of the text to be measured, and the weight total value is institute There is the position weight value sum of sentence;
3rd computing module 343, for calculating the theme line weight that first computing module 341 is obtained The quotient of the weight total value that value is obtained with second computing module 342, obtains the text to be measured Theme degree of membership.
Further, the subordinate sentence processing unit 31 of described device is additionally operable to, according to predetermined punctuation mark Subordinate sentence process is carried out to the text to be measured.
In sum, the method and device of the calculating text subject degree of membership that the embodiment of the present invention is adopted, It is by carrying out to text to be measured after subordinate sentence, determining and whether contain in each sentence subject key words and remember Quantity of the record containing subject key words sentence, further according to the sentence containing theme key in text to be measured Position, determine the position weight value of the sentence, through calculating the position weight containing keyword sentence Value sum is worth to the theme degree of membership of text to be measured with the ratio of the position weight value sum of all sentences. Compared with prior art, the present invention determines the related journey of text to be measured and theme by the size of ratio Degree, it is to avoid the excessively absolute problem of the conclusion of dichotomy.Additionally, by sentence in text to be measured The theme degree of membership that calculated of position weight value be to enter the position that keyword occurs in text to be measured Go and quantitative analysis and be added in the calculating of degree of membership, can reduce because of making an uproar for mentioning in background technology Analytical error caused by sound, so as to improve the accuracy of degree of membership judgement.
The device for calculating text subject degree of membership includes processor and memory, and above-mentioned subordinate sentence is processed Unit, searching unit, determining unit and computing unit etc. are stored in memory as program unit, Corresponding function is realized by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can To arrange one or more, test text is calculated relative to preset theme by adjusting kernel parameter Theme degree of membership, so as to improve the accuracy of theme degree of membership judgement.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment, It is adapted for carrying out initializing the program code of there are as below methods step:Subordinate sentence process is carried out to text to be measured, Obtain sentence list;According to preset subject key words collection, search in the sentence list containing described Subject key words concentrate the sentence of keyword;According to position of the sentence in the text to be measured, Determine the position weight value of the sentence;According to the position weight value of the sentence and contain keyword Sentence quantity, calculate the theme degree of membership of the text to be measured.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system, Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is describing.It should be understood that can be realized flowing by computer program instructions In each flow process and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram Flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing devices producing one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices It is raw to be used to realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple sides The device of the function of specifying in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims (10)

1. it is a kind of calculate text subject degree of membership method, it is characterised in that methods described includes:
Subordinate sentence process is carried out to text to be measured, sentence list is obtained;
According to preset subject key words collection, search and contain in the sentence list subject key words Concentrate the sentence of keyword;
According to position of the sentence in the text to be measured, the position weight value of the sentence is determined;
The quantity of the sentence according to the position weight value of the sentence and containing keyword, calculates described The theme degree of membership of text to be measured.
2. method according to claim 1, it is characterised in that it is described according to the sentence in institute The position in text to be measured is stated, determining the position weight value of the sentence includes:
Determine position of the sentence in the text to be measured, the position include taglines position, First section sentence position, head and the tail sentence position and common sentence position;
The corresponding position weight value of the sentence is determined according to position of the sentence in the text to be measured, The position weight value is arranged according to it with theme correlation degree, including taglines weighted value, first section sentence Weighted value, head and the tail sentence weighted value and common sentence weighted value.
3. method according to claim 2, it is characterised in that described to be closed according to preset theme Keyword collection, searches in the sentence list and concentrates the sentence of keyword to include containing the subject key words:
Sentence in the sentence list is carried out into word segmentation processing;
The participle is matched with the keyword;
If the match is successful, it is the sentence containing keyword to record the sentence.
4. method according to claim 3, it is characterised in that it is described by the participle with it is described Keyword carries out matching to be included:
The keyword that the participle is concentrated with the subject key words is matched one by one;
When the participle in the sentence is with Keywords matching success, no longer in the sentence Other participles are matched.
5. the method according to any one of claim 1-4, it is characterised in that described according to institute The position weight value of sentence and the quantity of the sentence containing keyword are stated, the text to be measured is calculated Theme degree of membership includes:
The position weight value of the cumulative sentence containing the keyword, obtains the theme of the text to be measured Sentence weighted value;
The weight total value of the text to be measured is calculated, the weight total value is the position weight of all sentences Value sum;
The quotient of the theme line weighted value and the weight total value is calculated, the text to be measured is obtained Theme degree of membership.
6. method according to claim 1, it is characterised in that described text to be measured to be carried out point Sentence is processed, and obtaining sentence list includes:
Subordinate sentence process is carried out to the text to be measured according to predetermined punctuation mark.
7. it is a kind of calculate text subject degree of membership device, it is characterised in that described device includes:
Subordinate sentence processing unit, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit, for according to preset subject key words collection, searching in the subordinate sentence processing unit The sentence of keyword is concentrated in the sentence list for obtaining containing the subject key words;
Determining unit, for the sentence that found according to the searching unit in the text to be measured Position, determines the position weight value of the sentence;
Computing unit, for the position weight value of sentence that determined according to the determining unit and lookup The quantity of the sentence containing keyword that unit is searched, calculates the theme degree of membership of the text to be measured.
8. device according to claim 7, it is characterised in that the determining unit includes:
First determining module, for determining position of the sentence in the text to be measured, institute's rheme Put including title, first section, head and the tail sentence and other general positions;
Second determining module, for the sentence that determined according to first determining module in the text to be measured Position in this determines the corresponding position weight value of the sentence, and the position weight value is according to itself and master Topic degree of correlation is arranged, including taglines weighted value, first section sentence weighted value, head and the tail sentence weighted value and general Logical sentence weighted value.
9. the device according to claim 7 or 8, it is characterised in that the searching unit includes:
Word-dividing mode, for the sentence in the sentence list to be carried out into word segmentation processing;
Matching module, the participle for the word-dividing mode to be obtained is matched with the keyword;
Logging modle, for when the match is successful for the matching module, it to be containing relevant to record the sentence The sentence of keyword.
10. the device according to any one of claim 7-9, it is characterised in that the calculating list Unit includes:
First computing module, for the position weight value of the sentence containing the keyword that adds up, obtains The theme line weighted value of the text to be measured;
Second computing module, calculates the weight total value of the text to be measured, and the weight total value is all The position weight value sum of sentence;
3rd computing module, for calculating theme line weighted value and the institute that first computing module is obtained The quotient of the weight total value that the second computing module is obtained is stated, the theme degree of membership of the text to be measured is obtained.
CN201510680277.8A 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree Active CN106598997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510680277.8A CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510680277.8A CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Publications (2)

Publication Number Publication Date
CN106598997A true CN106598997A (en) 2017-04-26
CN106598997B CN106598997B (en) 2021-05-18

Family

ID=58555102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510680277.8A Active CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Country Status (1)

Country Link
CN (1) CN106598997B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704763A (en) * 2017-09-04 2018-02-16 中国移动通信集团广东有限公司 Multi-source heterogeneous leak information De-weight method, stage division and device
CN109657202A (en) * 2017-10-10 2019-04-19 北京国双科技有限公司 The method and device of text-processing
CN111369294A (en) * 2020-03-06 2020-07-03 中国铁塔股份有限公司 Software cost estimation method and device
CN111581358A (en) * 2020-04-08 2020-08-25 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111950037A (en) * 2020-08-25 2020-11-17 北京天融信网络安全技术有限公司 Detection method, detection device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition
US20150341751A1 (en) * 2013-11-07 2015-11-26 a la mode technologies, inc. System and Method for Gathering Information about a Subject in Close Proximity to a User

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
US20150341751A1 (en) * 2013-11-07 2015-11-26 a la mode technologies, inc. System and Method for Gathering Information about a Subject in Close Proximity to a User
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704763A (en) * 2017-09-04 2018-02-16 中国移动通信集团广东有限公司 Multi-source heterogeneous leak information De-weight method, stage division and device
CN109657202A (en) * 2017-10-10 2019-04-19 北京国双科技有限公司 The method and device of text-processing
CN109657202B (en) * 2017-10-10 2022-10-28 北京国双科技有限公司 Text processing method and device
CN111369294A (en) * 2020-03-06 2020-07-03 中国铁塔股份有限公司 Software cost estimation method and device
CN111369294B (en) * 2020-03-06 2023-06-23 中国铁塔股份有限公司 Software cost estimation method and device
CN111581358A (en) * 2020-04-08 2020-08-25 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111581358B (en) * 2020-04-08 2023-08-18 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111950037A (en) * 2020-08-25 2020-11-17 北京天融信网络安全技术有限公司 Detection method, detection device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106598997B (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN105389349B (en) Dictionary update method and device
CN107229668B (en) Text extraction method based on keyword matching
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN106598997A (en) Method and device for computing membership degree of text subject
CN108920456B (en) Automatic keyword extraction method
CN103336766B (en) Short text garbage identification and modeling method and device
CN104199965B (en) Semantic information retrieval method
CN107463658B (en) Text classification method and device
CN104077407B (en) A kind of intelligent data search system and method
CN106570128A (en) Mining algorithm based on association rule analysis
CN106598999A (en) Method and device for calculating text theme membership degree
CN103309852A (en) Method for discovering compound words in specific field based on statistics and rules
CN107229627B (en) Text processing method and device and computing equipment
CN106446124A (en) Website classification method based on network relation graph
CN103324641B (en) Information record recommendation method and device
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN114997288A (en) Design resource association method
CN114385775A (en) Sensitive word recognition method based on big data
CN107357794A (en) Optimize the method and apparatus of the data store organisation of key value database
CN106919576A (en) Using the method and device of two grades of classes keywords database search for application now
CN107992402A (en) Blog management method and log management apparatus
CN106815209B (en) Uygur agricultural technical term identification method
CN107861950A (en) The detection method and device of abnormal text
CN106934024A (en) A kind of data processing method and device
CN102789466A (en) Question title quality judgment method and device and question guiding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant