CN106598997B - Method and device for calculating text theme attribution degree - Google Patents

Method and device for calculating text theme attribution degree Download PDF

Info

Publication number
CN106598997B
CN106598997B CN201510680277.8A CN201510680277A CN106598997B CN 106598997 B CN106598997 B CN 106598997B CN 201510680277 A CN201510680277 A CN 201510680277A CN 106598997 B CN106598997 B CN 106598997B
Authority
CN
China
Prior art keywords
sentence
text
weight value
topic
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510680277.8A
Other languages
Chinese (zh)
Other versions
CN106598997A (en
Inventor
侯明午
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510680277.8A priority Critical patent/CN106598997B/en
Publication of CN106598997A publication Critical patent/CN106598997A/en
Application granted granted Critical
Publication of CN106598997B publication Critical patent/CN106598997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The invention discloses a method and a device for calculating the attribution degree of a text theme, relates to the technical field of computers, and solves the problem that the attribution degree calculation has a large error due to the fact that a theme keyword appearing in a text is irrelevant to the text theme. The main technical scheme of the invention is as follows: carrying out sentence dividing processing on the text to be detected to obtain a sentence list; searching sentences containing keywords in the topic keyword set in the sentence list according to a preset topic keyword set; determining a position weight value of the sentence according to the position of the sentence in the text to be detected; and calculating the topic attribution degree of the text to be detected according to the position weight value of the sentences and the number of the sentences containing the keywords. The method is mainly used for calculating the attribution degree of the text theme.

Description

Method and device for calculating text theme attribution degree
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for calculating text theme attribution degree.
Background
In the context of big data, relevant information extraction is an important issue. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. The topic content expressed by the article is determined by extracting the characteristic keywords in the article.
Most of the existing related information extraction algorithms determine whether the content expressed by an article belongs to a certain topic by determining whether the article has a feature keyword related to the topic. Although the related information in the article can be obtained relatively comprehensively by using whether the keywords appear in the article as features, the extracted information may have a lot of noise because not all words in the article are closely related to the subject. Therefore, when the topic expressed by the article is finally judged, the opposite judgment result may be obtained, and larger errors of the subsequent analysis are caused.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for calculating text topic attribution, and mainly aims to calculate the topic attribution of a text according to the positions and frequencies of topic keywords appearing in the text, thereby improving the accuracy of attribution judgment.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
in one aspect, the present invention provides a method for calculating text topic attribution, including:
carrying out sentence dividing processing on the text to be detected to obtain a sentence list;
searching sentences containing keywords in the topic keyword set in the sentence list according to a preset topic keyword set;
determining a position weight value of the sentence according to the position of the sentence in the text to be detected;
and calculating the topic attribution degree of the text to be detected according to the position weight value of the sentences and the number of the sentences containing the keywords.
On the other hand, the invention also provides a device for calculating the attribution degree of the text theme, which comprises the following steps:
the sentence dividing processing unit is used for carrying out sentence dividing processing on the text to be detected to obtain a sentence list;
a searching unit, configured to search, according to a preset topic keyword set, a sentence containing the keywords in the topic keyword set in the sentence list obtained by the sentence segmentation processing unit;
the determining unit is used for determining the position weight value of the sentence according to the position of the sentence searched by the searching unit in the text to be detected;
and the calculating unit is used for calculating the topic attribution degree of the text to be detected according to the position weight value of the sentence determined by the determining unit and the number of the sentences containing the keywords searched by the searching unit.
According to the method and the device for calculating the text topic attribution degree provided by the invention, after the text to be detected is divided into sentences, whether each sentence contains topic keywords or not is determined, the number of the sentences containing the topic keywords is recorded, the position weighted value of the sentence is determined according to the position of the sentence containing the topic keywords in the text to be detected, and the topic attribution degree of the text to be detected is obtained by calculating the ratio of the sum of the position weighted values of the sentences containing the keywords to the sum of the position weighted values of all the sentences. Compared with the prior art, the method and the device determine the correlation degree of the text to be detected and the theme through the size of the ratio, and avoid the problem that the conclusion of the dichotomy is too absolute. In addition, the topic attribution degree calculated through the position weight value of the sentence in the text to be detected is obtained by quantitatively analyzing the position of the keyword appearing in the text to be detected and adding the position into the calculation of the attribution degree, so that the analysis error caused by the noise mentioned in the background technology can be reduced, and the accuracy of judging the attribution degree is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for calculating text topic attribution according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for calculating text topic attribution according to the embodiment of the invention;
FIG. 3 is a block diagram illustrating an apparatus for calculating attribution of a text topic according to an embodiment of the present invention;
fig. 4 is a block diagram illustrating another apparatus for calculating attribution of a text topic according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The embodiment of the invention provides a method for calculating text theme attribution degree, which comprises the following specific steps as shown in figure 1:
101. and carrying out sentence division processing on the text to be detected to obtain a sentence list.
The text to be tested used in the embodiment of the invention refers to Chinese text, but the general text capable of expressing a certain theme is mostly large-space or multi-section articles. The existing method for analyzing the article theme mainly comprises the steps of segmenting the article and checking whether keywords related to the theme are contained in the segmentation. The existing word segmentation mode is relatively complex, and the accuracy of word segmentation needs to be improved. Therefore, in order to avoid error analysis caused by word segmentation, the embodiment of the invention divides the text to be detected into a plurality of sentences by carrying out sentence segmentation processing on the text to be detected, and forms the sentences into a sentence list.
102. And searching sentences containing the keywords in the topic keyword set in a sentence list according to a preset topic keyword set.
When the text to be tested is decomposed, a topic keyword set is also required to be created, and the word set comprises a plurality of keywords related to the topic. After a sentence list of the text to be detected is obtained, matching the sentences in the sentence list with the keywords in the topic keyword set, and screening out the sentences containing the keywords.
When matching a sentence with a keyword, the sentence can be decomposed into a plurality of words, the words and the keyword are matched one by one, the keyword can also be brought into the sentence one by one and matched with characters or words in the sentence, and the specific matching mode is not specifically limited in the embodiment. The method mainly aims to screen out sentences containing keywords in a sentence list.
103. And determining the position weight value of the sentence according to the position of the sentence in the text to be detected.
The position weight value of a sentence refers to the importance of the position of the sentence in the text. In an article, keywords related to a subject often appear in relatively obvious and important positions, such as a title of the article, a first segment or a last segment of the article, and the like. Thus, in an article, the location of a sentence has relevance to the topic of the article. In the embodiment of the present invention, it can be derived from the above-mentioned correlation that the position weight value of a sentence can be used to represent the degree of correlation between the sentence and the topic.
It should be noted that, for the value of the position weight value, the embodiment of the present invention is not limited to taking the position weight value of the sentence through a fixed algorithm, or manually setting the position weight value of the sentence according to the experience of the user.
104. And calculating the topic attribution degree of the text to be detected according to the position weight value of the sentences and the number of the sentences containing the keywords.
After the position weight values of the sentences and the number of the sentences containing the keywords are determined, the sum of the position weight values of all the sentences containing the keywords can be obtained through accumulation. When the position weight values of all sentences in the text to be detected are calculated, the proportion of the sum of the position weight values of the sentences containing the keywords in the position weight values of all the sentences can be obtained. The percentage value is the topic attribution degree of the text to be tested relative to the topic. The higher the proportion is, the higher the correlation degree between the theme of the text to be tested and the test theme is.
It can be seen from the foregoing implementation manner that, in the method for calculating a text topic attribution degree adopted in the embodiment of the present invention, after a text to be detected is divided into sentences, it is determined whether each sentence contains a topic keyword and the number of sentences containing the topic keyword is recorded, then a position weight value of the sentence is determined according to the position of the sentence containing the topic keyword in the text to be detected, and the topic attribution degree of the text to be detected is obtained by calculating a ratio of a sum of the position weight values of the sentences containing the keyword to a sum of the position weight values of all the sentences. Compared with the prior art, the method and the device determine the correlation degree of the text to be detected and the theme through the size of the ratio, and avoid the problem that the conclusion of the dichotomy is too absolute. In addition, the topic attribution degree calculated through the position weight value of the sentence in the text to be detected is obtained by quantitatively analyzing the position of the keyword appearing in the text to be detected and adding the position into the calculation of the attribution degree, so that the analysis error caused by the noise mentioned in the background technology can be reduced, and the accuracy of judging the attribution degree is improved.
In order to describe the method for calculating the attribution degree of the text topic in more detail, the embodiment of the present invention will be described through a specific implementation manner, as shown in fig. 2, the method includes the following steps when calculating the attribution degree of the text topic:
201. and carrying out sentence division processing on the text to be detected to obtain a sentence list.
When the sentence segmentation is carried out on the text to be detected, the processing mode is simple and easy compared with the word segmentation. The clauses can be obtained only by punctuating the sentence according to the fixed punctuation marks. In Chinese punctuation, it is usually indicated by periods, question marks, exclamation marks, etc. when a sentence is over. Therefore, in this embodiment, the punctuation marks may be pre-selected, then the text to be tested is compared byte by byte, and when a certain byte in the text to be tested is determined to be a preset punctuation mark, the content between the certain byte and the last short sentence is intercepted and stored as a clause in the sentence list.
202. And searching sentences containing the keywords in the topic keyword set in a sentence list according to a preset topic keyword set.
Before executing the step, a plurality of topic keywords should be acquired. The selection of the theme keywords is to select according to the high and low sequence of the correlation degree with the theme, and the theme keywords with high correlation degree with the theme are selected on the premise of fixed quantity. The topic keywords with high correlation degree can be selected to improve the judgment accuracy of the topic attribution degree.
After the topic keywords are determined, the sentences in the sentence list in 201 need to be filtered to select the sentences containing the topic keywords. The specific method may be to select a sentence from the sentence list, and perform word segmentation on the sentence to obtain a plurality of segmented words constituting the sentence. The participles are then matched against all topic keywords and if the same, the sentence is recorded as a sentence containing topic keywords. Since the number of topic keywords may be plural, the main purpose of this step is to find a sentence containing topic keywords without intending on the number of topic keywords contained in the sentence. Therefore, when matching the topic keywords, it is not necessary to match all the participles in the sentence with all the topic keywords one by one, but once the participles are the same as the topic keywords in the matching process, the subsequent matching process of the sentence is terminated, and the sentence is directly recorded as the sentence with the topic keywords. The specific way of performing word segmentation processing on the sentence may be an existing processing way, and the specific process is not described herein again.
203. And determining the position weight value of the sentence according to the position of the sentence in the text to be detected.
The positions of sentences in an article can be roughly divided into title positions, first-paragraph positions, first-tail sentence positions and the like, position weight values of different positions are set according to the probability of the theme keywords appearing at different positions, and generally, the position weight values sequentially comprise title sentence weight values, first-paragraph sentence weight values, first-tail sentence weight values and general sentence weight values from high to low.
The method for determining the position of the sentence in the text to be tested may be to mark the sentence when the sentence is divided, and determine the position of the sentence according to the fixed identification characters, for example, the title sentences of different levels may be distinguished according to the style of the text, and according to the enter symbol, it may be determined that a sentence after the symbol is a beginning sentence, a sentence before the symbol is a ending sentence, and the beginning sentence is a sentence between the title sentence and the first enter symbol after the sentence. The sentence position can be marked by the above judgment strategy, and the sentence position marking mode is not particularly limited in the embodiment of the present invention. The main purpose is to configure different position weight values for different sentences according to the difference of sentence positions.
204. And accumulating the position weight values of the sentences containing the keywords to obtain the theme sentence weight values of the text to be detected.
The weight value of the subject sentence is the sum of the position weight values of the sentences containing the keywords, the number of the sentences with the same position weight value can be counted during calculation, and the products of the different position weight values and the number are accumulated to obtain the weight value of the subject sentence. By calculating the weight value of the theme sentence, the appearance frequency of the theme key words in a plurality of key positions in the text to be tested can be displayed. Therefore, the approximation degree of the theme of the text to be detected and the theme expressed by the theme key words can be relatively intuitively seen.
The specific calculation method can refer to the following calculation formula:
B=Ntitle*Weighttitle+Nfirst-last*Weightfirst-last+Npara-frist*Weightpara-frist+Ncommon*Weightcommon
wherein B is a weight value of the theme sentence, NtitleNumber of sentences for title sentence, WeighttitleWeight value of the title sentence, Nfirst-lastNumber of sentences as beginning and end sentences, Weightfirst-lastWeighted value of beginning and end sentences, Npara-fristNumber of sentences as first paragraph, Weightpara-fristIs the weight value of the first sentence, NcommonNumber of sentences for common sentence, WeightcommonIs a plain sentence weight value.
205. And calculating the total weight value of the text to be detected.
The total weight value is the sum of the position weight values of all sentences in the text to be tested.
The specific calculation formula is as follows:
Ball=Ntitle-all*Weighttitle+Nfirst-last-all*Weightfirst-last+Npara-frist-all*Weightpara-frist+Ncommon-all*Weightcommon
wherein, BallIs the total value of the weights, Ntitle-allNumber of sentences for all title sentences, Nfirst-last-allNumber of sentences for all beginning and end sentences, Npara-frist-allNumber of sentences for all first sentences, Ncommon-allThe number of sentences that are all common sentences.
206. And calculating a quotient of the weighted value of the theme sentence and the total weighted value to obtain the theme attribution degree of the text to be detected.
The topic attribution degree of the text to be tested is a similarity coefficient for judging the topic or central thought of the text and the test topic, wherein the topic sentence weight value shows that the text to be tested is related to the test topicThe content parameter of (2) can be obtained by calculating the quotient of the weighted value of the subject sentence and the total weight value, so as to obtain the proportion of the content related to the test subject in the text to be tested to the total content, namely the subject attribution degree of the text to be tested, namely B/BallThe value of (c).
Further, as an implementation of the foregoing method, an embodiment of the present invention further provides a device for calculating a text topic attribution degree, as shown in fig. 3, where the embodiment of the device corresponds to the foregoing method embodiment, and for convenience of reading, details in the foregoing method embodiment are not repeated in this device embodiment again, but it should be clear that the device in this embodiment can correspondingly implement all the contents in the foregoing method embodiment. The device includes:
a sentence dividing processing unit 31, configured to perform sentence dividing processing on the text to be detected, so as to obtain a sentence list;
a searching unit 32, configured to search, according to a preset topic keyword set, a sentence containing the keywords in the topic keyword set in the sentence list obtained by the sentence segmentation processing unit 31;
a determining unit 33, configured to determine a position weight value of the sentence according to the position of the sentence found by the searching unit 32 in the text to be detected;
a calculating unit 34, configured to calculate a topic attribution degree of the text to be tested according to the position weight value of the sentence determined by the determining unit 33 and the number of the sentences containing the keywords searched by the searching unit 32.
Further, as shown in fig. 4, the determination unit 33 of the apparatus includes:
a first determining module 331, configured to determine a position of the sentence in the text to be tested, where the position includes a title, a first segment, a first and last sentence, and other common positions;
a second determining module 332, configured to determine a position weight value corresponding to the sentence according to the position of the sentence in the text to be tested, where the position weight value is set according to a degree of correlation between the position weight value and a topic, and includes a title sentence weight value, a first paragraph sentence weight value, a first-last sentence weight value, and a common sentence weight value.
Further, as shown in fig. 4, the search unit 32 of the apparatus includes:
a word segmentation module 321, configured to perform word segmentation on the sentences in the sentence list;
a matching module 322, configured to match the word obtained by the word segmentation module 321 with the keyword;
a recording module 323, configured to record the sentence as a sentence with keywords when the matching module 322 succeeds in matching.
Further, the matching module 322 is configured to match the participles with the keywords in the topic keyword set one by one, and when the participles in the sentence are successfully matched with the keywords, no other participles in the sentence are matched.
Further, as shown in fig. 4, the calculating unit 34 includes:
a first calculating module 341, configured to accumulate position weight values of sentences containing the keywords to obtain a subject sentence weight value of the text to be tested;
the second calculating module 342 is configured to calculate a total weight value of the text to be tested, where the total weight value is a sum of position weight values of all sentences;
the third calculating module 343 is configured to calculate a quotient between the weight value of the topic sentence obtained by the first calculating module 341 and the total weight value obtained by the second calculating module 342, so as to obtain the topic attribution degree of the text to be detected.
Further, the sentence segmentation processing unit 31 of the device is further configured to perform sentence segmentation processing on the text to be tested according to a predetermined punctuation mark.
In summary, the method and apparatus for calculating text topic attribution degree adopted in the embodiments of the present invention determine whether each sentence contains a topic keyword and record the number of sentences containing the topic keyword after the text to be detected is divided into sentences, determine the position weight value of the sentence according to the position of the sentence containing the topic keyword in the text to be detected, and obtain the topic attribution degree of the text to be detected by calculating the ratio of the sum of the position weight values of the sentences containing the keyword to the sum of the position weight values of all the sentences. Compared with the prior art, the method and the device determine the correlation degree of the text to be detected and the theme through the size of the ratio, and avoid the problem that the conclusion of the dichotomy is too absolute. In addition, the topic attribution degree calculated through the position weight value of the sentence in the text to be detected is obtained by quantitatively analyzing the position of the keyword appearing in the text to be detected and adding the position into the calculation of the attribution degree, so that the analysis error caused by the noise mentioned in the background technology can be reduced, and the accuracy of judging the attribution degree is improved.
The device for calculating the attribution degree of the text theme comprises a processor and a memory, wherein the sentence processing unit, the searching unit, the determining unit, the calculating unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the topic attribution degree of the test text relative to the preset topic is calculated by adjusting the kernel parameters, so that the accuracy of judging the topic attribution degree is improved.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: carrying out sentence dividing processing on the text to be detected to obtain a sentence list; searching sentences containing keywords in the topic keyword set in the sentence list according to a preset topic keyword set; determining a position weight value of the sentence according to the position of the sentence in the text to be detected; and calculating the topic attribution degree of the text to be detected according to the position weight value of the sentences and the number of the sentences containing the keywords.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (9)

1. A method for calculating attribution of a text topic, the method comprising:
carrying out sentence dividing processing on the text to be detected to obtain a sentence list;
searching sentences containing keywords in the topic keyword set in the sentence list according to a preset topic keyword set;
determining a position weight value of the sentence according to the position of the sentence in the text to be detected;
calculating the topic attribution degree of the text to be detected according to the position weight value of the sentences and the number of the sentences containing the keywords;
wherein, according to a preset topic keyword set, searching the sentences containing the keywords in the topic keyword set in the sentence list comprises:
carrying out word segmentation on the sentences in the sentence list;
matching the word segmentation with the keywords in the topic keyword set one by one;
when the matching of the participles in the sentence and the keywords is successful, no matching is carried out on other participles in the sentence, and the sentence is recorded as a sentence containing the keywords;
wherein, the calculating the topic attribution degree of the text to be detected according to the position weight value of the sentence and the number of the sentences containing the keywords comprises:
and counting the number of sentences with the same position weight values, and accumulating the products of the different position weight values and the number to obtain the theme sentence weight values.
2. The method according to claim 1, wherein the determining the position weight value of the sentence according to the position of the sentence in the text to be tested comprises:
determining the position of the sentence in the text to be detected, wherein the position comprises a title sentence position, a first and last sentence position and a common sentence position;
and determining a position weight value corresponding to the sentence according to the position of the sentence in the text to be detected, wherein the position weight value is set according to the correlation degree of the position weight value and the theme, and comprises a title sentence weight value, a first paragraph sentence weight value, a first and last sentence weight value and a common sentence weight value.
3. The method according to claim 1 or 2, wherein the calculating the topic attribution degree of the text to be tested according to the position weight value of the sentence and the number of the sentences containing the keywords comprises:
calculating a total weight value of the text to be detected, wherein the total weight value is the sum of position weight values of all sentences;
and calculating a quotient of the weighted value of the theme sentence and the total weighted value to obtain the theme attribution degree of the text to be tested.
4. The method according to claim 1, wherein the sentence splitting processing the text to be tested to obtain the sentence list comprises:
and carrying out sentence division processing on the text to be detected according to a preset punctuation mark.
5. An apparatus for calculating attribution of a text topic, the apparatus comprising:
the sentence dividing processing unit is used for carrying out sentence dividing processing on the text to be detected to obtain a sentence list;
a searching unit, configured to search, according to a preset topic keyword set, a sentence containing the keywords in the topic keyword set in the sentence list obtained by the sentence segmentation processing unit;
the determining unit is used for determining the position weight value of the sentence according to the position of the sentence searched by the searching unit in the text to be detected;
the calculating unit is used for calculating the topic attribution degree of the text to be detected according to the position weight value of the sentence determined by the determining unit and the number of the sentences containing the keywords searched by the searching unit;
wherein the search unit comprises:
the word segmentation module is used for carrying out word segmentation on the sentences in the sentence list;
the matching module is used for matching the participles obtained by the participle module with the keywords;
the recording module is used for not matching other participles in the sentence when the matching module is successfully matched, and recording the sentence as a sentence containing keywords;
wherein the calculation unit includes:
the first calculation module is used for counting the number of sentences with the same position weight values and accumulating products of the different position weight values and the number to obtain the theme sentence weight values.
6. The apparatus of claim 5, wherein the determining unit comprises:
the first determining module is used for determining the position of the sentence in the text to be detected, wherein the position comprises a title, a first segment, a first and a last sentence and other common positions;
and the second determining module is used for determining a position weight value corresponding to the sentence according to the position of the sentence determined by the first determining module in the text to be detected, wherein the position weight value is set according to the correlation degree of the position weight value and the theme, and comprises a title sentence weight value, a first paragraph sentence weight value, a first and last sentence weight value and a common sentence weight value.
7. The apparatus according to claim 5 or 6, wherein the calculation unit comprises:
the second calculation module is used for calculating the total weight value of the text to be detected, wherein the total weight value is the sum of the position weight values of all sentences;
and the third calculation module is used for calculating a quotient of the weight value of the theme sentence obtained by the first calculation module and the total weight value obtained by the second calculation module to obtain the theme attribution degree of the text to be detected.
8. A storage medium, comprising a stored program, wherein when the program runs, the storage medium is controlled by a device to execute a method for calculating text topic attribution as claimed in any one of claims 1 to 4.
9. A processor, configured to execute a program, wherein the program executes a method for calculating attribution of a text topic according to any one of claims 1 to 4.
CN201510680277.8A 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree Active CN106598997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510680277.8A CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510680277.8A CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Publications (2)

Publication Number Publication Date
CN106598997A CN106598997A (en) 2017-04-26
CN106598997B true CN106598997B (en) 2021-05-18

Family

ID=58555102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510680277.8A Active CN106598997B (en) 2015-10-19 2015-10-19 Method and device for calculating text theme attribution degree

Country Status (1)

Country Link
CN (1) CN106598997B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704763A (en) * 2017-09-04 2018-02-16 中国移动通信集团广东有限公司 Multi-source heterogeneous leak information De-weight method, stage division and device
CN109657202B (en) * 2017-10-10 2022-10-28 北京国双科技有限公司 Text processing method and device
CN111369294B (en) * 2020-03-06 2023-06-23 中国铁塔股份有限公司 Software cost estimation method and device
CN111581358B (en) * 2020-04-08 2023-08-18 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111950037A (en) * 2020-08-25 2020-11-17 北京天融信网络安全技术有限公司 Detection method, detection device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015069652A1 (en) * 2013-11-07 2015-05-14 a la mode technologies, inc. Gathering subject information in close proximity to a user

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
CN103744953A (en) * 2014-01-02 2014-04-23 中国科学院计算机网络信息中心 Network hotspot mining method based on Chinese text emotion recognition

Also Published As

Publication number Publication date
CN106598997A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
CN106598997B (en) Method and device for calculating text theme attribution degree
EP2657884B1 (en) Identifying multimedia objects based on multimedia fingerprint
CN106598999B (en) Method and device for calculating text theme attribution degree
EP3051432A1 (en) Semantic information acquisition method, keyword expansion method thereof, and search method and system
CN106610931B (en) Topic name extraction method and device
CN110019668A (en) A kind of text searching method and device
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN109597983B (en) Spelling error correction method and device
CN108255886B (en) Evaluation method and device of recommendation system
CN108959474B (en) Entity relation extraction method
KR20150037924A (en) Information classification based on product recognition
CN109472017B (en) Method and device for obtaining relevant information of text court deeds of referee to be generated
CN109472722B (en) Method and device for obtaining relevant information of approved finding segment of official document to be generated
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN115329048A (en) Statement retrieval method and device, electronic equipment and storage medium
CN110019670A (en) A kind of text searching method and device
CN111144109A (en) Text similarity determination method and device
CN110363206B (en) Clustering of data objects, data processing and data identification method
CN109670153A (en) A kind of determination method, apparatus, storage medium and the terminal of similar model
US10782942B1 (en) Rapid onboarding of data from diverse data sources into standardized objects with parser and unit test generation
CN116226681B (en) Text similarity judging method and device, computer equipment and storage medium
CN107861950A (en) The detection method and device of abnormal text
CN112084448A (en) Similar information processing method and device
CN115796146A (en) File comparison method and device
CN110955845A (en) User interest identification method and device, and search result processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant