CN113836886A - News title similarity identification method - Google Patents
News title similarity identification method Download PDFInfo
- Publication number
- CN113836886A CN113836886A CN202110948184.4A CN202110948184A CN113836886A CN 113836886 A CN113836886 A CN 113836886A CN 202110948184 A CN202110948184 A CN 202110948184A CN 113836886 A CN113836886 A CN 113836886A
- Authority
- CN
- China
- Prior art keywords
- characters
- title
- titles
- identification method
- judging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Abstract
The invention discloses a news title similarity identification method, which comprises the following steps: 1. inputting two titles; 2. removing special characters in the two titles; 3. carrying out same character statistics on the removed title contents to obtain the same word number; 4. calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, otherwise, judging the characters to be dissimilar: the method is simple, rapid and highly portable.
Description
Technical Field
The invention relates to the technical field of text recognition, in particular to a news title similarity recognition method.
Background
When the similarity of texts is calculated by the existing similar text recognition technology, a text similarity calculation technology based on a dictionary or a feature engineering is mainly adopted, and the accuracy of the dictionary or the feature engineering influences the accuracy of an algorithm to a great extent.
However, for short texts with a small vocabulary and little semantic information, such as news headlines, it is difficult to establish an accurate dictionary or feature engineering, which results in that the existing similar text recognition technology is difficult to capture key information in the short texts, the similarity calculation effect is poor, and the similar text recognition rate is low.
Namely, the existing similar text recognition technology has the technical problem of low similar text recognition rate for short texts such as news titles and the like.
Disclosure of Invention
In order to achieve the purpose, the invention adopts the technical scheme that:
a news title similarity identification method comprises the following steps:
1. inputting two titles;
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.
The working principle and the beneficial effects are as follows: simple, fast and high transplantability.
Detailed Description
The invention will be better understood from the following examples.
A news title similarity identification method comprises the following steps:
1. inputting two titles; such as: a: "more than half of our country infected with helicobacter pylori! ", b: "more than half of people in China have been infected by helicobacter pylori bacteria";
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.
Claims (1)
1. A news title similarity identification method is characterized by comprising the following steps:
1. inputting two titles;
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110948184.4A CN113836886A (en) | 2021-08-18 | 2021-08-18 | News title similarity identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110948184.4A CN113836886A (en) | 2021-08-18 | 2021-08-18 | News title similarity identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113836886A true CN113836886A (en) | 2021-12-24 |
Family
ID=78960744
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110948184.4A Pending CN113836886A (en) | 2021-08-18 | 2021-08-18 | News title similarity identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113836886A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350032A (en) * | 2008-09-23 | 2009-01-21 | 胡辉 | Method for judging whether web page content is identical or not |
CN102693279A (en) * | 2012-04-28 | 2012-09-26 | 合一网络技术(北京)有限公司 | Method, device and system for fast calculating comment similarity |
CN107329947A (en) * | 2017-05-15 | 2017-11-07 | 中国移动通信集团湖北有限公司 | Determination method, device and the equipment of Similar Text |
CN107688661A (en) * | 2017-08-17 | 2018-02-13 | 广州酷狗计算机科技有限公司 | Lyrics similarity calculating method, terminal device and computer-readable recording medium |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
-
2021
- 2021-08-18 CN CN202110948184.4A patent/CN113836886A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350032A (en) * | 2008-09-23 | 2009-01-21 | 胡辉 | Method for judging whether web page content is identical or not |
CN102693279A (en) * | 2012-04-28 | 2012-09-26 | 合一网络技术(北京)有限公司 | Method, device and system for fast calculating comment similarity |
CN107329947A (en) * | 2017-05-15 | 2017-11-07 | 中国移动通信集团湖北有限公司 | Determination method, device and the equipment of Similar Text |
CN107688661A (en) * | 2017-08-17 | 2018-02-13 | 广州酷狗计算机科技有限公司 | Lyrics similarity calculating method, terminal device and computer-readable recording medium |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20190043504A1 (en) | Speech recognition method and device | |
CN110866399B (en) | Chinese short text entity recognition and disambiguation method based on enhanced character vector | |
CN101334768B (en) | Method and system for eliminating ambiguity for word meaning by computer, and search method | |
CN102693279B (en) | Method, device and system for fast calculating comment similarity | |
AUPR824301A0 (en) | Methods and systems (npw001) | |
CN110276052B (en) | Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device | |
HK1100586A1 (en) | Apparatus and method for handwriting recognition | |
CN106383814A (en) | Word segmentation method of English social media short text | |
CN105630822A (en) | Method for marking similar contents in patent retrieval in red color | |
CN112052319A (en) | Intelligent customer service method and system based on multi-feature fusion | |
CN111724766A (en) | Language identification method, related equipment and readable storage medium | |
CN107368466A (en) | A kind of name recognition methods and its system towards elementary mathematics field | |
CN113836886A (en) | News title similarity identification method | |
CN105955986A (en) | Character converting method and apparatus | |
CN105573981A (en) | Method and device for extracting Chinese names of people and places | |
Oprean et al. | Using the Web to create dynamic dictionaries in handwritten out-of-vocabulary word recognition | |
CN104699662B (en) | The method and apparatus for identifying overall symbol string | |
Celebi et al. | Segmenting hashtags using automatically created training data | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model | |
CN113139050B (en) | Text abstract generation method based on named entity identification additional label and priori knowledge | |
CN1037598A (en) | Eight first sounds (fool) code Chinese character input method | |
CN111538893B (en) | Method for extracting network security new words from unstructured data | |
Nghiem et al. | Improving vietnamese pos tagging by integrating a rich feature set and support vector machines | |
CN112632259A (en) | Automatic dialog intention recognition system based on linguistic rule generation | |
CN113435218A (en) | Regular expression-based speech translation text information extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |