CN113836886A - News title similarity identification method - Google Patents

News title similarity identification method Download PDF

Info

Publication number
CN113836886A
CN113836886A CN202110948184.4A CN202110948184A CN113836886A CN 113836886 A CN113836886 A CN 113836886A CN 202110948184 A CN202110948184 A CN 202110948184A CN 113836886 A CN113836886 A CN 113836886A
Authority
CN
China
Prior art keywords
characters
title
titles
identification method
judging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110948184.4A
Other languages
Chinese (zh)
Inventor
王欢
马云腾
夏茂晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingbo Intelligent Technology Co ltd
Original Assignee
Beijing Qingbo Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingbo Intelligent Technology Co ltd filed Critical Beijing Qingbo Intelligent Technology Co ltd
Priority to CN202110948184.4A priority Critical patent/CN113836886A/en
Publication of CN113836886A publication Critical patent/CN113836886A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention discloses a news title similarity identification method, which comprises the following steps: 1. inputting two titles; 2. removing special characters in the two titles; 3. carrying out same character statistics on the removed title contents to obtain the same word number; 4. calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, otherwise, judging the characters to be dissimilar: the method is simple, rapid and highly portable.

Description

News title similarity identification method
Technical Field
The invention relates to the technical field of text recognition, in particular to a news title similarity recognition method.
Background
When the similarity of texts is calculated by the existing similar text recognition technology, a text similarity calculation technology based on a dictionary or a feature engineering is mainly adopted, and the accuracy of the dictionary or the feature engineering influences the accuracy of an algorithm to a great extent.
However, for short texts with a small vocabulary and little semantic information, such as news headlines, it is difficult to establish an accurate dictionary or feature engineering, which results in that the existing similar text recognition technology is difficult to capture key information in the short texts, the similarity calculation effect is poor, and the similar text recognition rate is low.
Namely, the existing similar text recognition technology has the technical problem of low similar text recognition rate for short texts such as news titles and the like.
Disclosure of Invention
In order to achieve the purpose, the invention adopts the technical scheme that:
a news title similarity identification method comprises the following steps:
1. inputting two titles;
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.
The working principle and the beneficial effects are as follows: simple, fast and high transplantability.
Detailed Description
The invention will be better understood from the following examples.
A news title similarity identification method comprises the following steps:
1. inputting two titles; such as: a: "more than half of our country infected with helicobacter pylori! ", b: "more than half of people in China have been infected by helicobacter pylori bacteria";
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.

Claims (1)

1. A news title similarity identification method is characterized by comprising the following steps:
1. inputting two titles;
2. removing special characters in the two titles;
3. carrying out same character statistics on the removed title contents to obtain the same word number;
4. and calculating the ratio of the same number of characters to the shortest title, judging the characters to be similar if the ratio is more than 0.5, and otherwise, judging the characters to be dissimilar.
CN202110948184.4A 2021-08-18 2021-08-18 News title similarity identification method Pending CN113836886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110948184.4A CN113836886A (en) 2021-08-18 2021-08-18 News title similarity identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110948184.4A CN113836886A (en) 2021-08-18 2021-08-18 News title similarity identification method

Publications (1)

Publication Number Publication Date
CN113836886A true CN113836886A (en) 2021-12-24

Family

ID=78960744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110948184.4A Pending CN113836886A (en) 2021-08-18 2021-08-18 News title similarity identification method

Country Status (1)

Country Link
CN (1) CN113836886A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN107329947A (en) * 2017-05-15 2017-11-07 中国移动通信集团湖北有限公司 Determination method, device and the equipment of Similar Text
CN107688661A (en) * 2017-08-17 2018-02-13 广州酷狗计算机科技有限公司 Lyrics similarity calculating method, terminal device and computer-readable recording medium
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN107329947A (en) * 2017-05-15 2017-11-07 中国移动通信集团湖北有限公司 Determination method, device and the equipment of Similar Text
CN107688661A (en) * 2017-08-17 2018-02-13 广州酷狗计算机科技有限公司 Lyrics similarity calculating method, terminal device and computer-readable recording medium
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline

Similar Documents

Publication Publication Date Title
US20190043504A1 (en) Speech recognition method and device
CN110866399B (en) Chinese short text entity recognition and disambiguation method based on enhanced character vector
CN101334768B (en) Method and system for eliminating ambiguity for word meaning by computer, and search method
CN102693279B (en) Method, device and system for fast calculating comment similarity
AUPR824301A0 (en) Methods and systems (npw001)
CN110276052B (en) Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device
HK1100586A1 (en) Apparatus and method for handwriting recognition
CN106383814A (en) Word segmentation method of English social media short text
CN105630822A (en) Method for marking similar contents in patent retrieval in red color
CN112052319A (en) Intelligent customer service method and system based on multi-feature fusion
CN111724766A (en) Language identification method, related equipment and readable storage medium
CN107368466A (en) A kind of name recognition methods and its system towards elementary mathematics field
CN113836886A (en) News title similarity identification method
CN105955986A (en) Character converting method and apparatus
CN105573981A (en) Method and device for extracting Chinese names of people and places
Oprean et al. Using the Web to create dynamic dictionaries in handwritten out-of-vocabulary word recognition
CN104699662B (en) The method and apparatus for identifying overall symbol string
Celebi et al. Segmenting hashtags using automatically created training data
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN113139050B (en) Text abstract generation method based on named entity identification additional label and priori knowledge
CN1037598A (en) Eight first sounds (fool) code Chinese character input method
CN111538893B (en) Method for extracting network security new words from unstructured data
Nghiem et al. Improving vietnamese pos tagging by integrating a rich feature set and support vector machines
CN112632259A (en) Automatic dialog intention recognition system based on linguistic rule generation
CN113435218A (en) Regular expression-based speech translation text information extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination