CN112733537A - 文本去重方法、装置、电子设备及计算机可读存储介质 - Google Patents

文本去重方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
CN112733537A
CN112733537A CN202011637850.4A CN202011637850A CN112733537A CN 112733537 A CN112733537 A CN 112733537A CN 202011637850 A CN202011637850 A CN 202011637850A CN 112733537 A CN112733537 A CN 112733537A
Authority
CN
China
Prior art keywords
text
texts
deduplicated
word
preliminary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011637850.4A
Other languages
English (en)
Chinese (zh)
Other versions
CN112733537B (zh
Inventor
何友鑫
彭琛
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011637850.4A priority Critical patent/CN112733537B/zh
Priority claimed from CN202011637850.4A external-priority patent/CN112733537B/zh
Priority to PCT/CN2021/083711 priority patent/WO2022141860A1/fr
Publication of CN112733537A publication Critical patent/CN112733537A/zh
Application granted granted Critical
Publication of CN112733537B publication Critical patent/CN112733537B/zh
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
CN202011637850.4A 2020-12-31 2020-12-31 文本去重方法、装置、电子设备及计算机可读存储介质 Active CN112733537B (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011637850.4A CN112733537B (zh) 2020-12-31 文本去重方法、装置、电子设备及计算机可读存储介质
PCT/CN2021/083711 WO2022141860A1 (fr) 2020-12-31 2021-03-30 Procédé et appareil de déduplication de texte, dispositif électronique et support de stockage lisible par ordinateur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011637850.4A CN112733537B (zh) 2020-12-31 文本去重方法、装置、电子设备及计算机可读存储介质

Publications (2)

Publication Number Publication Date
CN112733537A true CN112733537A (zh) 2021-04-30
CN112733537B CN112733537B (zh) 2024-10-22

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386423A (zh) * 2022-01-18 2022-04-22 平安科技(深圳)有限公司 文本去重方法和装置、电子设备、存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294588A1 (en) * 2017-04-07 2019-09-26 Tencent Technology (Shenzhen) Company Limited Text deduplication method and apparatus, and storage medium
CN111159996A (zh) * 2019-12-31 2020-05-15 福建福诺移动通信技术有限公司 一种基于改进的文本指纹算法的短文本集合相似度比较方法及系统

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294588A1 (en) * 2017-04-07 2019-09-26 Tencent Technology (Shenzhen) Company Limited Text deduplication method and apparatus, and storage medium
CN111159996A (zh) * 2019-12-31 2020-05-15 福建福诺移动通信技术有限公司 一种基于改进的文本指纹算法的短文本集合相似度比较方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386423A (zh) * 2022-01-18 2022-04-22 平安科技(深圳)有限公司 文本去重方法和装置、电子设备、存储介质
CN114386423B (zh) * 2022-01-18 2023-07-14 平安科技(深圳)有限公司 文本去重方法和装置、电子设备、存储介质

Also Published As

Publication number Publication date
WO2022141860A1 (fr) 2022-07-07

Similar Documents

Publication Publication Date Title
CN112541338A (zh) 相似文本匹配方法、装置、电子设备及计算机存储介质
CN113157927B (zh) 文本分类方法、装置、电子设备及可读存储介质
WO2022160449A1 (fr) Procédé et appareil de classification de texte, dispositif électronique et support de stockage
CN113051356A (zh) 开放关系抽取方法、装置、电子设备及存储介质
CN113095076A (zh) 敏感词识别方法、装置、电子设备及存储介质
CN111460797B (zh) 关键字抽取方法、装置、电子设备及可读存储介质
CN113449187A (zh) 基于双画像的产品推荐方法、装置、设备及存储介质
CN113033198B (zh) 相似文本推送方法、装置、电子设备及计算机存储介质
CN112380859A (zh) 舆情信息的推荐方法、装置、电子设备及计算机存储介质
CN112883730B (zh) 相似文本匹配方法、装置、电子设备及存储介质
CN115146865A (zh) 基于人工智能的任务优化方法及相关设备
CN113268615A (zh) 资源标签生成方法、装置、电子设备及存储介质
CN113886708A (zh) 基于用户信息的产品推荐方法、装置、设备及存储介质
CN114612194A (zh) 产品推荐方法、装置、电子设备及存储介质
CN112632264A (zh) 智能问答方法、装置、电子设备及存储介质
CN113722472B (zh) 一种技术文献信息提取方法、系统及存储介质
CN113435308B (zh) 文本多标签分类方法、装置、设备及存储介质
CN113505117A (zh) 基于数据指标的数据质量评估方法、装置、设备及介质
CN113688239A (zh) 少样本下的文本分类方法、装置、电子设备及存储介质
CN112579781A (zh) 文本归类方法、装置、电子设备及介质
CN115409041B (zh) 一种非结构化数据提取方法、装置、设备及存储介质
CN114708073A (zh) 一种围标串标智能检测方法、装置、电子设备及存储介质
CN112733537B (zh) 文本去重方法、装置、电子设备及计算机可读存储介质
CN112733537A (zh) 文本去重方法、装置、电子设备及计算机可读存储介质
CN113434413A (zh) 基于数据差异的数据测试方法、装置、设备及存储介质

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40041501

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant