CN112733537A - 文本去重方法、装置、电子设备及计算机可读存储介质 - Google Patents
文本去重方法、装置、电子设备及计算机可读存储介质 Download PDFInfo
- Publication number
- CN112733537A CN112733537A CN202011637850.4A CN202011637850A CN112733537A CN 112733537 A CN112733537 A CN 112733537A CN 202011637850 A CN202011637850 A CN 202011637850A CN 112733537 A CN112733537 A CN 112733537A
- Authority
- CN
- China
- Prior art keywords
- text
- texts
- deduplicated
- word
- preliminary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000011218 segmentation Effects 0.000 claims description 173
- 238000012545 processing Methods 0.000 claims description 36
- 238000004422 calculation algorithm Methods 0.000 claims description 30
- 238000004458 analytical method Methods 0.000 claims description 19
- 238000012216 screening Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000013550 semantic technology Methods 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000008030 elimination Effects 0.000 description 7
- 238000003379 elimination reaction Methods 0.000 description 7
- 238000007726 management method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000013481 data capture Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011637850.4A CN112733537B (zh) | 2020-12-31 | 文本去重方法、装置、电子设备及计算机可读存储介质 | |
PCT/CN2021/083711 WO2022141860A1 (fr) | 2020-12-31 | 2021-03-30 | Procédé et appareil de déduplication de texte, dispositif électronique et support de stockage lisible par ordinateur |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011637850.4A CN112733537B (zh) | 2020-12-31 | 文本去重方法、装置、电子设备及计算机可读存储介质 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112733537A true CN112733537A (zh) | 2021-04-30 |
CN112733537B CN112733537B (zh) | 2024-10-22 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114386423A (zh) * | 2022-01-18 | 2022-04-22 | 平安科技(深圳)有限公司 | 文本去重方法和装置、电子设备、存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190294588A1 (en) * | 2017-04-07 | 2019-09-26 | Tencent Technology (Shenzhen) Company Limited | Text deduplication method and apparatus, and storage medium |
CN111159996A (zh) * | 2019-12-31 | 2020-05-15 | 福建福诺移动通信技术有限公司 | 一种基于改进的文本指纹算法的短文本集合相似度比较方法及系统 |
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190294588A1 (en) * | 2017-04-07 | 2019-09-26 | Tencent Technology (Shenzhen) Company Limited | Text deduplication method and apparatus, and storage medium |
CN111159996A (zh) * | 2019-12-31 | 2020-05-15 | 福建福诺移动通信技术有限公司 | 一种基于改进的文本指纹算法的短文本集合相似度比较方法及系统 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114386423A (zh) * | 2022-01-18 | 2022-04-22 | 平安科技(深圳)有限公司 | 文本去重方法和装置、电子设备、存储介质 |
CN114386423B (zh) * | 2022-01-18 | 2023-07-14 | 平安科技(深圳)有限公司 | 文本去重方法和装置、电子设备、存储介质 |
Also Published As
Publication number | Publication date |
---|---|
WO2022141860A1 (fr) | 2022-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112541338A (zh) | 相似文本匹配方法、装置、电子设备及计算机存储介质 | |
CN113157927B (zh) | 文本分类方法、装置、电子设备及可读存储介质 | |
WO2022160449A1 (fr) | Procédé et appareil de classification de texte, dispositif électronique et support de stockage | |
CN113051356A (zh) | 开放关系抽取方法、装置、电子设备及存储介质 | |
CN113095076A (zh) | 敏感词识别方法、装置、电子设备及存储介质 | |
CN111460797B (zh) | 关键字抽取方法、装置、电子设备及可读存储介质 | |
CN113449187A (zh) | 基于双画像的产品推荐方法、装置、设备及存储介质 | |
CN113033198B (zh) | 相似文本推送方法、装置、电子设备及计算机存储介质 | |
CN112380859A (zh) | 舆情信息的推荐方法、装置、电子设备及计算机存储介质 | |
CN112883730B (zh) | 相似文本匹配方法、装置、电子设备及存储介质 | |
CN115146865A (zh) | 基于人工智能的任务优化方法及相关设备 | |
CN113268615A (zh) | 资源标签生成方法、装置、电子设备及存储介质 | |
CN113886708A (zh) | 基于用户信息的产品推荐方法、装置、设备及存储介质 | |
CN114612194A (zh) | 产品推荐方法、装置、电子设备及存储介质 | |
CN112632264A (zh) | 智能问答方法、装置、电子设备及存储介质 | |
CN113722472B (zh) | 一种技术文献信息提取方法、系统及存储介质 | |
CN113435308B (zh) | 文本多标签分类方法、装置、设备及存储介质 | |
CN113505117A (zh) | 基于数据指标的数据质量评估方法、装置、设备及介质 | |
CN113688239A (zh) | 少样本下的文本分类方法、装置、电子设备及存储介质 | |
CN112579781A (zh) | 文本归类方法、装置、电子设备及介质 | |
CN115409041B (zh) | 一种非结构化数据提取方法、装置、设备及存储介质 | |
CN114708073A (zh) | 一种围标串标智能检测方法、装置、电子设备及存储介质 | |
CN112733537B (zh) | 文本去重方法、装置、电子设备及计算机可读存储介质 | |
CN112733537A (zh) | 文本去重方法、装置、电子设备及计算机可读存储介质 | |
CN113434413A (zh) | 基于数据差异的数据测试方法、装置、设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40041501 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |