WO2017107566A1 - Procédé et système d'extraction basés sur une similarité de vecteur de mot - Google Patents

Procédé et système d'extraction basés sur une similarité de vecteur de mot Download PDF

Info

Publication number
WO2017107566A1
WO2017107566A1 PCT/CN2016/098234 CN2016098234W WO2017107566A1 WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1 CN 2016098234 W CN2016098234 W CN 2016098234W WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
word vector
training
matching
file
Prior art date
Application number
PCT/CN2016/098234
Other languages
English (en)
Chinese (zh)
Inventor
李贤�
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2017107566A1 publication Critical patent/WO2017107566A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes

Definitions

  • the invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.
  • the existing techniques for the resume search matching process are usually searched by multiple keywords.
  • the user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements.
  • this search method has the following disadvantages:
  • the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.
  • the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.
  • An aspect of the present invention provides a retrieval method based on word vector similarity, including:
  • the performing word vector training on the search library comprises:
  • the pre-processing includes data cleaning and extracting data description
  • the word vector training for the search library includes:
  • Word vector training is performed on the search library based on the training sample file.
  • the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
  • the extracting the data description includes segmentation by adding a user dictionary.
  • the performing word vector training on the search library comprises:
  • Word vector training is performed on the training sample file by word2vec.
  • the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:
  • the similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:
  • a model training unit configured to perform word vector training on the search library, and establish a training model corresponding to the search library
  • a result output unit configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
  • the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file;
  • the pre-processing includes data cleaning and extracting data descriptions;
  • the word vector training for the search library includes:
  • Word vector training is performed on the search library based on the training sample file.
  • the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
  • the extracting the data description includes segmentation by adding a user dictionary.
  • the performing word vector training on the search library comprises:
  • Word vector training is performed on the training sample file by word2vec.
  • the search matching unit comprises:
  • a matching module configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words
  • a statistic module configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot The matching scores of each file and the related words are respectively obtained.
  • the search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result.
  • the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
  • the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
  • FIG. 1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.
  • Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.
  • FIG. 1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention.
  • the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:
  • the problem of natural language understanding translates into machine learning problems.
  • the first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector.
  • the word vector is a common Chinese name for "Word Representation” or "Word Embedding”.
  • the word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather”.
  • the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
  • the word vector may be a word vector represented by a Distributed Representation.
  • the word vector represented by Distributed Representation is a low-dimensional real number vector.
  • the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
  • each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.
  • said pre-processing comprises data cleaning and extracting data descriptions.
  • the data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
  • the extracting data description includes adding a user dictionary.
  • the word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).
  • word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library.
  • the specific manner may be: using the word2vec to the training sample
  • the file is trained in word vector, and the training settings are as follows:
  • -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
  • -hs indicates whether to use the HS method, 0 means not used, 1 means use,
  • sample le-3 indicates that the threshold of the sample is 10 -3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;
  • -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
  • -min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.
  • S2 receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
  • the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.
  • the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.
  • the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
  • the word vector similarity-based retrieval method of the above embodiment by performing a word orientation on the search library Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low
  • the files in the search library are sorted, and the search results are output according to the sort result.
  • the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
  • the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
  • the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words.
  • the unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:
  • the model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
  • the word vector in this embodiment should have the following features: let relevant or similar words, at a distance The distance is closer, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather".
  • the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
  • the word vector may be a word vector represented by a Distributed Representation.
  • the word vector represented by Distributed Representation is a low-dimensional real number vector.
  • the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
  • the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training.
  • word vector training is performed on the search library based on the training sample file.
  • the pre-processing includes data cleaning and extracting data description.
  • the data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
  • the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.
  • the training sample file can be trained by word vector by word2vec, and the training settings are as follows:
  • -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
  • -hs indicates whether to use the HS method, 0 means not used, 1 means use,
  • sample le-3 indicates that the sampling threshold is 10 -3 .
  • -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
  • -min_count indicates the lowest frequency set, the default is 5.
  • the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;
  • the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.
  • the search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;
  • the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • a matching module configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words.
  • the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
  • the result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
  • a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
  • each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.
  • each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium.
  • a program to instruct related hardware personal computer, server, or network device, etc.
  • the program can be stored in a computer readable storage medium.
  • the program when executed, may perform all or part of the steps of the method specified in any of the above embodiments.
  • the foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un système d'extraction basés sur une similarité de vecteur de mot. Le procédé consiste : à réaliser un apprentissage de vecteur de mot sur une bibliothèque d'extraction, et à établir un modèle d'apprentissage correspondant à la bibliothèque d'extraction (S1); à recevoir un mot-clé d'extraction d'entrée, et à obtenir des mots associés du mot-clé d'extraction et la similarité entre chacun des mots associés et le mot-clé d'extraction au moyen du modèle d'apprentissage (S2); à extraire et à mettre en correspondance la bibliothèque d'extraction à l'aide des mots associés, et à compter respectivement les notes de la mise en correspondance entre différents fichiers dans la bibliothèque d'extraction et les mots associés selon la similarité (S3); et à trier les fichiers dans la bibliothèque d'extraction selon les notes de mise en correspondance de élevés à faibles, et à délivrer un résultat d'extraction selon le résultat de tri (S4). Au moyen du procédé, les capacités d'extraction et de mise en correspondance de mots associés peuvent être améliorées en combinaison avec les caractéristiques lexicales dans différentes bibliothèques d'extraction, permettant ainsi d'améliorer le taux de précision et la robustesse d'extraction.
PCT/CN2016/098234 2015-12-25 2016-09-06 Procédé et système d'extraction basés sur une similarité de vecteur de mot WO2017107566A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511003865.4 2015-12-25
CN201511003865.4A CN105631009A (zh) 2015-12-25 2015-12-25 基于词向量相似度的检索方法和系统

Publications (1)

Publication Number Publication Date
WO2017107566A1 true WO2017107566A1 (fr) 2017-06-29

Family

ID=56045942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/098234 WO2017107566A1 (fr) 2015-12-25 2016-09-06 Procédé et système d'extraction basés sur une similarité de vecteur de mot

Country Status (2)

Country Link
CN (1) CN105631009A (fr)
WO (1) WO2017107566A1 (fr)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165382A (zh) * 2018-08-03 2019-01-08 南京工业大学 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法
CN109582771A (zh) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 面向电力领域基于移动应用的智能客户交互方法
CN109933779A (zh) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 用户意图识别方法及系统
CN110084658A (zh) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 物品匹配的方法和装置
CN111104488A (zh) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 检索和相似度分析一体化的方法、装置和存储介质
CN111625468A (zh) * 2020-06-05 2020-09-04 中国银行股份有限公司 一种测试案例去重方法及装置
CN112711648A (zh) * 2020-12-23 2021-04-27 航天信息股份有限公司 一种数据库字符串密文存储方法、电子设备和介质
CN113515621A (zh) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113569006A (zh) * 2021-06-17 2021-10-29 国家电网有限公司 一种基于数据特征的大规模数据质量异常检测方法
CN116431838A (zh) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 文献检索方法、装置、系统及存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631009A (zh) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 基于词向量相似度的检索方法和系统
CN106407311B (zh) * 2016-08-30 2020-07-24 北京百度网讯科技有限公司 获取搜索结果的方法和装置
CN106886567B (zh) * 2017-01-12 2019-11-08 北京航空航天大学 基于语义扩展的微博突发事件检测方法及装置
CN107330023B (zh) * 2017-06-21 2021-02-12 北京百度网讯科技有限公司 基于关注点的文本内容推荐方法和装置
DE112019001497T5 (de) * 2018-03-23 2021-01-07 Semiconductor Energy Laboratory Co., Ltd. System zur Dokumentensuche, Verfahren zur Dokumentensuche, Programm und nicht-transitorisches, von einem Computer lesbares Speichermedium
CN110610695B (zh) * 2018-05-28 2022-05-17 宁波方太厨具有限公司 一种基于孤立词的语音识别方法及应用有该方法的吸油烟机
CN109190046A (zh) * 2018-09-18 2019-01-11 北京点网聚科技有限公司 内容推荐方法、装置及内容推荐服务器
CN110110333A (zh) * 2019-05-08 2019-08-09 上海数据交易中心有限公司 一种互联对象的检索方法及系统
CN110309278B (zh) * 2019-05-23 2021-11-16 泰康保险集团股份有限公司 关键词检索方法、装置、介质及电子设备
CN110609952B (zh) * 2019-08-15 2024-04-26 中国平安财产保险股份有限公司 数据采集方法、系统和计算机设备
CN110674087A (zh) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 文件查询方法、装置及计算机可读存储介质
CN110909789A (zh) * 2019-11-20 2020-03-24 精硕科技(北京)股份有限公司 声量预测方法和装置、电子设备及存储介质
CN111625621B (zh) * 2020-04-27 2023-05-09 中国铁道科学研究院集团有限公司电子计算技术研究所 一种文档检索方法、装置、电子设备及存储介质
CN112650833A (zh) * 2020-12-25 2021-04-13 哈尔滨工业大学(深圳) Api匹配模型建立方法及跨城市政务api匹配方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778161A (zh) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 基于Word2Vec和Query log抽取关键词方法
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN105005589A (zh) * 2015-06-26 2015-10-28 腾讯科技(深圳)有限公司 一种文本分类的方法和装置
CN105631009A (zh) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 基于词向量相似度的检索方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN104778161A (zh) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 基于Word2Vec和Query log抽取关键词方法
CN105005589A (zh) * 2015-06-26 2015-10-28 腾讯科技(深圳)有限公司 一种文本分类的方法和装置
CN104933183A (zh) * 2015-07-03 2015-09-23 重庆邮电大学 一种融合词向量模型和朴素贝叶斯的查询词改写方法
CN105631009A (zh) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 基于词向量相似度的检索方法和系统

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933779A (zh) * 2017-12-18 2019-06-25 苏宁云商集团股份有限公司 用户意图识别方法及系统
CN110084658A (zh) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 物品匹配的方法和装置
CN110084658B (zh) * 2018-01-26 2024-01-16 北京京东尚科信息技术有限公司 物品匹配的方法和装置
CN109165382B (zh) * 2018-08-03 2022-08-23 南京工业大学 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法
CN109165382A (zh) * 2018-08-03 2019-01-08 南京工业大学 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法
CN109582771A (zh) * 2018-11-26 2019-04-05 国网湖南省电力有限公司 面向电力领域基于移动应用的智能客户交互方法
CN111104488A (zh) * 2019-12-30 2020-05-05 广州广电运通信息科技有限公司 检索和相似度分析一体化的方法、装置和存储介质
CN111104488B (zh) * 2019-12-30 2023-10-24 广州广电运通信息科技有限公司 检索和相似度分析一体化的方法、装置和存储介质
CN111625468A (zh) * 2020-06-05 2020-09-04 中国银行股份有限公司 一种测试案例去重方法及装置
CN111625468B (zh) * 2020-06-05 2024-04-16 中国银行股份有限公司 一种测试案例去重方法及装置
CN112711648A (zh) * 2020-12-23 2021-04-27 航天信息股份有限公司 一种数据库字符串密文存储方法、电子设备和介质
CN113515621A (zh) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113515621B (zh) * 2021-04-02 2024-03-29 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113569006A (zh) * 2021-06-17 2021-10-29 国家电网有限公司 一种基于数据特征的大规模数据质量异常检测方法
CN116431838A (zh) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 文献检索方法、装置、系统及存储介质
CN116431838B (zh) * 2023-06-15 2024-01-30 北京墨丘科技有限公司 文献检索方法、装置、系统及存储介质

Also Published As

Publication number Publication date
CN105631009A (zh) 2016-06-01

Similar Documents

Publication Publication Date Title
WO2017107566A1 (fr) Procédé et système d'extraction basés sur une similarité de vecteur de mot
CN111177365B (zh) 一种基于图模型的无监督自动文摘提取方法
CN109101479B (zh) 一种用于中文语句的聚类方法及装置
CN109960724B (zh) 一种基于tf-idf的文本摘要方法
JP5608817B2 (ja) 指定特性値を使用するターゲット単語の認識
US8073877B2 (en) Scalable semi-structured named entity detection
US9087297B1 (en) Accurate video concept recognition via classifier combination
Zhang et al. Extractive document summarization based on convolutional neural networks
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN108052500B (zh) 一种基于语义分析的文本关键信息提取方法及装置
US11928875B2 (en) Layout-aware, scalable recognition system
CN108038099B (zh) 基于词聚类的低频关键词识别方法
CN112989802A (zh) 一种弹幕关键词提取方法、装置、设备及介质
US20190318191A1 (en) Noise mitigation in vector space representations of item collections
CN109446299B (zh) 基于事件识别的搜索电子邮件内容的方法及系统
CN115098690B (zh) 一种基于聚类分析的多数据文档分类方法及系统
CN112527958A (zh) 用户行为倾向识别方法、装置、设备及存储介质
CN116738988A (zh) 文本检测方法、计算机设备和存储介质
Twinandilla et al. Multi-document summarization using k-means and latent dirichlet allocation (lda)–significance sentences
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
CN110413985B (zh) 一种相关文本片段搜索方法及装置
WO2021227951A1 (fr) Dénomination d'élément de page d'extrémité avant
US20050149846A1 (en) Apparatus, method, and program for text classification using frozen pattern
CN113761125A (zh) 动态摘要确定方法和装置、计算设备以及计算机存储介质
JP5389764B2 (ja) マイクロブログテキスト分類装置及び方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 15/11/2018)

122 Ep: pct application non-entry in european phase

Ref document number: 16877377

Country of ref document: EP

Kind code of ref document: A1