WO2017107566A1 - Procédé et système d'extraction basés sur une similarité de vecteur de mot - Google Patents
Procédé et système d'extraction basés sur une similarité de vecteur de mot Download PDFInfo
- Publication number
- WO2017107566A1 WO2017107566A1 PCT/CN2016/098234 CN2016098234W WO2017107566A1 WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1 CN 2016098234 W CN2016098234 W CN 2016098234W WO 2017107566 A1 WO2017107566 A1 WO 2017107566A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- search
- word vector
- training
- matching
- file
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
Definitions
- the invention relates to the field of information retrieval technology, in particular to a retrieval method based on word vector similarity and a retrieval system based on word vector similarity.
- the existing techniques for the resume search matching process are usually searched by multiple keywords.
- the user provides a set of keywords to search in the search library, and the number of matching word hits is used as the matching score, and the search result is output according to the ranking of the matching scores from high to low, and the default ranked first is more in line with the user requirements.
- this search method has the following disadvantages:
- the existing keyword-based retrieval method has poor retrieval retrieval rate and retrieval result accuracy, and has problems of poor robustness and adaptability.
- the present invention provides a retrieval method and system based on word vector similarity, which can improve retrieval accuracy and robustness.
- An aspect of the present invention provides a retrieval method based on word vector similarity, including:
- the performing word vector training on the search library comprises:
- the pre-processing includes data cleaning and extracting data description
- the word vector training for the search library includes:
- Word vector training is performed on the search library based on the training sample file.
- the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
- the extracting the data description includes segmentation by adding a user dictionary.
- the performing word vector training on the search library comprises:
- Word vector training is performed on the training sample file by word2vec.
- the search library is searched and matched by using the related words, and the matching scores of each file in the search library and the related words are respectively counted according to the similarity, including:
- the similarity corresponding to each related word is taken as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
- Another aspect of the present invention provides a retrieval system based on word vector similarity, comprising:
- a model training unit configured to perform word vector training on the search library, and establish a training model corresponding to the search library
- a result output unit configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
- the model training unit is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training sample file;
- the pre-processing includes data cleaning and extracting data descriptions;
- the word vector training for the search library includes:
- Word vector training is performed on the search library based on the training sample file.
- the data cleaning comprises at least one of uniform case, elimination of extra spaces, unified punctuation, and unified full-width format;
- the extracting the data description includes segmentation by adding a user dictionary.
- the performing word vector training on the search library comprises:
- Word vector training is performed on the training sample file by word2vec.
- the search matching unit comprises:
- a matching module configured to perform a search and match on each file in the search library by using the related words, to obtain a matching result of each file and the related words
- a statistic module configured to use the similarity corresponding to each related word as a cumulative weighting value, and combine the matching knot The matching scores of each file and the related words are respectively obtained.
- the search method and system based on word vector similarity of the above technical solution establishes a training model corresponding to the search library by performing word vector training on the search library; receiving an input search keyword, and obtaining the search through the training model a related word of the keyword, and a similarity between each related word and the search keyword; searching and matching the search library with the related word, and separately counting each file in the search library according to the similarity a matching score of the related word; sorting the files in the search library according to the matching score from high to low, and outputting the search result according to the sorting result.
- the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
- the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
- FIG. 1 is a schematic flowchart of a method for retrieving a word vector similarity according to an embodiment of the present invention
- FIG. 2 is a schematic structural diagram of a word vector similarity-based retrieval system according to an embodiment of the present invention.
- Embodiments provided by the present invention include a retrieval method embodiment based on word vector similarity, and a corresponding retrieval system embodiment based on word vector similarity. The details are described below separately.
- FIG. 1 is a schematic flowchart of a method for retrieving a similarity based on word vectors according to an embodiment of the present invention.
- the word vector similarity-based retrieval method of the present embodiment includes the following steps S1 to S4, and the steps are detailed as follows:
- the problem of natural language understanding translates into machine learning problems.
- the first step is to find a way to mathematicalize these symbols, such as expressing each word as a unique vector.
- the word vector is a common Chinese name for "Word Representation” or "Word Embedding”.
- the word vector in this embodiment should have the following features: Let related or similar words be closer in distance, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather”.
- the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
- the word vector may be a word vector represented by a Distributed Representation.
- the word vector represented by Distributed Representation is a low-dimensional real number vector.
- the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
- each file in the search library may be separately preprocessed, and the preprocessed data of each file is stored in a corresponding training sample file.
- said pre-processing comprises data cleaning and extracting data descriptions.
- the data cleaning is mainly used to implement the consistency of the data in the search library, and may specifically include at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
- the extracting data description includes adding a user dictionary.
- the word segmentation can be specifically added to the user dictionary and segmented by NLPIR (also known as ICTCLAS2013, Chinese word segmentation system).
- word vector training is performed on the search library based on the training sample file to establish a training model corresponding to the search library.
- the specific manner may be: using the word2vec to the training sample
- the file is trained in word vector, and the training settings are as follows:
- -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
- -hs indicates whether to use the HS method, 0 means not used, 1 means use,
- sample le-3 indicates that the threshold of the sample is 10 -3 . If the frequency of a word appears in the training sample is larger, the more it will be sampled;
- -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
- -min_count indicates the lowest frequency set. The default is 5. If a word appears in the document less than the threshold, the word will be discarded.
- S2 receiving an input search keyword, obtaining, by the training model, a related word of the search keyword, and a similarity between each related word and the search keyword;
- the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library. Specifically, the related words and similarities can be generated by the ./distance vectors.bin command, and automatically generated by the sh script and the expect script.
- the related words obtained by the above steps are respectively used to search and match each file in the search library, and the matching result of each file and the related words is obtained; and the similarity corresponding to each related word is used as the cumulative weighting value. And matching the matching result respectively to obtain a matching score of each file and the related word.
- the score threshold can be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorting values of the matching scores are outputted from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
- the word vector similarity-based retrieval method of the above embodiment by performing a word orientation on the search library Training, establishing a training model corresponding to the search library; receiving an input search keyword, obtaining, by the training model, related words of the search keyword, and similarity between each related word and the search keyword; The related words perform search matching on the search library, and respectively compare matching scores of each file in the search library with the related words according to the similarity; according to the matching scores from high to low
- the files in the search library are sorted, and the search results are output according to the sort result.
- the training model is based on the search library training, it can reflect the characteristics of the search library well, which is beneficial to improve the search accuracy.
- the keywords are expressed in the form of word vectors. Words are searched and matched, which increases the ability to search and match related words, thus improving retrieval robustness.
- the word vector similarity-based retrieval system of the present embodiment includes: a model training unit 210, and generates related words.
- the unit 220, the search matching unit 230, and the result output unit 240 are detailed as follows:
- the model training unit 210 is configured to perform word vector training on the search library, and establish a training model corresponding to the search library;
- the word vector in this embodiment should have the following features: let relevant or similar words, at a distance The distance is closer, for example, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Weather".
- the distance of the vector can be measured by the traditional Euclidean distance or by the angle of the cos.
- the word vector may be a word vector represented by a Distributed Representation.
- the word vector represented by Distributed Representation is a low-dimensional real number vector.
- the general form of this vector is [0.792, -0.177, -0.17, 0.109, -0.542,...], and the dimensions are more common in 50-dimensional and 100-dimensional.
- the model training unit 210 is further configured to perform pre-processing on each file in the search library before performing word vector training on the search library, and store the pre-processed data of each file into a corresponding training.
- word vector training is performed on the search library based on the training sample file.
- the pre-processing includes data cleaning and extracting data description.
- the data cleaning includes at least one of unified case, eliminating extra spaces, unified punctuation, and unified full-width format;
- the extracting data description includes segmentation by adding a user dictionary, and the specific manner may be adding a user dictionary and passing NLPIR ( Also known as ICTCLAS2013, Chinese word segmentation system) for word segmentation.
- the training sample file can be trained by word vector by word2vec, and the training settings are as follows:
- -negative indicates whether to use the negative sampling method, 0 means not used, 1 means use,
- -hs indicates whether to use the HS method, 0 means not used, 1 means use,
- sample le-3 indicates that the sampling threshold is 10 -3 .
- -binary indicates whether the output is a binary file, 0 means not used, 1 means use,
- -min_count indicates the lowest frequency set, the default is 5.
- the generating related word unit 220 is configured to receive the input search keyword, and obtain the related words of the search keyword and the similarity between each related word and the search keyword by using the training model;
- the similarity of the two word vectors refers to the cosine similarity, and the highest can be 1, and the lowest can be 0. Since the training model is based on the search library training, the related words obtained based on the training model can well reflect the wording characteristics of the search library.
- the search matching unit 230 is configured to perform search matching on the search library by using the related words, and separately calculate matching scores of each file in the search library and the related words according to the similarity;
- the search matching unit 230 may specifically include: a matching module, configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words; And the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
- a matching module configured to perform search and match on each file in the search library by using the related words, and obtain matching results of each file and the related words.
- the similarity corresponding to each related word is used as a cumulative weighting value, and the matching scores of each file and the related words are respectively obtained by combining the matching results.
- the result output unit 240 is configured to sort files in the search library according to the matching score from high to low, and output a search result according to the sorting result.
- a score threshold may also be set, and only the search results whose matching scores are higher than the score threshold are sorted, and the sorted values are sorted according to the rank of the matching scores from high to low. Further screening of the search results by setting the score threshold facilitates the user to view the search results.
- each functional module is merely an example, and the actual application may be according to requirements, for example, the configuration requirements of the corresponding hardware or the convenience of the implementation of the software. It is considered that the above-mentioned function allocation is completed by different functional modules, that is, the internal structure of the word vector similarity-based retrieval system is divided into different functional modules to complete all or part of the functions described above.
- each functional module may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one.
- the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
- the integrated modules if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer readable storage medium.
- a program to instruct related hardware personal computer, server, or network device, etc.
- the program can be stored in a computer readable storage medium.
- the program when executed, may perform all or part of the steps of the method specified in any of the above embodiments.
- the foregoing storage medium may include any medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511003865.4 | 2015-12-25 | ||
CN201511003865.4A CN105631009A (zh) | 2015-12-25 | 2015-12-25 | 基于词向量相似度的检索方法和系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017107566A1 true WO2017107566A1 (fr) | 2017-06-29 |
Family
ID=56045942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/098234 WO2017107566A1 (fr) | 2015-12-25 | 2016-09-06 | Procédé et système d'extraction basés sur une similarité de vecteur de mot |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105631009A (fr) |
WO (1) | WO2017107566A1 (fr) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165382A (zh) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法 |
CN109582771A (zh) * | 2018-11-26 | 2019-04-05 | 国网湖南省电力有限公司 | 面向电力领域基于移动应用的智能客户交互方法 |
CN109933779A (zh) * | 2017-12-18 | 2019-06-25 | 苏宁云商集团股份有限公司 | 用户意图识别方法及系统 |
CN110084658A (zh) * | 2018-01-26 | 2019-08-02 | 北京京东尚科信息技术有限公司 | 物品匹配的方法和装置 |
CN111104488A (zh) * | 2019-12-30 | 2020-05-05 | 广州广电运通信息科技有限公司 | 检索和相似度分析一体化的方法、装置和存储介质 |
CN111625468A (zh) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | 一种测试案例去重方法及装置 |
CN112711648A (zh) * | 2020-12-23 | 2021-04-27 | 航天信息股份有限公司 | 一种数据库字符串密文存储方法、电子设备和介质 |
CN113515621A (zh) * | 2021-04-02 | 2021-10-19 | 中国科学院深圳先进技术研究院 | 数据检索方法、装置、设备及计算机可读存储介质 |
CN113569006A (zh) * | 2021-06-17 | 2021-10-29 | 国家电网有限公司 | 一种基于数据特征的大规模数据质量异常检测方法 |
CN116431838A (zh) * | 2023-06-15 | 2023-07-14 | 北京墨丘科技有限公司 | 文献检索方法、装置、系统及存储介质 |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105631009A (zh) * | 2015-12-25 | 2016-06-01 | 广州视源电子科技股份有限公司 | 基于词向量相似度的检索方法和系统 |
CN106407311B (zh) * | 2016-08-30 | 2020-07-24 | 北京百度网讯科技有限公司 | 获取搜索结果的方法和装置 |
CN106886567B (zh) * | 2017-01-12 | 2019-11-08 | 北京航空航天大学 | 基于语义扩展的微博突发事件检测方法及装置 |
CN107330023B (zh) * | 2017-06-21 | 2021-02-12 | 北京百度网讯科技有限公司 | 基于关注点的文本内容推荐方法和装置 |
DE112019001497T5 (de) * | 2018-03-23 | 2021-01-07 | Semiconductor Energy Laboratory Co., Ltd. | System zur Dokumentensuche, Verfahren zur Dokumentensuche, Programm und nicht-transitorisches, von einem Computer lesbares Speichermedium |
CN110610695B (zh) * | 2018-05-28 | 2022-05-17 | 宁波方太厨具有限公司 | 一种基于孤立词的语音识别方法及应用有该方法的吸油烟机 |
CN109190046A (zh) * | 2018-09-18 | 2019-01-11 | 北京点网聚科技有限公司 | 内容推荐方法、装置及内容推荐服务器 |
CN110110333A (zh) * | 2019-05-08 | 2019-08-09 | 上海数据交易中心有限公司 | 一种互联对象的检索方法及系统 |
CN110309278B (zh) * | 2019-05-23 | 2021-11-16 | 泰康保险集团股份有限公司 | 关键词检索方法、装置、介质及电子设备 |
CN110609952B (zh) * | 2019-08-15 | 2024-04-26 | 中国平安财产保险股份有限公司 | 数据采集方法、系统和计算机设备 |
CN110674087A (zh) * | 2019-09-03 | 2020-01-10 | 平安科技(深圳)有限公司 | 文件查询方法、装置及计算机可读存储介质 |
CN110909789A (zh) * | 2019-11-20 | 2020-03-24 | 精硕科技(北京)股份有限公司 | 声量预测方法和装置、电子设备及存储介质 |
CN111625621B (zh) * | 2020-04-27 | 2023-05-09 | 中国铁道科学研究院集团有限公司电子计算技术研究所 | 一种文档检索方法、装置、电子设备及存储介质 |
CN112650833A (zh) * | 2020-12-25 | 2021-04-13 | 哈尔滨工业大学(深圳) | Api匹配模型建立方法及跨城市政务api匹配方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778161A (zh) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | 基于Word2Vec和Query log抽取关键词方法 |
US20150248608A1 (en) * | 2014-02-28 | 2015-09-03 | Educational Testing Service | Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses |
CN104933183A (zh) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | 一种融合词向量模型和朴素贝叶斯的查询词改写方法 |
CN105005589A (zh) * | 2015-06-26 | 2015-10-28 | 腾讯科技(深圳)有限公司 | 一种文本分类的方法和装置 |
CN105631009A (zh) * | 2015-12-25 | 2016-06-01 | 广州视源电子科技股份有限公司 | 基于词向量相似度的检索方法和系统 |
-
2015
- 2015-12-25 CN CN201511003865.4A patent/CN105631009A/zh active Pending
-
2016
- 2016-09-06 WO PCT/CN2016/098234 patent/WO2017107566A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150248608A1 (en) * | 2014-02-28 | 2015-09-03 | Educational Testing Service | Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses |
CN104778161A (zh) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | 基于Word2Vec和Query log抽取关键词方法 |
CN105005589A (zh) * | 2015-06-26 | 2015-10-28 | 腾讯科技(深圳)有限公司 | 一种文本分类的方法和装置 |
CN104933183A (zh) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | 一种融合词向量模型和朴素贝叶斯的查询词改写方法 |
CN105631009A (zh) * | 2015-12-25 | 2016-06-01 | 广州视源电子科技股份有限公司 | 基于词向量相似度的检索方法和系统 |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933779A (zh) * | 2017-12-18 | 2019-06-25 | 苏宁云商集团股份有限公司 | 用户意图识别方法及系统 |
CN110084658A (zh) * | 2018-01-26 | 2019-08-02 | 北京京东尚科信息技术有限公司 | 物品匹配的方法和装置 |
CN110084658B (zh) * | 2018-01-26 | 2024-01-16 | 北京京东尚科信息技术有限公司 | 物品匹配的方法和装置 |
CN109165382B (zh) * | 2018-08-03 | 2022-08-23 | 南京工业大学 | 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法 |
CN109165382A (zh) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | 一种加权词向量和潜在语义分析结合的相似缺陷报告推荐方法 |
CN109582771A (zh) * | 2018-11-26 | 2019-04-05 | 国网湖南省电力有限公司 | 面向电力领域基于移动应用的智能客户交互方法 |
CN111104488A (zh) * | 2019-12-30 | 2020-05-05 | 广州广电运通信息科技有限公司 | 检索和相似度分析一体化的方法、装置和存储介质 |
CN111104488B (zh) * | 2019-12-30 | 2023-10-24 | 广州广电运通信息科技有限公司 | 检索和相似度分析一体化的方法、装置和存储介质 |
CN111625468A (zh) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | 一种测试案例去重方法及装置 |
CN111625468B (zh) * | 2020-06-05 | 2024-04-16 | 中国银行股份有限公司 | 一种测试案例去重方法及装置 |
CN112711648A (zh) * | 2020-12-23 | 2021-04-27 | 航天信息股份有限公司 | 一种数据库字符串密文存储方法、电子设备和介质 |
CN113515621A (zh) * | 2021-04-02 | 2021-10-19 | 中国科学院深圳先进技术研究院 | 数据检索方法、装置、设备及计算机可读存储介质 |
CN113515621B (zh) * | 2021-04-02 | 2024-03-29 | 中国科学院深圳先进技术研究院 | 数据检索方法、装置、设备及计算机可读存储介质 |
CN113569006A (zh) * | 2021-06-17 | 2021-10-29 | 国家电网有限公司 | 一种基于数据特征的大规模数据质量异常检测方法 |
CN116431838A (zh) * | 2023-06-15 | 2023-07-14 | 北京墨丘科技有限公司 | 文献检索方法、装置、系统及存储介质 |
CN116431838B (zh) * | 2023-06-15 | 2024-01-30 | 北京墨丘科技有限公司 | 文献检索方法、装置、系统及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN105631009A (zh) | 2016-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017107566A1 (fr) | Procédé et système d'extraction basés sur une similarité de vecteur de mot | |
CN111177365B (zh) | 一种基于图模型的无监督自动文摘提取方法 | |
CN109101479B (zh) | 一种用于中文语句的聚类方法及装置 | |
CN109960724B (zh) | 一种基于tf-idf的文本摘要方法 | |
JP5608817B2 (ja) | 指定特性値を使用するターゲット単語の認識 | |
US8073877B2 (en) | Scalable semi-structured named entity detection | |
US9087297B1 (en) | Accurate video concept recognition via classifier combination | |
Zhang et al. | Extractive document summarization based on convolutional neural networks | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
CN108052500B (zh) | 一种基于语义分析的文本关键信息提取方法及装置 | |
US11928875B2 (en) | Layout-aware, scalable recognition system | |
CN108038099B (zh) | 基于词聚类的低频关键词识别方法 | |
CN112989802A (zh) | 一种弹幕关键词提取方法、装置、设备及介质 | |
US20190318191A1 (en) | Noise mitigation in vector space representations of item collections | |
CN109446299B (zh) | 基于事件识别的搜索电子邮件内容的方法及系统 | |
CN115098690B (zh) | 一种基于聚类分析的多数据文档分类方法及系统 | |
CN112527958A (zh) | 用户行为倾向识别方法、装置、设备及存储介质 | |
CN116738988A (zh) | 文本检测方法、计算机设备和存储介质 | |
Twinandilla et al. | Multi-document summarization using k-means and latent dirichlet allocation (lda)–significance sentences | |
CN110705285B (zh) | 一种政务文本主题词库构建方法、装置、服务器及可读存储介质 | |
CN110413985B (zh) | 一种相关文本片段搜索方法及装置 | |
WO2021227951A1 (fr) | Dénomination d'élément de page d'extrémité avant | |
US20050149846A1 (en) | Apparatus, method, and program for text classification using frozen pattern | |
CN113761125A (zh) | 动态摘要确定方法和装置、计算设备以及计算机存储介质 | |
JP5389764B2 (ja) | マイクロブログテキスト分類装置及び方法及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16877377 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16877377 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 15/11/2018) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16877377 Country of ref document: EP Kind code of ref document: A1 |