CN111488429A - Short text clustering system based on search engine and short text clustering method thereof - Google Patents

Short text clustering system based on search engine and short text clustering method thereof Download PDF

Info

Publication number
CN111488429A
CN111488429A CN202010194422.2A CN202010194422A CN111488429A CN 111488429 A CN111488429 A CN 111488429A CN 202010194422 A CN202010194422 A CN 202010194422A CN 111488429 A CN111488429 A CN 111488429A
Authority
CN
China
Prior art keywords
search engine
data
text
similarity
short text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010194422.2A
Other languages
Chinese (zh)
Inventor
赵粉玉
徐鹏波
陈尚武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xujian Science And Technology Co ltd
Original Assignee
Hangzhou Xujian Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xujian Science And Technology Co ltd filed Critical Hangzhou Xujian Science And Technology Co ltd
Priority to CN202010194422.2A priority Critical patent/CN111488429A/en
Publication of CN111488429A publication Critical patent/CN111488429A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention provides a short text clustering system based on a search engine and a short text clustering method thereof, wherein the short text clustering system based on the search engine comprises a data preprocessing module, a search engine data matching module, a short text similarity calculation module and a data processing module; the data preprocessing module is used for preprocessing the text data according to the actual service condition, wherein the sample data refers to the short text; the invention effectively solves the problems that the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like, can put a text data into a similar data set in real time, and can realize high-efficiency clustering by utilizing the high concurrency characteristic of a search engine database.

Description

Short text clustering system based on search engine and short text clustering method thereof
Technical Field
The invention relates to the technical field of big data processing, in particular to a short text clustering system based on a search engine and a short text clustering method thereof.
Background
Common short text clustering methods include a partition-based algorithm represented by k-means and a hierarchical-based algorithm represented by hierarchical clustering. The k-means algorithm has the defects that the number of clusters needs to be determined in advance, when a data set is large, a proper value is difficult to give, the condition of stopping splitting needs to be determined by the hierarchical division algorithm, and the calculation speed is low. The clustering effect of the short text clustering based on unsupervised learning is not obvious under the condition of a large amount of noise data, and certain data cannot be put into a similar data set in real time.
Disclosure of Invention
In order to solve the technical problems, the invention provides a short text clustering system based on a search engine and a short text clustering method thereof, aiming at the fact that various unsupervised short text clustering methods are mature day by day, such as k-means, DBSCAN and the like. However, the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like.
In order to achieve the aim, the invention provides a short text clustering system based on a search engine, which comprises a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3) and a data processing module (4);
the short text generally refers to a text form with a short text length of no more than 300 characters theoretically, such as a microblog, a news theme, a viewpoint comment, a mobile phone short message, a document summary and the like.
The data preprocessing module (1) is used for preprocessing text data according to actual service conditions, wherein the sample data refers to the short text;
the search engine data matching module (2) searches the database in a fuzzy manner according to corresponding rules on the text processed by the data preprocessing module (1), and returns the first n results; the rules can be customized according to related services, for example, news is clustered, places in the text can be extracted and stored in place fields in a database, multi-field fuzzy search is carried out on the text and the places during retrieval, data with similar places and texts is returned, and clustering accuracy can be improved;
the short text similarity calculation module (3) calculates the similarity between each sentence returned by the search engine data matching module (2) and the input text by using a short text similarity calculation method;
the data processing module (4) places the data with similarity greater than a certain threshold value into the corresponding field in the search engine table according to the rule.
The invention also provides a short text clustering method based on the search engine, which comprises the following procedures:
the method comprises the following steps that (1) a data preprocessing module (1) is responsible for processing input text data and removing stop words in short texts; such as: words without actual meanings such as yes, and the like, removing format marks, removing messy code characters and the like, and selecting and removing English, numbers, emoticons, special stop words set by actual application and the like according to actual conditions.
Examples are: for example, in a sentence '# how the Shanxi Daizhongda unclear gas # goes back, unclear gas appears around the Shanxi Daizda colorless and odorless, all people choking to start to cough, @ Shanxi environmental protection public sentiment gateway department surveys Shanzhong and Shanxi agriculture university' after treatment, a sentence 'Shanxi Daizhongda unclear gas around the Shanxi Daizda colorless and odorless choking all people starts to cough department surveys Shanzhong and Shanxi agriculture university'.
Step (2), a search engine data matching module (2) selects a search engine database, such as elastic search, Solr and the like, when the search engine (elastic search engine) processes full-text search, firstly, a query character string is analyzed, then, a query is constructed according to word segmentation, a search result shows a result set which is ordered from top to bottom according to score of score, and two sentences are generally similar when the score is higher; arranging word segmenters in a search engine into ik _ smart, ik _ max _ word or other Chinese word segmenters, and increasing, deleting and modifying stop words and dictionaries in the search engine according to requirements;
the short text is fuzzily searched in a search engine database by adopting a direct search mode or a CUR L command mode, and the first n sentences which are similar are returned, wherein n can be adjusted according to the final effect and is suggested to be about 3, namely the first 3 data are returned, and the 3 data are the 3 data which are most similar to the short text in the database;
step (3), the short text similarity calculation module (3) divides each short sentence of the first n sentences in the search engine data matching module (2) into words and removes noise parts, converts the words into a word vector list through a word vector space model and calculates the similarity between the short sentences through cosine similarity between vectors;
the word vector space model is obtained by performing word segmentation on Wikipedia linguistic data or other large linguistic data through a Chinese word segmentation tool and removing stop words and then training the word2vec of a genesis toolkit, and the word vector space model is used for expressing words by vectors;
cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of the difference between the two individuals; the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are.
The cosine value between the two vectors is obtained by using the following formula, wherein A, B is a word vector list converted by a word vector space model of the two sentences;
Figure DEST_PATH_IMAGE001
step (4), the data processing module (4) sets a similar data set of similar fields in a database, circularly calculates the similarity between the matching sentence and each sentence in the matching return list by using a formula in the short text similarity calculation module (3), and directly adds the similarity smaller than a set threshold value into the database; adding the similarity which is larger than a set threshold and the sentence length which is smaller than the similarity into a similar data set field of the similarity; replacing the similar sentence if the similarity is greater than a set threshold and the length is greater than the similar sentence, and adding the similar sentence into a similar data set field of the sentence;
the threshold value is generally set to be 0.8, if the first strip is similar and the second strip is still similar, the second strip is added to the similar data set field of the first strip and is deleted, and then the sentence processing modes are consistent;
that is, if the list of sentences similar to the matching sentence s is [ a, b, c ], the sentence s will perform the following operations (using a as an example) with each sentence in the list: s and a calculate similarity, the similarity is smaller than a threshold value and is directly added into a database, the similarity is larger than the threshold value, the length of s is smaller than the length of a, s is added into a similar field of a, the similarity is larger than the threshold value, the length of s is larger than the length of a, a is replaced by s, and a is added into the replaced similar field, wherein s, a, b and c are single short text data;
the word segmentation tools in the text include, but are not limited to L TP, N L PIR, jieba and Han L P, and the word segmentation tools are a custom dictionary and a stop word dictionary which are set according to requirements.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the invention effectively solves the problems that the current clustering mode has the defects of low calculation speed, difficult control of the clustering effect of short texts, incapability of clustering certain text data in real time and the like, can put a text data into a similar data set in real time, and can realize high-efficiency clustering by utilizing the high concurrency characteristic of a search engine database.
Drawings
FIG. 1 is a block diagram of a short text clustering system based on a search engine according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a specific embodiment of a short text clustering system based on a search engine, which includes a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3), and a data processing module (4);
the short text generally refers to a text form with a short text length of no more than 300 characters theoretically, such as a microblog, a news theme, a viewpoint comment, a mobile phone short message, a document summary and the like.
The data preprocessing module (1) is used for preprocessing text data according to actual service conditions, wherein the sample data refers to the short text;
the search engine data matching module (2) searches the database in a fuzzy manner according to corresponding rules on the text processed by the data preprocessing module (1), and returns the first n results; the rules can be customized according to related services, for example, news is clustered, places in the text can be extracted and stored in place fields in a database, multi-field fuzzy search is carried out on the text and the places during retrieval, data with similar places and texts is returned, and clustering accuracy can be improved;
the short text similarity calculation module (3) calculates the similarity between each sentence returned by the search engine data matching module (2) and the input text by using a short text similarity calculation method;
the data processing module (4) places the data with similarity greater than a certain threshold value into the corresponding field in the search engine table according to the rule.
As shown in fig. 1, the present invention further provides a specific embodiment of a short text clustering method based on a search engine, which includes the following steps:
the method comprises the following steps that (1) a data preprocessing module (1) is responsible for processing input text data and removing stop words in short texts; such as: words without actual meanings such as yes, and the like, removing format marks, removing messy code characters and the like, and selecting and removing English, numbers, emoticons, special stop words set by actual application and the like according to actual conditions.
Examples are: for example, in a sentence '# how the Shanxi Daizhongda unclear gas # goes back, unclear gas appears around the Shanxi Daizda colorless and odorless, all people choking to start to cough, @ Shanxi environmental protection public sentiment gateway department surveys Shanzhong and Shanxi agriculture university' after treatment, a sentence 'Shanxi Daizhongda unclear gas around the Shanxi Daizda colorless and odorless choking all people starts to cough department surveys Shanzhong and Shanxi agriculture university'.
Step (2), a search engine data matching module (2) selects a search engine database, such as elastic search, Solr and the like, when the search engine (elastic search engine) processes full-text search, firstly, a query character string is analyzed, then, a query is constructed according to word segmentation, a search result shows a result set which is ordered from top to bottom according to score of score, and two sentences are generally similar when the score is higher; arranging word segmenters in a search engine into ik _ smart, ik _ max _ word or other Chinese word segmenters, and increasing, deleting and modifying stop words and dictionaries in the search engine according to requirements;
the short text is fuzzily searched in a search engine database by adopting a direct search mode or a CUR L command mode, and the first n sentences which are similar are returned, wherein n can be adjusted according to the final effect and is suggested to be about 3, namely the first 3 data are returned, and the 3 data are the 3 data which are most similar to the short text in the database;
step (3), the short text similarity calculation module (3) divides each short sentence of the first n sentences in the search engine data matching module (2) into words and removes noise parts, converts the words into a word vector list through a word vector space model and calculates the similarity between the short sentences through cosine similarity between vectors;
the word vector space model is obtained by performing word segmentation on Wikipedia linguistic data or other large linguistic data through a Chinese word segmentation tool and removing stop words and then training the word2vec of a genesis toolkit, and the word vector space model is used for expressing words by vectors;
cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of the difference between the two individuals; the closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e. the more similar the two vectors are.
The cosine value between the two vectors is obtained by using the following formula, wherein A, B is a word vector list converted by a word vector space model of the two sentences;
Figure 571449DEST_PATH_IMAGE001
step (4), the data processing module (4) sets a similar data set of similar fields in a database, circularly calculates the similarity between the matching sentence and each sentence in the matching return list by using a formula in the short text similarity calculation module (3), and directly adds the similarity smaller than a set threshold value into the database; adding the similarity which is larger than a set threshold and the sentence length which is smaller than the similarity into a similar data set field of the similarity; replacing the similar sentence if the similarity is greater than a set threshold and the length is greater than the similar sentence, and adding the similar sentence into a similar data set field of the sentence;
the threshold value is generally set to be 0.8, if the first strip is similar and the second strip is still similar, the second strip is added to the similar data set field of the first strip and is deleted, and then the sentence processing modes are consistent;
that is, if the list of sentences similar to the matching sentence s is [ a, b, c ], the sentence s will perform the following operations (using a as an example) with each sentence in the list: s and a calculate similarity, the similarity is smaller than a threshold value and is directly added into a database, the similarity is larger than the threshold value, the length of s is smaller than the length of a, s is added into a similar field of a, the similarity is larger than the threshold value, the length of s is larger than the length of a, a is replaced by s, and a is added into the replaced similar field, wherein s, a, b and c are single short text data;
the word segmentation tools in the text include, but are not limited to L TP, N L PIR, jieba and Han L P, and the word segmentation tools are a custom dictionary and a stop word dictionary which are set according to requirements.
Such as: the existing database is as follows:
Figure 616765DEST_PATH_IMAGE002
the short text is ' explosion accident of extra large gas in the west two mining areas of xx coal mines ', the gas explosion accident in the west two mining areas of xx coal mines is obtained through text preprocessing ', two pieces of data with id of 1 and 3 are obtained by using text fields in a short text fuzzy matching table, the similarity between the short text and the data in the text fields with id of 1 and 3 is respectively calculated, wherein the similarity between the short text and the data in the text fields with id of 1 is greater than a threshold value, and the number of words is less, the short text is added into a similar data set with id of 1, the similarity between the short text and the data in id of 3 is less than the threshold value, no processing is performed, the data with id of 1 is updated, and the following results are obtained:
Figure DEST_PATH_IMAGE003
the main short text clustering method comprises the steps of operating an elastic search and calculating the similarity between texts. The Elasticissearch supports the near real-time processing of mass data and is a distributed RESTful style search and data analysis engine, and the calculation speed of the similarity between texts is in the millisecond level, so that the short text clustering efficiency is high.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principle and embodiments of the present invention have been described herein by way of specific examples, which are provided only to help understand the method and the core idea of the present invention, and the above is only a preferred embodiment of the present invention, and it should be noted that there are objectively infinite specific structures due to the limited character expressions, and it will be apparent to those skilled in the art that a plurality of modifications, decorations or changes can be made without departing from the principle of the present invention, and the above technical features can also be combined in a suitable manner; such modifications, variations, combinations, or adaptations of the invention using its spirit and scope, as defined by the claims, may be directed to other uses and embodiments.

Claims (5)

1. A short text clustering system based on a search engine is characterized by comprising a data preprocessing module (1), a search engine data matching module (2), a short text similarity calculation module (3) and a data processing module (4);
the data preprocessing module (1) is used for preprocessing text data according to actual service conditions, wherein the sample data refers to the short text;
the search engine data matching module (2) returns the first n results of the text processed by the data preprocessing module (1) according to a fuzzy search database;
the short text similarity calculation module (3) calculates the similarity between each sentence returned by the search engine data matching module (2) and the input text by using a short text similarity calculation method;
the data processing module (4) puts the data with similarity larger than a certain threshold value into the corresponding field in the search engine table.
2. The search engine-based short text clustering system according to claim 1, wherein the short text is in a text form with a short text length of no more than 300 characters, such as microblog, news topic, opinion comment, short message service, and document summary.
3. A short text clustering method based on a search engine is characterized by comprising the following processes:
the method comprises the following steps that (1) a data preprocessing module (1) is responsible for processing input text data and removing stop words in short texts;
step (2), the search engine data matching module (2) selects a search engine database, and the search engine is used for processing full-text search;
step (3), the short text similarity calculation module (3) divides each short sentence of the first n sentences in the search engine data matching module (2) into words and removes noise parts, converts the words into a word vector list through a word vector space model and calculates the similarity between the short sentences through cosine similarity between vectors;
step (4), the data processing module (4) sets a similar data set of similar fields in a database, circularly calculates the similarity between the matching sentence and each sentence in the matching return list by using a formula in the short text similarity calculation module (3), and directly adds the similarity smaller than a set threshold value into the database; adding the similarity which is larger than a set threshold and the sentence length which is smaller than the similarity into a similar data set field of the similarity; and if the similarity is greater than the set threshold and the length is greater than the length of the similar sentence, replacing the similar sentence, and adding the similar sentence into the similar data set field of the sentence.
4. The method for clustering short texts based on a search engine as claimed in claim 3, wherein in step (2), when the search engine processes the full text search, the search engine firstly analyzes the query string, then constructs the query according to the word segmentation, the search result shows a result set which is ranked from top to bottom according to score, and the two sentences are more similar when the score is higher; arranging word segmenters in a search engine into ik _ smart, ik _ max _ word or other Chinese word segmenters, and increasing, deleting and modifying stop words and dictionaries in the search engine according to requirements;
and (3) fuzzily searching the short text in a search engine database by adopting a direct search mode or a CUR L command mode, and returning the first n similar sentences, wherein n is adjusted according to the final effect.
5. The method for clustering short texts based on a search engine according to claim 3, wherein in the step (3), the word vector space model is obtained by word2vec training of a genim toolkit after the wikipedia or other large corpora are participated by a Chinese word segmentation tool and stop words are removed, and the word vector space model is used for representing words by vectors;
cosine similarity uses a cosine value of an included angle between two vectors in a vector space as a measure of the difference between the two individuals; the closer the cosine value is to 1, the closer the included angle is to 0 degrees, namely the more similar the two vectors are; the cosine value between the two vectors is obtained by using the following formula, wherein A, B is a word vector list converted by a word vector space model of the two sentences;
Figure 836965DEST_PATH_IMAGE001
CN202010194422.2A 2020-03-19 2020-03-19 Short text clustering system based on search engine and short text clustering method thereof Pending CN111488429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010194422.2A CN111488429A (en) 2020-03-19 2020-03-19 Short text clustering system based on search engine and short text clustering method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010194422.2A CN111488429A (en) 2020-03-19 2020-03-19 Short text clustering system based on search engine and short text clustering method thereof

Publications (1)

Publication Number Publication Date
CN111488429A true CN111488429A (en) 2020-08-04

Family

ID=71794537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010194422.2A Pending CN111488429A (en) 2020-03-19 2020-03-19 Short text clustering system based on search engine and short text clustering method thereof

Country Status (1)

Country Link
CN (1) CN111488429A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN112667809A (en) * 2020-12-25 2021-04-16 平安科技(深圳)有限公司 Text processing method and device, electronic equipment and storage medium
CN113138982A (en) * 2021-05-25 2021-07-20 黄柱挺 Big data cleaning method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method
CN107943762A (en) * 2017-11-24 2018-04-20 四川长虹电器股份有限公司 A kind of text similarity sort method based on ES search
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
WO2019196314A1 (en) * 2018-04-10 2019-10-17 平安科技(深圳)有限公司 Text information similarity matching method and apparatus, computer device, and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method
CN107943762A (en) * 2017-11-24 2018-04-20 四川长虹电器股份有限公司 A kind of text similarity sort method based on ES search
WO2019196314A1 (en) * 2018-04-10 2019-10-17 平安科技(深圳)有限公司 Text information similarity matching method and apparatus, computer device, and storage medium
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN109960799A (en) * 2019-03-12 2019-07-02 中南大学 A kind of Optimum Classification method towards short text
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257431A (en) * 2020-10-30 2021-01-22 中电万维信息技术有限责任公司 NLP-based short text data processing method
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN112380348B (en) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 Metadata processing method, apparatus, electronic device and computer readable storage medium
CN112667809A (en) * 2020-12-25 2021-04-16 平安科技(深圳)有限公司 Text processing method and device, electronic equipment and storage medium
CN113138982A (en) * 2021-05-25 2021-07-20 黄柱挺 Big data cleaning method
CN113138982B (en) * 2021-05-25 2022-09-27 深圳市元宇宙科技有限公司 Big data cleaning method

Similar Documents

Publication Publication Date Title
CN105824959B (en) Public opinion monitoring method and system
CN108287858B (en) Semantic extraction method and device for natural language
CN107451126B (en) Method and system for screening similar meaning words
CN108491462B (en) Semantic query expansion method and device based on word2vec
CN109543178B (en) Method and system for constructing judicial text label system
CN109508414B (en) Synonym mining method and device
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN108268668B (en) Topic diversity-based text data viewpoint abstract mining method
CN106708893A (en) Error correction method and device for search query term
CN102081602B (en) Method and equipment for determining category of unlisted word
JP2017511922A (en) Method, system, and storage medium for realizing smart question answer
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN105224640A (en) A kind of method and apparatus extracting viewpoint
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN106446018B (en) Query information processing method and device based on artificial intelligence
CN109325124B (en) Emotion classification method, device, server and storage medium
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN106570120A (en) Process for realizing searching engine optimization through improved keyword optimization
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN104281565A (en) Semantic dictionary constructing method and device
CN107577713B (en) Text handling method based on electric power dictionary
JP2020027548A (en) Program, device and method for creating dialog scenario corresponding to character attribute
CN104346382A (en) Text analysis system and method employing language query
JPH05120345A (en) Keyword extracting device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination