CN109446516A - A kind of data processing method and system based on subject recommending model - Google Patents

A kind of data processing method and system based on subject recommending model Download PDF

Info

Publication number
CN109446516A
CN109446516A CN201811142853.3A CN201811142853A CN109446516A CN 109446516 A CN109446516 A CN 109446516A CN 201811142853 A CN201811142853 A CN 201811142853A CN 109446516 A CN109446516 A CN 109446516A
Authority
CN
China
Prior art keywords
theme
document
lexical item
data
distributed intelligence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811142853.3A
Other languages
Chinese (zh)
Other versions
CN109446516B (en
Inventor
王军平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cyberbas Data Technology Co Ltd
Original Assignee
Beijing Cyberbas Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cyberbas Data Technology Co Ltd filed Critical Beijing Cyberbas Data Technology Co Ltd
Priority to CN201811142853.3A priority Critical patent/CN109446516B/en
Publication of CN109446516A publication Critical patent/CN109446516A/en
Application granted granted Critical
Publication of CN109446516B publication Critical patent/CN109446516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The present invention provides a kind of data processing method and system based on subject recommending model, wherein the described method includes: obtaining document training sample set, it includes more sample files that the document training sample, which is concentrated,;Based on the document training sample set, document-lexical item distributed intelligence is generated;Using the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distributed intelligence;Receive pending data, and the corresponding theme of pending data according to the document-theme distribution information prediction that training obtains, when the compatible target data of the theme received with prediction obtains, according to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.Technical solution provided by the present application can be improved the judgement precision of data correlation.

Description

A kind of data processing method and system based on subject recommending model
Technical field
The present invention relates to technical field of data processing, in particular to a kind of data processing method based on subject recommending model And system.
Background technique
It is current usually same or similar by occurring in two documents of comparison when judging whether two documents are related The quantity of vocabulary is judged.However, in some cases, in two relevant documents, may and there is no identical or Similar vocabulary.For example, document 1 is " Qiao Busi is from us ", document 2 is " iPhone can make a price reduction ".The two Same or similar vocabulary is not present only from the point of view of literal in document, but analyzes from semantically, the two documents are practical It is relevant.
Therefore, when coming to judge whether two kinds of data are associated, there are biggish erroneous judgements for the method for use.
Summary of the invention
The application's is designed to provide a kind of data processing method and system based on subject recommending model, can be improved The judgement precision of data correlation.
To achieve the above object, the application provides a kind of data processing method based on subject recommending model, the method It include: to obtain document training sample set, it includes more sample files that the document training sample, which is concentrated,;Based on the document training Sample set generates document-lexical item distributed intelligence;Using the document-lexical item distributed intelligence, training obtains document-theme distribution Information and theme-lexical item distributed intelligence;Pending data is received, and is believed according to the document-theme distribution that training obtains Breath predicts the corresponding theme of the pending data, when the compatible target data of the theme for receiving with predicting to obtain When, according to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.
Further, the document training sample set, is obtained by following methods:
Based on search engine, document data is acquired, and the noise data in the document data of acquisition is cleared up, specifically Are as follows:
Establish the clean database for storing the not clean data of Noise;Text data to be cleaned is obtained, is treated Cleaning data are pre-processed to obtain structural data, the set of the word of the structural data composition text data, tool Body are as follows: data to be cleaned are segmented, and all words are converted to unified coding form;It will be with Unified coding form Data eliminate inconsistent data according to data dictionary, obtain standardized data;Consistency desired result is carried out to the standardized data, Apparent error in content is modified;Identical word is subjected to deduplication operation, to obtain structural data;
Obtain the semantic similarity of every two word;Specifically: concept expressed by each word is obtained respectively and description is each The justice of concept is former;The independent word of any two is obtained, the similarity between the justice original under each concept of two words is calculated separately, Two former similarities of justice are measured with their semantic distance;Find the most cardinal principles of righteousness original similarity and minimum between two concepts Adopted original similarity, the similarity between two concepts are the mean value of most cardinal principles of righteousness original similarity and the former similarity of minimum justice;Find two Maximum concept similarity between a word, using maximum concept similarity as the semantic similarity of two words;
Using the semantic similarity of two words as distance metric, using K-means algorithm, automatic cluster is carried out to word, Identify noise data;It specifically includes: obtaining K word at random as mass center, set similarity threshold;By remaining each word point Do not measure its arrive each mass center distance, and by the word be included into in its class apart from shortest mass center;It recalculates and has obtained The mass center of each class arrived;Judge whether new mass center is equal to or less than similarity threshold at a distance from the protoplasm heart, if so, far It is noise data that the remaining data in the class of any mass center, which can not be attributed to, from each mass center;
Found in noise data and cause the Ontology of noise, to cause the Ontology of noise to be corrected, to obtain Clean data are taken, clean data are stored in clean database;Specifically: a noise data is obtained, is judged in noise data Whether there is certain field to deviate considerably from cluster mass center and cause to encourage, if so, thinking that the field is to cause the semantic sheet of noise Body;If it is not, then obtaining all fields of the noise data, clustered after each field of the noise data is abandoned respectively, If after certain field is dropped, this data point remains as noise, then it is assumed that the field being dropped is non-noise Ontology;If After certain field is dropped, this data point is no longer known as noise, then the field being dropped is the Ontology for causing noise;It goes Except this causes the Ontology of noise, which is clustered again be included into in its class apart from shortest mass center;It will The data value of the Ontology attribute of original word in the class of the mass center is averaging, using this average value as noise data Ontology attribute, then it is assumed that noise data is corrected to form clean data;
Above step is repeated to complete until to the noise data cleaning in text data;
When the acquisition document data includes multiple themes: the noise data in the document data of described pair of acquisition carries out Cleaning, further includes: the document data of the acquisition is cleaned, specifically:
Configuration data cleaning rule file;The data cleansing rule file includes at least one data cleansing rule, institute Stating data cleansing rule includes data table name, data cleansing rule pseudocode and number of regulation;
According to data cleansing rule file, data cleansing code is generated;It include: to be obtained from the data cleansing rule file The corresponding data cleansing rule of the table name of tables of data to be cleaned is taken, temporary file is generated;First for reading the temporary file Data cleansing rule, the condition part that the data cleansing rule pseudocode in data cleansing rule is judged as condition are raw At the cleaning code for being directed to data cleansing rule;Data cleansing rule all in the temporary file is traversed, is each Data cleansing rule generates corresponding cleaning code, is combined into the cleaning code of complete tables of data to be cleaned;
Data cleansing code is executed, it is tagged for data to be cleaned;It include: one read in tables of data to be cleaned Initial labels value is arranged for the data in data;One data cleaning rule of the every triggering of data, then increase its label value 2n, wherein n is the number of regulation of data cleansing rule;Traverse each corresponding data cleansing of table name of tables of data to be cleaned Rule;Each data in tables of data to be cleaned is traversed, is that each data to be cleaned are tagged;
Label is parsed, dirty data is cleaned;It include: by label value and 2nIt does respectively and operation, if obtained knot Fruit is 2nItself, then illustrate the corresponding data cleansing rule of the label value corresponding data-triggered n, otherwise do not trigger n pairs The data cleansing rule answered, n are the number of regulation of data cleansing rule.
Further, it is based on the document training sample set, generating document-lexical item distributed intelligence includes: in the document Training sample, which is concentrated, determines destination document, and the text information in the destination document is segmented, and obtains multiple lexical items;According to The ratio that the secondary each lexical item of statistics occurs in the destination document, and using the ratio of each lexical item of statistics as the target The document of document-lexical item distributed intelligence.
Further, document-theme distribution information and theme-lexical item distributed intelligence are obtained according to following formula training:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distribution Information, P (zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates I-th of document, K indicate the total quantity of theme.
Further, the method also includes:
The generating probability of each lexical item in document is determined according to following formula:
Wherein, P (dii) indicate the probability that j-th of lexical item occurs in i-th of document, P (di) indicate that i-th of document exists Document training sample concentrates the probability occurred.
Further, the pending data according to the document-theme distribution information prediction that training obtains is corresponding Theme includes: to predict the corresponding multiple themes of the pending data according to the document-theme distribution information, the multiple Theme is associated with prediction probability value respectively;From the multiple theme, using the maximum N number of theme of prediction probability value as described in The corresponding theme of pending data;Wherein, N is the integer more than or equal to 1.
Further, judge the target data it is whether associated with the pending data include: call and measure in advance The compatible theme of the theme arrived-lexical item distributed intelligence, and judge whether the lexical item in the target data meets calling The theme-lexical item distributed intelligence;If satisfied, determining that the target data is associated with the pending data;If discontented Foot, determines that the target data is unrelated to the pending data.
To achieve the above object, the application also provides a kind of data processing system based on subject recommending model, the system System includes: sample set acquiring unit, and for obtaining document training sample set, it includes more samples that the document training sample, which is concentrated, Document;Information generating unit generates document-lexical item distributed intelligence for being based on the document training sample set;Information training is single Member, for utilizing the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distribution letter Breath;Data processing unit, for receiving pending data, and the document-theme distribution information prediction obtained according to training The corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, according to The theme-lexical item distributed intelligence judges whether the target data is associated with the pending data.
Further, the information training unit obtains document-theme distribution information and master according to following formula training Topic-lexical item distributed intelligence:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distribution Information, P (zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates I-th of document, K indicate the total quantity of theme.
Further, the data processing unit includes: theme prediction module, for according to the document-theme distribution Information predicts that the corresponding multiple themes of the pending data, the multiple theme are associated with prediction probability value respectively;Theme Determining module, for being corresponded to the maximum N number of theme of prediction probability value as the pending data from the multiple theme Theme;Wherein, N is the integer more than or equal to 1.
Further, the data processing unit includes: information matches module, the master for calling and predicting Compatible theme-lexical item distributed intelligence is inscribed, and judges whether the lexical item in the target data meets the theme-of calling Lexical item distributed intelligence;Information judging module, for if satisfied, determining that the target data is associated with the pending data; If not satisfied, determining that the target data is unrelated to the pending data.
Therefore technical solution provided by the present application first can be with when identifying whether two kinds of data have relevance By a large amount of training sample, the distributed intelligence of document-lexical item is generated.Then, it according to the known distributed intelligence, can train Obtain the distributed intelligence of document-theme and the distributed intelligence of theme-lexical item, so as to by the content of document with wherein include Lexical item and document expressed by theme be associated.In this way, it is subsequent when judging whether two data are associated with, it can be from text Theme the two levels that the lexical item or document for including in shelves are reflected account for, so as to improve data correlation Judge precision.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by written explanation Specifically noted structure is achieved and obtained in book, claims and attached drawing.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the data processing method flow chart based on subject recommending model in the embodiment of the present invention;
Fig. 2 is the functional block diagram of the data processing system based on subject recommending model in the embodiment of the present invention.
Specific embodiment
Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.
Referring to Fig. 1, the application provides a kind of data processing method based on subject recommending model, which comprises
S1: document training sample set is obtained, it includes more sample files that the document training sample, which is concentrated,;
S2: being based on the document training sample set, generates document-lexical item distributed intelligence;
S3: utilizing the document-lexical item distributed intelligence, and training obtains document-theme distribution information and theme-lexical item point Cloth information;
S4: pending data is received, and to be processed according to the document-theme distribution information prediction that training obtains The corresponding theme of data, when the compatible target data of the theme received with prediction obtains, according to the theme-word Item distributed intelligence, judges whether the target data is associated with the pending data.
In the present embodiment, it is based on the document training sample set, generating document-lexical item distributed intelligence includes:
It is concentrated in the document training sample and determines destination document, and the text information in the destination document is divided Word obtains multiple lexical items;
The ratio that each lexical item occurs in the destination document is successively counted, and the ratio of each lexical item of statistics is made For document-lexical item distributed intelligence of the destination document.
In practical applications, the document of multiple documents-lexical item distributed intelligence can constitute the document-vocabulary an of entirety Distributed intelligence, the distributed intelligence can characterize the frequency that lexical item occurs in a document.
In the present embodiment, document-theme distribution information and theme-lexical item can be obtained according to following formula training Distributed intelligence:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distribution Information, P (zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates I-th of document, K indicate the total quantity of theme.
Further, the generating probability that each lexical item in document can be determined according to following formula is also starved:
Wherein, P (dii) indicate the probability that j-th of lexical item occurs in i-th of document, P (di) indicate that i-th of document exists Document training sample concentrates the probability occurred.
In practical applications, after training obtains document-theme distribution information and theme-lexical item distributed intelligence, also It can be based on trained as a result, automatically generating a document.
It specifically, can be according to prior probability (the namely above-mentioned P (d of reference documentsi)), to determine reference text Corresponding document-theme distribution the information of shelves.Then certain for generating the reference documents can be sampled from document-theme distribution information A theme.Then, it is based on the theme, can determine that lexical item is distributed from corresponding theme-lexical item distributed intelligence, then to this Lexical item distribution is sampled, some lexical item may finally be generated.In this way, the theme to reference documents is analyzed one by one, To generate another document associated with reference documents.
In the present embodiment, the pending data according to the document-theme distribution information prediction that training obtains Corresponding theme includes:
According to the document-theme distribution information, the corresponding multiple themes of the pending data, the multiple master are predicted It inscribes associated with prediction probability value respectively;
From the multiple theme, using the maximum N number of theme of prediction probability value as the corresponding master of the pending data Topic;Wherein, N is the integer more than or equal to 1.
Specifically, judging whether the target data is associated with the pending data includes:
Theme-lexical item distributed intelligence compatible with the theme that prediction obtains is called, and judges the target data In lexical item whether meet the theme-lexical item distributed intelligence of calling;
If satisfied, determining that the target data is associated with the pending data;If not satisfied, determining the number of targets According to unrelated to the pending data.
Referring to Fig. 2, the application also provides a kind of data processing system based on subject recommending model, the system packet It includes:
Sample set acquiring unit, for obtaining document training sample set, it includes more samples that the document training sample, which is concentrated, This document;
Information generating unit generates document-lexical item distributed intelligence for being based on the document training sample set;
Information training unit, for utilizing the document-lexical item distributed intelligence, training obtains document-theme distribution information And theme-lexical item distributed intelligence;
Data processing unit, for receiving pending data, and the document-theme distribution information obtained according to training Predict the corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, According to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.
In the present embodiment, the information training unit obtains document-theme distribution information according to following formula training And theme-lexical item distributed intelligence:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distribution Information, P (zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates I-th of document, K indicate the total quantity of theme.
In the present embodiment, the data processing unit includes:
Theme prediction module, for predicting that the pending data is corresponding more according to the document-theme distribution information A theme, the multiple theme are associated with prediction probability value respectively;
Theme determining module, for from the multiple theme, using the maximum N number of theme of prediction probability value as described in Handle the corresponding theme of data;Wherein, N is the integer more than or equal to 1.
In the present embodiment, the data processing unit includes:
Information matches module, the compatible theme-lexical item distributed intelligence of the theme for calling with predicting, and Judge whether the lexical item in the target data meets the theme-lexical item distributed intelligence of calling;
Information judging module, for if satisfied, determining that the target data is associated with the pending data;If discontented Foot, determines that the target data is unrelated to the pending data.
Therefore technical solution provided by the present application first can be with when identifying whether two kinds of data have relevance By a large amount of training sample, the distributed intelligence of document-lexical item is generated.Then, it according to the known distributed intelligence, can train Obtain the distributed intelligence of document-theme and the distributed intelligence of theme-lexical item, so as to by the content of document with wherein include Lexical item and document expressed by theme be associated.In this way, it is subsequent when judging whether two data are associated with, it can be from text Theme the two levels that the lexical item or document for including in shelves are reflected account for, so as to improve data correlation Judge precision.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (10)

1. a kind of data processing method based on subject recommending model, which is characterized in that the described method includes:
Document training sample set is obtained, it includes more sample files that the document training sample, which is concentrated,;
Based on the document training sample set, document-lexical item distributed intelligence is generated;
Using the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distributed intelligence;
Receive pending data, and the pending data pair according to the document-theme distribution information prediction that training obtains The theme answered is distributed when the compatible target data of the theme received with prediction obtains according to the theme-lexical item Information judges whether the target data is associated with the pending data.
2. generating document-lexical item the method according to claim 1, wherein being based on the document training sample set Distributed intelligence includes:
It is concentrated in the document training sample and determines destination document, and the text information in the destination document is segmented, Obtain multiple lexical items;
The ratio that each lexical item occurs in the destination document is successively counted, and using the ratio of each lexical item of statistics as institute State document-lexical item distributed intelligence of destination document.
3. believing the method according to claim 1, wherein obtaining document-theme distribution according to following formula training Breath and theme-lexical item distributed intelligence:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distributed intelligence, P(zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates i-th Document, K indicate the total quantity of theme.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
The generating probability of each lexical item in document is determined according to following formula:
Wherein, P (dii) indicate the probability that j-th of lexical item occurs in i-th of document, P (di) indicate i-th of document in document Training sample concentrates the probability occurred.
5. the method according to claim 1, wherein the document-theme distribution information obtained according to training Predict that the corresponding theme of the pending data includes:
According to the document-theme distribution information, the corresponding multiple themes of the pending data, the multiple theme point are predicted It is not associated with prediction probability value;
From the multiple theme, using the maximum N number of theme of prediction probability value as the corresponding theme of the pending data;Its In, N is the integer more than or equal to 1.
6. the method according to claim 1, wherein judge the target data whether with the pending data It is associated to include:
Theme-lexical item distributed intelligence compatible with the theme that prediction obtains is called, and is judged in the target data Whether lexical item meets the theme-lexical item distributed intelligence of calling;
If satisfied, determining that the target data is associated with the pending data;If not satisfied, determine the target data with The pending data is unrelated.
7. a kind of data processing system based on subject recommending model, which is characterized in that the system comprises:
Sample set acquiring unit, for obtaining document training sample set, it includes more sample texts that the document training sample, which is concentrated, Shelves;
Information generating unit generates document-lexical item distributed intelligence for being based on the document training sample set;
Information training unit, for utilize the document-lexical item distributed intelligence, training obtain document-theme distribution information and Theme-lexical item distributed intelligence;
Data processing unit, for receiving pending data, and the document-theme distribution information prediction obtained according to training The corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, according to The theme-lexical item distributed intelligence judges whether the target data is associated with the pending data.
8. system according to claim 7, which is characterized in that the information training unit is obtained according to following formula training Document-theme distribution information and theme-lexical item distributed intelligence:
Wherein, P (ωj|di) indicate the document-lexical item distributed intelligence, P (ωj|zk) indicate the theme-lexical item distributed intelligence, P(zk|di) indicate the document-theme distribution information, ωjIndicate j-th of lexical item, zkIndicate k-th of theme, diIt indicates i-th Document, K indicate the total quantity of theme.
9. system according to claim 7, which is characterized in that the data processing unit includes:
Theme prediction module, for predicting the corresponding multiple masters of the pending data according to the document-theme distribution information Topic, the multiple theme are associated with prediction probability value respectively;
Theme determining module is used for from the multiple theme, using the maximum N number of theme of prediction probability value as described to be processed The corresponding theme of data;Wherein, N is the integer more than or equal to 1.
10. system according to claim 7, which is characterized in that the data processing unit includes:
Information matches module, the compatible theme-lexical item distributed intelligence of the theme for calling with predicting, and judge Whether the lexical item in the target data meets the theme-lexical item distributed intelligence of calling;
Information judging module, for if satisfied, determining that the target data is associated with the pending data;If not satisfied, Determine that the target data is unrelated to the pending data.
CN201811142853.3A 2018-09-28 2018-09-28 Data processing method and system based on theme recommendation model Active CN109446516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811142853.3A CN109446516B (en) 2018-09-28 2018-09-28 Data processing method and system based on theme recommendation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811142853.3A CN109446516B (en) 2018-09-28 2018-09-28 Data processing method and system based on theme recommendation model

Publications (2)

Publication Number Publication Date
CN109446516A true CN109446516A (en) 2019-03-08
CN109446516B CN109446516B (en) 2022-11-11

Family

ID=65544620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811142853.3A Active CN109446516B (en) 2018-09-28 2018-09-28 Data processing method and system based on theme recommendation model

Country Status (1)

Country Link
CN (1) CN109446516B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105041A (en) * 2019-12-02 2020-05-05 成都四方伟业软件股份有限公司 Machine learning method and device for intelligent data collision

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687364A (en) * 1994-09-16 1997-11-11 Xerox Corporation Method for learning to infer the topical content of documents based upon their lexical content
JP2013134752A (en) * 2011-12-27 2013-07-08 Nippon Telegr & Teleph Corp <Ntt> Topic model learning method, apparatus, and program
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687364A (en) * 1994-09-16 1997-11-11 Xerox Corporation Method for learning to infer the topical content of documents based upon their lexical content
JP2013134752A (en) * 2011-12-27 2013-07-08 Nippon Telegr & Teleph Corp <Ntt> Topic model learning method, apparatus, and program
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN107239438A (en) * 2016-03-28 2017-10-10 阿里巴巴集团控股有限公司 A kind of document analysis method and device
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105041A (en) * 2019-12-02 2020-05-05 成都四方伟业软件股份有限公司 Machine learning method and device for intelligent data collision
CN111105041B (en) * 2019-12-02 2022-12-23 成都四方伟业软件股份有限公司 Machine learning method and device for intelligent data collision

Also Published As

Publication number Publication date
CN109446516B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN106570708B (en) Management method and system of intelligent customer service knowledge base
Jacovi et al. Understanding convolutional neural networks for text classification
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN106156204B (en) Text label extraction method and device
Alfonseca et al. Extending a lexical ontology by a combination of distributional semantics signatures
US20100205198A1 (en) Search query disambiguation
JP5536875B2 (en) Method and apparatus for identifying synonyms and searching using synonyms
CN112035730B (en) Semantic retrieval method and device and electronic equipment
US20100306144A1 (en) System and method for classifying information
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN108681564B (en) Keyword and answer determination method, device and computer readable storage medium
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN111613341A (en) Entity linking method and device based on semantic components
Hillard et al. Learning weighted entity lists from web click logs for spoken language understanding
CN110209659A (en) A kind of resume filter method, system and computer readable storage medium
CN109446516A (en) A kind of data processing method and system based on subject recommending model
CN105159905B (en) Microblogging clustering method based on forwarding relationship
CN115130455A (en) Article processing method and device, electronic equipment and storage medium
Palm Sentiment classification of Swedish Twitter data
Fan et al. Stop Words for Processing Software Engineering Documents: Do they Matter?
CN111898034A (en) News content pushing method and device, storage medium and computer equipment
CN111898375A (en) Automatic detection and division method for article argument and data based on word vector sentence subchain
WO2019132648A1 (en) System and method for identifying concern evolution within temporal and geospatial windows
Lu et al. Improving web search relevance with semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant