CN109446516A

CN109446516A - A kind of data processing method and system based on subject recommending model

Info

Publication number: CN109446516A
Application number: CN201811142853.3A
Authority: CN
Inventors: 王军平
Original assignee: Beijing Cyberbas Data Technology Co Ltd
Current assignee: Beijing Cyberbas Data Technology Co Ltd
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-03-08
Anticipated expiration: 2038-09-28
Also published as: CN109446516B

Abstract

The present invention provides a kind of data processing method and system based on subject recommending model, wherein the described method includes: obtaining document training sample set, it includes more sample files that the document training sample, which is concentrated,；Based on the document training sample set, document-lexical item distributed intelligence is generated；Using the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distributed intelligence；Receive pending data, and the corresponding theme of pending data according to the document-theme distribution information prediction that training obtains, when the compatible target data of the theme received with prediction obtains, according to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.Technical solution provided by the present application can be improved the judgement precision of data correlation.

Description

A kind of data processing method and system based on subject recommending model

Technical field

The present invention relates to technical field of data processing, in particular to a kind of data processing method based on subject recommending model And system.

Background technique

It is current usually same or similar by occurring in two documents of comparison when judging whether two documents are related The quantity of vocabulary is judged.However, in some cases, in two relevant documents, may and there is no identical or Similar vocabulary.For example, document 1 is " Qiao Busi is from us ", document 2 is " iPhone can make a price reduction ".The two Same or similar vocabulary is not present only from the point of view of literal in document, but analyzes from semantically, the two documents are practical It is relevant.

Therefore, when coming to judge whether two kinds of data are associated, there are biggish erroneous judgements for the method for use.

Summary of the invention

The application's is designed to provide a kind of data processing method and system based on subject recommending model, can be improved The judgement precision of data correlation.

To achieve the above object, the application provides a kind of data processing method based on subject recommending model, the method It include: to obtain document training sample set, it includes more sample files that the document training sample, which is concentrated,；Based on the document training Sample set generates document-lexical item distributed intelligence；Using the document-lexical item distributed intelligence, training obtains document-theme distribution Information and theme-lexical item distributed intelligence；Pending data is received, and is believed according to the document-theme distribution that training obtains Breath predicts the corresponding theme of the pending data, when the compatible target data of the theme for receiving with predicting to obtain When, according to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.

Further, the document training sample set, is obtained by following methods:

Based on search engine, document data is acquired, and the noise data in the document data of acquisition is cleared up, specifically Are as follows:

Establish the clean database for storing the not clean data of Noise；Text data to be cleaned is obtained, is treated Cleaning data are pre-processed to obtain structural data, the set of the word of the structural data composition text data, tool Body are as follows: data to be cleaned are segmented, and all words are converted to unified coding form；It will be with Unified coding form Data eliminate inconsistent data according to data dictionary, obtain standardized data；Consistency desired result is carried out to the standardized data, Apparent error in content is modified；Identical word is subjected to deduplication operation, to obtain structural data；

Obtain the semantic similarity of every two word；Specifically: concept expressed by each word is obtained respectively and description is each The justice of concept is former；The independent word of any two is obtained, the similarity between the justice original under each concept of two words is calculated separately, Two former similarities of justice are measured with their semantic distance；Find the most cardinal principles of righteousness original similarity and minimum between two concepts Adopted original similarity, the similarity between two concepts are the mean value of most cardinal principles of righteousness original similarity and the former similarity of minimum justice；Find two Maximum concept similarity between a word, using maximum concept similarity as the semantic similarity of two words；

Using the semantic similarity of two words as distance metric, using K-means algorithm, automatic cluster is carried out to word, Identify noise data；It specifically includes: obtaining K word at random as mass center, set similarity threshold；By remaining each word point Do not measure its arrive each mass center distance, and by the word be included into in its class apart from shortest mass center；It recalculates and has obtained The mass center of each class arrived；Judge whether new mass center is equal to or less than similarity threshold at a distance from the protoplasm heart, if so, far It is noise data that the remaining data in the class of any mass center, which can not be attributed to, from each mass center；

Found in noise data and cause the Ontology of noise, to cause the Ontology of noise to be corrected, to obtain Clean data are taken, clean data are stored in clean database；Specifically: a noise data is obtained, is judged in noise data Whether there is certain field to deviate considerably from cluster mass center and cause to encourage, if so, thinking that the field is to cause the semantic sheet of noise Body；If it is not, then obtaining all fields of the noise data, clustered after each field of the noise data is abandoned respectively, If after certain field is dropped, this data point remains as noise, then it is assumed that the field being dropped is non-noise Ontology；If After certain field is dropped, this data point is no longer known as noise, then the field being dropped is the Ontology for causing noise；It goes Except this causes the Ontology of noise, which is clustered again be included into in its class apart from shortest mass center；It will The data value of the Ontology attribute of original word in the class of the mass center is averaging, using this average value as noise data Ontology attribute, then it is assumed that noise data is corrected to form clean data；

Above step is repeated to complete until to the noise data cleaning in text data；

When the acquisition document data includes multiple themes: the noise data in the document data of described pair of acquisition carries out Cleaning, further includes: the document data of the acquisition is cleaned, specifically:

Configuration data cleaning rule file；The data cleansing rule file includes at least one data cleansing rule, institute Stating data cleansing rule includes data table name, data cleansing rule pseudocode and number of regulation；

According to data cleansing rule file, data cleansing code is generated；It include: to be obtained from the data cleansing rule file The corresponding data cleansing rule of the table name of tables of data to be cleaned is taken, temporary file is generated；First for reading the temporary file Data cleansing rule, the condition part that the data cleansing rule pseudocode in data cleansing rule is judged as condition are raw At the cleaning code for being directed to data cleansing rule；Data cleansing rule all in the temporary file is traversed, is each Data cleansing rule generates corresponding cleaning code, is combined into the cleaning code of complete tables of data to be cleaned；

Data cleansing code is executed, it is tagged for data to be cleaned；It include: one read in tables of data to be cleaned Initial labels value is arranged for the data in data；One data cleaning rule of the every triggering of data, then increase its label value 2ⁿ, wherein n is the number of regulation of data cleansing rule；Traverse each corresponding data cleansing of table name of tables of data to be cleaned Rule；Each data in tables of data to be cleaned is traversed, is that each data to be cleaned are tagged；

Label is parsed, dirty data is cleaned；It include: by label value and 2ⁿIt does respectively and operation, if obtained knot Fruit is 2ⁿItself, then illustrate the corresponding data cleansing rule of the label value corresponding data-triggered n, otherwise do not trigger n pairs The data cleansing rule answered, n are the number of regulation of data cleansing rule.

Further, it is based on the document training sample set, generating document-lexical item distributed intelligence includes: in the document Training sample, which is concentrated, determines destination document, and the text information in the destination document is segmented, and obtains multiple lexical items；According to The ratio that the secondary each lexical item of statistics occurs in the destination document, and using the ratio of each lexical item of statistics as the target The document of document-lexical item distributed intelligence.

Further, document-theme distribution information and theme-lexical item distributed intelligence are obtained according to following formula training:

Wherein, P (ω_j|d_i) indicate the document-lexical item distributed intelligence, P (ω_j|z_k) indicate the theme-lexical item distribution Information, P (z_k|d_i) indicate the document-theme distribution information, ω_jIndicate j-th of lexical item, z_kIndicate k-th of theme, d_iIt indicates I-th of document, K indicate the total quantity of theme.

Further, the method also includes:

The generating probability of each lexical item in document is determined according to following formula:

Wherein, P (d_i,ω_i) indicate the probability that j-th of lexical item occurs in i-th of document, P (d_i) indicate that i-th of document exists Document training sample concentrates the probability occurred.

Further, the pending data according to the document-theme distribution information prediction that training obtains is corresponding Theme includes: to predict the corresponding multiple themes of the pending data according to the document-theme distribution information, the multiple Theme is associated with prediction probability value respectively；From the multiple theme, using the maximum N number of theme of prediction probability value as described in The corresponding theme of pending data；Wherein, N is the integer more than or equal to 1.

Further, judge the target data it is whether associated with the pending data include: call and measure in advance The compatible theme of the theme arrived-lexical item distributed intelligence, and judge whether the lexical item in the target data meets calling The theme-lexical item distributed intelligence；If satisfied, determining that the target data is associated with the pending data；If discontented Foot, determines that the target data is unrelated to the pending data.

To achieve the above object, the application also provides a kind of data processing system based on subject recommending model, the system System includes: sample set acquiring unit, and for obtaining document training sample set, it includes more samples that the document training sample, which is concentrated, Document；Information generating unit generates document-lexical item distributed intelligence for being based on the document training sample set；Information training is single Member, for utilizing the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distribution letter Breath；Data processing unit, for receiving pending data, and the document-theme distribution information prediction obtained according to training The corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, according to The theme-lexical item distributed intelligence judges whether the target data is associated with the pending data.

Further, the information training unit obtains document-theme distribution information and master according to following formula training Topic-lexical item distributed intelligence:

Further, the data processing unit includes: theme prediction module, for according to the document-theme distribution Information predicts that the corresponding multiple themes of the pending data, the multiple theme are associated with prediction probability value respectively；Theme Determining module, for being corresponded to the maximum N number of theme of prediction probability value as the pending data from the multiple theme Theme；Wherein, N is the integer more than or equal to 1.

Further, the data processing unit includes: information matches module, the master for calling and predicting Compatible theme-lexical item distributed intelligence is inscribed, and judges whether the lexical item in the target data meets the theme-of calling Lexical item distributed intelligence；Information judging module, for if satisfied, determining that the target data is associated with the pending data； If not satisfied, determining that the target data is unrelated to the pending data.

Therefore technical solution provided by the present application first can be with when identifying whether two kinds of data have relevance By a large amount of training sample, the distributed intelligence of document-lexical item is generated.Then, it according to the known distributed intelligence, can train Obtain the distributed intelligence of document-theme and the distributed intelligence of theme-lexical item, so as to by the content of document with wherein include Lexical item and document expressed by theme be associated.In this way, it is subsequent when judging whether two data are associated with, it can be from text Theme the two levels that the lexical item or document for including in shelves are reflected account for, so as to improve data correlation Judge precision.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by written explanation Specifically noted structure is achieved and obtained in book, claims and attached drawing.

Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Detailed description of the invention

Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the data processing method flow chart based on subject recommending model in the embodiment of the present invention；

Fig. 2 is the functional block diagram of the data processing system based on subject recommending model in the embodiment of the present invention.

Specific embodiment

Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.

Referring to Fig. 1, the application provides a kind of data processing method based on subject recommending model, which comprises

S1: document training sample set is obtained, it includes more sample files that the document training sample, which is concentrated,；

S2: being based on the document training sample set, generates document-lexical item distributed intelligence；

S3: utilizing the document-lexical item distributed intelligence, and training obtains document-theme distribution information and theme-lexical item point Cloth information；

S4: pending data is received, and to be processed according to the document-theme distribution information prediction that training obtains The corresponding theme of data, when the compatible target data of the theme received with prediction obtains, according to the theme-word Item distributed intelligence, judges whether the target data is associated with the pending data.

In the present embodiment, it is based on the document training sample set, generating document-lexical item distributed intelligence includes:

It is concentrated in the document training sample and determines destination document, and the text information in the destination document is divided Word obtains multiple lexical items；

The ratio that each lexical item occurs in the destination document is successively counted, and the ratio of each lexical item of statistics is made For document-lexical item distributed intelligence of the destination document.

In practical applications, the document of multiple documents-lexical item distributed intelligence can constitute the document-vocabulary an of entirety Distributed intelligence, the distributed intelligence can characterize the frequency that lexical item occurs in a document.

In the present embodiment, document-theme distribution information and theme-lexical item can be obtained according to following formula training Distributed intelligence:

Further, the generating probability that each lexical item in document can be determined according to following formula is also starved:

In practical applications, after training obtains document-theme distribution information and theme-lexical item distributed intelligence, also It can be based on trained as a result, automatically generating a document.

It specifically, can be according to prior probability (the namely above-mentioned P (d of reference documents_i)), to determine reference text Corresponding document-theme distribution the information of shelves.Then certain for generating the reference documents can be sampled from document-theme distribution information A theme.Then, it is based on the theme, can determine that lexical item is distributed from corresponding theme-lexical item distributed intelligence, then to this Lexical item distribution is sampled, some lexical item may finally be generated.In this way, the theme to reference documents is analyzed one by one, To generate another document associated with reference documents.

In the present embodiment, the pending data according to the document-theme distribution information prediction that training obtains Corresponding theme includes:

According to the document-theme distribution information, the corresponding multiple themes of the pending data, the multiple master are predicted It inscribes associated with prediction probability value respectively；

From the multiple theme, using the maximum N number of theme of prediction probability value as the corresponding master of the pending data Topic；Wherein, N is the integer more than or equal to 1.

Specifically, judging whether the target data is associated with the pending data includes:

Theme-lexical item distributed intelligence compatible with the theme that prediction obtains is called, and judges the target data In lexical item whether meet the theme-lexical item distributed intelligence of calling；

If satisfied, determining that the target data is associated with the pending data；If not satisfied, determining the number of targets According to unrelated to the pending data.

Referring to Fig. 2, the application also provides a kind of data processing system based on subject recommending model, the system packet It includes:

Sample set acquiring unit, for obtaining document training sample set, it includes more samples that the document training sample, which is concentrated, This document；

Information generating unit generates document-lexical item distributed intelligence for being based on the document training sample set；

Information training unit, for utilizing the document-lexical item distributed intelligence, training obtains document-theme distribution information And theme-lexical item distributed intelligence；

Data processing unit, for receiving pending data, and the document-theme distribution information obtained according to training Predict the corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, According to the theme-lexical item distributed intelligence, judge whether the target data is associated with the pending data.

In the present embodiment, the information training unit obtains document-theme distribution information according to following formula training And theme-lexical item distributed intelligence:

In the present embodiment, the data processing unit includes:

Theme prediction module, for predicting that the pending data is corresponding more according to the document-theme distribution information A theme, the multiple theme are associated with prediction probability value respectively；

Theme determining module, for from the multiple theme, using the maximum N number of theme of prediction probability value as described in Handle the corresponding theme of data；Wherein, N is the integer more than or equal to 1.

In the present embodiment, the data processing unit includes:

Information matches module, the compatible theme-lexical item distributed intelligence of the theme for calling with predicting, and Judge whether the lexical item in the target data meets the theme-lexical item distributed intelligence of calling；

Information judging module, for if satisfied, determining that the target data is associated with the pending data；If discontented Foot, determines that the target data is unrelated to the pending data.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of data processing method based on subject recommending model, which is characterized in that the described method includes:

Document training sample set is obtained, it includes more sample files that the document training sample, which is concentrated,；

Based on the document training sample set, document-lexical item distributed intelligence is generated；

Using the document-lexical item distributed intelligence, training obtains document-theme distribution information and theme-lexical item distributed intelligence；

Receive pending data, and the pending data pair according to the document-theme distribution information prediction that training obtains The theme answered is distributed when the compatible target data of the theme received with prediction obtains according to the theme-lexical item Information judges whether the target data is associated with the pending data.

2. generating document-lexical item the method according to claim 1, wherein being based on the document training sample set Distributed intelligence includes:

It is concentrated in the document training sample and determines destination document, and the text information in the destination document is segmented, Obtain multiple lexical items；

The ratio that each lexical item occurs in the destination document is successively counted, and using the ratio of each lexical item of statistics as institute State document-lexical item distributed intelligence of destination document.

3. believing the method according to claim 1, wherein obtaining document-theme distribution according to following formula training Breath and theme-lexical item distributed intelligence:

Wherein, P (ω_j|d_i) indicate the document-lexical item distributed intelligence, P (ω_j|z_k) indicate the theme-lexical item distributed intelligence, P(z_k|d_i) indicate the document-theme distribution information, ω_jIndicate j-th of lexical item, z_kIndicate k-th of theme, d_iIt indicates i-th Document, K indicate the total quantity of theme.

4. according to the method described in claim 3, it is characterized in that, the method also includes:

Wherein, P (d_i,ω_i) indicate the probability that j-th of lexical item occurs in i-th of document, P (d_i) indicate i-th of document in document Training sample concentrates the probability occurred.

5. the method according to claim 1, wherein the document-theme distribution information obtained according to training Predict that the corresponding theme of the pending data includes:

According to the document-theme distribution information, the corresponding multiple themes of the pending data, the multiple theme point are predicted It is not associated with prediction probability value；

From the multiple theme, using the maximum N number of theme of prediction probability value as the corresponding theme of the pending data；Its In, N is the integer more than or equal to 1.

6. the method according to claim 1, wherein judge the target data whether with the pending data It is associated to include:

Theme-lexical item distributed intelligence compatible with the theme that prediction obtains is called, and is judged in the target data Whether lexical item meets the theme-lexical item distributed intelligence of calling；

If satisfied, determining that the target data is associated with the pending data；If not satisfied, determine the target data with The pending data is unrelated.

7. a kind of data processing system based on subject recommending model, which is characterized in that the system comprises:

Sample set acquiring unit, for obtaining document training sample set, it includes more sample texts that the document training sample, which is concentrated, Shelves；

Information training unit, for utilize the document-lexical item distributed intelligence, training obtain document-theme distribution information and Theme-lexical item distributed intelligence；

Data processing unit, for receiving pending data, and the document-theme distribution information prediction obtained according to training The corresponding theme of the pending data, when the compatible target data of the theme received with prediction obtains, according to The theme-lexical item distributed intelligence judges whether the target data is associated with the pending data.

8. system according to claim 7, which is characterized in that the information training unit is obtained according to following formula training Document-theme distribution information and theme-lexical item distributed intelligence:

9. system according to claim 7, which is characterized in that the data processing unit includes:

Theme prediction module, for predicting the corresponding multiple masters of the pending data according to the document-theme distribution information Topic, the multiple theme are associated with prediction probability value respectively；

Theme determining module is used for from the multiple theme, using the maximum N number of theme of prediction probability value as described to be processed The corresponding theme of data；Wherein, N is the integer more than or equal to 1.

10. system according to claim 7, which is characterized in that the data processing unit includes:

Information judging module, for if satisfied, determining that the target data is associated with the pending data；If not satisfied, Determine that the target data is unrelated to the pending data.