CN106970988A - Data processing method, device and electronic equipment - Google Patents

Data processing method, device and electronic equipment Download PDF

Info

Publication number
CN106970988A
CN106970988A CN201710202705.5A CN201710202705A CN106970988A CN 106970988 A CN106970988 A CN 106970988A CN 201710202705 A CN201710202705 A CN 201710202705A CN 106970988 A CN106970988 A CN 106970988A
Authority
CN
China
Prior art keywords
keyword
comment
language material
word
destination object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710202705.5A
Other languages
Chinese (zh)
Inventor
刘帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201710202705.5A priority Critical patent/CN106970988A/en
Publication of CN106970988A publication Critical patent/CN106970988A/en
Priority to PCT/CN2017/102942 priority patent/WO2018176764A1/en
Priority to US16/587,440 priority patent/US11468108B2/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of data processing method, device and electronic equipment, by the keywords database for building the comment language material that destination object is associated, afterwards, the comment word of the first keyword in the keywords database is included in screening comment language material, merge the pending language material for obtaining first keyword, obtain including the long text of first keyword, afterwards, Subject Clustering processing is carried out to the pending language material of each keyword, the descriptor that destination object is directed in comment language material can accurately be obtained, solve and directly short text language material is handled using Subject Clustering algorithm in the prior art, obtain multiple theme word problems, and without manually extracting short text language material descriptor, reduce manpower and materials consumption, substantially increase operating efficiency.

Description

Data processing method, device and electronic equipment
Technical field
Present application relates generally to technical field of data processing, more particularly to a kind of data processing method, device And electronic equipment.
Background technology
Information method for organizing is the need for being directed to retrieval, information resources to be carried out with content analysis, index, processing, is finally made The method for obtaining information resources ordering, it plays a great role in the application service of internet.
In actual applications, when it should be understood that user is to product or the focus of whole industry, to have carried out pin accordingly Research and development of products is carried out to property, user's request is preferably met, it will usually to the user using certain product or certain particular row is engaged in The feedback information of the user of industry carries out Subject Clustering analysis, obtains user and pays close attention to certain product or the subject categories letter of certain specific industry Breath, that is, learn user's focus.
However, the language material that Subject Clustering of the prior art is normally based on the long texts such as news report is analyzed, if Field feedback is analyzed with it, it is easy to obtain multiple themes, it is impossible to determine user's focus;According to traditional people Work mode carries out subject distillation, it will expend a large amount of human and material resources, and operating efficiency is very low.
The content of the invention
In view of this, this application provides a kind of data processing method, device and electronic equipment, realize to short text language The Subject Clustering of material, short text language material theme is extracted without artificial, greatly reduces manpower and materials loss, and improve work effect Rate.
To achieve these goals, this application provides following technical scheme:
A kind of data processing method, methods described includes:
The comment language material associated using the destination object of acquisition, builds the keywords database of the comment language material;
The comment word of the first keyword in the keywords database is included in the screening comment language material, merging obtains described the The pending language material of one keyword, the comment word is the part comment language material comprising first keyword;
The pending language material of each keyword to obtaining carries out Subject Clustering processing, and obtain the comment language material is directed to institute State the descriptor of destination object.
It is preferred that, the comment word of the first keyword in the keywords database is included in the screening comment language material, is closed And the pending language material of first keyword is obtained, including:
The comment language material is scanned;
For the first keyword in the keywords database, extract described comprising what is detected in Current Scan comment language material At least one comment word of first keyword and at least one adjoining word composition, wherein, the adjacent word is apart from described first Keyword is in the range of preset length;
The comment word of the correspondence of extraction first keyword is spliced into a pending language material.
It is preferred that, the comment language material associated using the destination object of acquisition builds the keywords database of the comment language material, bag Include:
Obtain the keyword to be selected in the comment language material of destination object association;
Uncorrelated word filtering is carried out to the keyword to be selected of acquisition, the keywords database of the comment language material is obtained, it is described not Related term refers to the comment word unrelated with the attribute of the destination object.
It is preferred that, methods described also includes:
Using the default predetermined keyword for the destination object, the keywords database is supplemented.
It is preferred that, it is described to extract in Current Scan comment language material comprising first keyword and at least one detected At least one comment word of individual adjacent word composition, including:
Using preset length window, the comment word for including first keyword is extracted from Current Scan comment language material;
Wherein, the length of the preset length window is more than the character length of first keyword.
It is preferred that, the keyword to be selected in the comment language material for obtaining destination object association, including:
Word segmentation processing is carried out to the comment language material that destination object is associated;
Using term frequency-inverse document frequency TF-IDF algorithms, the IDF values of each word after word segmentation processing are calculated;
The word that screening IDF values are more than predetermined threshold value is defined as keyword to be selected.
A kind of data processing equipment, described device includes:
First builds module, for the comment language material using the destination object association obtained, builds the comment language material Keywords database;
Merging module is screened, for screening the comment for including the first keyword in the keywords database in the comment language material Word, merges the pending language material for obtaining first keyword, and the comment word is that the part comprising first keyword is commented The Analects of Confucius material;
Clustering processing module, the pending language material for each keyword to obtaining carries out Subject Clustering processing, obtains institute The descriptor for the destination object of commentary The Analects of Confucius material.
A kind of electronic equipment, the electronic equipment includes:
Communication module, the comment language material for obtaining destination object association;
Processor, for utilizing the comment language material, builds the keywords database of the comment language material, screens the comments The comment word of the first keyword in the keywords database is included in material, merges the pending language material for obtaining first keyword, The comment word is that language material is commented in the part comprising first keyword, and the pending language material of each keyword to obtaining enters The processing of row Subject Clustering, obtains the descriptor for the destination object of the comment language material.
It is preferred that, the electronic equipment also includes:
Input equipment, for obtaining the predetermined keyword for destination object, is supplemented the keywords database;
And/or;
Display, the descriptor for exporting the destination object.
It is preferred that, the processor to the comment language material specifically for being scanned, in the keywords database First keyword, is extracted in Current Scan comment language material comprising first keyword detected and at least one adjoining word At least one comment word of composition, and the comment word of the correspondence of extraction first keyword is spliced into a pending language Material, wherein, the adjacent word is apart from first keyword in the range of preset length.
As can be seen here, compared with prior art, this application provides a kind of data processing method, device and electronic equipment, By building the keywords database for the comment language material that destination object is associated, afterwards, included in screening comment language material in the keywords database The comment word of first keyword, merges the pending language material for obtaining first keyword, that is, obtains comprising first keyword Long text, afterwards, Subject Clustering processing is carried out to the pending language material of each keyword, you can accurate obtain is directed in comment language material The descriptor of destination object, is solved and directly short text language material is handled using Subject Clustering algorithm in the prior art, obtained To multiple theme word problems, and without manually extracting short text language material descriptor, manpower and materials consumption is reduced, is greatly improved Operating efficiency.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this The embodiment of application, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.
A kind of flow chart for data processing method that Fig. 1 provides for the embodiment of the present application;
The flow chart for another data processing method that Fig. 2 provides for the embodiment of the present application;
A kind of keyword and its display application schematic diagram of comment word that Fig. 3 provides for the embodiment of the present application;
A kind of theme display schematic diagram that Fig. 4 provides for the embodiment of the present application;
A kind of structured flowchart for data processing equipment that Fig. 5 provides for the embodiment of the present application;
The structured flowchart for another data processing equipment that Fig. 6 provides for the embodiment of the present application;
The structured flowchart for another data processing equipment that Fig. 7 provides for the embodiment of the present application;
The structure chart for a kind of electronic equipment that Fig. 8 provides for the embodiment of the present application;
The hardware structure diagram for a kind of electronic equipment that Fig. 9 provides for the embodiment of the present application.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of the application protection.
At present, with network technical development, there is the multiple network product for being related to every field in the market, such as various sounds Video information, various network games on line, shopping application, pay application etc. various applications, life, working and learning to user Add many enjoyment and facility.
In actual applications, developer is in order to understand use feeling of the user to networking products, it will usually set user to comment By this option so that user has been used after the networking products, can be filled with by the option experience sense by etc. comment on, so as to Developer is by analyzing user comment, and it is what to the focus of the networking products to learn user, in other words to which side Face is dissatisfied etc., so that the networking products are upgraded or changed accordingly, so that networking products are more popular with users, expansion net The market of network product.
If in addition, when the product in certain field is developed in developer's plan, can also be produced by counting user in field correlation The user comment of product, obtains what user is concerned with to the product in the field so as to analyze, to carry out product development accordingly.
For above-mentioned various situations, Subject Clustering algorithm would generally be used in the prior art, and the user comment of acquisition is believed The i.e. comment language material of breath is handled, to obtain the focus of corresponding descriptor, i.e. user.However, in natural language processing, Subject Clustering algorithm is typically only capable to analyze the isometric corpus of text of such as news report, and the user comment language material base collected This is all short text, because the Feature Words of short text are sparse, and dependence is strong up and down, if directly using Subject Clustering algorithm, often Descriptor can not be obtained;Moreover, the comment data of some user comment language materials theme connotation itself is just indefinite, often obtain Multiple descriptor, it is impossible to reach the using effect of Subject Clustering algorithm.Wherein, in order to which the key phrases extraction for improving the field is accurate Property, the subject information in the field is extracted according to manual type, it will bring substantial amounts of manpower consumption, increase cost, and work Make efficiency very low.
In order to solve the above problems, the application proposes a kind of data processing method, device and electronic equipment, by building mesh The keywords database of the comment language material of object association is marked, afterwards, the first keyword in the keywords database is included in screening comment language material Comment word, merge and obtain the pending language material of first keyword, that is, obtain including the long text of first keyword, it Afterwards, Subject Clustering processing is carried out to the pending language material of each keyword, you can accurate obtain in comment language material is directed to destination object Descriptor, solve and directly short text language material handled using Subject Clustering algorithm in the prior art, obtain multiple masters The problem of epigraph, and without manually extracting short text language material descriptor, manpower and materials consumption is reduced, substantially increase work effect Rate.
A kind of reference picture 1, the flow chart of the data processing method provided for the embodiment of the present application, this method can include:
Step S11, the comment language material associated using the destination object of acquisition, builds the keywords database of the comment language material;
In the practical application of the present embodiment, after the comment language material for obtaining destination object association, first it can be divided Word processing, recycles term frequency-inverse document frequency TF-IDF (term frequency-inverse document Frequency) algorithm, come IDF (the inverse document frequency, reverse file of each word for calculating word segmentation processing Frequency) value, so that according to the size of the IDF values come the keyword in selected comment language material, and then constitute pass by these keywords Keyword storehouse.
Wherein, TF-IDF algorithms are a kind of statistical methods, for assessing a words for a file set or one The significance level of a copy of it file in corpus, the number of times that the significance level of words occurs hereof with it is directly proportional Increase, the decline while frequency that can occur with it in corpus is inversely proportional.Also, IDF values can be used as correspondence words weight The measurement for the property wanted, its can the quantity by general act words quantity divided by comprising the words, then take the logarithm to obtaining business and obtain, this Application is not described in detail here.
As can be seen here, the present embodiment can utilize the size of IDF values, determine whether its correspondence words is keyword, generally In the case of, the IDF values are bigger to represent that equivalent is more likely to be keyword, is more than specifically, the application can first select IDF values The word of predetermined threshold value is keyword to be selected.
Afterwards, in order to improve the accuracy of keyword, to improve treatment effeciency, the application can also be to these keys to be selected Word carries out uncorrelated word filtering, and comment word that will be unrelated with the destination object attribute is deleted, such as " I ", " it " person generation Word, various modal particles etc..In addition, the application can also increase descriptor commonly used in the art, so as to make obtained keyword Storehouse is more plentiful.
It should be noted that the application is not construed as limiting to the process that implements for building the keywords database for commenting on language material, really The mode of fixed keyword to be selected is not limited to TF-IDF algorithms listed above, in addition, for the uncorrelated of keyword to be selected The implementation of word filtering, the application is also not construed as limiting.
The comment word of the first keyword in the keywords database is included in step S12, screening comment language material, merging obtains first The pending language material of keyword;
Wherein, the comment word can be the part comment language material comprising the first keyword, such as before the first keyword is adjacent N word afterwards etc..
In actual applications, it can come true by being scanned to the comment language material that the destination object initially obtained is associated Fixed the first keyword that it is included, while the adjacent multiple words of first keyword are obtained, afterwards, by this multiple first keyword Comment word be stitched together, so as to obtain a long text language material of first keyword, i.e., pending language material.
Wherein, the application is not construed as limiting to the splicing order of the comment word of multiple first keywords, can arbitrarily be spliced, be come Obtain long text language material.Moreover, the acquisition modes on the comment word of the first keyword are not limited to mode listed above.
It should be noted that above-mentioned first keyword can be any one keyword in keywords database, the application couple It is simultaneously not specific to some keyword.
Step S13, judges whether to obtain the pending language material of all keywords in the keywords database, if it is, entering step Rapid S14;If not, return to step S12, continues to screen the comment word of keyword;
Optionally, the application can in the quantity and keywords database of pending language material as obtained by comparing keyword number Amount, to determine whether to have screened all keywords;Certainly, above-mentioned screening operation can according to keyword in keywords database order Carry out, so, can by judge this obtained pending language material the first keyword whether be in the keywords database most Latter keyword, you can know whether that screening is finished.The application is not construed as limiting to step S13 specific implementation.
In addition, it is necessary to explanation, when step S13 judged result is no, can select a first new keyword, To obtain its corresponding pending language material.In this regard, in above-mentioned screening process, garbled keyword can be marked, To avoid the situation for repeating to screen or leakage is screened from occurring, but it is not limited to a kind of this mode.
Step S14, the pending language material of each keyword to obtaining carries out Subject Clustering processing, it is determined that the pin of comment language material To the descriptor of the destination object.
Optionally, after the application obtains the pending language material of all keywords in keywords database, it can generate and wait to locate Corpus is managed, afterwards, Subject Clustering is carried out to the pending language material using Subject Clustering algorithm, the master for destination object is obtained Epigraph.
Due to passing through above-mentioned processing, so that pending language material is all long text, so, Subject Clustering algorithm is recycled to carry out Processing, it becomes possible to accurately obtain corresponding descriptor, and subject information is extracted without workman, reduce cost of labor, and significantly Improve data-handling efficiency.
With reference to above-mentioned analysis, the application provides a kind of data processing method preferred embodiment, in actual applications, not office It is limited to the implementation process of the preferred embodiment description, the application only carries out scheme explanation as example herein.
Reference picture 2, the flow chart of another data processing method provided for the embodiment of the present application, this method can include:
Step S21, obtains the comment language material of destination object association;
In actual applications, the comment language material can be obtained by third party, can also be identified by multiple users, is obtained Each application platform comment language material corresponding and being associated with the destination object with user mark etc., the application is to obtaining the comment The implementation of language material is not limited.
Step S22, word segmentation processing is carried out to the comment language material, and utilize each word after TF-IDF algorithms calculating processing IDF values;
Step S23, screening is used as keyword to be selected more than the corresponding word of IDF values of predetermined threshold value;
Step S24, carries out uncorrelated word filtering to the keyword to be selected, obtains the keywords database of comment language material;
Wherein, uncorrelated word can refer to the comment word unrelated with the attribute of destination object, such as various personal pronouns, According to the difference of comment corpus, its corresponding uncorrelated word can be different, and the application is to the content of the uncorrelated word and really Determine mode to be not construed as limiting.
Step S25, using the predetermined keyword for destination object, supplements keyword to be selected, obtains the comment The keywords database of language material;
In the present embodiment, the predetermined keyword can be descriptor usually used in destination object field, for difference Destination object, the content of the predetermined keyword is typically different, if destination object is mobile phone, so, and predetermined keyword can With including stand-by time, performance etc.;If the destination object is wrist-watch, its corresponding predetermined keyword can be waterproof, the degree of accuracy, Life-span etc., the application will not enumerate herein.
It can be seen that, the application, so as to make obtained keywords database more enrich, is defined by way of supplementing industry descriptor Destination object descriptor is really obtained to lay a good foundation.
Step S26, is scanned to the comment language material;
In actual applications, first comment language material in corpus can be commented on and proceed by scanning, it is of course also possible to adopt It is scanned in other ways, the application is not construed as limiting to the scanning sequency of comment language material in comment corpus.
Step S27, for the first keyword in keywords database, extracts in Current Scan comment language material and includes what is detected At least one comment word of first keyword and at least one adjoining word composition;
Wherein, the adjacent word is apart from first keyword in the range of preset length, and the first keyword can be closed Any one keyword in keyword storehouse, and it is not specific to some keyword.As can be seen here, the application is in the keywords database Each keyword, can determine its corresponding comment word in the way of step S27 is recorded.
Optionally, it is determined that during keyword correspondence comment word, the application can utilize preset length window, from current The comment word comprising the first keyword is extracted, it is necessary to which explanation, the length of the preset length window is big in scanning comment language material In the character length of the first keyword, to ensure the selected comment word of the preset length window while comprising the keyword, Also include other words.
In the application process of the alternative embodiment, the application can set the preset length window of fixed position, and in fact When detect words in the window, afterwards, its corresponding IDF value can be calculated in the manner described above, whether judge it is to treat Select keyword.Specifically, it is determined that after the first keyword, the first character of the first keyword can be regard as preset length window First character in mouthful, or it regard the last character of the first keyword as the side of the last character of the window Formula, to determine the comment word in the window.
Certainly, the application can also analyze such comment language material and the position of keyword occur by historical operating data, and Preset length window, the comment word to obtain keyword to be selected etc. are set in the position, the application is to passing through the pre- of fixed position If length window, determine that the specific implementation of the comment word of keyword to be selected is not construed as limiting.
As another embodiment of the application, the application can also be according to the particular content for commenting on language material, mobile preset length Window, to obtain the comment word of keyword to be selected.If specifically, during scanning comment language material, keyword to be selected is arrived in scanning When, preset length window can be generated herein, the preset length window can also suitably be adjusted according to word before and after the keyword to be selected Mouthful position, afterwards, the words in the preset length window is defined as the corresponding comment word of the keyword to be selected.
Reference picture 3, is detected after " battery " this keyword, control preset length window can be moved into the position, Select the corresponding comment word of the keyword, such as " battery charge time length " this content, but be not limited thereto.
In addition, the application scanning arrive keyword to be selected after, can directly by before the keyword to be selected with and/or afterwards N word and the keyword to be selected as comment word etc., n is greater than or equal to 1 integer.The application couple determines key to be selected The implementation of the comment word of word is not construed as limiting.
Step S28, a pending language material is spliced into by the comment word of the keyword of correspondence first of extraction;
When the comment word of multiple first keywords to acquisition splices, can arbitrarily it splice, the application is not limited Order between fixed each comment word.
Step S29, judges whether also exist without the keyword to be selected for extracting correspondence comment word in the keywords database, if It is, return to step S26;If not, into step S210;
Step S210, target corpus is constituted by the pending language material being spliced into;
Step S211, is handled pending corpus using Subject Clustering algorithm, obtains comments material pin to target The descriptor of object.
Illustrated so that destination object is electronic equipment as an example, the keywords database obtained in the manner described above can include: Battery, battery capacity, power consumption, screen intensity, processor type etc., it can be continuation of the journey to cluster obtained descriptor, it is seen then that User is concerned with its continuation of the journey to the electronic equipment, and for the combination, developer can grind to the continuation of the journey of electronic equipment Optimization is studied carefully, so as to extend its cruising time.
In summary, for the comment language material being made up of short text, the application is first to extract the multiple keys wherein included Word, constitutes keywords database, afterwards, then the comment word of same keyword is spliced into a long article this document, so, recycles master Inscribe clustering algorithm and Subject Clustering is carried out to long article this document, you can the accurate descriptor for obtaining commenting on language material, that is, user couple The focus of the destination object, is solved and directly the comment language material of short text is carried out using Subject Clustering algorithm in the prior art Processing, it is impossible to obtain the technical problem of accurate descriptor, and whole processing procedure need not manually extract subject information, save Cost of labor, and improve data-handling efficiency.
As the another embodiment of the application, on the basis of the various embodiments described above, for being directed in obtained comment language material The descriptor of destination object, can directly be presented on current display area, specifically may be displayed on predeterminable area, can also be in bullet The descriptor is presented in the prompt window gone out, as shown in figure 4, the application is not construed as limiting to the prompting mode of the descriptor.
Optionally, as needed, obtained keywords database can be stored, to detect corresponding query statement Afterwards, the keywords database is exported, so that developer refers to, the improvement to destination object is realized.
A kind of reference picture 5, the structured flowchart of the data processing equipment provided for the embodiment of the present application, device can be wrapped Include:
First builds module 51, for the comment language material using the destination object association obtained, builds the comment language material Keywords database;
Optionally, as shown in fig. 6, the first structure module 51 can include:
Word segmentation processing unit 511, the comment language material for being associated to destination object carries out word segmentation processing;
Computing unit 512, for utilizing term frequency-inverse document frequency TF-IDF algorithms, calculates each after word segmentation processing The IDF values of word;
Screening unit 513, the word that predetermined threshold value is more than for screening IDF values is defined as keyword to be selected.
Filter element 514, uncorrelated word filtering is carried out for the keyword to be selected to acquisition, obtains the comment language material Keywords database;
Wherein, the uncorrelated word can refer to the comment word unrelated with the attribute of the destination object.
In actual applications, obtain after keyword to be selected, can also be predetermined keyword using the descriptor of this area, it is right The keyword to be selected is supplemented, or after keywords database is obtained, it is supplemented using predetermined keyword, so that The keywords database of acquisition more enriches, comprehensively, with the accuracy for the descriptor for improving final gained destination object.
Merging module 52 is screened, for screening the comment for including the first keyword in the keywords database in comment language material Word, merges the pending language material for obtaining first keyword;
Wherein, the comment word can be the part comment language material comprising first keyword.
Specifically, as shown in fig. 7, the screening merging module 52 can include:
Scanning element 521, for being scanned to comment language material;
Extraction unit 522, for for the first keyword in the keywords database, extracting in Current Scan comment language material At least one comment word comprising first keyword detected and at least one adjoining word composition;
Wherein, first keyword of adjoining word distance can be in the range of preset length, and the application is to the preset length model Enclose and be not construed as limiting, can be determined according to the character length of the first keyword.
In actual applications, it is possible to use preset length window, extract and closed comprising first from Current Scan comment language material The comment word of keyword, wherein, the length of the preset length window is more than the character length of the first keyword, moreover, the default length Degree window can be moved with the relatively-stationary window in position of the first keyword or according to the content of the first keyword Window, be specifically referred to the description of above method embodiment appropriate section, the present embodiment will not be repeated here..
Concatenation unit 523, for the comment word of the correspondence of extraction first keyword to be spliced into a pending language Material.
It should be noted that the application is not construed as limiting to the splicing order of multiple comment words.
Clustering processing module 53, the pending language material for each keyword to obtaining carries out Subject Clustering processing, obtains The descriptor for the destination object of the comment language material.
The application can utilize the themes such as LDA (Latent Dirichlet Allocation, document subject matter generation model) Clustering algorithm, is handled obtained pending corpus, the process the application of implementing will not be described in detail herein.
To sum up, in the present embodiment, by building the keywords database for the comment language material that destination object is associated, afterwards, screening is commented The comment word of the first keyword in the keywords database is included in The Analects of Confucius material, merges the pending language material for obtaining first keyword, Obtain including the long text of first keyword, afterwards, Subject Clustering processing is carried out to the pending language material of each keyword, i.e., The descriptor that destination object is directed in comment language material can be accurately obtained, solves and directly utilizes Subject Clustering algorithm in the prior art Short text language material is handled, multiple theme word problems are obtained, and without manually extracting short text language material descriptor, reduce Manpower and materials consumption, substantially increase operating efficiency.
Data processing equipment described above is mainly the explanation carried out from the angle of virtual functions module, below will be to reality The angle of the hardware configuration of existing above-mentioned data processing scheme is illustrated.
The structure chart for a kind of electronic equipment that the embodiment of the present application shown in reference picture 8 is provided, and this Shen shown in Fig. 9 Please embodiment provide a kind of electronic equipment hardware structure diagram, the electronic equipment can be the equipment such as computer, mobile phone, the application Product type to the electronic equipment is not construed as limiting.In actual applications, the electronic equipment can include:
Communication module 81, the comment language material for obtaining destination object association;
In actual applications, the comment language material of destination object association can be obtained by third party, so, electronic equipment can Communicated to connect with being set up by the communication module 81 and third party device, from obtaining the comment language material.
It should be noted that the application is not construed as limiting to the concrete composition structure of the communication module 81, can be channel radio Believe module or wire communication module, such as gsm module, WIFI module or bluetooth module.
Processor 82, for utilizing the comment language material, builds the keywords database of the comment language material, screens the comment The comment word of the first keyword in the keywords database is included in language material, merges the pending language for obtaining first keyword Material, the comment word is the part comment language material comprising first keyword, and to the pending language of obtained each keyword Material carries out Subject Clustering processing, obtains the descriptor for the destination object of the comment language material.
Wherein, realize that the detailed process of the function is referred to above method embodiment appropriate section on the processor 82 Description, the present embodiment will not be described in detail herein.
Optionally, as shown in figure 9, the electronic equipment can also include:
Memory 83, is stored for the comment language material to acquisition, as needed can also be to obtained keywords database And the information such as descriptor is preserved, the application is not construed as limiting to the storage mode of these data.
In the present embodiment, the memory 83 can include high-speed RAM memory, or nonvolatile memory (non- Volatile memory), for example, at least one magnetic disk storage etc..
Input equipment 84, obtains the predetermined keyword for destination object, the keywords database is supplemented;
In the present embodiment, the input equipment can include keyboard or mouse or voice collector etc., can basis Electronic device data input mode is determined.
Display 85, the descriptor for exporting obtained destination object.
To sum up, electronic equipment is by extracting the keyword in short text, afterwards, by by the comment word of same keyword A long text is spliced into, then Subject Clustering is carried out to obtained long text, accurately obtains being directed to destination object in comment language material Descriptor, solve the problem of Subject Clustering algorithm is not suitable for short text Subject Clustering, and due to need not manually extract Subject information, substantially increases data-handling efficiency.
Finally, it is necessary to illustrate, in the various embodiments described above, such as first, second or the like relational terms are only Only it is used for an operation, unit or module are operated with another, unit or module make a distinction, and not necessarily requires or secretly Show there is any this actual relation or order between these units, operation or module.Moreover, term " comprising ", " bag Containing " or any other variant thereof is intended to cover non-exclusive inclusion, so that process, method including a series of key elements Or system not only includes those key elements, but also other key elements including being not expressly set out, or also include to be this Process, method or the intrinsic key element of system.In the absence of more restrictions, being limited by sentence "including a ..." Key element, it is not excluded that also there is other identical element in the process including the key element, method or system.
The embodiment of each in this specification is described by the way of progressive, and what each embodiment was stressed is and other Between the difference of embodiment, each embodiment identical similar portion mutually referring to.For device disclosed in embodiment For electronic equipment, because it is corresponding with method disclosed in embodiment, so description is fairly simple, related part is referring to side Method part illustrates.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or use the application. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can in other embodiments be realized in the case where not departing from spirit herein or scope.Therefore, the application The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims (10)

1. a kind of data processing method, it is characterised in that methods described includes:
The comment language material associated using the destination object of acquisition, builds the keywords database of the comment language material;
The comment word of the first keyword in the keywords database is included in the screening comment language material, merging obtains described first and closed The pending language material of keyword, the comment word is the part comment language material comprising first keyword;
The pending language material of each keyword to obtaining carries out Subject Clustering processing, and obtain the comment language material is directed to the mesh Mark the descriptor of object.
2. according to the method described in claim 1, it is characterised in that include the keyword in the screening comment language material The comment word of first keyword in storehouse, merges the pending language material for obtaining first keyword, including:
The comment language material is scanned;
For the first keyword in the keywords database, extract in Current Scan comment language material comprising described first detected At least one comment word of keyword and at least one adjoining word composition, wherein, the adjacent word is crucial apart from described first Word is in the range of preset length;
The comment word of the correspondence of extraction first keyword is spliced into a pending language material.
3. according to the method described in claim 1, it is characterised in that utilize the comment language material of the destination object association obtained, structure The keywords database of the comment language material is built, including:
Obtain the keyword to be selected in the comment language material of destination object association;
Uncorrelated word filtering is carried out to the keyword to be selected of acquisition, the keywords database of the comment language material is obtained, it is described uncorrelated Word refers to the comment word unrelated with the attribute of the destination object.
4. method according to claim 3, it is characterised in that methods described also includes:
Using the default predetermined keyword for the destination object, the keywords database is supplemented.
5. method according to claim 2, it is characterised in that include and detect in the extraction Current Scan comment language material First keyword and at least one adjoining word composition at least one comment word, including:
Using preset length window, the comment word for including first keyword is extracted from Current Scan comment language material;
Wherein, the length of the preset length window is more than the character length of first keyword.
6. method according to claim 3, it is characterised in that treating in the comment language material of the acquisition destination object association Keyword is selected, including:
Word segmentation processing is carried out to the comment language material that destination object is associated;
Using term frequency-inverse document frequency TF-IDF algorithms, the IDF values of each word after word segmentation processing are calculated;
The word that screening IDF values are more than predetermined threshold value is defined as keyword to be selected.
7. a kind of data processing equipment, it is characterised in that described device includes:
First builds module, for the comment language material using the destination object association obtained, builds the key of the comment language material Dictionary;
Merging module is screened, for screening the comment word for including the first keyword in the keywords database in the comment language material, Merge the pending language material for obtaining first keyword, the comment word is the part comments for including first keyword Material;
Clustering processing module, the pending language material for each keyword to obtaining carries out Subject Clustering processing, obtains institute's commentary The descriptor for the destination object of The Analects of Confucius material.
8. a kind of electronic equipment, it is characterised in that the electronic equipment includes:
Communication module, the comment language material for obtaining destination object association;
Processor, for utilizing the comment language material, builds the keywords database of the comment language material, screens in the comment language material The comment word of the first keyword in the keywords database is included, merges the pending language material for obtaining first keyword, it is described Comment word is that language material is commented in the part comprising first keyword, and the pending language material of each keyword to obtaining is led Clustering processing is inscribed, the descriptor for the destination object of the comment language material is obtained.
9. electronic equipment according to claim 8, it is characterised in that the electronic equipment also includes:
Input equipment, for obtaining the predetermined keyword for destination object, is supplemented the keywords database;
And/or;
Display, the descriptor for exporting the destination object.
10. electronic equipment according to claim 8, it is characterised in that the processor is specifically for the comments Material is scanned, for the first keyword in the keywords database, is extracted in Current Scan comment language material and is included what is detected At least one comment word of first keyword and at least one adjoining word composition, and the correspondence of extraction described first is closed The comment word of keyword is spliced into a pending language material, wherein, the adjacent word is apart from first keyword in preset length In the range of.
CN201710202705.5A 2017-03-30 2017-03-30 Data processing method, device and electronic equipment Pending CN106970988A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201710202705.5A CN106970988A (en) 2017-03-30 2017-03-30 Data processing method, device and electronic equipment
PCT/CN2017/102942 WO2018176764A1 (en) 2017-03-30 2017-09-22 Data processing method and apparatus, and electronic device
US16/587,440 US11468108B2 (en) 2017-03-30 2019-09-30 Data processing method and apparatus, and electronic device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710202705.5A CN106970988A (en) 2017-03-30 2017-03-30 Data processing method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN106970988A true CN106970988A (en) 2017-07-21

Family

ID=59336336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710202705.5A Pending CN106970988A (en) 2017-03-30 2017-03-30 Data processing method, device and electronic equipment

Country Status (3)

Country Link
US (1) US11468108B2 (en)
CN (1) CN106970988A (en)
WO (1) WO2018176764A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107730346A (en) * 2017-09-25 2018-02-23 北京京东尚科信息技术有限公司 The method and apparatus of article cluster
CN107872714A (en) * 2017-10-27 2018-04-03 咪咕视讯科技有限公司 A kind of processing method of barrage, terminal device and computer-readable recording medium
WO2018176764A1 (en) * 2017-03-30 2018-10-04 联想(北京)有限公司 Data processing method and apparatus, and electronic device
CN108984519A (en) * 2018-06-14 2018-12-11 华东理工大学 Event corpus method for auto constructing, device and storage medium based on double mode
CN109871486A (en) * 2019-02-18 2019-06-11 合肥工业大学 The Product Requirement Analysis method and system of perceived social support under social media environment
CN111709226A (en) * 2020-06-18 2020-09-25 中国银行股份有限公司 Text processing method and device
CN112052397A (en) * 2020-09-29 2020-12-08 北京百度网讯科技有限公司 User feature generation method and device, electronic equipment and storage medium
CN112491649A (en) * 2020-11-17 2021-03-12 中国平安财产保险股份有限公司 Interface joint debugging test method and device, electronic equipment and storage medium
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857922A (en) * 2019-01-18 2019-06-07 深圳壹账通智能科技有限公司 Data evaluate and test model modelling approach, device, computer equipment and storage medium
CN110321561A (en) * 2019-06-27 2019-10-11 腾讯科技(深圳)有限公司 A kind of keyword extracting method and device
CN112464081A (en) * 2020-09-08 2021-03-09 广东省华南技术转移中心有限公司 Project information matching method, device and storage medium
CN112101012B (en) * 2020-09-25 2024-04-26 北京百度网讯科技有限公司 Interactive domain determining method and device, electronic equipment and storage medium
CN112364169B (en) * 2021-01-13 2022-03-04 北京云真信科技有限公司 Nlp-based wifi identification method, electronic device and medium
CN113177399B (en) * 2021-04-25 2024-02-06 网易(杭州)网络有限公司 Text processing method, device, electronic equipment and storage medium
CN115103212B (en) * 2022-06-10 2023-09-05 咪咕文化科技有限公司 Bullet screen display method, bullet screen processing device and electronic equipment
CN115408420B (en) * 2022-09-02 2023-08-01 自然资源部地图技术审查中心 Method and apparatus for automatically filtering map notes and points of interest using a computer
CN115840510B (en) * 2023-02-21 2023-04-28 中航信移动科技有限公司 Input association method for intelligent questioning and answering of civil aviation, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013015971A (en) * 2011-07-01 2013-01-24 Kddi Corp Representative comment extraction method and program
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN106445912A (en) * 2016-08-31 2017-02-22 五八同城信息技术有限公司 Evaluation information processing method and apparatus

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8423565B2 (en) * 2006-12-21 2013-04-16 Digital Doors, Inc. Information life cycle search engine and method
US8082248B2 (en) * 2008-05-29 2011-12-20 Rania Abouyounes Method and system for document classification based on document structure and written style
US8949252B2 (en) * 2010-03-29 2015-02-03 Ebay Inc. Product category optimization for image similarity searching of image-based listings in a network-based publication system
US8316030B2 (en) * 2010-11-05 2012-11-20 Nextgen Datacom, Inc. Method and system for document classification or search using discrete words
CN104615593B (en) * 2013-11-01 2017-09-29 北大方正集团有限公司 Hot microblog topic automatic testing method and device
US9495444B2 (en) * 2014-02-07 2016-11-15 Quixey, Inc. Rules-based generation of search results
US9348920B1 (en) * 2014-12-22 2016-05-24 Palantir Technologies Inc. Concept indexing among database of documents using machine learning techniques
CN104778209B (en) * 2015-03-13 2018-04-27 国家计算机网络与信息安全管理中心 A kind of opining mining method for millions scale news analysis
US10810217B2 (en) * 2015-10-07 2020-10-20 Facebook, Inc. Optionalization and fuzzy search on online social networks
CN105512277B (en) * 2015-12-04 2019-09-20 北京航空航天大学 A kind of short text clustering method towards Book Market title
US10354009B2 (en) * 2016-08-24 2019-07-16 Microsoft Technology Licensing, Llc Characteristic-pattern analysis of text
US11106712B2 (en) * 2016-10-24 2021-08-31 Google Llc Systems and methods for measuring the semantic relevance of keywords
US20180246973A1 (en) * 2017-02-28 2018-08-30 Laserlike Inc. User interest modeling
CN106970988A (en) * 2017-03-30 2017-07-21 联想(北京)有限公司 Data processing method, device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013015971A (en) * 2011-07-01 2013-01-24 Kddi Corp Representative comment extraction method and program
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN106445912A (en) * 2016-08-31 2017-02-22 五八同城信息技术有限公司 Evaluation information processing method and apparatus

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176764A1 (en) * 2017-03-30 2018-10-04 联想(北京)有限公司 Data processing method and apparatus, and electronic device
US11468108B2 (en) 2017-03-30 2022-10-11 Lenovo (Beijing) Limited Data processing method and apparatus, and electronic device thereof
CN107730346A (en) * 2017-09-25 2018-02-23 北京京东尚科信息技术有限公司 The method and apparatus of article cluster
CN107872714A (en) * 2017-10-27 2018-04-03 咪咕视讯科技有限公司 A kind of processing method of barrage, terminal device and computer-readable recording medium
CN107872714B (en) * 2017-10-27 2020-09-18 咪咕视讯科技有限公司 Bullet screen processing method, terminal equipment and computer readable storage medium
CN108984519A (en) * 2018-06-14 2018-12-11 华东理工大学 Event corpus method for auto constructing, device and storage medium based on double mode
CN108984519B (en) * 2018-06-14 2022-07-05 华东理工大学 Dual-mode-based automatic event corpus construction method and device and storage medium
CN109871486B (en) * 2019-02-18 2021-04-06 合肥工业大学 Product demand analysis method and system for market-ahead under social media environment
CN109871486A (en) * 2019-02-18 2019-06-11 合肥工业大学 The Product Requirement Analysis method and system of perceived social support under social media environment
CN111709226A (en) * 2020-06-18 2020-09-25 中国银行股份有限公司 Text processing method and device
CN111709226B (en) * 2020-06-18 2023-10-13 中国银行股份有限公司 Text processing method and device
CN112052397A (en) * 2020-09-29 2020-12-08 北京百度网讯科技有限公司 User feature generation method and device, electronic equipment and storage medium
CN112052397B (en) * 2020-09-29 2024-05-03 北京百度网讯科技有限公司 User characteristic generation method and device, electronic equipment and storage medium
CN112491649A (en) * 2020-11-17 2021-03-12 中国平安财产保险股份有限公司 Interface joint debugging test method and device, electronic equipment and storage medium
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium

Also Published As

Publication number Publication date
WO2018176764A1 (en) 2018-10-04
US20200026724A1 (en) 2020-01-23
US11468108B2 (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN106970988A (en) Data processing method, device and electronic equipment
US20190042626A1 (en) Recommendation Engine using Inferred Deep Similarities for Works of Literature
CN107679144A (en) News sentence clustering method, device and storage medium based on semantic similarity
US9146915B2 (en) Method, apparatus, and computer storage medium for automatically adding tags to document
CN106547742B (en) Semantic parsing result treating method and apparatus based on artificial intelligence
CN103577452A (en) Website server and method and device for enriching content of website
CN105956031A (en) Text classification method and apparatus
CN108875090A (en) A kind of song recommendations method, apparatus and storage medium
CN110275963A (en) Method and apparatus for output information
CN110362601A (en) Mapping method, device, equipment and the storage medium of metadata standard
CN109726282A (en) A kind of method, apparatus, equipment and storage medium generating article abstract
CN103440262A (en) Image searching system and image searching method basing on relevance feedback and Bag-of-Features
CN108319586A (en) A kind of generation of information extraction rule and semantic analysis method and device
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN104503988A (en) Searching method and device
CN110363206A (en) Cluster, data processing and the data identification method of data object
CN111382563A (en) Text relevance determining method and device
CN107169011A (en) The original recognition methods of webpage based on artificial intelligence, device and storage medium
CN113297345B (en) Analysis report generation method, electronic equipment and related product
CN115269889B (en) Clip template searching method and device
CN103294780B (en) Directory mapping relationship mining device and directory mapping relationship mining device
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
CN105045868A (en) Method and apparatus for searching hot event
CN116860963A (en) Text classification method, equipment and storage medium
KR101650888B1 (en) Content collection and recommendation system and method

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170721

RJ01 Rejection of invention patent application after publication