CN107679217B

CN107679217B - Associated content extraction method and device based on data mining

Info

Publication number: CN107679217B
Application number: CN201710976636.3A
Authority: CN
Inventors: 徐伟建; 刘建林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2021-12-07
Anticipated expiration: 2037-10-19
Also published as: CN107679217A

Abstract

The embodiment of the application discloses a method and a device for extracting associated content based on data mining. One embodiment of the method comprises: acquiring data to be processed, wherein the data to be processed comprises a preset query object; determining candidate comment tags associated with a preset query object in the data to be processed; screening out comment tags from the candidate comment tags; and determining the presentation sequence of the comment tags based on the click quantity of the user on the comment tags. And intelligently extracting and presenting comment tags of preset query objects according to priority.

Description

Associated content extraction method and device based on data mining

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to a method and a device for extracting associated content based on data mining.

Background

In existing search tools, search keywords are typically entered by a user and corresponding search results are presented to the user after the user triggers a search.

When a user needs to obtain a summary view of the search keywords, the search results can be read one by one, and summarized and refined by the user.

Data mining, in general, refers to the process of searching through algorithms from a large amount of data for information hidden therein. Data mining is generally related to computer science and achieves this through many methods such as statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (relying on past rules of thumb), and pattern recognition.

In the existing search tools, no technical scheme for showing a summary view about search keywords in search results based on data mining appears.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for extracting associated content based on data mining.

In a first aspect, an embodiment of the present application provides a method for extracting associated content based on data mining, including: acquiring data to be processed, wherein the data to be processed comprises a preset query object; determining candidate comment tags associated with a preset query object in the data to be processed; screening out comment tags from the candidate comment tags; and determining the presentation sequence of the comment tags based on the click quantity of the user on the comment tags.

In some embodiments, determining candidate comment tags associated with a preset query object in the data to be processed includes: and extracting candidate comment tags associated with a preset query object from the data to be processed based on a natural language processing method.

In some embodiments, screening the candidate comment tags for comment tags includes: and based on a preset matching rule, removing candidate comment tags which are not matched with a preset query object from the candidate comment tags to screen out the comment tags.

In some embodiments, the data to be processed includes comment data for a preset query object, and the method further includes: obtaining comment data containing a preset query object from a preset hotspot data source; determining the weight of each comment data; and determining the display sequence of the comment data based on the weight of the comment data.

In some embodiments, obtaining comment data containing a preset query object from a preset hotspot data source includes: acquiring candidate comment data containing a preset query object from a preset hotspot data source; and determining comment data from the candidate comment data based on the page browsing amount of each piece of candidate comment data.

In some embodiments, determining the weight for each review data includes determining the weight for each review data based on any one of: determining the weight of the comment data based on whether the comment data has the hot words of which the number of co-occurrence times with a preset query object exceeds a preset number of times; determining a quality score of the comment data based on a machine learning algorithm, and determining a weight of the comment data based on the quality score; and determining the weight of the comment data based on the click rate of the user on the comment data.

In some embodiments, the method further comprises: determining the emotional tendency of each comment data based on a natural language processing tool, and determining the good rating of the preset query object based on the emotional tendency of each comment data.

In some embodiments, the method further comprises: and generating a good rating curve of the preset query object based on the good rating of the preset query object in each preset time period.

In a second aspect, an embodiment of the present application provides an associated content extraction device based on data mining, including: the device comprises a to-be-processed data acquisition unit, a query unit and a query unit, wherein the to-be-processed data acquisition unit is used for acquiring to-be-processed data which comprises a preset query object; the determining unit is used for determining candidate comment tags which are associated with a preset query object in the data to be processed; the first screening unit is used for screening the comment tags from the candidate comment tags; and the first presentation unit is used for determining the presentation sequence of each comment tag based on the click amount of the user on each comment tag.

In some embodiments, the determining unit is further configured to: and extracting candidate comment tags associated with a preset query object from the data to be processed based on a natural language processing device.

In some embodiments, the first screening unit is further configured to: and based on a preset matching rule, removing candidate comment tags which are not matched with a preset query object from the candidate comment tags to screen out the comment tags.

In some embodiments, the data to be processed includes comment data for a preset query object, and the apparatus further includes: the comment data acquisition unit is used for acquiring comment data containing a preset query object from a preset hotspot data source; a weight determination unit configured to determine a weight of each comment data; and the second presentation unit is used for determining the display sequence of the comment data based on the weight of the comment data.

In some embodiments, the comment data acquisition unit is further configured to: acquiring candidate comment data containing a preset query object from a preset hotspot data source; and determining comment data from the candidate comment data based on the page browsing amount of each piece of candidate comment data.

In some embodiments, the weighting unit is further configured to determine a weight for each review data based on any one of: determining the weight of the comment data based on whether the comment data has the hot words of which the number of co-occurrence times with a preset query object exceeds a preset number of times; determining a quality score of the comment data based on a machine learning algorithm, and determining a weight of the comment data based on the quality score; and determining the weight of the comment data based on the click rate of the user on the comment data.

In some embodiments, the apparatus further comprises: and the favorable rating determining unit is used for determining the emotional tendency of each piece of comment data based on the natural language processing tool and determining the favorable rating of the preset query object based on the emotional tendency of each piece of comment data.

In some embodiments, the apparatus further comprises: and the good evaluation rate curve generating unit is used for generating a good evaluation rate curve of the preset query object based on the good evaluation rates of the preset query object in each preset time period.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement the method as above.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method is implemented.

According to the method and the device for extracting the associated content based on the data mining, the to-be-processed data including the preset query object are obtained, the candidate comment tags associated with the preset query object are determined from the to-be-processed data, the comment tags are screened out from the candidate comment tags, and finally the presentation sequence of the comment tags is determined based on the click amount of the user on the comment tags, so that the comment tags of the preset query object are intelligently extracted and presented according to the priority.

Furthermore, when the preset query object is used as the search keyword for searching, the click-by-click reading of the search result by the user can be reduced, so that the occupation of network resources can be reduced, and the stable operation of the search server is facilitated.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a data mining based associated content extraction method according to the present application;

FIG. 3 is a flow diagram of yet another embodiment of a data mining based associated content extraction method according to the present application;

FIG. 4 is a schematic diagram illustrating an embodiment of an associated content extraction apparatus based on data mining according to the present application;

fig. 5 is a schematic structural diagram of a computer system suitable for implementing the terminal device or the server according to the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the data mining-based associated content extraction method or the data mining-based associated content extraction apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103,

networks

104, 105, a first server 106, and a second server 107. The network 104 is used to provide the medium of a communication link between the

terminal devices

101, 102, 103 and the first server 106, and the network 105 is used to provide the medium of a communication link between the first server 106 and the second server 107. The

networks

104, 105 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may use the

terminal devices

101, 102, 103 to interact with the first server 106 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a search application, a web browser application, a shopping application, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The first server 106 may be a server that provides various services, such as a background search server that provides search results for search requests sent by the

terminal devices

101, 102, 103. The background search server may analyze and perform other processes on the received data such as the search request, and feed back the processing results (e.g., the search results) to the

terminal devices

101, 102, and 103.

It should be noted that the method for extracting associated content based on data mining provided by the embodiment of the present application is generally executed by the first server 106, and accordingly, the associated content extracting apparatus based on data mining is generally disposed in the first server 106.

The second server 107 may be a server that provides various services, for example, a background server that generates comment tags for search keywords to which search results are directed by crawling the search results on the first server 106. The second server 107 may obtain a search result corresponding to the search keyword on the first server 106, generate a comment tag corresponding to the search keyword, and feed back the generated comment tag to the first server 106.

It should be understood that the number of terminal devices, networks, first servers and second servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the method for extracting associated content based on data mining provided in the embodiment of the present application is generally executed by the second server 107, and accordingly, the associated content extracting apparatus based on data mining is generally disposed in the second server 107.

With continued reference to FIG. 2, a flow 200 of one embodiment of a data mining based associated content extraction method in accordance with the present application is illustrated. The associated content extraction method based on data mining comprises the following steps:

step 210, obtaining data to be processed, where the data to be processed includes a preset query object.

In this embodiment, the electronic device (e.g., the second server 107 shown in fig. 1) on which the associated content extraction method based on data mining operates may obtain the data to be processed from the electronic device (e.g., the first server 106 shown in fig. 1) communicatively connected thereto through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The data to be processed obtained in this step may include, but is not limited to, semi-structured data such as articles, comments made by users in social applications, and the like. Here, the semi-structured data may mean, for example, data that can form structured data by appropriate data processing.

In some application scenarios, the acquiring of the to-be-processed data of this step may be associated with a search request of a user, for example. Specifically, when a user sends a search request for a certain search keyword to a search server (e.g., the first server 106 shown in fig. 1) through a terminal device (e.g., the

terminal devices

101, 102, 103 shown in fig. 1), an electronic device (e.g., the second server 107 shown in fig. 1) on which the data mining-based associated content extraction method of the present embodiment operates may receive a search result sent thereto by the search server as data to be processed for the current search keyword. In these application scenarios, the search keyword input by the user can be regarded as a preset query object in this step.

Alternatively, in other application scenarios, the obtaining of the to-be-processed data at this step may be independent of the search request of the user. Specifically, the electronic device on which the associated content extraction method based on data mining of this embodiment is executed may actively grab the to-be-processed data including the preset query object. For example, the electronic device may actively crawl pending data containing keywords a from servers of various large social platform applications.

Step 220, determining candidate comment tags associated with the preset query object in the data to be processed.

Here, the candidate comment tag associated with the preset query object may be understood as information that is likely to be a feature of the preset query object, for example.

In some application scenarios, it is assumed that a preset query object is a "Convolutional Neural Network (CNN)", and a content of a certain piece of data to be processed refers to both a Convolutional Neural Network and a cyclic Neural Network (RNN), so that some evaluation words, such as "high accuracy", "fast model operation speed", "directional cycle", and the like, which may be used as features of the CNN, in the data to be processed may be understood as candidate comment tags.

And step 230, screening the comment tags from the candidate comment tags.

The purpose of this step is to remove the candidate comment tags that do not belong to the preset query object from the candidate comment tags associated with the preset query object obtained in step 220, so that the association relationship between the comment tags obtained by screening and the preset query object is more accurate.

This is illustrated by the example of step 220. The candidate comment tags extracted in step 220 for the preset query object "Convolutional Neural Network (CNN)" include "high accuracy", "fast model operation speed", and "directional loop". In the step, a candidate comment label which is obviously 'directional circulation' and can not be used for evaluating the characteristics of the CNN is removed, so that the comment label obtained by screening has high accuracy and high model operation speed and has higher matching degree with the CNN.

In addition, the comment tag obtained by screening in the step can be used as one of the extracted associated contents related to the preset query object.

And 240, determining the presentation sequence of each comment tag based on the click rate of the user on each comment tag.

In step 230, the comment tags associated with the preset query object have been determined. When a user initiates a search request with the preset query object as a search keyword, the comment tag associated with the preset query object can be sent to the terminal device used by the user along with a search result page.

When the comment tags are presented on the search result page, the comment tags with a large click amount of the user are presented at a more remarkable position, so that the comment tags can be prompted to have high attention in a period of time, and therefore the user can further screen and preferentially display the search results by clicking the comment tags with high attention.

Still taking a preset query object of a Convolutional Neural Network (CNN) as an example, when a user searches by using the Convolutional Neural Network as a search keyword, the comment tags "high accuracy", "fast model operation speed" associated with the Convolutional Neural Network can be sent to the terminal device used by the user together with the search result page. In addition, the comment tag of "fast model operation" is displayed preferentially in the search result page compared with other comment tags (for example, "high accuracy") due to the high user click rate. In this way, if the user clicks "model computation speed is fast", the search results of the "convolutional neural network" can be further screened, and the search results associated with the label "model computation speed is fast" can be screened out.

According to the method for extracting the associated content based on the data mining, the to-be-processed data including the preset query object is obtained, the candidate comment tags associated with the preset query object are determined from the to-be-processed data, the comment tags are screened out from the candidate comment tags, and finally the presentation sequence of the comment tags is determined based on the click amount of the user on the comment tags, so that the comment tags of the preset query object are intelligently extracted and presented according to the priority.

In some optional implementations of this embodiment, in the determining the to-be-processed data in step 220, the candidate comment tag associated with the preset query object may further include: and extracting candidate comment tags associated with a preset query object from the data to be processed based on a natural language processing method.

Natural Language Processing (NLP) is a technology for studying the Processing of human Language by computers. The method comprises the branches of syntactic semantic analysis, information extraction, text mining, machine translation, information retrieval and the like. Natural language processing methods are widely studied and will not be described herein.

In some optional implementations, the screening of the comment tags from the candidate comment tags of step 230 may further include: and based on a preset matching rule, removing candidate comment tags which are not matched with a preset query object from the candidate comment tags to screen out the comment tags.

In some application scenarios, for example, the preset matching rule includes: the comment tags for evaluating men are not available for women. Then, assuming that the candidate labels obtained in step 220 include "good performing", "beautiful" and "general", and the preset query object is a woman, it is obvious that the candidate comment label of "general" will be removed.

Referring to fig. 3, a schematic flow chart diagram 300 of another embodiment of the data mining-based associated content extraction method of the present application is shown.

The method of the embodiment comprises the following steps:

step 310, obtaining data to be processed, where the data to be processed includes a preset query object.

And step 320, determining candidate comment tags associated with the preset query object in the data to be processed.

And step 330, screening the comment tags from the candidate comment tags.

And step 340, determining the presentation sequence of each comment tag based on the click rate of the user on each comment tag.

The steps 310 to 340 are similar to the steps 210 to 240 of the embodiment shown in fig. 2, and are not described herein again.

Unlike the embodiment shown in fig. 2, this embodiment further includes:

and step 350, obtaining comment data containing a preset query object from a preset hotspot data source.

Here, the preset hotspot data source may be, for example, recent hot search data of a preset social platform. Assume that the preset query object is a movie recently shown. The movie name of the movie may be a preset query object. If the movie name appears in the hot search data of a certain social platform, comment data containing the movie name in the hot search data of the social platform can be acquired.

And step 360, determining the weight of each comment data.

By determining the weight of the comment data, the degree of association between the comment data and the preset query object, and/or the quality of the comment data itself can be determined.

Step 370, determining the display order of the comment data based on the weight of the comment data.

By determining the display order of the comment data based on the weight of the comment data, the comment data with higher association degree and/or the comment data with higher quality level with the preset query object can be preferentially displayed to the user.

In some optional implementation manners, the obtaining of the comment data containing the preset query object from the preset hotspot data source in step 350 in this embodiment may further include:

step 351, obtaining candidate comment data containing a preset query object from a preset hotspot data source.

And step 352, determining comment data from the candidate comment data based on the page browsing amount of each piece of candidate comment data.

Therefore, comment data which are associated with the preset query object and have higher user attention can be further screened from the hotspot data source.

In some optional implementations, the determining the weight of each comment data in step 360 of this embodiment may include, for example, at least one of the following:

step 361, determining the weight of the comment data based on whether the comment data has the hot words with the co-occurrence frequency of the preset query object exceeding the preset frequency. By determining whether the comment data has the hot words of which the number of co-occurrence times with the preset query object exceeds the preset number, the core hot spots (namely, the hot words) concerned by the user in the comment data can be extracted, and the weight of the comment data is increased so as to be preferentially displayed.

Step 362, based on the machine learning algorithm, determines the quality score of the review data and determines the weight of the review data based on the quality score. In some application scenarios, although the comment data in the hotspot data source has a certain degree of association with the preset query object, the comment data has a more obvious tendency of "twitching tendency", and then the comment data can be considered to have a lower quality score. In these application scenarios, the comment data may be input into a machine learning model (e.g., a neural network model) trained in advance, so as to obtain a quality score for the comment data, and comment data with a higher quality score may be given a higher weight so as to be preferentially displayed.

On the other hand, in some application scenarios, there may be some cases where the comment data in the hotspot data source may contain spam that is not desired to be presented to the user, such as comment data advertised by hotspot data, comment data with inappropriate speech, and so on. In the application scenarios, the comment data can be filtered and eliminated by adopting a machine learning method. For example, the comment data containing spam content may be filtered out using the same machine learning model that determines the quality score, or a separate machine learning model may be used.

And step 363, determining the weight of the comment data based on the click rate of the user on the comment data. By determining the weight of the comment data based on the click amount of the user on the comment data, comment data that the user is more interested in (comment data having a higher click amount) can be given a higher weight so as to be preferentially displayed.

It is to be understood that, if at least two of the steps 361 to 363 are adopted to determine the weight of the comment data, the weights determined by at least two of the steps may be weighted and added, so as to determine the final weight of the comment data.

In some optional implementation manners, the method for extracting associated content based on data mining according to this embodiment may further include:

and 380, determining the emotional tendency of each comment data based on the natural language processing tool, and determining the favorable rating of the preset query object based on the emotional tendency of each comment data.

For example, in some application scenarios, the emotion score of each piece of comment data for a preset query object (for example, the positive tendency is assigned 1, the negative tendency is assigned 0, and the neutral tendency is assigned 0.5) may be determined based on a natural language processing tool, and finally, the good rating of the preset query object is determined in a certain operation manner.

And step 390, generating a rating curve of the preset query object based on the rating of the preset query object in each preset time period.

It is understood that, since the number of the comment data gradually changes (e.g., increases) with the lapse of time, the rating of the preset query object will also change accordingly. The favorable rating curve of the preset query object is generated based on the favorable rating of the preset query object in each preset time period, so that the development trend of the favorable rating of the preset query object in a period of time can be visually displayed.

With further reference to fig. 4, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an associated content extraction apparatus based on data mining, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 4, the associated content extraction apparatus 400 based on data mining according to the present embodiment includes: a to-be-processed data acquisition unit 410, a determination unit 420, a first filtering unit 430, and a first presentation unit 440.

The to-be-processed data obtaining unit 410 may be configured to obtain to-be-processed data, where the to-be-processed data includes a preset query object.

The determining unit 420 may be configured to determine candidate comment tags associated with a preset query object in the to-be-processed data.

The first filtering unit 430 may be configured to filter the comment tags from the candidate comment tags.

The first presenting unit 440 may be configured to determine a presentation order of the comment tags based on a click amount of the comment tags by the user.

In some optional implementations, the determining unit 420 may be further configured to:

and extracting candidate comment tags associated with a preset query object from the data to be processed based on a natural language processing device.

In some optional implementations, the first screening unit may be further configured to:

and based on a preset matching rule, removing candidate comment tags which are not matched with a preset query object from the candidate comment tags to screen out the comment tags.

In some optional implementations, the data to be processed may include comment data for a preset query object.

In these optional implementations, the data mining-based associated content extracting apparatus may further include: the comment data acquisition unit is used for acquiring comment data containing a preset query object from a preset hotspot data source; a weight determination unit configured to determine a weight of each comment data; and the second presentation unit is used for determining the display sequence of the comment data based on the weight of the comment data.

In some optional implementations, the comment data obtaining unit may be further configured to: acquiring candidate comment data containing a preset query object from a preset hotspot data source; and determining comment data from the candidate comment data based on the page browsing amount of each piece of candidate comment data.

In some optional implementations, the weighting unit may be further configured to determine the weight of each review data based on any one of:

and determining the weight of the comment data based on whether the comment data has the hot words of which the number of co-occurrence times with the preset query object exceeds the preset number.

Based on a machine learning algorithm, a quality score of the review data is determined, and a weight of the review data is determined based on the quality score.

And determining the weight of the comment data based on the click rate of the user on the comment data.

In some optional implementations, the data mining-based associated content extracting apparatus may further include: and the favorable rating determining unit is used for determining the emotional tendency of each piece of comment data based on the natural language processing tool and determining the favorable rating of the preset query object based on the emotional tendency of each piece of comment data.

In some optional implementations, the data mining-based associated content extracting apparatus may further include: and the good evaluation rate curve generating unit is used for generating a good evaluation rate curve of the preset query object based on the good evaluation rates of the preset query object in each preset time period.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a to-be-processed data acquisition unit, a determination unit, a first screening unit and a first presentation unit. Here, the names of these units do not constitute a limitation to the unit itself in some cases, and for example, the unit for acquiring data to be processed may also be described as "a unit for acquiring data to be processed".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring data to be processed, wherein the data to be processed comprises a preset query object; determining candidate comment tags associated with a preset query object in the data to be processed; screening out comment tags from the candidate comment tags; and determining the presentation sequence of the comment tags based on the click quantity of the user on the comment tags.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for extracting associated content based on data mining is characterized by comprising the following steps:

acquiring data to be processed, wherein the data to be processed comprises a preset query object and comment data of the preset query object;

determining candidate comment tags associated with the preset query object in the data to be processed;

screening out comment tags conforming to the preset query object from the candidate comment tags;

determining the presentation sequence of each comment tag based on the click rate of the user on each comment tag;

obtaining comment data containing the preset query object from a preset hotspot data source, and determining the display sequence of each comment data;

the obtaining of the comment data containing the preset query object from the preset hotspot data source includes: acquiring candidate comment data containing the preset query object from a preset hotspot data source; and determining the comment data from the candidate comment data based on the page browsing amount of each piece of the candidate comment data.

2. The method of claim 1, wherein the determining candidate comment tags associated with the preset query object in the to-be-processed data comprises:

and extracting candidate comment tags associated with the preset query object from the data to be processed based on a natural language processing method.

3. The method of claim 1, wherein the screening the candidate comment tags for comment tags that conform to the preset query object comprises:

based on a preset matching rule, removing candidate comment tags which do not accord with the preset query object from the candidate comment tags to screen out the comment tags which accord with the preset query object.

4. The method according to any one of claims 1 to 3, wherein the determining of the presentation order of the comment data comprises:

determining a weight for each of the review data; and

and determining the display sequence of each evaluation data based on the weight of each evaluation data.

5. The method of claim 4, wherein determining the weight of each of the opinion data comprises determining the weight of each of the opinion data based on any one of:

determining the weight of the comment data based on whether the comment data has the hot words of which the number of co-occurrence times with the preset query object exceeds the preset number of times;

determining a quality score of the review data based on a machine learning algorithm and determining a weight of the review data based on the quality score; and

6. The method of claim 5, further comprising:

and determining the emotional tendency of each comment data based on a natural language processing tool, and determining the favorable rating of the preset query object based on the emotional tendency of each comment data.

7. The method of claim 6, further comprising:

and generating a good rating curve of the preset query object based on the good rating of the preset query object in each preset time period.

8. An associated content extraction device based on data mining, comprising:

the device comprises a to-be-processed data acquisition unit, a to-be-processed data acquisition unit and a processing unit, wherein the to-be-processed data acquisition unit is used for acquiring to-be-processed data which comprises a preset query object and comment data of the preset query object;

the determining unit is used for determining candidate comment tags which are associated with the preset query object in the data to be processed;

the first screening unit is used for screening the comment tags conforming to the preset query object from the candidate comment tags;

a first presentation unit, configured to determine a presentation order of each of the comment tags based on a click amount of a user on each of the comment tags;

the comment data acquisition unit is used for acquiring comment data containing the preset query object from a preset hotspot data source; the second presentation unit is used for determining the display sequence of each piece of comment data;

wherein the comment data acquiring unit is further configured to: acquiring candidate comment data containing the preset query object from a preset hotspot data source; and determining the comment data from the candidate comment data based on the page browsing amount of each piece of the candidate comment data.

9. The apparatus of claim 8, wherein the determining unit is further configured to:

and extracting candidate comment tags associated with the preset query object from the data to be processed based on a natural language processing device.

10. The apparatus of claim 8, wherein the first screening unit is further configured to:

11. The apparatus according to any one of claims 8-10, further comprising:

a weight determination unit for determining a weight of each of the evaluation data; and

the second presentation unit is used for determining the display sequence of each evaluation data based on the weight of each evaluation data.

12. The apparatus of claim 11, wherein the weight determination unit is further configured to determine the weight of each comment data based on any one of:

13. The apparatus of claim 12, further comprising:

and the favorable rating determining unit is used for determining the emotional tendency of each piece of comment data based on a natural language processing tool and determining the favorable rating of the preset query object based on the emotional tendency of each piece of comment data.

14. The apparatus of claim 13, further comprising:

and the favorable rating curve generating unit is used for generating a favorable rating curve of the preset query object based on the favorable rating of the preset query object in each preset time period.

15. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.