WO2019041521A1 - 用户关键词提取装置、方法及计算机可读存储介质 - Google Patents

用户关键词提取装置、方法及计算机可读存储介质 Download PDF

Info

Publication number
WO2019041521A1
WO2019041521A1 PCT/CN2017/108797 CN2017108797W WO2019041521A1 WO 2019041521 A1 WO2019041521 A1 WO 2019041521A1 CN 2017108797 W CN2017108797 W CN 2017108797W WO 2019041521 A1 WO2019041521 A1 WO 2019041521A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
word
preset
keywords
score
Prior art date
Application number
PCT/CN2017/108797
Other languages
English (en)
French (fr)
Inventor
吴振宇
刘睿恺
王建明
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to AU2017408801A priority Critical patent/AU2017408801B2/en
Priority to US16/084,988 priority patent/US20210097238A1/en
Priority to JP2018538141A priority patent/JP2019533205A/ja
Priority to KR1020187024862A priority patent/KR102170929B1/ko
Priority to EP17904351.8A priority patent/EP3477495A4/en
Publication of WO2019041521A1 publication Critical patent/WO2019041521A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a social network-based user keyword extraction apparatus, method, and computer readable storage medium.
  • the current recommendation method is mainly based on friend recommendation of the same tag information.
  • the recommendation of the topic Based on the recommendation of friends who are concerned about the topic, the recommendation of the topic based on the popularity of the topic, but the recommendation method is limited, and it is difficult to make recommendations based on the interests of the user. Therefore, how to extract keywords that can effectively represent the user's interests from the massive blog post data, and analyze and determine the user's real interest is an urgent problem to be solved.
  • the present application provides a social network-based user keyword extraction apparatus, method, and computer readable storage medium, the main purpose of which is to solve the problem in the prior art that it is difficult to extract a keyword that can effectively represent a user's interest according to a user's blog post. problem.
  • the present application provides a social network-based user keyword extraction apparatus, the apparatus comprising a memory and a processor, wherein the memory stores a user keyword extraction program executable on the processor, When the user keyword extraction program is executed by the processor, the following steps are implemented:
  • the keyword of the preset condition is used as the keyword of interest of the target user.
  • the step of constructing a semantic similarity map according to the candidate keyword set and the word vector corresponding to each keyword in the candidate keyword set includes:
  • the keyword in the candidate keyword set is used as a word node, wherein one keyword corresponds to one word node;
  • the semantic similarity map is composed of all word nodes and established edges.
  • the step of calculating the context similarity between each two word nodes according to the corresponding word vector comprises:
  • a word vector of two word nodes is obtained, and a cosine similarity between the two word vectors is calculated, and the cosine similarity is used as a context similarity between the two word nodes.
  • the step of extracting the keyword corresponding to the blog post from the word list of the blog post by the keyword extraction algorithm includes:
  • the repeated keywords in the keywords extracted by the plurality of keyword extraction algorithms are used as keywords corresponding to the blog posts.
  • the step of using the keyword that meets the preset condition as the interest keyword of the target user includes:
  • a keyword having a score greater than a preset score is used as a keyword of interest of the target user
  • the keyword having a score greater than the preset score is used as the keyword of interest of the target user, wherein, when the number of keywords whose score is greater than the preset score is greater than the first preset number, the first preset is The second preset number of keywords among the plurality of keywords is used as the interest keyword of the target user, and the first preset number is greater than the second preset number.
  • the present application further provides a social network-based user keyword extraction method, including:
  • the Pagerank algorithm is run on the semantic similarity graph to score each keyword, and the keyword whose score meets the preset condition is used as the interest keyword of the target user.
  • the step of constructing a semantic similarity map according to the candidate keyword set and the word vector corresponding to each keyword in the candidate keyword set includes:
  • the keyword in the candidate keyword set is used as a word node, wherein one keyword corresponds to one word node;
  • the semantic similarity map is composed of all word nodes and established edges.
  • the step of calculating the context similarity between each two word nodes according to the corresponding word vector comprises:
  • a word vector of two word nodes is obtained, and a cosine similarity between the two word vectors is calculated, and the cosine similarity is used as a context similarity between the two word nodes.
  • the step of extracting the keyword corresponding to the blog post from the word list of the blog post by the keyword extraction algorithm includes:
  • the repeated keywords in the keywords extracted by the plurality of keyword extraction algorithms are used as keywords corresponding to the blog posts.
  • the present application further provides a computer readable storage medium having a user keyword extraction program stored thereon, the user keyword extraction program being executable by at least one processor, To achieve the following steps:
  • the Pagerank algorithm is run on the semantic similarity graph to score each keyword, and the keyword whose score meets the preset condition is used as the interest keyword of the target user.
  • the social network-based user keyword extracting apparatus, method and computer readable storage medium proposed by the present application perform word segmentation processing on each blog post that the target user has published in a preset time interval.
  • To obtain a list of words corresponding to each blog post input into the Word2Vec model for training to obtain a word vector model, and extract a corresponding keyword from the word list of the blog post to form a candidate keyword set based on the keyword extraction algorithm, based on the above words
  • the vector model calculates the word vector of each keyword in the set, constructs a semantic similarity graph according to the keyword and the word vector in the keyword set, and runs the Pagerank algorithm on the semantic similarity graph to score the keyword, and the key that satisfies the preset condition
  • the present application extracts a keyword that can effectively represent the user's interest by synthesizing the user's published blog post in the above manner.
  • FIG. 1 is a schematic diagram of a preferred embodiment of a user keyword extraction apparatus based on a social network according to the present application
  • FIG. 2 is a schematic diagram of a program module of a user keyword extraction program in an embodiment of a social network-based user keyword extraction apparatus according to the present application;
  • FIG. 3 is a flowchart of a preferred embodiment of a method for extracting user keywords based on a social network.
  • the application provides a social network based user keyword extraction device.
  • FIG. 1 it is a schematic diagram of a preferred embodiment of a social network based user keyword extraction apparatus according to the present application.
  • the social network-based user keyword extraction device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, an e-book reader, or a portable computer.
  • PC Personal Computer
  • terminal device such as a smart phone, a tablet computer, an e-book reader, or a portable computer.
  • the social network based user keyword extraction device includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may in some embodiments be an internal storage unit of a social network based user keyword extraction device, such as a hard disk of the social network based user keyword extraction device.
  • the memory 11 may also be an external storage device of the social network-based user keyword extraction device in other embodiments, such as a plug-in hard disk equipped on a social network-based user keyword extraction device, a smart memory card (Smart Media Card) , SMC), Secure Digital (SD) card, Flash Card, etc.
  • the memory 11 may also include both an internal storage unit of the social network based user keyword extraction device and an external storage device.
  • the memory 11 can be used not only for storing application software and various types of data installed in a social network-based user keyword extraction device, such as code of a user keyword extraction program, but also for temporarily storing data that has been output or is to be output. .
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as executing a user keyword extraction program or the like.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor or other data processing chip for running program code or processing stored in the memory 11.
  • Data such as executing a user keyword extraction program or the like.
  • Communication bus 13 is used to implement connection communication between these components.
  • the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • Figure 1 shows only a social network based user keyword extraction device with components 11-14 and a user keyword extraction program, but it should be understood that not all illustrated components may be implemented, and alternative implementations may be implemented. Or fewer components.
  • the device may further include a user interface
  • the user interface may include a display
  • an input unit such as a keyboard
  • the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display may also be suitably referred to as a display screen or display unit for displaying information processed in the social network based user keyword extraction device and a user interface for displaying visualization.
  • a user keyword extraction program is stored in the memory 11; when the processor 12 executes the user keyword extraction program stored in the memory 11, the following steps are implemented:
  • A. Obtain a blog post published by the target user in a preset time interval, perform a word segmentation process on the obtained blog post using a preset word segmentation tool, and respectively obtain a word list corresponding to each blog post;
  • the scheme of the present application is explained by taking Weibo as an example.
  • the blog post that the user has published is obtained for word segmentation processing.
  • the published blog posts are filtered in the time dimension to set a preset time interval. Only the published blog posts of this time period are analyzed, for example, only the blog posts published in the past year are analyzed.
  • the number of blog posts published by the user in the preset time interval is small, all the blog posts that the user has published in the past may also be analyzed.
  • the word segmentation tool is used to perform word segmentation processing on each of the obtained blog posts one by one, for example, a word segmentation tool such as Stanford Chinese word segmentation tool and jieba word segmentation is used for word segmentation processing.
  • a word segmentation tool such as Stanford Chinese word segmentation tool and jieba word segmentation is used for word segmentation processing.
  • Stanford Chinese word segmentation tool and jieba word segmentation is used for word segmentation processing.
  • the word segmentation result is retained, and the words such as adverbs and adjectives that cannot express the user's interest are removed, for example, in the above example, only Keep the word "movie".
  • the result of the word segmentation processing is empty, the corresponding blog post is filtered out, and for each blog post whose score result is not empty, a corresponding word list can be obtained, and all the blog posts in the above time interval are corresponding.
  • the word list is input into the Word2Vec model for training, and a word vector model is obtained, which is used to convert the keyword into a word vector.
  • the Word2Vec model is a tool for word vector calculation. There are mature calculation methods for training the model and using it to calculate the word vector of a word, and will not be described here.
  • keyword extraction is performed for each blog post using a keyword extraction algorithm, for example, using TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, LSA (Latent Semantic Analysis), implicit semantics.
  • Analytic algorithm or any one of the keyword extraction algorithms such as PLSA (Probabilisitic Latent Semantic Analysis) algorithm calculates the word list of each blog post, and takes the highest scored one or more words as the The keyword corresponding to the blog post uses the above word vector model to convert each keyword into a corresponding word vector.
  • the keyword extraction algorithm is combined with a plurality of keyword extraction algorithms.
  • the step of extracting the keyword corresponding to the blog post from the word list of the blog based on the keyword extraction algorithm includes: respectively, according to the preset The plurality of keyword extraction algorithms extract keywords from the word list of the blog post; and the repeated keywords in the keywords extracted by the plurality of keyword extraction algorithms are used as keywords corresponding to the blog post.
  • the keyword extraction is performed according to the TF-IDF algorithm, the LSA algorithm, or the PLSA algorithm, and then the keywords of the overlapping portion are taken as the keywords corresponding to the blog.
  • the keyword extraction algorithm is used to extract keywords and serve as candidate keywords, and a candidate keyword set is established, and then the keyword set is processed according to a subsequent algorithm, and keywords that can reflect user interest are obtained therefrom. .
  • the keywords corresponding to each blog post published by the target user in the preset time interval constitute a candidate keyword set of the target user, and the word vector of each keyword in the set is calculated using the above word vector model.
  • a semantic similarity graph is constructed according to the above candidate keyword set and the word vector.
  • the step of constructing the semantic similarity graph may include the following refinement step: using the keyword in the candidate keyword set as a word a node, wherein a keyword corresponds to a word node; traversing all word nodes, calculating a context similarity between each two word nodes according to the corresponding word vector, whenever the context similarity between the two word nodes is greater than a preset At the threshold, an edge is established between the two word nodes; the semantic similarity graph is formed by all word nodes and established edges.
  • the word vectors of the two word nodes are obtained, and the cosine similarity between the two word vectors is calculated, and the cosine similarity is used as the context similarity between the two word nodes. degree.
  • the edge established between the word nodes may be a directed edge or an undirected edge, wherein the direction of the directed edge may be an early word node that appears to point to the late word node that appears. They have different advantages.
  • the characteristic of the directed edge is that it needs to be iteratively calculated when running the Pagerank algorithm.
  • the calculation amount is slightly larger, and the advantage is that the denoising effect is good; for example, after analyzing a user, the obtained keywords are: C Luo, Real Madrid, La Liga, football, sweepstakes, the first four words in the semantic similarity map whoever points to who will form a mutual promotion role in the Pagerank algorithm score, then even if there are some words, such as snacks, and other words established There is a directed edge, but it does not form a promotion in the iteration, so that the word "sweepstake" is scored lower, and the word can be excluded. For the undirected edge, the calculation speed when running the Pagerank algorithm is fast, and iterative calculation is not needed, but the effect of denoising is not very good.
  • the word "sweepstake” may not be excluded.
  • other methods may be used to calculate the semantic similarity between two words.
  • a method for calculating semantic similarity based on a large-scale corpus is a method for calculating semantic similarity based on a large-scale corpus. The calculation method of the similarity between the more mature words, the specific principle of which will not be described here.
  • the step of using a keyword that satisfies a preset condition as a keyword of interest of the target user may include:
  • a keyword having a score greater than a preset score is used as a keyword of interest of the target user
  • the keyword having a score greater than the preset score is used as the keyword of interest of the target user, wherein, when the number of keywords whose score is greater than the preset score is greater than the first preset number, the first preset is The second preset number of keywords among the plurality of keywords is used as the interest keyword of the target user, and the first preset number is greater than the second preset number.
  • the preset threshold, the preset number of words, the first preset number, and the second preset number which are involved in the foregoing embodiments, may be preset according to actual conditions.
  • the social network-based user keyword extracting apparatus proposed in the above embodiment performs word segmentation processing on each blog post that the target user has published in the preset time interval, so as to obtain a word list corresponding to each blog post, and input it into the Word2Vec model.
  • Training to obtain a word vector model extracting corresponding keywords from the word list of the blog based on the keyword extraction algorithm to form a candidate keyword set, and calculating a word vector of each keyword in the set based on the word vector model, according to the keyword In the collection
  • the keyword and the word vector construct a semantic similarity graph.
  • the Pagerank algorithm is used to score the keywords on the semantic similarity graph, and the keywords whose scores meet the preset conditions are used as the interest keywords of the user. This application is synthesized by the user in the above manner.
  • the user keyword extraction program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation)
  • the processor 12 is executed to complete the application
  • a module referred to herein refers to a series of computer program instructions that are capable of performing a particular function.
  • FIG. 2 it is a schematic diagram of a program module of a user keyword extraction program in an embodiment of a social network-based user keyword extraction apparatus according to the present application.
  • the user keyword extraction program may be divided into acquisitions. Module 10, training module 20, extraction module 30, mapping module 40, and scoring module 50, by way of example:
  • the obtaining module 10 is configured to obtain a blog post that has been published by the target user in a preset time interval, and perform a word segmentation process on the obtained blog post by using a preset word segmentation tool, and respectively obtain a word list corresponding to each blog post;
  • the training module 20 is configured to input the obtained word list corresponding to each blog post into the Word2Vec model for training to obtain a word vector model;
  • the extraction module 30 is configured to extract a keyword corresponding to the blog post from the word list of the blog post based on the keyword extraction algorithm, and form a keyword accumulated by the target user in the preset time interval to constitute the target user. a candidate keyword set, and calculating a word vector of each keyword in the candidate keyword set based on the word vector model;
  • the mapping module 40 is configured to construct a semantic similarity map according to the candidate keyword set and the word vector corresponding to each keyword in the candidate keyword set;
  • the scoring module 50 is configured to run the Pagerank algorithm on the semantic similarity graph to score each keyword, and use a keyword whose score meets the preset condition as the interest keyword of the target user.
  • the present application also provides a social network based user keyword extraction method.
  • FIG. 3 it is a flowchart of a preferred embodiment of a method for extracting user keywords based on a social network according to the present application. The method can be performed by a device that can be implemented by software and/or hardware.
  • the social network-based user keyword extraction method includes:
  • Step S10 Obtain a blog post published by the target user in a preset time interval, perform a word segmentation process on the obtained blog post by using a preset word segmentation tool, and obtain a word list corresponding to each blog post respectively;
  • Step S20 inputting the obtained word list corresponding to each blog post into the Word2Vec model for training, to obtain a word vector model
  • Step S30 extracting, according to the keyword extraction algorithm, the correspondence corresponding to the blog post from the word list of the blog post Key words, the keyword accumulated by the target user published in the preset time interval constitutes a candidate keyword set of the target user, and the candidate keyword set is calculated based on the word vector model The word vector for each keyword.
  • the scheme is explained by taking Weibo as an example.
  • the blog post that the user has published is obtained for word segmentation processing. It can be understood that since the user's hobbies may change over time, in order to improve the accuracy of keyword extraction, the published blog posts are filtered in the time dimension to set a preset time interval.
  • the word segmentation tool is used to perform word segmentation processing on each of the obtained blog posts one by one, for example, a word segmentation tool such as Stanford Chinese word segmentation tool and jieba word segmentation is used for word segmentation processing.
  • a word segmentation of the content of this blog post "Going to the movie last night” will result in the following results: "Yesterday
  • Movie the word segmentation result is retained.
  • the word segmentation result is retained, and the words such as adverbs and adjectives that cannot express the user's interest are removed, for example, in the above example, only Keep the word "movie".
  • the result of the word segmentation processing is empty, the corresponding blog post is filtered out, and for each blog post whose score result is not empty, a corresponding word list can be obtained, and all the blog posts in the above time interval are corresponding.
  • the word list is input into the Word2Vec model for training, and a word vector model is obtained, which is used to convert the keyword into a word vector.
  • the Word2Vec model is a tool for word vector calculation. There are mature calculation methods for training the model and using it to calculate the word vector of a word, and will not be described here.
  • keyword extraction is performed for each blog post using a keyword extraction algorithm, for example, using TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, LSA (Latent Semantic Analysis), implicit semantics.
  • Analytic algorithm or any one of the keyword extraction algorithms such as PLSA (Probabilisitic Latent Semantic Analysis) algorithm calculates the word list of each blog post, and takes the highest scored one or more words as the The keyword corresponding to the blog post uses the above word vector model to convert each keyword into a corresponding word vector.
  • the keyword extraction algorithm is combined with a plurality of keyword extraction algorithms.
  • the step of extracting the keyword corresponding to the blog post from the word list of the blog based on the keyword extraction algorithm includes: respectively, according to the preset The plurality of keyword extraction algorithms extract keywords from the word list of the blog post; and the repeated keywords in the keywords extracted by the plurality of keyword extraction algorithms are used as keywords corresponding to the blog post.
  • the keyword extraction is performed according to the TF-IDF algorithm, the LSA algorithm, or the PLSA algorithm, and then the keywords of the overlapping portion are taken as the keywords corresponding to the blog.
  • the keyword extraction algorithm extracts the keyword and uses it as a candidate keyword to establish a candidate keyword set, and then processes the keyword set according to a subsequent algorithm to obtain a keyword that can reflect the user's interest.
  • Step S40 Construct a semantic similarity map according to the candidate keyword set and the word vector corresponding to each keyword in the candidate keyword set.
  • the keywords corresponding to each blog post published by the target user in the preset time interval constitute a candidate keyword set of the target user, and the word vector of each keyword in the set is calculated using the above word vector model.
  • a semantic similarity graph is constructed according to the above candidate keyword set and the word vector.
  • the step of constructing the semantic similarity graph may include the following refinement step: using the keyword in the candidate keyword set as a word a node, wherein a keyword corresponds to a word node; traversing all word nodes, calculating a context similarity between each two word nodes according to the corresponding word vector, whenever the context similarity between the two word nodes is greater than a preset At the threshold, an edge is established between the two word nodes; the semantic similarity graph is formed by all word nodes and established edges.
  • the word vectors of the two word nodes are obtained, and the cosine similarity between the two word vectors is calculated, and the cosine similarity is used as the context similarity between the two word nodes. degree.
  • the edge established between the word nodes may be a directed edge or an undirected edge, wherein the direction of the directed edge may be an early word node that appears to point to the late word node that appears. They have different advantages.
  • the characteristic of the directed edge is that it needs to be iteratively calculated when running the Pagerank algorithm.
  • the calculation amount is slightly larger, and the advantage is that the denoising effect is good; for example, after analyzing a user, the obtained keywords are: C Luo, Real Madrid, La Liga, football, sweepstakes, the first four words in the semantic similarity map whoever points to who will form a mutual promotion role in the Pagerank algorithm score, then even if there are some words, such as snacks, and other words established There is a directed edge, but it does not form a promotion in the iteration, so that the word "sweepstake" is scored lower, and the word can be excluded. For the undirected edge, the calculation speed when running the Pagerank algorithm is fast, and iterative calculation is not needed, but the effect of denoising is not very good.
  • the word "sweepstake” may not be excluded.
  • other methods may be used to calculate the semantic similarity between two words.
  • a method for calculating semantic similarity based on a large-scale corpus is a method for calculating semantic similarity based on a large-scale corpus. The calculation method of the similarity between the more mature words, the specific principle of which will not be described here.
  • step S50 the Pagerank algorithm is run on the semantic similarity graph to score each keyword, and the keyword whose score meets the preset condition is used as the interest keyword of the target user.
  • the step of using a keyword that satisfies a preset condition as a keyword of interest of the target user may include:
  • a keyword having a score greater than a preset score is used as a keyword of interest of the target user
  • the keyword having a score greater than the preset score is used as the keyword of interest of the target user, wherein, when the number of keywords whose score is greater than the preset score is greater than the first preset number, the first preset is The second preset number of keywords among the plurality of keywords is used as the interest keyword of the target user, and the first preset number is greater than the second preset number.
  • the preset threshold, the preset number of words, the first preset number, and the second preset number which are involved in the foregoing embodiments, may be preset according to actual conditions.
  • the social network-based user keyword extraction method proposed in the above embodiment performs word segmentation processing on each blog post published by the target user in a preset time interval, so as to obtain a word list corresponding to each blog post, and input it into the Word2Vec model.
  • Training to obtain a word vector model, extracting corresponding keywords from the word list of the blog based on the keyword extraction algorithm to form a candidate keyword set, and calculating a word vector of each keyword in the set based on the word vector model, according to the keyword
  • the keyword and the word vector in the set construct a semantic similarity graph, and the Pagerank algorithm is used to score the keyword on the semantic similarity graph, and the keyword whose score meets the preset condition is used as the interest keyword of the user, and the application integrates the user by the above method.
  • the published blog post performs word segmentation processing to extract keywords that can effectively represent the user's interests.
  • the embodiment of the present application further provides a computer readable storage medium, where the user keyword extraction program is stored, and the user keyword extraction program may be executed by one or more processors, Implement the following operations:
  • the Pagerank algorithm is run on the semantic similarity graph to score each keyword, and the keyword whose score meets the preset condition is used as the interest keyword of the target user.
  • the keyword in the candidate keyword set is used as a word node, wherein one keyword corresponds to one word node;
  • the semantic similarity map is composed of all word nodes and established edges.
  • a word vector of two word nodes is obtained, and a cosine similarity between the two word vectors is calculated, and the cosine similarity is used as a context similarity between the two word nodes.
  • the repeated keywords in the keywords extracted by the plurality of keyword extraction algorithms are used as keywords corresponding to the blog posts.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Abstract

本申请公开了一种基于社交网络的用户关键词提取方法,包括:获取目标用户在预设时间区间内发表过的博文,并进行分词处理,获取每条博文的单词列表;将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练获取词向量模型;基于关键词提取算法提取博文对应的关键词构成目标用户的候选关键词集合,并基于词向量模型计算候选关键词集合中每个关键词的词向量,并构建语义相似图;在语义相似图上运行Pagerank算法为关键词打分以获取用户的兴趣关键词。本申请还提出一种基于社交网络的用户关键词提取装置以及一种计算机可读存储介质。本申请解决了现有技术中难以根据用户的博文提取出能够有效代表用户的兴趣的关键词的技术问题。

Description

用户关键词提取装置、方法及计算机可读存储介质
优先权申明
本申请基于巴黎公约申明享有2017年08月29日递交的申请号为201710754314.4、名称为“用户关键词提取装置、方法及计算机可读存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种基于社交网络的用户关键词提取装置、方法及计算机可读存储介质。
背景技术
目前,随着社交网络的普及,基于微博等社交网络的各种应用也越来越多,例如,针对用户的博文进行个性化的推荐,目前的推荐方式主要是基于相同标签信息的好友推荐、基于共同关注的好友推荐、基于话题热度的微博话题推荐等,但是这种推荐方式局限性大,难以根据用户的兴趣爱好有针对性地进行推荐。所以,如何从海量博文数据中,提取出能够有效代表用户的兴趣的关键词,分析确定用户的真正兴趣是急需解决的问题。
发明内容
本申请提供一种基于社交网络的用户关键词提取装置、方法及计算机可读存储介质,其主要目的在于解决现有技术中难以根据用户的博文提取出能够有效代表用户的兴趣的关键词的技术问题。
为实现上述目的,本申请提供一种基于社交网络的用户关键词提取装置,该装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的用户关键词提取程序,所述用户关键词提取程序被所述处理器执行时实现如下步骤:
获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满 足预设条件的关键词作为所述目标用户的兴趣关键词。
可选地,所述根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤包括:
将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;
由全部单词节点以及建立的边构成所述语义相似图。
可选地,所述根据对应的词向量计算每两个单词节点之间的上下文相似度的步骤包括:
获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
可选地,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
可选地,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
此外,为实现上述目的,本申请还提供一种基于社交网络的用户关键词提取方法,该方法包括:
获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
可选地,所述根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤包括:
将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;
由全部单词节点以及建立的边构成所述语义相似图。
可选地,所述根据对应的词向量计算每两个单词节点之间的上下文相似度的步骤包括:
获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
可选地,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有用户关键词提取程序,所述用户关键词提取程序可被至少一个处理器执行,以实现如下步骤:
获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
本申请提出的基于社交网络的用户关键词提取装置、方法及计算机可读存储介质,对目标用户在预设时间区间内发表过的每个博文进行分词处理, 以获取每条博文对应的单词列表,输入到Word2Vec模型中进行训练,以获取词向量模型,基于关键词提取算法从博文的单词列表中提取对应的关键词构成一个候选关键词集合,基于上述词向量模型计算集合中的各个关键词的词向量,根据关键词集合中的关键词以及词向量构建语义相似图,在语义相似图上运行Pagerank算法为关键词打分,将得分满足预设条件的关键词作为该用户的兴趣关键词,本申请通过上述方式综合用户发表的过的博文进行分词处理的方式,提取出能够有效代表用户的兴趣的关键词。
附图说明
图1为本申请基于社交网络的用户关键词提取装置较佳实施例的示意图;
图2为本申请基于社交网络的用户关键词提取装置一实施例中用户关键词提取程序的程序模块示意图;
图3为本申请基于社交网络的用户关键词提取方法较佳实施例的流程图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供一种基于社交网络的用户关键词提取装置。参照图1所示,为本申请基于社交网络的用户关键词提取装置较佳实施例的示意图。
在本实施例中,基于社交网络的用户关键词提取装置可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、电子书阅读器、便携计算机等终端设备。
该基于社交网络的用户关键词提取装置包括存储器11、处理器12,通信总线13,以及网络接口14。
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是基于社交网络的用户关键词提取装置的内部存储单元,例如该基于社交网络的用户关键词提取装置的硬盘。存储器11在另一些实施例中也可以是基于社交网络的用户关键词提取装置的外部存储设备,例如基于社交网络的用户关键词提取装置上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括基于社交网络的用户关键词提取装置的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于基于社交网络的用户关键词提取装置的应用软件及各类数据,例如用户关键词提取程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行用户关键词提取程序等。
通信总线13用于实现这些组件之间的连接通信。
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置与其他电子设备之间建立通信连接。
图1仅示出了具有组件11-14以及用户关键词提取程序的基于社交网络的用户关键词提取装置,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,该装置还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在基于社交网络的用户关键词提取装置中处理的信息以及用于显示可视化的用户界面。
在图1所示的装置实施例中,存储器11中存储有用户关键词提取程序;处理器12执行存储器11中存储的用户关键词提取程序时实现如下步骤:
A、获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
B、将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
C、基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量。
本实施例中,以微博为例对本申请的方案进行解释。当需要根据目标用户发表过的微博内容来获取能够有效体现该用户的兴趣爱好的关键词时,获取用户发表过的博文进行分词处理。可以理解的是,由于随着时间的推移,用户的兴趣爱好可能会发生变化,因此,为了提高关键词提取的准确性,在时间维度上对发表过的博文进行过滤,设置预设时间区间,只对该时间段的发表的博文进行分析,例如,只分析近一年发表过的博文。当然,在其他的实施例中,当用户在预设时间区间内发表过的博文的数量较少时,也可以对该用户过去曾发表过的全部博文进行分析。
在获取到目标用户的博文后,使用分词工具逐个对获取到的每一个博文进行分词处理,例如使用Stanford汉语分词工具、jieba分词等分词工具进行分词处理。例如,对这一博文内容“昨天晚上去看了电影”进行分词,会得 到如下结果“昨天|晚上|去|看|了|电影”。分词处理后保留分词结果,进一步地,为了进一步提高关键词的有效性,只保留分词结果中的动词和/或名词,去掉副词、形容词等无法体现用户兴趣的词,例如上述例子中,可以只保留“电影”这个词。可以理解的是,经过分词处理后的结果为空,则过滤掉对应的博文,而对于每一个分词结果不为空的博文都能得到一个对应的单词列表,将上述时间区间内的所有博文对应的单词列表输入到Word2Vec模型中进行训练,得到词向量模型,该词向量模型用于将关键词转化为一个词向量。Word2Vec模型是一个用于词向量计算的工具,关于对该模型进行训练并使用它来计算单词的词向量已经有成熟的计算方法,在此不再赘述。
接下来,使用关键词提取算法对每一个博文进行关键词提取,例如,使用TF-IDF(Term Frequency-Inverse Document Frequency,词项频率-逆向文本频率)算法、LSA(Latent Semantic Analysis,隐性语义分析)算法或者PLSA(Probabilisitic Latent Semantic Analysis,概率隐性语义分析)算法等关键词提取算法中的任意一种算法对每一个博文的单词列表进行计算,将得分最高的一个或者多个单词作为该博文对应的关键词,使用上述词向量模型将每一个关键词转换为一个对应的词向量。或者,作为一种实施方式,结合多个关键词提取算法进行关键词的提取,具体地,基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。例如,分别按照上述TF-IDF算法、LSA算法或者PLSA算法进行一次关键词的提取,然后取重合部分的关键词作为该博文对应的关键词。
由于博文的内容一般比较短小,在应用上述关键词提取算法对博文进行关键词提取时,一般提取到的关键词噪声大,并且过于宽泛,难以准确地反映用户的兴趣,因此,本实施例中,针对大量的博文,应用上述关键词提取算法提取到关键词并作为候选关键词,建立候选关键词集合,再根据后续的算法对该关键词集合进行处理,从中获取能够反映用户兴趣的关键词。
D、根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图。
将目标用户在上述预设时间区间内发表过的每一个博文对应的关键词构成该目标用户的候选关键词集合,并使用上述词向量模型计算集合中每一个关键词的词向量。根据上述候选关键词集合以及词向量构建一个语义相似图。
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤可以包括如下细化步骤:将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;由全部单词节点以及建立的边构成所述语义相似图。
其中,在计算上下文相似度时,获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。其中,在单词节点之间建立的边可以是有向边,也可以是无向边,其中,有向边的方向可以是有出现的早的单词节点指向出现的晚的单词节点。它们具有不同的优点,有向边的特点是运行Pagerank算法时需要进行迭代计算,计算量稍大,其优点是去噪效果良好;例如,对一个用户进行分析后,得到的关键词有:C罗,皇马,西甲,足球,抽奖,前四个词在语义相似图中无论谁指向谁,都会在Pagerank算法的打分中形成相互促进的作用,那么就算有一些词,例如零食,和其它词建立了有向边,但是在迭代中形成不了促进,这样对于“抽奖”这个词的打分就比较低,就可以排除掉这个词。而对于无向边,运行Pagerank算法时的计算速度快,不需要进行迭代计算,但是去噪的效果不是很好,例如在上述例子中,有可能不会排除掉“抽奖”这个词。在其他实施例中,也可以采用其他的方式计算两个单词之间的语义相似度,例如,通过基于大规模语料库计算语义相似度的方法等,基于大规模语料库计算语义相似度的方法是一种较为成熟的词语之间相似度的计算方法,其具体原理在此不再赘述。
E、在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
在语义相似图上运行Pagerank算法对每个单词节点进行打分,单词节点的Pagerank值越大,说明在图上指向该单词节点的其他单词节点(针对有向边的情况)或者与该单词节点建立连接的其他单词节点(针对无向边的情况)越多,进而说明在图上有越多的其他单词节点与该单词节点的相似度比较高,则该单词节点对应的关键词越能够体现用户的兴趣,因此,将得分较高的关键词作为目标用户的兴趣关键词。具体地,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤可以包括:
将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
可以理解的是,上述各实施例中涉及到的预设阈值、预设字数、第一预设个数以及第二预设个数等需要预先设置的参数,可以用户根据实际情况进行设置。
上述实施例提出的基于社交网络的用户关键词提取装置,对目标用户在预设时间区间内发表过的每个博文进行分词处理,以获取每条博文对应的单词列表,输入到Word2Vec模型中进行训练,以获取词向量模型,基于关键词提取算法从博文的单词列表中提取对应的关键词构成一个候选关键词集合,基于上述词向量模型计算集合中的各个关键词的词向量,根据关键词集合中 的关键词以及词向量构建语义相似图,在语义相似图上运行Pagerank算法为关键词打分,将得分满足预设条件的关键词作为该用户的兴趣关键词,本申请通过上述方式综合用户发表的过的博文进行分词处理的方式,提取出能够有效代表用户的兴趣的关键词。
可选地,在其他的实施例中,用户关键词提取程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行,以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。例如,参照图2所示,为本申请基于社交网络的用户关键词提取装置一实施例中的用户关键词提取程序的程序模块示意图,该实施例中,用户关键词提取程序可以被分割为获取模块10、训练模块20、提取模块30、建图模块40以及打分模块50,示例性地:
获取模块10用于获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
训练模块20用于将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
提取模块30用于基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
建图模块40用于根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
打分模块50用于在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
上述获取模块10、训练模块20、提取模块30、建图模块40以及打分模块50被执行所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。
此外,本申请还提供一种基于社交网络的用户关键词提取方法。参照图3所示,为本申请基于社交网络的用户关键词提取方法较佳实施例的流程图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
在本实施例中,基于社交网络的用户关键词提取方法包括:
步骤S10,获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
步骤S20,将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
步骤S30,基于关键词提取算法从博文的单词列表中提取该博文对应的关 键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量。本实施例中,以微博为例对方案进行解释说明。当需要根据目标用户发表过的微博内容来获取能够有效体现该用户的兴趣爱好的关键词时,获取用户发表过的博文进行分词处理。可以理解的是,由于随着时间的推移,用户的兴趣爱好可能会发生变化,因此,为了提高关键词提取的准确性,在时间维度上对发表过的博文进行过滤,设置预设时间区间,只对该时间段的发表的博文进行分析,例如,只分析近一年发表过的博文。当然,在其他的实施例中,当用户在预设时间区间内发表过的博文的数量较少时,也可以对该用户过去曾发表过的全部博文进行分析。
在获取到目标用户的博文后,使用分词工具逐个对获取到的每一个博文进行分词处理,例如使用Stanford汉语分词工具、jieba分词等分词工具进行分词处理。例如,对这一博文内容“昨天晚上去看了电影”进行分词,会得到如下结果“昨天|晚上|去|看|了|电影”。分词处理后保留分词结果,进一步地,为了进一步提高关键词的有效性,只保留分词结果中的动词和/或名词,去掉副词、形容词等无法体现用户兴趣的词,例如上述例子中,可以只保留“电影”这个词。可以理解的是,经过分词处理后的结果为空,则过滤掉对应的博文,而对于每一个分词结果不为空的博文都能得到一个对应的单词列表,将上述时间区间内的所有博文对应的单词列表输入到Word2Vec模型中进行训练,得到词向量模型,该词向量模型用于将关键词转化为一个词向量。Word2Vec模型是一个用于词向量计算的工具,关于对该模型进行训练并使用它来计算单词的词向量已经有成熟的计算方法,在此不再赘述。
接下来,使用关键词提取算法对每一个博文进行关键词提取,例如,使用TF-IDF(Term Frequency-Inverse Document Frequency,词项频率-逆向文本频率)算法、LSA(Latent Semantic Analysis,隐性语义分析)算法或者PLSA(Probabilisitic Latent Semantic Analysis,概率隐性语义分析)算法等关键词提取算法中的任意一种算法对每一个博文的单词列表进行计算,将得分最高的一个或者多个单词作为该博文对应的关键词,使用上述词向量模型将每一个关键词转换为一个对应的词向量。或者,作为一种实施方式,结合多个关键词提取算法进行关键词的提取,具体地,基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。例如,分别按照上述TF-IDF算法、LSA算法或者PLSA算法进行一次关键词的提取,然后取重合部分的关键词作为该博文对应的关键词。
由于博文的内容一般比较短小,在应用上述关键词提取算法对博文进行关键词提取时,一般提取到的关键词噪声大,并且过于宽泛,难以准确地反映用户的兴趣,因此,本实施例中,针对大量的博文,应用上述关键词提取 算法提取到关键词并作为候选关键词,建立候选关键词集合,再根据后续的算法对该关键词集合进行处理,从中获取能够反映用户兴趣的关键词。
步骤S40,根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图。
将目标用户在上述预设时间区间内发表过的每一个博文对应的关键词构成该目标用户的候选关键词集合,并使用上述词向量模型计算集合中每一个关键词的词向量。根据上述候选关键词集合以及词向量构建一个语义相似图。
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤可以包括如下细化步骤:将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;由全部单词节点以及建立的边构成所述语义相似图。
其中,在计算上下文相似度时,获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。其中,在单词节点之间建立的边可以是有向边,也可以是无向边,其中,有向边的方向可以是有出现的早的单词节点指向出现的晚的单词节点。它们具有不同的优点,有向边的特点是运行Pagerank算法时需要进行迭代计算,计算量稍大,其优点是去噪效果良好;例如,对一个用户进行分析后,得到的关键词有:C罗,皇马,西甲,足球,抽奖,前四个词在语义相似图中无论谁指向谁,都会在Pagerank算法的打分中形成相互促进的作用,那么就算有一些词,例如零食,和其它词建立了有向边,但是在迭代中形成不了促进,这样对于“抽奖”这个词的打分就比较低,就可以排除掉这个词。而对于无向边,运行Pagerank算法时的计算速度快,不需要进行迭代计算,但是去噪的效果不是很好,例如在上述例子中,有可能不会排除掉“抽奖”这个词。在其他实施例中,也可以采用其他的方式计算两个单词之间的语义相似度,例如,通过基于大规模语料库计算语义相似度的方法等,基于大规模语料库计算语义相似度的方法是一种较为成熟的词语之间相似度的计算方法,其具体原理在此不再赘述。
步骤S50,在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
在语义相似图上运行Pagerank算法对每个单词节点进行打分,单词节点的Pagerank值越大,说明在图上指向该单词节点的其他单词节点(针对有向边的情况)或者与该单词节点建立连接的其他单词节点(针对无向边的情况)越多,进而说明在图上有越多的其他单词节点与该单词节点的相似度比较高,则该单词节点对应的关键词越能够体现用户的兴趣,因此,将得分较高的关键词作为目标用户的兴趣关键词。具体地,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤可以包括:
将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
可以理解的是,上述各实施例中涉及到的预设阈值、预设字数、第一预设个数以及第二预设个数等需要预先设置的参数,可以用户根据实际情况进行设置。
上述实施例提出的基于社交网络的用户关键词提取方法,对目标用户在预设时间区间内发表过的每个博文进行分词处理,以获取每条博文对应的单词列表,输入到Word2Vec模型中进行训练,以获取词向量模型,基于关键词提取算法从博文的单词列表中提取对应的关键词构成一个候选关键词集合,基于上述词向量模型计算集合中的各个关键词的词向量,根据关键词集合中的关键词以及词向量构建语义相似图,在语义相似图上运行Pagerank算法为关键词打分,将得分满足预设条件的关键词作为该用户的兴趣关键词,本申请通过上述方式综合用户发表的过的博文进行分词处理的方式,提取出能够有效代表用户的兴趣的关键词。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有用户关键词提取程序,所述用户关键词提取程序可被一个或多个处理器执行,以实现如下操作:
获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
进一步地,所述用户关键词提取程序被处理器执行时还实现如下操作:
将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述 两个单词节点之间建立一条边;
由全部单词节点以及建立的边构成所述语义相似图。
进一步地,所述用户关键词提取程序被处理器执行时还实现如下操作:
获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
进一步地,所述用户关键词提取程序被处理器执行时还实现如下操作:
分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
本申请计算机可读存储介质具体实施方式与上述基于社交网络的用户关键词提取装置和方法各实施例基本相同,在此不作累述。
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于社交网络的用户关键词提取装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的用户关键词提取程序,所述用户关键词提取程序被所述处理器执行时实现如下步骤:
    获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
    将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
    基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
    根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
    在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
  2. 根据权利要求1所述的基于社交网络的用户关键词提取装置,其特征在于,所述根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤包括:
    将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
    遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;
    由全部单词节点以及建立的边构成所述语义相似图。
  3. 根据权利要求2所述的基于社交网络的用户关键词提取装置,其特征在于,所述根据对应的词向量计算每两个单词节点之间的上下文相似度的步骤包括:
    获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
  4. 根据权利要求1所述的基于社交网络的用户关键词提取装置,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对 应的关键词。
  5. 根据权利要求2所述的基于社交网络的用户关键词提取装置,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
  6. 根据权利要求1所述的基于社交网络的用户关键词提取装置,其特征在于,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
    将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
    或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
  7. 根据权利要求2所述的基于社交网络的用户关键词提取装置,其特征在于,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
    将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
    或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
  8. 一种基于社交网络的用户关键词提取方法,其特征在于,所述方法包括:
    获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
    将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
    基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
    根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
    在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
  9. 根据权利要求8所述的基于社交网络的用户关键词提取方法,其特征在于,所述根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤包括:
    将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
    遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;
    由全部单词节点以及建立的边构成所述语义相似图。
  10. 根据权利要求9所述的基于社交网络的用户关键词提取方法,其特征在于,所述根据对应的词向量计算每两个单词节点之间的上下文相似度的步骤包括:
    获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
  11. 根据权利要求8所述的基于社交网络的用户关键词提取方法,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
  12. 根据权利要求9所述的基于社交网络的用户关键词提取方法,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
  13. 根据权利要求8所述的基于社交网络的用户关键词提取装置,其特征在于,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
    将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
    或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第 一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
  14. 根据权利要求9所述的基于社交网络的用户关键词提取装置,其特征在于,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
    将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
    或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有用户关键词提取程序,所述用户关键词提取程序可被至少一个处理器执行,以实现如下步骤:
    获取目标用户在预设时间区间内发表过的博文,使用预设的分词工具对获取的博文进行分词处理,分别获取每条博文对应的单词列表;
    将获取的每个博文对应的单词列表输入到Word2Vec模型中进行训练,以获取词向量模型;
    基于关键词提取算法从博文的单词列表中提取该博文对应的关键词,将所述目标用户在所述预设时间区间内发表过的博文累计的关键词构成所述目标用户的候选关键词集合,并基于所述词向量模型计算所述候选关键词集合中每一个关键词的词向量;
    根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图;
    在所述语义相似图上运行Pagerank算法为每一个关键词打分,将得分满足预设条件的关键词作为所述目标用户的兴趣关键词。
  16. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述根据所述候选关键词集合以及所述候选关键词集合中每一个关键词对应的词向量,构建语义相似图的步骤包括:
    将所述候选关键词集合中的关键词作为单词节点,其中,一个关键词对应一个单词节点;
    遍历全部单词节点,根据对应的词向量计算每两个单词节点之间的上下文相似度,每当两个单词节点之间的上下文相似度大于预设阈值时,在所述两个单词节点之间建立一条边;
    由全部单词节点以及建立的边构成所述语义相似图。
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述根 据对应的词向量计算每两个单词节点之间的上下文相似度的步骤包括:
    获取两个单词节点的词向量,并计算这两个词向量之间的余弦相似度,将所述余弦相似度作为所述两个单词节点之间的上下文相似度。
  18. 根据权利要求15所述的计算机可读存储介质,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
  19. 根据权利要求16所述的计算机可读存储介质,其特征在于,当所述博文包含的字数大于或者等于预设字数时,所述基于关键词提取算法从博文的单词列表中提取该博文对应的关键词的步骤包括:
    分别按照预设的多个关键词提取算法从博文的单词列表中提取关键词;
    将所述多个关键词提取算法提取的关键词中重复的关键词作为该博文对应的关键词。
  20. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述将得分满足预设条件的关键词作为所述目标用户的兴趣关键词的步骤包括:
    将得分大于预设分数的关键词作为所述目标用户的兴趣关键词;
    或者,将得分大于预设分数的关键词作为所述目标用户的兴趣关键词,其中,在得分大于预设分数的关键词的数量大于第一预设个数时,将所述第一预设个数个关键词中的第二预设个数个关键词作为所述目标用户的兴趣关键词,所述第一预设个数大于所述第二预设个数。
PCT/CN2017/108797 2017-08-29 2017-10-31 用户关键词提取装置、方法及计算机可读存储介质 WO2019041521A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU2017408801A AU2017408801B2 (en) 2017-08-29 2017-10-31 User keyword extraction device and method, and computer-readable storage medium
US16/084,988 US20210097238A1 (en) 2017-08-29 2017-10-31 User keyword extraction device and method, and computer-readable storage medium
JP2018538141A JP2019533205A (ja) 2017-08-29 2017-10-31 ユーザキーワード抽出装置、方法、及びコンピュータ読み取り可能な記憶媒体
KR1020187024862A KR102170929B1 (ko) 2017-08-29 2017-10-31 사용자 키워드 추출장치, 방법 및 컴퓨터 판독 가능한 저장매체
EP17904351.8A EP3477495A4 (en) 2017-08-29 2017-10-31 APPARATUS AND METHOD FOR USER KEYWORD EXTRACTION AND COMPUTER-READABLE MEMORY MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710754314.4 2017-08-29
CN201710754314.4A CN107704503A (zh) 2017-08-29 2017-08-29 用户关键词提取装置、方法及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019041521A1 true WO2019041521A1 (zh) 2019-03-07

Family

ID=61169937

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108797 WO2019041521A1 (zh) 2017-08-29 2017-10-31 用户关键词提取装置、方法及计算机可读存储介质

Country Status (7)

Country Link
US (1) US20210097238A1 (zh)
EP (1) EP3477495A4 (zh)
JP (1) JP2019533205A (zh)
KR (1) KR102170929B1 (zh)
CN (1) CN107704503A (zh)
AU (1) AU2017408801B2 (zh)
WO (1) WO2019041521A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489758A (zh) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 应用程序的价值观计算方法及装置
CN111160193A (zh) * 2019-12-20 2020-05-15 中国平安财产保险股份有限公司 关键信息提取方法、装置及存储介质
CN111191119A (zh) * 2019-12-16 2020-05-22 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN111581492A (zh) * 2020-04-01 2020-08-25 车智互联(北京)科技有限公司 一种内容推荐方法、计算设备及可读存储介质
CN111858834A (zh) * 2020-07-30 2020-10-30 平安国际智慧城市科技股份有限公司 基于ai的案件争议焦点确定方法、装置、设备及介质
CN112101012A (zh) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 互动领域确定方法、装置、电子设备及存储介质
CN112800771A (zh) * 2020-02-17 2021-05-14 腾讯科技(深圳)有限公司 文章识别方法、装置、计算机可读存储介质和计算机设备
CN112988971A (zh) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 基于词向量的搜索方法、终端、服务器及存储介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596789B (zh) * 2018-03-29 2022-08-30 时时同云科技(成都)有限责任公司 一种菜品标准化的方法
CN108573134A (zh) * 2018-04-04 2018-09-25 阿里巴巴集团控股有限公司 一种识别身份的方法、装置及电子设备
CN109635273B (zh) * 2018-10-25 2023-04-25 平安科技(深圳)有限公司 文本关键词提取方法、装置、设备及存储介质
CN109408826A (zh) * 2018-11-07 2019-03-01 北京锐安科技有限公司 一种文本信息提取方法、装置、服务器及存储介质
CN111259656A (zh) * 2018-11-15 2020-06-09 武汉斗鱼网络科技有限公司 短语相似度计算方法、存储介质、电子设备及系统
CN109508423A (zh) * 2018-12-14 2019-03-22 平安科技(深圳)有限公司 基于语义识别的房源推荐方法、装置、设备及存储介质
CN110298029B (zh) * 2019-05-22 2022-07-12 平安科技(深圳)有限公司 基于用户语料的好友推荐方法、装置、设备及介质
JP7451917B2 (ja) * 2019-09-26 2024-03-19 株式会社Jvcケンウッド 情報提供装置、情報提供方法及びプログラム
KR102326744B1 (ko) * 2019-11-21 2021-11-16 강원오픈마켓 주식회사 사용자 참여형 키워드 선정 시스템의 제어 방법, 장치 및 프로그램
CN111274428B (zh) * 2019-12-19 2023-06-30 北京创鑫旅程网络技术有限公司 一种关键词的提取方法及装置、电子设备、存储介质
CN111460099B (zh) * 2020-03-30 2023-04-07 招商局金融科技有限公司 关键词提取方法、装置及存储介质
KR102476334B1 (ko) * 2020-04-22 2022-12-09 인하대학교 산학협력단 딥러닝 기반 일기 생성 방법 및 장치
CN111737523B (zh) * 2020-04-22 2023-11-14 聚好看科技股份有限公司 一种视频标签、搜索内容的生成方法及服务器
CN111724196A (zh) * 2020-05-14 2020-09-29 天津大学 一种基于用户体验的提高汽车产品质量的方法
CN112069232B (zh) * 2020-09-08 2023-08-01 中国移动通信集团河北有限公司 宽带业务覆盖范围的查询方法及装置
CN112347778B (zh) * 2020-11-06 2023-06-20 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质
CN112329462B (zh) * 2020-11-26 2024-02-20 北京五八信息技术有限公司 一种数据排序方法、装置、电子设备及存储介质
CN113919342A (zh) * 2021-09-18 2022-01-11 暨南大学 一种会计术语共现网络图构建的方法
CN115080718B (zh) * 2022-06-21 2024-04-09 浙江极氪智能科技有限公司 一种文本关键短语的抽取方法、系统、设备及存储介质
CN115344679A (zh) * 2022-08-16 2022-11-15 中国平安财产保险股份有限公司 问题数据的处理方法、装置、计算机设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073095A1 (en) * 2000-12-07 2002-06-13 Patentmall Ltd. Patent classification displaying method and apparatus
CN104778161A (zh) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 基于Word2Vec和Query log抽取关键词方法
CN106997382A (zh) * 2017-03-22 2017-08-01 山东大学 基于大数据的创新创意标签自动标注方法及系统

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5088096B2 (ja) * 2007-11-02 2012-12-05 富士通株式会社 情報抽出プログラムおよび情報抽出装置
CN103201718A (zh) * 2010-11-05 2013-07-10 乐天株式会社 关于关键词提取的系统和方法
US9798818B2 (en) * 2015-09-22 2017-10-24 International Business Machines Corporation Analyzing concepts over time
CN105893410A (zh) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 一种关键词提取方法和装置
US20170139899A1 (en) 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN105447179B (zh) * 2015-12-14 2019-02-05 清华大学 基于微博社交网络的话题自动推荐方法及其系统
CN105912524B (zh) * 2016-04-09 2019-08-20 北京交通大学 基于低秩矩阵分解的文章话题关键词提取方法和装置
CN106372064B (zh) * 2016-11-18 2019-04-19 北京工业大学 一种文本挖掘的特征词权重计算方法
CN106970910B (zh) * 2017-03-31 2020-03-27 北京奇艺世纪科技有限公司 一种基于图模型的关键词提取方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073095A1 (en) * 2000-12-07 2002-06-13 Patentmall Ltd. Patent classification displaying method and apparatus
CN104778161A (zh) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 基于Word2Vec和Query log抽取关键词方法
CN106997382A (zh) * 2017-03-22 2017-08-01 山东大学 基于大数据的创新创意标签自动标注方法及系统

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489758B (zh) * 2019-09-10 2023-04-18 深圳市和讯华谷信息技术有限公司 应用程序的价值观计算方法及装置
CN110489758A (zh) * 2019-09-10 2019-11-22 深圳市和讯华谷信息技术有限公司 应用程序的价值观计算方法及装置
CN111191119A (zh) * 2019-12-16 2020-05-22 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN111191119B (zh) * 2019-12-16 2023-12-12 绍兴市上虞区理工高等研究院 一种基于神经网络的科技成果自学习方法及装置
CN111160193A (zh) * 2019-12-20 2020-05-15 中国平安财产保险股份有限公司 关键信息提取方法、装置及存储介质
CN111160193B (zh) * 2019-12-20 2024-02-09 中国平安财产保险股份有限公司 关键信息提取方法、装置及存储介质
CN112800771B (zh) * 2020-02-17 2023-11-07 腾讯科技(深圳)有限公司 文章识别方法、装置、计算机可读存储介质和计算机设备
CN112800771A (zh) * 2020-02-17 2021-05-14 腾讯科技(深圳)有限公司 文章识别方法、装置、计算机可读存储介质和计算机设备
CN111581492A (zh) * 2020-04-01 2020-08-25 车智互联(北京)科技有限公司 一种内容推荐方法、计算设备及可读存储介质
CN111581492B (zh) * 2020-04-01 2024-02-23 车智互联(北京)科技有限公司 一种内容推荐方法、计算设备及可读存储介质
CN111858834B (zh) * 2020-07-30 2023-12-01 平安国际智慧城市科技股份有限公司 基于ai的案件争议焦点确定方法、装置、设备及介质
CN111858834A (zh) * 2020-07-30 2020-10-30 平安国际智慧城市科技股份有限公司 基于ai的案件争议焦点确定方法、装置、设备及介质
CN112101012A (zh) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 互动领域确定方法、装置、电子设备及存储介质
CN112101012B (zh) * 2020-09-25 2024-04-26 北京百度网讯科技有限公司 互动领域确定方法、装置、电子设备及存储介质
CN112988971A (zh) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 基于词向量的搜索方法、终端、服务器及存储介质

Also Published As

Publication number Publication date
EP3477495A4 (en) 2019-12-11
KR102170929B1 (ko) 2020-10-29
AU2017408801B2 (en) 2020-04-02
KR20190038751A (ko) 2019-04-09
CN107704503A (zh) 2018-02-16
AU2017408801A1 (en) 2019-03-14
EP3477495A1 (en) 2019-05-01
JP2019533205A (ja) 2019-11-14
US20210097238A1 (en) 2021-04-01

Similar Documents

Publication Publication Date Title
WO2019041521A1 (zh) 用户关键词提取装置、方法及计算机可读存储介质
WO2019200806A1 (zh) 文本分类模型的生成装置、方法及计算机可读存储介质
CN108287864B (zh) 一种兴趣群组划分方法、装置、介质及计算设备
US10026021B2 (en) Training image-recognition systems using a joint embedding model on online social networks
CN107609152B (zh) 用于扩展查询式的方法和装置
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
US11797620B2 (en) Expert detection in social networks
CN110334272B (zh) 基于知识图谱的智能问答方法、装置及计算机存储介质
US10083379B2 (en) Training image-recognition systems based on search queries on online social networks
JP6661790B2 (ja) テキストタイプを識別する方法、装置及びデバイス
KR20200094627A (ko) 텍스트 관련도를 확정하기 위한 방법, 장치, 기기 및 매체
WO2020000717A1 (zh) 网页分类方法、装置及计算机可读存储介质
WO2019205373A1 (zh) 相似用户查找装置、方法及计算机可读存储介质
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
JP2019519019A5 (zh)
WO2020056977A1 (zh) 知识点推送方法、装置及计算机可读存储介质
CN110275962B (zh) 用于输出信息的方法和装置
WO2021068681A1 (zh) 标签分析方法、装置及计算机可读存储介质
WO2020258481A1 (zh) 个性化文本智能推荐方法、装置及计算机可读存储介质
CN110019763B (zh) 文本过滤方法、系统、设备及计算机可读存储介质
WO2018205459A1 (zh) 获取目标用户的方法、装置、电子设备及介质
CN113626704A (zh) 基于word2vec模型的推荐信息方法、装置及设备
CN115248890A (zh) 用户兴趣画像的生成方法、装置、电子设备以及存储介质
WO2019085118A1 (zh) 基于主题模型的关联词分析方法、电子装置及存储介质
WO2018205460A1 (zh) 获取目标用户的方法、装置、电子设备及介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018538141

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20187024862

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017904351

Country of ref document: EP

Effective date: 20181008

ENP Entry into the national phase

Ref document number: 2017408801

Country of ref document: AU

Date of ref document: 20171031

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE