WO2014108004A1 - 一种微博用户身份识别方法及系统 - Google Patents

一种微博用户身份识别方法及系统 Download PDF

Info

Publication number
WO2014108004A1
WO2014108004A1 PCT/CN2013/088616 CN2013088616W WO2014108004A1 WO 2014108004 A1 WO2014108004 A1 WO 2014108004A1 CN 2013088616 W CN2013088616 W CN 2013088616W WO 2014108004 A1 WO2014108004 A1 WO 2014108004A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
feature
user behavior
information
identified
Prior art date
Application number
PCT/CN2013/088616
Other languages
English (en)
French (fr)
Inventor
赵立永
于晓明
杨建武
郑妍
Original Assignee
北大方正集团有限公司
北京大学
北京北大方正电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北大方正集团有限公司, 北京大学, 北京北大方正电子有限公司 filed Critical 北大方正集团有限公司
Priority to US14/760,048 priority Critical patent/US20150356091A1/en
Publication of WO2014108004A1 publication Critical patent/WO2014108004A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/316User authentication by observing the pattern of computer usage, e.g. typical user behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication

Definitions

  • the present invention relates to the field of computer information processing technologies, and in particular, to a microblog user identification method and system.
  • the identification process mainly involves the user information registered and stored in the network by Weibo users. For example, the log, temporary information and registration information of the website that the user visits the website are obtained from the website to realize the user identification; or the Chinese text classification method is used to identify the microblog user.
  • the process of identifying the user, the temporary information, and the registration information of the website to be identified by the user to be authenticated is implemented by the website, and the data of the user identification process is based on the user registration information and the user registration information.
  • the user's log and temporary information make data acquisition more difficult and less accurate.
  • the object of the present invention is to provide a microblog user identification method and system with high accuracy and real-time performance.
  • the invention provides a microblog user identification method, which comprises:
  • the identity of the user to be identified is determined.
  • the invention also provides a microblog user body system, including:
  • An information obtaining unit configured to obtain feature database information of only user behavior data and user behavior
  • a preprocessing unit configured to preprocess the acquired user behavior data to be identified
  • a semantic unit reconstruction unit configured to perform semantic unit reconstruction on the pre-processed user behavior data
  • An attribute and weight information obtaining unit configured to acquire attribute information of the semantic unit and a corresponding weight thereof;
  • a behavior feature extraction unit configured to acquire the user behavior feature to be identified according to attribute information of the semantic unit and a corresponding weight thereof;
  • a comparing unit configured to compare each of the feature types in the feature library information of the user behavior feature to be identified
  • the identity determining unit is configured to determine the identity of the user to be identified when the similarity between the feature type of the user behavior feature to be identified and the feature database of the user behavior exceeds a preset threshold.
  • the microblog user identification method and system provided by the present invention obtains only the user behavior data and the feature database information of the user behavior; preprocesses the acquired user behavior data to be recognized; and the preprocessed user behavior data And performing the semantic unit reconstruction; acquiring the attribute information of the semantic unit and the corresponding weight; acquiring the to-be-identified user behavior feature according to the attribute information of the semantic unit and the corresponding weight; The behavior feature is compared with each feature type in the feature database information of the user behavior; when the user behavior feature to be identified is similar to a feature type in the feature database information of the user behavior > 3 ⁇ 4 exceeds a preset threshold, then The identity of the user to be identified is determined.
  • the microblog user identification method and system provided by the invention can effectively improve the accuracy and real-time performance of the microblog user identification.
  • FIG. 1 is a flowchart of a microblog user identity identification method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of constructing a feature database of user behavior in a microblog user identity recognition method according to the present invention
  • 3 is a 3 ⁇ 4 ⁇ 2 diagram of a feature library for updating user behavior in a microblog user identification method provided by the present invention
  • FIG. 4 is a schematic structural diagram of a microblog user identity recognition system according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of another microblog user identity recognition system according to an embodiment of the present invention
  • FIG. 1 is a schematic diagram of a microblog user identification method according to an embodiment of the present invention, where the method includes:
  • Step 101 Obtain feature database information of only user behavior data and user behavior;
  • Step 102 Pre-process the acquired user behavior data to be identified; the pre-processing mainly includes behavior data screening, spelling correction, word segmentation, and part-of-speech tagging.
  • Step 103 Perform semantic unit reconstruction on the pre-processed user behavior data.
  • the semantic unit reconstruction is a method for applying word-of-speech information to perform word adhesion on the basis of preprocessing, and constructing by combining specific words.
  • a semantic unit (word string) that contains richer semantics.
  • Step 104 Obtain attribute information of the semantic unit and its corresponding weight; wherein, the attribute information of the semantic unit refers to counting a word frequency and a document frequency of each semantic unit; and the weight of the semantic unit is a TFDID function. Realize the weight calculation of user behavior characteristics and realize the numericalization of user behavior characteristics.
  • Step 105 Acquire the user behavior feature to be identified according to the attribute information of the semantic unit and its corresponding weight; the user behavior feature to be identified refers to the feature that is extracted to represent the user behavior, and the feature item ( That is, the semantic unit) has a good degree of discrimination, for a single user to be identified
  • sort the keywords according to the word weight and word frequency filter out stop words or non-stop words according to the stop word table (satisfying the word length is greater than the maximum length or less than the minimum length) ; select the part of speech as "a,,,” cw,,, "v,,, “j,,,”ns,,,”nr,,,”nf,,”nz, or a word containing "no,” .
  • Step 106 Comparing the user behavior feature to be identified with each feature type in the feature database information of the user behavior; the comparing process includes performing user classification, mainly adopting a KNN algorithm, and the K value selection method adopts a probability distribution.
  • the method that is, the ratio of similar feature vectors to feature vector spaces.
  • the classification idea is: Compare the similarity sim(u, C) of each user category in the information of the user to be identified and the user behavior characteristic database, compare the similarity sim(u, Cui) of the user and each category, if sim If (u, C) is greater than the experience threshold, or if most sim(u, Cui) is greater than the experience threshold, then the user is considered to have a correlation with the category, and the user category with the highest similarity is selected to determine the user identity.
  • the similarity between the feature vectors is calculated by measuring the cosine similarity.
  • the specific steps are as follows:
  • Step 107 When the similarity of a feature type of the feature behavior information to be identified and the feature database information of the user behavior exceeds a preset threshold, the identity of the user to be identified is determined.
  • the method may further include constructing The process of characterizing the user's behavior.
  • FIG. 2 is a flowchart of a feature library for constructing a user behavior in a microblog user identification method according to an embodiment of the present invention, where the method includes:
  • Step 201 Acquire known user behavior data; specifically, obtain known user behavior data, that is, training data; the training data is used to construct a feature library of user behavior.
  • each word contains word string information and part of speech after processing, and the tools for word segmentation and part-of-speech tagging are all from known technologies, and will not be described here.
  • Step 203 Perform semantic unit reconstruction on the pre-processed user behavior data.
  • the semantic unit is reconstructed as: because the long word string contains more semantic information than the short word string, and has stronger expression capability, Therefore, the semantic unit reconstruction is based on the processing result in step 201, and the words are glued to adjacent specific words by a specific rule, thereby generating a longer semantic string.
  • the adjacent words to be processed in this step include "ns,, place name, "nr,” person name, "nf, institution name, “nz,” proper noun and "j, abbreviation, etc., and the rule of processing is the combination for the first time. All words between the type of word and the last occurrence of the type of word appear.
  • the word string of the tagged post is "cw", which is more important in feature selection and weight calculation.
  • Step 204 Obtain attribute information of the semantic unit and its corresponding weight
  • the obtaining the attribute information of the semantic unit in steps 201 and 202, uniformly numbering the semantic unit, establishing a microblog-semantic unit index vector, and attribute information of the semantic unit according to the user, including word frequency sum.
  • Document frequency preparing for individual user behavior feature extraction, performing word frequency and document frequency statistics according to the same identity user, preparing for class behavior feature extraction of the same identity category, and processing the result information into the data structure as shown in FIG. 6.
  • the weights of the semantic units taken are:
  • the stop words are filtered according to the stop word list commonly used in the field of natural language processing, and the semantic units whose word frequency is less than the empirical threshold and whose part of speech is not including "n” or "cw" are filtered out.
  • the TF-IDF weight calculation method calculates the weight of each semantic unit, and assigns a higher weight to a specific type of semantic unit.
  • the specific method is that the part of the word is "nr", as shown in the following formula (2),
  • the confound word as shown in the following formula (3),
  • Step 205 Acquire the known according to the attribute information of the semantic unit and its corresponding weight. User behavior characteristics; get i1 ⁇ 2 as:
  • the training data of the obtained known user identity mainly adopts a method of combining chi-square statistics, part of speech and word frequency; firstly calculating the chi-square value corresponding to the user category of each semantic unit, and sorting the semantic units according to the chi-square value Filter out words whose length is equal to 1, and whose part of speech is non-nr; filter out stop words or non-stop words according to the stop word table (satisfying the word length is greater than the maximum length or less than the minimum length); select the part of speech as "a,, , "CW,,,, "V,,,"j,,, “118,,,,"111",,,"1 ⁇ ,,,"112, or a word containing "No,”; none of the above information When distinguishing, choose a semantic unit with a large word frequency.
  • Step 206 Store the acquired known user behavior characteristics in a feature database of the user behavior according to a category.
  • FIG. 3 is a flowchart of a feature database for updating a user behavior in a microblog user identification method according to an embodiment of the present invention, where the process includes:
  • Step 301 Acquire at least one semantic unit of the user to be identified that determines the identity of the user, and user type information corresponding to the identity of the user;
  • Step 302 Compare user type information of the semantic unit and the user identity, and give similarity between the semantic unit and the user type information of the user identity. This step may use a chi-square statistical method to calculate a semantic unit. Correlation is evaluated by the obtained chi-square value with the chi-square value of the user category.
  • Step 303 Sort the semantic units according to the order of the degree of appearance;
  • Step 304 Obtain the top-n semantic units before the similarity as the behavior characteristics of the user of the type;
  • Step 305 The user's behavioral characteristics are added to the corresponding categories of the feature library of the user behavior.
  • the behavior feature includes at least one semantic unit; as shown in FIG. 6, the semantic unit attribute information includes at least: an index value, a character information, a part of speech, a word frequency, and a document frequency; The semantic unit includes at least one word; the attribute information of the word includes: an index of the word, a word frequency, a document frequency, an IDF value, and a weight.
  • FIG. 4 is a schematic diagram of a microblog user identification system according to an embodiment of the present invention, the system includes:
  • the information obtaining unit 401 is configured to obtain feature database information of only user behavior data and user behavior;
  • the pre-processing unit 402 is configured to pre-process the user behavior data to be identified by the spring;
  • a semantic unit reconstruction unit 403 configured to perform semantic unit reconstruction on the pre-processed user behavior data
  • An attribute and weight information obtaining unit 404 configured to acquire attribute information of the semantic unit and a corresponding weight thereof;
  • the behavior feature extraction unit 405 is configured to acquire the user behavior feature to be identified according to the attribute information of the semantic unit and the corresponding weight thereof;
  • the comparing unit 406 is configured to compare the to-be-identified user behavior feature with each feature type in the feature library information of the user behavior;
  • the identity determining unit 407 is configured to determine the identity of the user to be identified when the similarity between the feature type of the feature to be identified and the feature database of the user behavior exceeds a preset threshold.
  • the system further includes: a feature library construction unit 501 and/or an information feedback unit 502 of user behavior.
  • the feature library construction unit 501 of the user behavior is configured to acquire the known user behavior data; pre-process the acquired known user behavior data; and perform the semantic unit reconstruction by using the pre-processed known user behavior data; Obtaining the attribute information of the semantic unit and its corresponding weight; acquiring the known user behavior feature according to the attribute information of the semantic unit and its corresponding weight; and the obtained known user behavior characteristic, Stored in the feature library of the user behavior by category.
  • the information feedback unit 502 is configured to acquire at least one semantic unit of the user to be identified that determines the identity of the user, and user type information corresponding to the identity of the user; compare user type information of the semantic unit with the user identity, And the similarity between the semantic unit and the user type information of the user identity; sorting the semantic units according to the order of appearance degree; obtaining the top-n semantic units before the similarity A behavioral characteristic of the user of the type; adding the behavioral characteristics of the user to a corresponding category of the feature library of the user behavior.
  • the behavior feature described above includes at least one semantic unit; the semantic unit attribute information includes at least The index value, the character information, the part of speech, the word frequency and the document frequency; the semantic unit includes at least one word; the attribute information of the word includes: an index of the word, a word frequency, a document frequency, an IDF value, and a weight.
  • the above pre-processing operations mainly include: behavior data screening, spelling correction, word segmentation and part-of-speech tagging.
  • the microblog user identification method and system provided by the present invention obtains only the user behavior data and the feature database information of the user behavior; preprocesses the acquired user behavior data to be recognized; and the preprocessed user behavior data And performing the semantic unit reconstruction; acquiring the attribute information of the semantic unit and the corresponding weight; acquiring the to-be-identified user behavior feature according to the attribute information of the semantic unit and the corresponding weight; The behavior feature is compared with each feature type in the feature database information of the user behavior; when the user behavior feature to be identified is similar to a feature type in the feature database information of the user behavior > 3 ⁇ 4 exceeds a preset threshold, then The identity of the user to be identified is determined.
  • the microblog user identification method and system provided by the invention can effectively improve the accuracy and real-time performance of the microblog user identification.
  • a computer readable medium having computer executable instructions that, when executed by a computer, perform a microblog user identification method, the method comprising: obtaining user behavior data to be identified and a user Character library information of the behavior; pre-processing the acquired user behavior data to be identified; performing the semantic unit reconstruction on the pre-processed user behavior data; acquiring attribute information of the semantic unit and its corresponding weight; And comparing the attribute information of the semantic unit and the corresponding weight thereof to obtain the user behavior feature to be identified; comparing the user behavior feature to be identified with each feature type in the feature database information of the user behavior; The identity of the user to be identified is determined by determining that the similarity between the user behavior feature and one of the feature database information of the user behavior exceeds a preset threshold.
  • a computer is also provided that includes one or more computer readable media with computer executable instructions that, when executed by a computer, perform the above described microblog user identification method.
  • Exemplary operating environment includes one or more computer readable media with computer executable instructions that, when executed by a computer, perform the above described microblog user identification method.
  • a computer or computing device such as described herein, has hardware, including one or more processors or processing units, system memory, and some form of computer-readable media.
  • computer-readable media includes computer storage media and communication media.
  • Computer storage includes any method for storing information such as computer readable instructions, data structures, program modules or other data or The volatility and non-volatility of technology implementations are both mobile and non-removable.
  • Communication media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal, such as a carrier wave or other transmission mechanism, and includes any information delivery medium. Combinations of any of the above are also included within the scope of computer readable.
  • the computer can be used to one or more remote computers, such as logical connections of remote computers operating in a networked environment.
  • remote computers such as logical connections of remote computers operating in a networked environment.
  • the computing system environment is not intended to suggest any limitation as to the scope of use or functionality of any aspect of the invention.
  • the computer environment should not be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use in aspects of the present invention include, but are not limited to: personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor based System, set-top box, programmable consumer electronics, mobile phone, network PC, small computer, mainframe computer, distributed computing environment including any of the above systems or devices, and the like.
  • Computer executable instructions can be organized as software into one or more computer executable components or modules.
  • program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types. Any number of such components or modules and their organization may be utilized to implement aspects of the present invention.
  • aspects of the invention are not limited to the specific computer-executable instructions or specific components or modules illustrated in the figures and described herein.
  • Other embodiments of the invention may be in packages or components. Aspects of the invention may also be implemented in a distributed computing environment where tasks are set up by remote processing linked through a communications network.
  • program modules can be located in a memory storage, including memory storage.
  • the methods and systems of the present invention may be implemented in a number of ways.
  • the methods and systems of the present invention can be implemented in software, hardware, firmware, or any combination of software, hardware, or firmware.
  • the above sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the above The order of the description, unless otherwise specified.
  • the invention may also be embodied as a program recorded in a recording medium, the program comprising machine readable instructions for implementing the method according to the invention.
  • the present invention also stores a recording medium for executing a program according to the method of the present invention.

Abstract

本发明提供一种微博用户身份识别方法及系统,所述方法包括:获取待识别用户行为数据以及用户行为的特征库信息;预处理所述获取的待识别用户行为数据;将所述预处理后的用户行为数据,进行语义单元重构;获取所述语义单元的属性信息以及其对应的权重;根据所述语义单元的属性信息以及其对应的权重,获取所述待识别用户行为特征;将所述待识别用户行为特征与用户行为的特征库信息中的每个特征类型进行比较;当所述待识别用户行为特征与所述用户行为的特征库信息中的一个特征类型的相似度超过预设阈值,则所述待识别用户身份确定。采用本发明提供的微博用户身份识别方法及系统可以有效提高微博用户身份识别的准确性及实时性。

Description

一种微博用户身份识别方法及系统 技术领域
本发明涉及计算机信息处理技术领域, 尤其涉及一种微博用户身份识别方 法及系统。
背景技术
随着 web技术的 和微博的出现, 越来越多的用户加入到互联网中, 成 为虚拟社会中的一员, 促进了信息传播方式的变革, 提高了信息传播的效率。 然而, 微博用户身份的识别作为微博后台维护的重要組成部分, 其识别过程主 要通过微博用户在网络注册、 存储的数据信息进行用户身 ^只别。 例如: 从网 站获 只别用户访问网站的日志、临时信息及注册信息来实现用户身份识别; 或者, 通过中文文本分类方法进行微博用户身份识别。
但是, 在现有的微博用户身 ^只别 ½中, 发明 现技术至少存在如下 问题:
在现有技术中通过网站获取待识别用户访问网站的日志、 临时信息及注册 信息来实现用户身份识别的过程, 由于用户身份识别过程所依据的数据主 H^ 靠从网站获取用户注册信息以及该用户的日志及临时信息, 从而使得数据获取 较为困难, 且准确性不高。
现有技术中采用中文 L^分类的方法虽然可以实现微博用户身份识别, 但 是, 无法满足当前微博用户身份识别的准确性及实时性。 发明内容
针对现有技术中存在的缺陷, 本发明的目的是提出一种准确性高, 实时性 强的微博用户身份识别方法及系统。
本发明提供一种微博用户身份识别方法, 包括:
获 只别用户行为数据以及用户行为的特征库信息;
预处理所述获取的待识别用户行为数据;
将所述预处理后的用户行为数据, 进行语义单元重构;
获取所述语义单元的属性信息以及其对应的权重;
根据所述语义单元的属性信息以及其对应的权重, 获取所述待识别用户行 为特征;
将所述待识别用户行为特征与用户行为的特征库信息中的每个特征类型进 行比较;
当所述待识别用户行为特征与所述用户行为的特征库信息中的一个特征类 型的相似度超过预设阈值, 则所述待识别用户身份确定。
本发明还提供一种微博用户身^只别系统, 包括:
信息获取单元,用于获 只别用户行为数据以及用户行为的特征库信息; 预处理单元, 用于预处理所述获取的待识别用户行为数据;
语义单元重构单元, 用于将所述预处理后的用户行为数据, 进行语义单元 重构;
属性及权重信息获取单元, 用于获取所述语义单元的属性信息以及其对应 的权重;
行为特征抽取单元,用于根据所述语义单元的属性信息以及其对应的权重, 获取所述待识别用户行为特征;
比较单元, 用于将所述待识别用户行为特征与用户行为的特征库信息中的 每个特征类型进行比较;
身份确定单元, 用于当所述待识别用户行为特征与所述用户行为的特征库 信息中的一个特征类型的相似度超过预设阈值, 确定所述待识别用户身份。
通过本发明提供的微博用户身份识别方法及系统, 获 只别用户行为数 据以及用户行为的特征库信息; 预处理所述获取的待识别用户行为数据; 将所 述预处理后的用户行为数据, 进行语义单元重构; 获取所述语义单元的属性信 息以及其对应的权重; 根据所述语义单元的属性信息以及其对应的权重, 获取 所述待识别用户行为特征; 将所述待识别用户行为特征与用户行为的特征库信 息中的每个特征类型进行比较; 当所述待识别用户行为特征与所述用户行为的 特征库信息中的一个特征类型的相似 >¾ 过预设阈值, 则所述待识别用户身份 确定。 采用本发明提供的微博用户身份识别方法及系统可以有效提高微博用户 身份识别的准确性及实时性。 附图说明
图 1为本发明实施例提供的一种微博用户身份识别方法的流程图; 图 2为本发明提供的一种微博用户身份识别方法中用户行为的特征库的构 建的流程图;
图 3为本发明提供的一种微博用户身份识别方法中更新用户行为的特征库 的¾½图;
图 4为本发明实施例提供的一种微博用户身份识别系统结构示意图; 图 5为本发明实施例提供的另一种微博用户身份识别系统结构示意图; 图 6为本发明实施例提供的一种微博用户身^只别方法中语义单元属性信 息数据结构示意图。
具体实施方式
下面结合附图对本发明实施例提供的一种微博用户身^只别方法及系统进 行伴细描述。
图 1示出了为本发明实施例提供的一种微博用户身份识别方法, 该方法包 括:
步骤 101: 获 只别用户行为数据以及用户行为的特征库信息; 步骤 102:预处理所述获取的待识别用户行为数据;所述预处理主要包括行 为数据筛选、 拼写纠正、 分词和词性标注。
步骤 103: 将所述预处理后的用户行为数据,进行语义单元重构; 所述语义 单元重构是在预处理的基础上应用词性信息进行词粘连的方法, 通过合并特定 的词, 来构建包含更丰富语义的语义单元(词串)。
步骤 104: 获取所述语义单元的属性信息以及其对应的权重; 其中,所述语 义单元的属性信息是指统计每个语义单元的词频和文档频率; 所述语义单元的 权重则采用 TFIDF函数来实现用户行为特征的权值计算,实现用户行为特征的 数值化。
步骤 105:根据所述语义单元的属性信息以及其对应的权重,获取所述待识 别用户行为特征; 所述待识别用户行为特征是指所抽取的最能代表用户行为的 特征, 并且特征项(即语义单元)具有很好的区分度, 对于单个待识别用户主 要采用词权重、词频、词性相结合的方法, 根据词权重和词频进行关键词排序; 根据停用词表过滤掉停用词或非停用词(满足词长大于最大长度或小于最小长 度 ); 选取词性为" a,,, "cw,,, "v,,, "j,,, "ns,,,"nr,,,"nf,,"nz,,或者包含 "不,,的 词。
步骤 106:将所述待识别用户行为特征与用户行为的特征库信息中的每个特 征类型进行比较;所述比较的过程包括进行用户分类,主要可以采用 KNN算法, K值选取方法采用概率分布的方法, 即相似的特征向量和特征向量空间之比。
分类思路为: 比较待识别用户和用户行为特征库信息中每个用户类别的相 似度 sim(u,C), 比较用户和每个类别中包含用户的相似度 sim(u,Cui),如果 sim(u,C)大于经验阈值, 或者多数 sim(u,Cui)大于经验阈值, 则认为用户和该类 别存在相关性, 选取相似度最大的用户类别来确定用户身份。
采用调整余弦相似度的测量方法计算特征向量之间的相似度, 具体步骤如 下:
( 1 )对于特征向量库中每一个特征向量,计算与该用户特征向量的相似度;
( 2 )进行向量对齐操作,对于向量 vl和 v2,求其所有特征项的并集 C(vl, v2), 然后将 vl和 v2映射到 C上, 得到新的向量 vl,和 v2,;
(3)采用调整余弦相似度计算公式计算 vl,和 v2,的相似度。
步骤 107:当所述待识别用户行为特征与所述用户行为的特征库信息中的一 个特征类型的相似 L¾过预设阈值, 则所述待识别用户身份确定。
在上述的根据本发明实施例的微博用户身份识别方法的一个实施方式中, 在如上所述的获取待识别用户行为数据以及用户行为的特征库信息的步骤 101 之前, 该方法还可以包括构建用户行为的特征库的过程。 图 2示出了为本发明 实施例提供的一种在微博用户身份识别方法中构建用户行为的特征库的流程, 该构建方法包括:
步骤 201:获取已知用户行为数据;具体的讲,就是获取已知用户行为数据, 即训练数据; 该训练数据用于构建用户行为的特征库。
步骤 202:预处理所述获取的已知用户行为数据; 具体的讲,就是按照已知 用户的不同身份, 对训练数据(即已知用户数据)进行标注, 对相同身份的每 个用户的微博消息进行过滤, 过滤的方法是比较消息的长度和观测值 e (通过 对大量微博消息统计分析, 10个字符以内的微博消息包含较少或没有语义信息, 因此本系统中 θ=10 )之间的大小关系, 如果长度小于观测值, 则将微 ^为 噪声过滤掉。 拼写检查主要根据拼写常见错误对照表进行拼写错误校正。 利用 分词和词性标注工具进行分词及词性标注, 处理后每个词都包含词字符串信息 和词性, 分词和词性标注的工具均来自已知技术, 此处不再赘述。
步骤 203: 将所述预处理后的用户行为数据,进行语义单元重构; 所述语义 单元重构 为: 由于长词串相对于短词串包含更多语义信息, 具有更强的表 达能力, 所以语义单元重构就是在步骤 201处理结果的基础上, 通过特定的规 则对相邻的特定词进行词粘连, 进而产生更长的语义串。 该步骤要处理的相邻 词包括 "ns,,地名, "nr,,人名, "nf,机构名, "nz,,专有名词 和" j,,简称等,处理的规 则是組合第一次出现该类型词和最后一次出现该类型词之间的所有词。 标注粘 连后的词串词性为" cw", 在特征选择和权值计算时, 该类词更重要。
步骤 204: 获取所述语义单元的属性信息以及其对应的权重;
其中,所述获取语义单元的属性信息, ^^于步骤 201和步骤 202,为所述 语义单元进行统一编号, 建立微博-语义单元索引向量, 按用户统计语义单元的 属性信息, 包括词频和文档频率, 为单个用户行为特征提取做准备, 按照相同 身份用户进行词频和文档频率统计, 为相同身份类别的类别行为特征提取做准 备, 处理结果信息 到如图 6所示的数据结构中。
所 取所述语义单元的权重的 为:
首先, 根据自然语言处理领域中常用的停用词表过滤掉停用词, 并过滤掉 词频小于经验阈值且词性为非包含 "n"、 "cw"的语义单元。 其次, 采用基于
TF-IDF权值计算方法,计算每个语义单元的权值,对于特定类型的语义单元赋 予更高的权值, 具体方法为, 对于词性为" nr"人名, 如下式(2 )所示, 加权系 数^= 2 , 对于词性为 "cw,,粘连词, 如下式(3 )所示, 加权系数为 = 1-S , 权值计^^式为:
weightl= TF|log2IDF (1)
weight2= 2.0|TF|log2IDF (2)
weight3= 1.5|TF|log2IDF (3)
步骤 205:根据所述语义单元的属性信息以及其对应的权重,获取所述已知 用户行为特征; 获取 i½为:
对于所述获取的已知用户身份的训练数据主要采用卡方统计、 词性、 词频 相结合的方法; 首先计算每个语义单元相当于用户类别的卡方值, 按照卡方值 对语义单元进行排序; 过滤掉长度等于 1, 且词性为非 nr的词; 根据停用词表 过滤掉停用词或非停用词(满足词长大于最大长度或小于最小长度 ); 选取词 性为" a,,, "CW,,, "V,,, "j,,, "118,,,"111",,,"1^,,,"112,,或者包含"不,,的词; 上述信息 均不能区分时, 选择词频较大的语义单元。
为了控制分类过程中特征的维数, 设定选取语义单元的上限值 = 2 。 步骤 206:将所述获取的所述已知用户行为特征,按照类别存储在所述用户 行为的特征库中。
在如图 1所示的根据本发明实施例的微博用户身份识别方法的一个实施方 式中, 在如上所述的确定所述待识别用户身份的步骤 107之后, 该方法还可以 包括更新用户行为的特征库的 i½。 图 3示出了为本发明实施例提供的一种在 微博用户身份识别方法中更新用户行为的特征库的流程, 该流程包括:
步骤 301:获取所述确定用户身份的待识别用户的至少一个语义单元以及对 应所述用户身份的用户类型信息;
步骤 302:比较所述语义单元与所述用户身份的用户类型信息,给出所述各 个语义单元与所述用户身份的用户类型信息的相似度; 该步骤可以采用卡方统 计方法, 计算语义单元与用户类别的卡方值, 通过所述获取的卡方值来评价相 关性。
步骤 303: 按照所 目似度由大到小的顺序, 对所述语义单元进行排序; 步骤 304: 获取相似度前 top-n个语义单元作为该类型用户的行为特征; 步骤 305:将所述用户的行为特征添加到所述用户行为的特征库的对应类别 中。
需要说明的是,以上所述的实施例中所述行为特征至少包括一个语义单元; 如图 6所示, 所述语义单元属性信息至少包括: 索引值, 字符信息, 词性, 词 频和文档频率; 所述语义单元至少包括一个词; 所述词的属性信息包括: 词的 索引, 词频, 文档频率, IDF值, 权值。
所述预处理步骤主要包括: 行为数据筛选、 拼写纠正、 分词和词性标注。 图 4示出了为本发明实施例提供的一种微博用户身份识别系统, 该系统包 括:
信息获取单元 401,用于获 只别用户行为数据以及用户行为的特征库信 息;
预处理单元 402, 用于预处理所述泉取的待识别用户行为数据;
语义单元重构单元 403,用于将所述预处理后的用户行为数据,进行语义单 元重构;
属性及权重信息获取单元 404,用于获取所述语义单元的属性信息以及其对 应的权重;
行为特征抽取单元 405,用于根据所述语义单元的属性信息以及其对应的权 重, 获取所述待识别用户行为特征;
比较单元 406,用于将所述待识别用户行为特征与用户行为的特征库信息中 的每个特征类型进行比较;
身份确定单元 407,用于当所述待识别用户行为特征与所述用户行为的特征 库信息中的一个特征类型的相似度超过预设阈值, 确定所述待识别用户身份。
需要说明的是, 如图 5所示, 该系统还包括: 用户行为的特征库构建单元 501和 /或信息反馈单元 502。
所述用户行为的特征库构建单元 501用于获取已知用户行为数据; 预处理 所述获取的已知用户行为数据; 将所述预处理后的已知用户行为数据, 进行语 义单元重构; 获取所述语义单元的属性信息以及其对应的权重; 根据所述语义 单元的属性信息以及其对应的权重, 获取所述已知用户行为特征; 将所述获取 的所述已知用户行为特征, 按照类别存储在所述用户行为的特征库中。
所述信息反馈单元 502用于获取所述确定用户身份的待识别用户的至少一 个语义单元以及对应所述用户身份的用户类型信息; 比较所述语义单元与所述 用户身份的用户类型信息, 给出所述各个语义单元与所述用户身份的用户类型 信息的相似度; 按照所 目似度由大到小的顺序, 对所述语义单元进行排序; 获取相似度前 top-n个语义单元作为该类型用户的行为特征;将所述用户的行为 特征添加到所述用户行为的特征库的对应类别中。
以上所述行为特征至少包括一个语义单元; 所述语义单元属性信息至少包 括: 索引值, 字符信息, 词性, 词频和文档频率; 所述语义单元至少包括一个 词; 所述词的属性信息包括: 词的索引, 词频, 文档频率, IDF值, 权值。
上述预处理操作主要包括: 行为数据筛选、 拼写纠正、 分词和词性标注。 通过本发明提供的微博用户身份识别方法及系统, 获 只别用户行为数 据以及用户行为的特征库信息; 预处理所述获取的待识别用户行为数据; 将所 述预处理后的用户行为数据, 进行语义单元重构; 获取所述语义单元的属性信 息以及其对应的权重; 根据所述语义单元的属性信息以及其对应的权重, 获取 所述待识别用户行为特征; 将所述待识别用户行为特征与用户行为的特征库信 息中的每个特征类型进行比较; 当所述待识别用户行为特征与所述用户行为的 特征库信息中的一个特征类型的相似 >¾ 过预设阈值, 则所述待识别用户身份 确定。 采用本发明提供的微博用户身份识别方法及系统可以有效提高微博用户 身份识别的准确性及实时性。
开还提供一种或多种具有计算机可执行指令的计算机可读介廣, 所述 指令在由计算机执行时, 执行微博用户身份识别方法, 所述方法包括: 获取待 识别用户行为数据以及用户行为的特征库信息; 预处理所述获取的待识别用户 行为数据; 将所述预处理后的用户行为数据, 进行语义单元重构; 获取所述语 义单元的属性信息以及其对应的权重; 根据所述语义单元的属性信息以及其对 应的权重, 获取所述待识别用户行为特征; 将所述待识别用户行为特征与用户 行为的特征库信息中的每个特征类型进行比较; 当所述待识别用户行为特征与 所述用户行为的特征库信息中的一个特征类型的相似度超过预设阈值, 则所述 待识别用户身份确定。
开还提供一台包括带有计算机可执行指令的一个或多个计算机可读介 质的计算机, 所述指令在由计算机执行时执行上述微博用户身份识别方法。 示例性操作环境
诸如此处所描述的计算机或计算 i殳备具有硬件, 包括一个或多个处理器或 处理单元、 系统存储器和某种形式的计算机可读介廣。 作为示例而非限制, 计 算机可读介廣包括计算机存储介廣和通信介廣。 计算机存储介廣包括以用于存 如计算机可读指令、 数据结构、 程序模块或其它数据的信息的任何方法或 技术实现的易失性与非易失性、 可移动与不可移动介廣。 通信介廣一般以诸如 载波或其它传输机制等已调制数据信号来体现计算机可读指令、 数据结构、 程 序模块或其它数据, 并且包括任何信息传递介廣。 以上的任一种的組合也包括 在计算机可读介廣的范围之内。
计算机可使用至一个或多个远程计算机, 如远程计算机的逻辑连接在网络 化环境中操作。 尽管结合示例性计算系统环境进行了描述, 但本发明的各实施 例可用于众多其它通用或专用计算系统环境或配置。 计算系统环境并非旨在对 本发明的任何方面的使用范围或功能提出任何限制。 此外, 计算机环境也不应 被解释成对于示例性操作环境中所示出的任一組件或其組合有任何依赖或要 求。适用于本发明的各方面的公知的计算系统、环境和 /或配置的示例包括,但 不仅限于: 个人计算机、服务器计算机、 手持式或膝上型 i殳备、 多处理器系统、 基于微处理器的系统、机顶盒、 可编程消费电子产品、移动电话、 网络 PC、 小 型计算机、 大型计算机、 包括上面的系统或设备的中的任何一种的分布式计算 环境等等。
可以在由一台或多台计算机或其他设^ I行的诸如程序模块之类的计算机 可执行的指令的一般上下文中来描 发明的各实施例。 计算机可执行指令可 作为软件被組织成一个或多个计算机可执行組件或模块。 一般而言, 程序模块 包括, 但不限于, 执行特定任务或实现特定抽象数据类型的例程、程序、对象、 組件, 以及数据结构。 可以利用任何数量的这样的組件或模块及其組织来实现 本发明的各方面。 例如, 本发明的各方面不仅限于附图中所示出并且在此处所 描述的特定计算机可执行指令或特定組件或模块。 本发明的其他实施例可以包 或組件。 本发明的各方面也可以在其中任务由通过通信网络链接的远程处理设 行的分布式计算环境中实现。 在分布式计算环境中, 程序模块可以位于包 括存储器存 备在内的 和 计算 储介廣中。 可能以许多方式来实现本发明的方法和系统。 例如, 可通过软件、 硬件、 固件或者软件、 硬件、 固件的任何組合来实现本发明的方法和系统。 用于所述 方法的步骤的上述顺序仅是为了进行说明, 本发明的方法的步骤不限于以上具 述的顺序, 除非以其它方式特别说明。 此外, 在一些实施例中, 还可将本 发明实施为记录在记录介廣中的程序, 这些程序包括用于实现根据本发明的方 法的机器可读指令。 因而, 本发明还^^存储用于执行根据本发明的方法的程 序的记录介廣。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是 可以通过程序来指令相关的硬件来完成, 所述的程序可以存储于一计算机可读 M储介廣中, 该程序在被执行时可以实现上面讨论的微博用户身份识别方法 的步骤, 所述的存储介廣例如为: ROM/RAM、 磁碟、 光盘等。
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限于 此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易想到 变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保护范围应 以所述权利要求的保护范围为准。

Claims

1.一种微博用户身份识别方法, 其特征在于, 包括:
获 只别用户行为数据以及用户行为的特征库信息;
预处理所述获取的待识别用户行为数据;
将所述预处理后的用户行为数据, 进行语义单元重构;
获取所述语义单元的属性信息以及其对应的权重;
根据所述语义单元的属性信息以及其对应的权重, 获取所述待识别用户行 为特征;
将所述待识别用户行为特征与用户行为的特征库信息中的每个特征类型进 行比较;
当所述待识别用户行为特征与所述用户行为的特征库信息中的一个特征类 型的相似度超过预设阈值, 则所述待识别用户身份确定。
2.根据权利要求 1所述的微博用户身份识别方法, 其特征在于, 在获取待 识别用户行为数据以及用户行为的特征库信息的步骤之前, 该方法还包括: 获取已知用户行为数据;
预处理所述获取的已知用户行为数据;
将所述预处理后的已知用户行为数据, 进行语义单元重构;
获取所述语义单元的属性信息以及其对应的权重;
根据所述语义单元的属性信息以及其对应的权重, 获取所述已知用户行为 特征;
将所述获取的所述已知用户行为特征, 按照类别存储在所述用户行为的特 征库中。
3.根据权利要求 1或 2所述的微博用户身份识别方法, 其特征在于, 在所 述待识别用户身份确定之后, 该方法还包括:
获取所述确定用户身份的待识别用户的至少一个语义单元以及对应所述用 户身份的用户类型信息;
比较所述语义单元与所述用户身份的用户类型信息, 给出所述各个语义单 元与所述用户身份的用户类型信息的相似度; 按照所 目似度由大到小的顺序, 对所述语义单元进行排序; 获取相似度前 top-n个语义单元作为该类型用户的行为特征;
将所述用户的行为特征添加到所述用户行为的特征库的对应类别中。
4.根据权利要求 3所述的微博用户身份识别方法, 其特征在于, 所述行为 特征至少包括一个语义单元; 所述语义单元属性信息至少包括: 索引值, 字符 信息, 词性, 词频和文档频率; 所述语义单元至少包括一个词; 所述词的属性 信息包括: 词的索引, 词频, 文档频率, IDF值, 权值。
5.根据权利要求 4所述的微博用户身^只别方法, 其特征在于, 所述预处 理步骤包括: 行为数据筛选、 拼写纠正、 分词和词性标注。
6.—种微博用户身^只别系统, 其特征在于, 包括:
信息获取单元,用于获 只别用户行为数据以及用户行为的特征库信息; 预处理单元, 用于预处理所述获取的待识别用户行为数据;
语义单元重构单元, 用于将所述预处理后的用户行为数据, 进行语义单元 重构;
属性及权重信息获取单元, 用于获取所述语义单元的属性信息以及其对应 的权重;
行为特征抽取单元,用于根据所述语义单元的属性信息以及其对应的权重, 获取所述待识别用户行为特征;
比较单元, 用于将所述待识别用户行为特征与用户行为的特征库信息中的 每个特征类型进行比较;
身份确定单元, 用于当所述待识别用户行为特征与所述用户行为的特征库 信息中的一个特征类型的相似度超过预设阈值, 确定所述待识别用户身份。
7.根据权利要求 6所述的微博用户身^只别系统, 其特征在于, 该系统还 包括: 用户行为的特征库构建单元, 用于获取已知用户行为数据; 预处理所述 获取的已知用户行为数据; 将所述预处理后的已知用户行为数据, 进行语义单 元重构; 获取所述语义单元的属性信息以及其对应的权重; 根据所述语义单元 的属性信息以及其对应的权重, 获取所述已知用户行为特征; 将所述获取的所 述已知用户行为特征, 按照类别存储在所述用户行为的特征库中。
8.根据权利要求 6或 7所述的微博用户身^只别系统, 其特征在于, 该系 统还包括: 信息反馈单元, 用于获取所述确定用户身份的待识别用户的至少一 个语义单元以及对应所述用户身份的用户类型信息; 比较所述语义单元与所述 用户身份的用户类型信息, 给出所述各个语义单元与所述用户身份的用户类型 信息的相似度; 按照所 目似度由大到小的顺序, 对所述语义单元进行排序; 获取相似度前 top-n个语义单元作为该类型用户的行为特征;将所述用户的行为 特征添加到所述用户行为的特征库的对应类别中。
9.根据权利要求 8所述的微博用户身份识别系统, 其特征在于, 所述行为 特征至少包括一个语义单元; 所述语义单元属性信息至少包括: 索引值, 字符 信息, 词性, 词频和文档频率; 所述语义单元至少包括一个词; 所述词的属性 信息包括: 词的索引, 词频, 文档频率, IDF值, 权值。
10.根据权利要求 9所述的微博用户身^只别系统,其特^于,所述预处 理包括: 行为数据筛选、 拼写纠正、 分词和词性标注。
PCT/CN2013/088616 2013-01-09 2013-12-05 一种微博用户身份识别方法及系统 WO2014108004A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/760,048 US20150356091A1 (en) 2013-01-09 2013-12-05 Method and system for identifying microblog user identity

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310008156.X 2013-01-09
CN201310008156.XA CN103914494B (zh) 2013-01-09 2013-01-09 一种微博用户身份识别方法及系统

Publications (1)

Publication Number Publication Date
WO2014108004A1 true WO2014108004A1 (zh) 2014-07-17

Family

ID=51040184

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/088616 WO2014108004A1 (zh) 2013-01-09 2013-12-05 一种微博用户身份识别方法及系统

Country Status (3)

Country Link
US (1) US20150356091A1 (zh)
CN (1) CN103914494B (zh)
WO (1) WO2014108004A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106878275A (zh) * 2017-01-03 2017-06-20 阿里巴巴集团控股有限公司 身份验证方法及装置和服务器
CN113297397A (zh) * 2021-05-12 2021-08-24 山东大学 一种基于层次化多模态信息融合的信息匹配方法及系统

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447038A (zh) * 2014-08-29 2016-03-30 国际商业机器公司 用于获取用户特征的方法和系统
CN105591747B (zh) * 2014-12-30 2019-11-22 中国银联股份有限公司 基于用户网络行为特征的辅助身份验证方法
CN105989149A (zh) * 2015-03-02 2016-10-05 苏宁云商集团股份有限公司 一种用户设备指纹的提取和识别方法及系统
CN105989268A (zh) * 2015-03-02 2016-10-05 苏宁云商集团股份有限公司 一种人机识别的安全访问方法和系统
CN104778388A (zh) * 2015-05-04 2015-07-15 苏州大学 一种两个不同平台下同一用户识别方法及系统
CN107025567A (zh) * 2016-02-01 2017-08-08 秒针信息技术有限公司 一种数据处理方法和装置
CN105808529B (zh) * 2016-03-10 2018-06-08 语联网(武汉)信息技术有限公司 一种语料划分领域的方法和装置
CN106295701A (zh) * 2016-08-11 2017-01-04 五八同城信息技术有限公司 用户识别方法及装置
CN106327555A (zh) * 2016-08-24 2017-01-11 网易(杭州)网络有限公司 一种获得唇形动画的方法及装置
WO2018226948A1 (en) * 2017-06-09 2018-12-13 Humada Holdings Inc. Providing user specific information for services
CN110019722B (zh) * 2017-12-21 2023-11-24 株式会社理光 对话模型的回复排序方法、装置及计算机可读存储介质
CN108573134A (zh) * 2018-04-04 2018-09-25 阿里巴巴集团控股有限公司 一种识别身份的方法、装置及电子设备
CN111309774A (zh) * 2018-12-11 2020-06-19 北京嘀嘀无限科技发展有限公司 数据处理方法、装置、电子设备及存储介质
CN110009056B (zh) * 2019-04-15 2021-07-30 秒针信息技术有限公司 一种社交账号的分类方法及分类装置
CN110110084A (zh) * 2019-04-23 2019-08-09 北京科技大学 高质量用户生成内容的识别方法
CN110245687B (zh) * 2019-05-17 2021-06-04 腾讯科技(上海)有限公司 用户分类方法以及装置
CN112413832B (zh) * 2019-08-23 2021-11-30 珠海格力电器股份有限公司 一种基于用户行为的用户身份识别方法及其电器设备
CN110795570B (zh) * 2019-10-11 2022-06-17 上海上湖信息技术有限公司 一种用户时序行为特征提取方法及装置
CN110866114B (zh) * 2019-10-16 2023-05-26 平安科技(深圳)有限公司 对象行为的识别方法、装置及终端设备
CN111368552B (zh) * 2020-02-26 2023-09-26 北京市公安局 一种面向特定领域的网络用户群组划分方法及装置
CN111370086A (zh) * 2020-02-27 2020-07-03 平安国际智慧城市科技股份有限公司 电子病例检测方法、装置、计算机设备和存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295381A (zh) * 2008-06-25 2008-10-29 北京大学 一种垃圾邮件检测方法
CN102654859A (zh) * 2011-03-01 2012-09-05 北京彩云在线技术开发有限公司 一种歌曲推荐方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7716225B1 (en) * 2004-06-17 2010-05-11 Google Inc. Ranking documents based on user behavior and/or feature data
CN101187920A (zh) * 2006-11-17 2008-05-28 财团法人资讯工业策进会 行为特征评估系统与方法
US20080312985A1 (en) * 2007-06-18 2008-12-18 Microsoft Corporation Computerized evaluation of user impressions of product artifacts
CN102012900B (zh) * 2009-09-04 2013-01-30 阿里巴巴集团控股有限公司 信息检索方法和系统
CN102355664A (zh) * 2011-08-09 2012-02-15 郑毅 一种基于用户的社交网络对用户身份进行识别与匹配的方法
CN102289522B (zh) * 2011-09-19 2014-08-13 北京金和软件股份有限公司 一种对于文本智能分类的方法
US9003025B2 (en) * 2012-07-05 2015-04-07 International Business Machines Corporation User identification using multifaceted footprints

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295381A (zh) * 2008-06-25 2008-10-29 北京大学 一种垃圾邮件检测方法
CN102654859A (zh) * 2011-03-01 2012-09-05 北京彩云在线技术开发有限公司 一种歌曲推荐方法及系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106878275A (zh) * 2017-01-03 2017-06-20 阿里巴巴集团控股有限公司 身份验证方法及装置和服务器
CN106878275B (zh) * 2017-01-03 2020-05-19 阿里巴巴集团控股有限公司 身份验证方法及装置和服务器
CN113297397A (zh) * 2021-05-12 2021-08-24 山东大学 一种基于层次化多模态信息融合的信息匹配方法及系统
CN113297397B (zh) * 2021-05-12 2022-08-09 山东大学 一种基于层次化多模态信息融合的信息匹配方法及系统

Also Published As

Publication number Publication date
US20150356091A1 (en) 2015-12-10
CN103914494A (zh) 2014-07-09
CN103914494B (zh) 2017-05-17

Similar Documents

Publication Publication Date Title
WO2014108004A1 (zh) 一种微博用户身份识别方法及系统
US10764353B2 (en) Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability
US11544459B2 (en) Method and apparatus for determining feature words and server
KR102345455B1 (ko) 질의 응답 시스템에서 컴퓨터 발생형 자연 언어 출력
US9898554B2 (en) Implicit question query identification
US9594826B2 (en) Co-selected image classification
US11727053B2 (en) Entity recognition from an image
CN107204960B (zh) 网页识别方法及装置、服务器
US9672251B1 (en) Extracting facts from documents
AU2017355420B2 (en) Systems and methods for event detection and clustering
WO2020087774A1 (zh) 基于概念树的意图识别方法、装置及计算机设备
CN104838413A (zh) 基于用户提交来调整内容递送
US10565253B2 (en) Model generation method, word weighting method, device, apparatus, and computer storage medium
WO2019218476A1 (zh) 一种数据的导出方法及设备
CN106844640A (zh) 一种网页数据分析处理方法
CN105574200A (zh) 基于历史记录的用户兴趣提取方法
US20210141822A1 (en) Systems and methods for identifying latent themes in textual data
US20150104065A1 (en) Apparatus and method for recognizing object in image
US20160034589A1 (en) Method and system for search term whitelist expansion
KR102151858B1 (ko) 링크드 데이터와 문자열 데이터를 이용한 온톨로지 인스턴스 확장 방법 및 시스템
WO2021103594A1 (zh) 一种默契度检测方法、设备、服务器及可读存储介质
US9323721B1 (en) Quotation identification
US10963743B2 (en) Machine learning with small data sets
KR102315350B1 (ko) 질의 처리 자동화 장치 및 방법
CN117751368A (zh) 隐私敏感神经网络训练

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13871143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 14760048

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 13871143

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC ( EPO FORM 1205A DATED 26/01/2016 )

122 Ep: pct application non-entry in european phase

Ref document number: 13871143

Country of ref document: EP

Kind code of ref document: A1