CN105630928B - The identification method and device of text - Google Patents

The identification method and device of text Download PDF

Info

Publication number
CN105630928B
CN105630928B CN201510974385.6A CN201510974385A CN105630928B CN 105630928 B CN105630928 B CN 105630928B CN 201510974385 A CN201510974385 A CN 201510974385A CN 105630928 B CN105630928 B CN 105630928B
Authority
CN
China
Prior art keywords
text
words
feature vector
identification method
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510974385.6A
Other languages
Chinese (zh)
Other versions
CN105630928A (en
Inventor
张伸正
魏少俊
陈培军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510974385.6A priority Critical patent/CN105630928B/en
Publication of CN105630928A publication Critical patent/CN105630928A/en
Application granted granted Critical
Publication of CN105630928B publication Critical patent/CN105630928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of identification method of text and identity devices.This method comprises: choosing the first text to be identified;According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively;According to the multiple characteristic fingerprint, first text is identified.In conclusion the identification method and identity device of text according to an embodiment of the present invention, to increase the identifiability of the text, have greatly reduced the space size of text by identifying according to multiple mark post texts are the multiple characteristic fingerprints of text production to be identified.

Description

The identification method and device of text
Technical field
The present invention relates to technical field of network information, the identity device of identification method and text more particularly to text.
Background technique
With the development of network technology, people arrive a large amount of information by Internet communication platform is available.Many information with The form of text is supplied to people.
In order to store and identify mass text, the identification means of many texts have been developed.For example, widely known passes through TFIDF algorithm obtains the feature vector of text, is then compressed, is obtained to vector information by min-hash (hash) algorithm again The characteristic fingerprint for obtaining text, thus can greatly save greatly the space of text.
But if two texts are similar, need to sample enough elements in feature vector, it just can ensure that two texts Characteristic fingerprint it is different, but the space that will result in mark text in this way is larger.
Summary of the invention
In view of the above problems, the identification method and identity device of a kind of text are proposed, multiple characteristic fingerprints can be passed through To identify text.
According to an aspect of the invention, there is provided a kind of identification method of text, comprising:
Choose the first text to be identified;
According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively;
According to the multiple characteristic fingerprint, first text is identified.
Optionally, characteristic fingerprint is obtained in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
Optionally, the characteristic fingerprint that the first text is obtained according to weight, comprising:
According to the weight, on the basis of the first eigenvector, establish the second feature of first text to Amount;
According to the second feature vector, the characteristic fingerprint of first text is generated.
Optionally, the characteristic fingerprint that the first text is generated according to second feature vector, comprising:
According to the second feature vector, it is based on the distance between first text and the mark post text, generates institute State the characteristic fingerprint of the first text.
Optionally, by min-hash operation, the distance between first text and the mark post text are determined.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.
Optionally, the first eigenvector for obtaining the first text, comprising:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back First eigenvector of the words of amount as first text.
Optionally, first text is subjected to word segmentation processing, then carried out before forming sequence after garbage is handled Words sequence.
Optionally, described eigenvector is extracted from one or more below: text header, text snippet, text is just Text.
According to another aspect of the present invention, a kind of identity device of text is provided, comprising:
Module is chosen, for choosing the first text to be identified;
Determining module, for determining multiple characteristic fingerprints of the first text respectively according to multiple mark post texts;
Mark module, for identifying first text according to the multiple characteristic fingerprint.
Optionally, the determining module obtains characteristic fingerprint in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
Further, the determining module obtains the characteristic fingerprint of the first text in the following manner:
According to the weight, on the basis of the first eigenvector, establish the second feature of first text to Amount;
According to the second feature vector, the characteristic fingerprint of first text is generated.
Optionally, the determining module generates the characteristic fingerprint of the first text in the following manner:
According to the second feature vector, it is based on the distance between first text and the mark post text, generates institute State the characteristic fingerprint of the first text.
Optionally, by min-hash operation, the distance between first text and the mark post text are determined.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.
Optionally, the module that obtains is used to arrange word according to the sequence of the words frequency of occurrences in words sequence from high to low Word, and first eigenvector of the words of preset quantity as first text is taken out from front to back.
Optionally, the acquisition module is used to first text carrying out word segmentation processing, then carries out at garbage The words sequence before sequence is formed after reason.
In conclusion the identification method and identity device of text according to an embodiment of the present invention pass through according to multiple mark post texts This is that text to be identified produces multiple characteristic fingerprints to identify, to increase the identifiability of the text, is greatly reduced The space size of text.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the step flow chart of the identification method of text according to an embodiment of the present invention;
Fig. 2 is the step flow chart according to an embodiment of the present invention for obtaining characteristic fingerprint;
Fig. 3 is the structural schematic diagram of the identity device of text according to an embodiment of the present invention;
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Referring to Fig.1, the flow chart of the identification method of text according to one embodiment of the present invention is shown.As schemed Show, this method comprises the following steps:
11, choose the first text to be identified.
After the first text to be identified has been determined, so that it may obtain its first eigenvector.
In general, carrying out word segmentation processing first to the first text, multiple words are obtained.Word after word segmentation processing Word, it is also possible to include garbage.In general, according to the frequency that these words occur in the text with sequence from high to low into Row arrangement, then will come front preset quantity words as the first eigenvector of first text.
It is possible to further the garbage occurred in text is removed, such as " ", " ground ", " obtaining " etc..Garbage It can be divided into punctuation mark, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.Current words goes out in the text Existing frequency is high, but often without practical significance, therefore needs to ignore these words when production feature vector.That is, will be described First text carries out word segmentation processing, then forms the words sequence before sequence after carrying out garbage processing.
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back Accord with the feature vector as the newsletter archive.
It is appreciated that the source of element can be extracted from one or more below in feature vector: text header, text This abstract, text body.
12, according to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively.
First text is directed to each mark post text respectively and obtains a characteristic fingerprint, shares several mark post files, energy Obtain several characteristic fingerprints.
Wherein, the step of the first text obtains characteristic fingerprint according to mark post file is as follows:
S121 obtains the first eigenvector of first text;
S122 determines the weight of each element in the first eigenvector of first text according to the mark post text;
In an embodiment of the present invention, weight can using following methods determine:
Word frequency TF indicates the frequency that a certain words Ti occurs in a certain document D j, and the frequency that Ti occurs is higher, TFi It is higher, illustrate that this words is more important for entire document, for example, the document D j of from-primary-to-junior-middle-school is talked about for one, it is " small in document Rise at the beginning of " occur frequency TFi it is relatively high.
That is, determining each element in feature vector according to the word frequency of each words in feature vector Weight.
In another embodiment of the present invention, weight can be determined using following methods:
Document frequency DF indicates to contain the number of the document of a certain words Ti, includes the words for a certain words Ti The document of Ti is more, i.e., DFi is bigger, and the effect that Ti is used to distinguish different documents is smaller, belongs to non-focus word.
Inverse document frequency IDF is in inverse relation with document frequency DF.It such as, but not limited to, can be with for a certain words It sets IDFi=log (N/DFi), wherein N is total number of documents.If a certain words only occurs in a document, i.e. DFi is 1, Then IDFi is logN, and the words acts on the differentiation between document maximum at this time.
That is, determining each member in feature vector according to the inverse document frequency of each words in feature vector The weight of element.
In another embodiment of the invention, weight can using following methods determine:
According to the word frequency and inverse document frequency of each words in feature vector, each in feature vector is determined The weighted value of element.Such as, but not limited to, each member in feature vector can be determined using the product of IF and IDF as parameter The weight of element.
In a specific embodiment of the invention, weight can be determined using following methods:
Words appears in title, and text snippet, the different position such as text body, significance level is different, to text Role of delegate is also different.Therefore, member can be determined according to each element in feature vector the location of in the text The weight of element, the position can include but is not limited to text header, text snippet, text body.
In an embodiment of the present invention, weight can using following methods determine:
It is determined according to words position in the text and word frequency and/or inverse document frequency each in feature vector The weight of a element.
S123 obtains the characteristic fingerprint of first text according to the weight.
Specifically, according to the weight, on the basis of the first eigenvector, the second of first text is established Feature vector;According to the second feature vector, the characteristic fingerprint of first text is generated.
That is, in newly-generated second feature vector the quantity of each element embody the element identify this Weight when one text.For example a words is bigger to the effect of mark text, corresponding weight is also bigger.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.
For example, the first eigenvector of the first text is (examination of from-primary-to-junior-middle-school section primary school is put into several classes), mark post text is then related to (examination of the section You Sheng little primary school is put into several classes), it can be seen that, the weight of " from-primary-to-junior-middle-school " should just increase accordingly in the first text.To be " small At the beginning of rising " weight be set to 0.4, the weight of section is 0.2, other are 0.1, then the second feature vector generated is that (from-primary-to-junior-middle-school is small It rises from-primary-to-junior-middle-school section section primary school examination at the beginning of lower primary school rises to put into several classes).
Further, it is also possible to according to the second feature vector, based between first text and the mark post text Distance generates the characteristic fingerprint of first text.For example, by minimum hash operation, determine first text with it is described The distance between mark post text.In fact, there are also other algorithms to obtain the distance other than minimum hash operation.
13, according to the multiple characteristic fingerprint, identify first text.
Through the above steps, the first text can be obtained respectively to refer to relative to multiple features of multiple and different mark post texts Line identifies the first text with multiple characteristic fingerprint, increases the identifiability of the text, greatly reduced the space of text Size.
Fig. 3 shows the structural schematic diagram of the identity device of text according to an embodiment of the present invention.
In Fig. 3, the identity device 30 of text includes choosing module 31, determining module 32 and mark module 33.Wherein, it selects Modulus block 31 is for choosing the first text to be identified;Determining module 32 determines the first text according to multiple mark post texts respectively Multiple characteristic fingerprints;Mark module 33 identifies first text according to the multiple characteristic fingerprint.
Determining module 32 is specifically used for:
Obtain the first eigenvector of first text;According to the mark post text, the of first text is determined The weight of each element in one feature vector;According to the weight, the characteristic fingerprint of first text is obtained.
For example, on the basis of the first eigenvector, establish first text second is special according to the weight Levy vector;According to the second feature vector, the characteristic fingerprint of first text is generated.
Specifically, according to the second feature vector, it is based on the distance between first text and the mark post text, Generate the characteristic fingerprint of first text.
For example, can determine the distance between first text and the mark post text by min-hash operation.
Further, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element System.
The acquisition module 31 is specifically used for:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back First eigenvector of the words of amount as first text.Further, first text is subjected to word segmentation processing, then The words sequence before sequence is formed after carrying out garbage processing.
Feature vector described here can be extracted from one or more below: text header, text snippet, text Text.
More than, in an embodiment of the present invention, weight can be determined using following methods:
Word frequency TF indicates the frequency that a certain words Ti occurs in a certain document D j, and the frequency that Ti occurs is higher, TFi It is higher, illustrate that this words is more important for entire document, for example, the document D j of from-primary-to-junior-middle-school is talked about for one, it is " small in document Rise at the beginning of " occur frequency TFi it is relatively high.
That is, determining each element in feature vector according to the word frequency of each words in feature vector Weight.
In another embodiment of the present invention, weight can be determined using following methods:
Document frequency DF indicates to contain the number of the document of a certain words Ti, includes the words for a certain words Ti The document of Ti is more, i.e., DFi is bigger, and the effect that Ti is used to distinguish different documents is smaller, belongs to non-focus word.
Inverse document frequency IDF is in inverse relation with document frequency DF.It such as, but not limited to, can be with for a certain words It sets IDFi=log (N/DFi), wherein N is total number of documents.If a certain words only occurs in a document, i.e. DFi is 1, Then IDFi is logN, and the words acts on the differentiation between document maximum at this time.
That is, determining each member in feature vector according to the inverse document frequency of each words in feature vector The weight of element.
In another embodiment of the invention, weight can using following methods determine:
According to the word frequency and inverse document frequency of each words in feature vector, each in feature vector is determined The weighted value of element.Such as, but not limited to, each member in feature vector can be determined using the product of IF and IDF as parameter The weight of element.
In a specific embodiment of the invention, weight can be determined using following methods:
Words appears in title, and text snippet, the different position such as text body, significance level is different, to text Role of delegate is also different.Therefore, member can be determined according to each element in feature vector the location of in the text The weight of element, the position can include but is not limited to text header, text snippet, text body.
In an embodiment of the present invention, weight can using following methods determine:
It is determined according to words position in the text and word frequency and/or inverse document frequency each in feature vector The weight of a element.
In conclusion the identity device of text according to an embodiment of the present invention is by being to be identified according to multiple mark post texts Text produce multiple characteristic fingerprints to identify, to increase the identifiability of the text, greatly reduced the space of text Size.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) are according to an embodiment of the present invention based on the determining news advowson of comment to realize The some or all functions of some or all components in the device of weight.The present invention is also implemented as executing here Some or all device or device programs of described method are (for example, computer program and computer program produce Product).It is such to realize that program of the invention can store on a computer-readable medium, or can have one or more The form of signal.Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or to appoint What other forms provides.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this specification.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
Furthermore, it should also be noted that language used in this specification primarily to readable and introduction purpose and select Select, rather than in order to explain or defining the subject matter of the present invention and select.Therefore, without departing from the appended claims In the case where scope and spirit, many modifications and changes are all apparent for those skilled in the art 's.For the scope of the present invention, the disclosure that the present invention is done be it is illustrative and not restrictive, the scope of the present invention by The appended claims limit.

Claims (10)

1. a kind of identification method of text, comprising:
Choose the first text to be identified;
According to multiple mark post texts, the first text is directed to each mark post text respectively and obtains a characteristic fingerprint, is determined respectively Multiple characteristic fingerprints of first text;According to the multiple characteristic fingerprint, first text is identified.
2. the identification method of text according to claim 1, wherein obtaining characteristic fingerprint in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
3. the identification method of text according to claim 2, wherein the feature for obtaining the first text according to weight refers to Line, comprising:
According to the weight, on the basis of the first eigenvector, the second feature vector of first text is established;
According to the second feature vector, the characteristic fingerprint of first text is generated.
4. the identification method of text according to claim 3, wherein described generate the first text according to second feature vector Characteristic fingerprint, comprising:
According to the second feature vector, it is based on the distance between first text and the mark post text, generation described the The characteristic fingerprint of one text.
5. the identification method of text according to claim 4, wherein determining first text by min-hash operation The distance between described mark post text.
6. the identification method of text according to claim 5, wherein the quantity of each element in the second feature vector Meet the multiple proportion between the weight of each element.
7. the identification method of the text according to any one of claim 2 to 6, wherein described obtain the first of the first text Feature vector, comprising:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out preset quantity from front to back First eigenvector of the words as first text.
8. the identification method of text according to claim 7 wherein first text is carried out word segmentation processing, then carries out The words sequence before sequence is formed after going garbage to handle.
9. the identification method of the text according to any one of claim 2-6 or 8, wherein described eigenvector is from below It is extracted in one or more: text header, text snippet, text body.
10. a kind of identity device of text, comprising:
Module is chosen, for choosing the first text to be identified;
Determining module, for the first text being directed to each mark post text respectively and obtains a feature according to multiple mark post texts Fingerprint determines multiple characteristic fingerprints of the first text respectively;
Mark module, for identifying first text according to the multiple characteristic fingerprint.
CN201510974385.6A 2015-12-22 2015-12-22 The identification method and device of text Active CN105630928B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510974385.6A CN105630928B (en) 2015-12-22 2015-12-22 The identification method and device of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510974385.6A CN105630928B (en) 2015-12-22 2015-12-22 The identification method and device of text

Publications (2)

Publication Number Publication Date
CN105630928A CN105630928A (en) 2016-06-01
CN105630928B true CN105630928B (en) 2019-06-21

Family

ID=56045861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510974385.6A Active CN105630928B (en) 2015-12-22 2015-12-22 The identification method and device of text

Country Status (1)

Country Link
CN (1) CN105630928B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840321B (en) * 2017-11-29 2022-02-01 腾讯科技(深圳)有限公司 Text recommendation method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101504667A (en) * 2009-03-20 2009-08-12 北京学之途网络科技有限公司 Keyword confirming method and system, weight vector learning method and system
CN102880724A (en) * 2012-10-23 2013-01-16 盛科网络(苏州)有限公司 Method and system for processing Hash collision
CN103246676A (en) * 2012-02-10 2013-08-14 富士通株式会社 Method and device for clustering messages
CN103324666A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Topic tracing method and device based on micro-blog data
CN103970806A (en) * 2013-02-05 2014-08-06 百度在线网络技术(北京)有限公司 Method and device for establishing lyric-feelings classification models
CN105022840A (en) * 2015-08-18 2015-11-04 新华网股份有限公司 News information processing method, news recommendation method and related devices

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101504667A (en) * 2009-03-20 2009-08-12 北京学之途网络科技有限公司 Keyword confirming method and system, weight vector learning method and system
CN103246676A (en) * 2012-02-10 2013-08-14 富士通株式会社 Method and device for clustering messages
CN102880724A (en) * 2012-10-23 2013-01-16 盛科网络(苏州)有限公司 Method and system for processing Hash collision
CN103970806A (en) * 2013-02-05 2014-08-06 百度在线网络技术(北京)有限公司 Method and device for establishing lyric-feelings classification models
CN103324666A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Topic tracing method and device based on micro-blog data
CN105022840A (en) * 2015-08-18 2015-11-04 新华网股份有限公司 News information processing method, news recommendation method and related devices

Also Published As

Publication number Publication date
CN105630928A (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN106528532B (en) Text error correction method, device and terminal
US8577155B2 (en) System and method for duplicate text recognition
CN104462152B (en) A kind of recognition methods of webpage and device
US20150295942A1 (en) Method and server for performing cloud detection for malicious information
CN102779170B (en) System and method for identifying text floor of webpage
CN103336766A (en) Short text garbage identification and modeling method and device
CN106960030A (en) Pushed information method and device based on artificial intelligence
CN105653984B (en) File fingerprint method of calibration and device
CN106874253A (en) Recognize the method and device of sensitive information
CN102306287B (en) A kind of method and equipment for identifying a sensitive image
CN105630931A (en) Document classification method and device
WO2019028990A1 (en) Code element naming method, device, electronic equipment and medium
Li et al. {TextShield}: Robust text classification based on multimodal embedding and neural machine translation
CN105630767A (en) Text similarity comparison method and device
WO2014153457A1 (en) Merging web page style addresses
CN103605691A (en) Device and method used for processing issued contents in social network
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN107678968A (en) Sample extraction method, apparatus, computing device and the storage medium of source code function
CN110647896A (en) Fishing page identification method based on logo image and related equipment
CN105989184A (en) Classification method and apparatus
CN104966109B (en) Medical laboratory single image sorting technique and device
CN110020430B (en) Malicious information identification method, device, equipment and storage medium
CN108388556B (en) Method and system for mining homogeneous entity
CN105630928B (en) The identification method and device of text
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220715

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: Room 112, block D, No. 28, Xinjiekou outer street, Xicheng District, Beijing 100088 (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.