CN105630928B - The identification method and device of text - Google Patents
The identification method and device of text Download PDFInfo
- Publication number
- CN105630928B CN105630928B CN201510974385.6A CN201510974385A CN105630928B CN 105630928 B CN105630928 B CN 105630928B CN 201510974385 A CN201510974385 A CN 201510974385A CN 105630928 B CN105630928 B CN 105630928B
- Authority
- CN
- China
- Prior art keywords
- text
- words
- feature vector
- identification method
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of identification method of text and identity devices.This method comprises: choosing the first text to be identified;According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively;According to the multiple characteristic fingerprint, first text is identified.In conclusion the identification method and identity device of text according to an embodiment of the present invention, to increase the identifiability of the text, have greatly reduced the space size of text by identifying according to multiple mark post texts are the multiple characteristic fingerprints of text production to be identified.
Description
Technical field
The present invention relates to technical field of network information, the identity device of identification method and text more particularly to text.
Background technique
With the development of network technology, people arrive a large amount of information by Internet communication platform is available.Many information with
The form of text is supplied to people.
In order to store and identify mass text, the identification means of many texts have been developed.For example, widely known passes through
TFIDF algorithm obtains the feature vector of text, is then compressed, is obtained to vector information by min-hash (hash) algorithm again
The characteristic fingerprint for obtaining text, thus can greatly save greatly the space of text.
But if two texts are similar, need to sample enough elements in feature vector, it just can ensure that two texts
Characteristic fingerprint it is different, but the space that will result in mark text in this way is larger.
Summary of the invention
In view of the above problems, the identification method and identity device of a kind of text are proposed, multiple characteristic fingerprints can be passed through
To identify text.
According to an aspect of the invention, there is provided a kind of identification method of text, comprising:
Choose the first text to be identified;
According to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively;
According to the multiple characteristic fingerprint, first text is identified.
Optionally, characteristic fingerprint is obtained in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
Optionally, the characteristic fingerprint that the first text is obtained according to weight, comprising:
According to the weight, on the basis of the first eigenvector, establish the second feature of first text to
Amount;
According to the second feature vector, the characteristic fingerprint of first text is generated.
Optionally, the characteristic fingerprint that the first text is generated according to second feature vector, comprising:
According to the second feature vector, it is based on the distance between first text and the mark post text, generates institute
State the characteristic fingerprint of the first text.
Optionally, by min-hash operation, the distance between first text and the mark post text are determined.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element
System.
Optionally, the first eigenvector for obtaining the first text, comprising:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back
First eigenvector of the words of amount as first text.
Optionally, first text is subjected to word segmentation processing, then carried out before forming sequence after garbage is handled
Words sequence.
Optionally, described eigenvector is extracted from one or more below: text header, text snippet, text is just
Text.
According to another aspect of the present invention, a kind of identity device of text is provided, comprising:
Module is chosen, for choosing the first text to be identified;
Determining module, for determining multiple characteristic fingerprints of the first text respectively according to multiple mark post texts;
Mark module, for identifying first text according to the multiple characteristic fingerprint.
Optionally, the determining module obtains characteristic fingerprint in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
Further, the determining module obtains the characteristic fingerprint of the first text in the following manner:
According to the weight, on the basis of the first eigenvector, establish the second feature of first text to
Amount;
According to the second feature vector, the characteristic fingerprint of first text is generated.
Optionally, the determining module generates the characteristic fingerprint of the first text in the following manner:
According to the second feature vector, it is based on the distance between first text and the mark post text, generates institute
State the characteristic fingerprint of the first text.
Optionally, by min-hash operation, the distance between first text and the mark post text are determined.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element
System.
Optionally, the module that obtains is used to arrange word according to the sequence of the words frequency of occurrences in words sequence from high to low
Word, and first eigenvector of the words of preset quantity as first text is taken out from front to back.
Optionally, the acquisition module is used to first text carrying out word segmentation processing, then carries out at garbage
The words sequence before sequence is formed after reason.
In conclusion the identification method and identity device of text according to an embodiment of the present invention pass through according to multiple mark post texts
This is that text to be identified produces multiple characteristic fingerprints to identify, to increase the identifiability of the text, is greatly reduced
The space size of text.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the step flow chart of the identification method of text according to an embodiment of the present invention;
Fig. 2 is the step flow chart according to an embodiment of the present invention for obtaining characteristic fingerprint;
Fig. 3 is the structural schematic diagram of the identity device of text according to an embodiment of the present invention;
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Referring to Fig.1, the flow chart of the identification method of text according to one embodiment of the present invention is shown.As schemed
Show, this method comprises the following steps:
11, choose the first text to be identified.
After the first text to be identified has been determined, so that it may obtain its first eigenvector.
In general, carrying out word segmentation processing first to the first text, multiple words are obtained.Word after word segmentation processing
Word, it is also possible to include garbage.In general, according to the frequency that these words occur in the text with sequence from high to low into
Row arrangement, then will come front preset quantity words as the first eigenvector of first text.
It is possible to further the garbage occurred in text is removed, such as " ", " ground ", " obtaining " etc..Garbage
It can be divided into punctuation mark, with structural auxiliary word function word etc. in Chinese meaningless vocabulary.Current words goes out in the text
Existing frequency is high, but often without practical significance, therefore needs to ignore these words when production feature vector.That is, will be described
First text carries out word segmentation processing, then forms the words sequence before sequence after carrying out garbage processing.
It optionally, can be using the words for going garbage to obtain after handling as the feature vector of news.Or it extracts and goes
Representative words constitutes the feature vector of news in the words obtained after garbage processing.
For example, after segmenting and going garbage to handle, obtaining a words sequence for a news report webpage
Arrange S=(s1,s2,s3......,sN), wherein the expressions such as s1, s2, s3 by participle and go garbage treated words.
It is possible that identical words in words sequence S, therefore related word frequency can be carried out to the words in words sequence
Statistics, is further arranged according to the sequence of words frequency of occurrence from high to low, takes out the word of preset quantity from front to back
Accord with the feature vector as the newsletter archive.
It is appreciated that the source of element can be extracted from one or more below in feature vector: text header, text
This abstract, text body.
12, according to multiple mark post texts, multiple characteristic fingerprints of the first text are determined respectively.
First text is directed to each mark post text respectively and obtains a characteristic fingerprint, shares several mark post files, energy
Obtain several characteristic fingerprints.
Wherein, the step of the first text obtains characteristic fingerprint according to mark post file is as follows:
S121 obtains the first eigenvector of first text;
S122 determines the weight of each element in the first eigenvector of first text according to the mark post text;
In an embodiment of the present invention, weight can using following methods determine:
Word frequency TF indicates the frequency that a certain words Ti occurs in a certain document D j, and the frequency that Ti occurs is higher, TFi
It is higher, illustrate that this words is more important for entire document, for example, the document D j of from-primary-to-junior-middle-school is talked about for one, it is " small in document
Rise at the beginning of " occur frequency TFi it is relatively high.
That is, determining each element in feature vector according to the word frequency of each words in feature vector
Weight.
In another embodiment of the present invention, weight can be determined using following methods:
Document frequency DF indicates to contain the number of the document of a certain words Ti, includes the words for a certain words Ti
The document of Ti is more, i.e., DFi is bigger, and the effect that Ti is used to distinguish different documents is smaller, belongs to non-focus word.
Inverse document frequency IDF is in inverse relation with document frequency DF.It such as, but not limited to, can be with for a certain words
It sets IDFi=log (N/DFi), wherein N is total number of documents.If a certain words only occurs in a document, i.e. DFi is 1,
Then IDFi is logN, and the words acts on the differentiation between document maximum at this time.
That is, determining each member in feature vector according to the inverse document frequency of each words in feature vector
The weight of element.
In another embodiment of the invention, weight can using following methods determine:
According to the word frequency and inverse document frequency of each words in feature vector, each in feature vector is determined
The weighted value of element.Such as, but not limited to, each member in feature vector can be determined using the product of IF and IDF as parameter
The weight of element.
In a specific embodiment of the invention, weight can be determined using following methods:
Words appears in title, and text snippet, the different position such as text body, significance level is different, to text
Role of delegate is also different.Therefore, member can be determined according to each element in feature vector the location of in the text
The weight of element, the position can include but is not limited to text header, text snippet, text body.
In an embodiment of the present invention, weight can using following methods determine:
It is determined according to words position in the text and word frequency and/or inverse document frequency each in feature vector
The weight of a element.
S123 obtains the characteristic fingerprint of first text according to the weight.
Specifically, according to the weight, on the basis of the first eigenvector, the second of first text is established
Feature vector;According to the second feature vector, the characteristic fingerprint of first text is generated.
That is, in newly-generated second feature vector the quantity of each element embody the element identify this
Weight when one text.For example a words is bigger to the effect of mark text, corresponding weight is also bigger.
Optionally, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element
System.
For example, the first eigenvector of the first text is (examination of from-primary-to-junior-middle-school section primary school is put into several classes), mark post text is then related to
(examination of the section You Sheng little primary school is put into several classes), it can be seen that, the weight of " from-primary-to-junior-middle-school " should just increase accordingly in the first text.To be " small
At the beginning of rising " weight be set to 0.4, the weight of section is 0.2, other are 0.1, then the second feature vector generated is that (from-primary-to-junior-middle-school is small
It rises from-primary-to-junior-middle-school section section primary school examination at the beginning of lower primary school rises to put into several classes).
Further, it is also possible to according to the second feature vector, based between first text and the mark post text
Distance generates the characteristic fingerprint of first text.For example, by minimum hash operation, determine first text with it is described
The distance between mark post text.In fact, there are also other algorithms to obtain the distance other than minimum hash operation.
13, according to the multiple characteristic fingerprint, identify first text.
Through the above steps, the first text can be obtained respectively to refer to relative to multiple features of multiple and different mark post texts
Line identifies the first text with multiple characteristic fingerprint, increases the identifiability of the text, greatly reduced the space of text
Size.
Fig. 3 shows the structural schematic diagram of the identity device of text according to an embodiment of the present invention.
In Fig. 3, the identity device 30 of text includes choosing module 31, determining module 32 and mark module 33.Wherein, it selects
Modulus block 31 is for choosing the first text to be identified;Determining module 32 determines the first text according to multiple mark post texts respectively
Multiple characteristic fingerprints;Mark module 33 identifies first text according to the multiple characteristic fingerprint.
Determining module 32 is specifically used for:
Obtain the first eigenvector of first text;According to the mark post text, the of first text is determined
The weight of each element in one feature vector;According to the weight, the characteristic fingerprint of first text is obtained.
For example, on the basis of the first eigenvector, establish first text second is special according to the weight
Levy vector;According to the second feature vector, the characteristic fingerprint of first text is generated.
Specifically, according to the second feature vector, it is based on the distance between first text and the mark post text,
Generate the characteristic fingerprint of first text.
For example, can determine the distance between first text and the mark post text by min-hash operation.
Further, the quantity of each element in the second feature vector meets the pass of the multiple between the weight of each element
System.
The acquisition module 31 is specifically used for:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out present count from front to back
First eigenvector of the words of amount as first text.Further, first text is subjected to word segmentation processing, then
The words sequence before sequence is formed after carrying out garbage processing.
Feature vector described here can be extracted from one or more below: text header, text snippet, text
Text.
More than, in an embodiment of the present invention, weight can be determined using following methods:
Word frequency TF indicates the frequency that a certain words Ti occurs in a certain document D j, and the frequency that Ti occurs is higher, TFi
It is higher, illustrate that this words is more important for entire document, for example, the document D j of from-primary-to-junior-middle-school is talked about for one, it is " small in document
Rise at the beginning of " occur frequency TFi it is relatively high.
That is, determining each element in feature vector according to the word frequency of each words in feature vector
Weight.
In another embodiment of the present invention, weight can be determined using following methods:
Document frequency DF indicates to contain the number of the document of a certain words Ti, includes the words for a certain words Ti
The document of Ti is more, i.e., DFi is bigger, and the effect that Ti is used to distinguish different documents is smaller, belongs to non-focus word.
Inverse document frequency IDF is in inverse relation with document frequency DF.It such as, but not limited to, can be with for a certain words
It sets IDFi=log (N/DFi), wherein N is total number of documents.If a certain words only occurs in a document, i.e. DFi is 1,
Then IDFi is logN, and the words acts on the differentiation between document maximum at this time.
That is, determining each member in feature vector according to the inverse document frequency of each words in feature vector
The weight of element.
In another embodiment of the invention, weight can using following methods determine:
According to the word frequency and inverse document frequency of each words in feature vector, each in feature vector is determined
The weighted value of element.Such as, but not limited to, each member in feature vector can be determined using the product of IF and IDF as parameter
The weight of element.
In a specific embodiment of the invention, weight can be determined using following methods:
Words appears in title, and text snippet, the different position such as text body, significance level is different, to text
Role of delegate is also different.Therefore, member can be determined according to each element in feature vector the location of in the text
The weight of element, the position can include but is not limited to text header, text snippet, text body.
In an embodiment of the present invention, weight can using following methods determine:
It is determined according to words position in the text and word frequency and/or inverse document frequency each in feature vector
The weight of a element.
In conclusion the identity device of text according to an embodiment of the present invention is by being to be identified according to multiple mark post texts
Text produce multiple characteristic fingerprints to identify, to increase the identifiability of the text, greatly reduced the space of text
Size.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
Microprocessor or digital signal processor (DSP) are according to an embodiment of the present invention based on the determining news advowson of comment to realize
The some or all functions of some or all components in the device of weight.The present invention is also implemented as executing here
Some or all device or device programs of described method are (for example, computer program and computer program produce
Product).It is such to realize that program of the invention can store on a computer-readable medium, or can have one or more
The form of signal.Such signal can be downloaded from an internet website to obtain, and perhaps be provided on the carrier signal or to appoint
What other forms provides.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this specification.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.
Furthermore, it should also be noted that language used in this specification primarily to readable and introduction purpose and select
Select, rather than in order to explain or defining the subject matter of the present invention and select.Therefore, without departing from the appended claims
In the case where scope and spirit, many modifications and changes are all apparent for those skilled in the art
's.For the scope of the present invention, the disclosure that the present invention is done be it is illustrative and not restrictive, the scope of the present invention by
The appended claims limit.
Claims (10)
1. a kind of identification method of text, comprising:
Choose the first text to be identified;
According to multiple mark post texts, the first text is directed to each mark post text respectively and obtains a characteristic fingerprint, is determined respectively
Multiple characteristic fingerprints of first text;According to the multiple characteristic fingerprint, first text is identified.
2. the identification method of text according to claim 1, wherein obtaining characteristic fingerprint in the following manner:
Obtain the first eigenvector of first text;
According to the mark post text, the weight of each element in the first eigenvector of first text is determined;
According to the weight, the characteristic fingerprint of first text is obtained.
3. the identification method of text according to claim 2, wherein the feature for obtaining the first text according to weight refers to
Line, comprising:
According to the weight, on the basis of the first eigenvector, the second feature vector of first text is established;
According to the second feature vector, the characteristic fingerprint of first text is generated.
4. the identification method of text according to claim 3, wherein described generate the first text according to second feature vector
Characteristic fingerprint, comprising:
According to the second feature vector, it is based on the distance between first text and the mark post text, generation described the
The characteristic fingerprint of one text.
5. the identification method of text according to claim 4, wherein determining first text by min-hash operation
The distance between described mark post text.
6. the identification method of text according to claim 5, wherein the quantity of each element in the second feature vector
Meet the multiple proportion between the weight of each element.
7. the identification method of the text according to any one of claim 2 to 6, wherein described obtain the first of the first text
Feature vector, comprising:
Words is arranged according to the sequence of the words frequency of occurrences in words sequence from high to low, and takes out preset quantity from front to back
First eigenvector of the words as first text.
8. the identification method of text according to claim 7 wherein first text is carried out word segmentation processing, then carries out
The words sequence before sequence is formed after going garbage to handle.
9. the identification method of the text according to any one of claim 2-6 or 8, wherein described eigenvector is from below
It is extracted in one or more: text header, text snippet, text body.
10. a kind of identity device of text, comprising:
Module is chosen, for choosing the first text to be identified;
Determining module, for the first text being directed to each mark post text respectively and obtains a feature according to multiple mark post texts
Fingerprint determines multiple characteristic fingerprints of the first text respectively;
Mark module, for identifying first text according to the multiple characteristic fingerprint.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510974385.6A CN105630928B (en) | 2015-12-22 | 2015-12-22 | The identification method and device of text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510974385.6A CN105630928B (en) | 2015-12-22 | 2015-12-22 | The identification method and device of text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105630928A CN105630928A (en) | 2016-06-01 |
CN105630928B true CN105630928B (en) | 2019-06-21 |
Family
ID=56045861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510974385.6A Active CN105630928B (en) | 2015-12-22 | 2015-12-22 | The identification method and device of text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105630928B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109840321B (en) * | 2017-11-29 | 2022-02-01 | 腾讯科技(深圳)有限公司 | Text recommendation method and device and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
CN101504667A (en) * | 2009-03-20 | 2009-08-12 | 北京学之途网络科技有限公司 | Keyword confirming method and system, weight vector learning method and system |
CN102880724A (en) * | 2012-10-23 | 2013-01-16 | 盛科网络(苏州)有限公司 | Method and system for processing Hash collision |
CN103246676A (en) * | 2012-02-10 | 2013-08-14 | 富士通株式会社 | Method and device for clustering messages |
CN103324666A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Topic tracing method and device based on micro-blog data |
CN103970806A (en) * | 2013-02-05 | 2014-08-06 | 百度在线网络技术(北京)有限公司 | Method and device for establishing lyric-feelings classification models |
CN105022840A (en) * | 2015-08-18 | 2015-11-04 | 新华网股份有限公司 | News information processing method, news recommendation method and related devices |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663046A (en) * | 2012-03-29 | 2012-09-12 | 中国科学院自动化研究所 | Sentiment analysis method oriented to micro-blog short text |
-
2015
- 2015-12-22 CN CN201510974385.6A patent/CN105630928B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
CN101504667A (en) * | 2009-03-20 | 2009-08-12 | 北京学之途网络科技有限公司 | Keyword confirming method and system, weight vector learning method and system |
CN103246676A (en) * | 2012-02-10 | 2013-08-14 | 富士通株式会社 | Method and device for clustering messages |
CN102880724A (en) * | 2012-10-23 | 2013-01-16 | 盛科网络(苏州)有限公司 | Method and system for processing Hash collision |
CN103970806A (en) * | 2013-02-05 | 2014-08-06 | 百度在线网络技术(北京)有限公司 | Method and device for establishing lyric-feelings classification models |
CN103324666A (en) * | 2013-05-14 | 2013-09-25 | 亿赞普(北京)科技有限公司 | Topic tracing method and device based on micro-blog data |
CN105022840A (en) * | 2015-08-18 | 2015-11-04 | 新华网股份有限公司 | News information processing method, news recommendation method and related devices |
Also Published As
Publication number | Publication date |
---|---|
CN105630928A (en) | 2016-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106528532B (en) | Text error correction method, device and terminal | |
US8577155B2 (en) | System and method for duplicate text recognition | |
CN104462152B (en) | A kind of recognition methods of webpage and device | |
US20150295942A1 (en) | Method and server for performing cloud detection for malicious information | |
CN102779170B (en) | System and method for identifying text floor of webpage | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN106960030A (en) | Pushed information method and device based on artificial intelligence | |
CN105653984B (en) | File fingerprint method of calibration and device | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN102306287B (en) | A kind of method and equipment for identifying a sensitive image | |
CN105630931A (en) | Document classification method and device | |
WO2019028990A1 (en) | Code element naming method, device, electronic equipment and medium | |
Li et al. | {TextShield}: Robust text classification based on multimodal embedding and neural machine translation | |
CN105630767A (en) | Text similarity comparison method and device | |
WO2014153457A1 (en) | Merging web page style addresses | |
CN103605691A (en) | Device and method used for processing issued contents in social network | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
CN107678968A (en) | Sample extraction method, apparatus, computing device and the storage medium of source code function | |
CN110647896A (en) | Fishing page identification method based on logo image and related equipment | |
CN105989184A (en) | Classification method and apparatus | |
CN104966109B (en) | Medical laboratory single image sorting technique and device | |
CN110020430B (en) | Malicious information identification method, device, equipment and storage medium | |
CN108388556B (en) | Method and system for mining homogeneous entity | |
CN105630928B (en) | The identification method and device of text | |
CN110704611B (en) | Illegal text recognition method and device based on feature de-interleaving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220715 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: Room 112, block D, No. 28, Xinjiekou outer street, Xicheng District, Beijing 100088 (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |