CN110990577A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN110990577A
CN110990577A CN201911360415.9A CN201911360415A CN110990577A CN 110990577 A CN110990577 A CN 110990577A CN 201911360415 A CN201911360415 A CN 201911360415A CN 110990577 A CN110990577 A CN 110990577A
Authority
CN
China
Prior art keywords
text
word list
classified
similarity
constructed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911360415.9A
Other languages
Chinese (zh)
Inventor
孙宇浩
孙龙超
唐劭
张斌
龚平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Asiainfo Data Co ltd
Original Assignee
Beijing Asiainfo Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Asiainfo Data Co ltd filed Critical Beijing Asiainfo Data Co ltd
Priority to CN201911360415.9A priority Critical patent/CN110990577A/en
Publication of CN110990577A publication Critical patent/CN110990577A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a text classification method and device, the text classification method comprising: calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list; judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list. The method and the device solve the problem that a large amount of manual items-by-item screening is needed to obtain two or more specific word lists from disordered mass data, save manpower and material resources, effectively improve the working efficiency, effectively construct various word lists and reduce manual errors.

Description

Text classification method and device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a text classification method and apparatus.
Background
In an actual production project, two or more types of word lists need to be constructed, mass data types obtained through a web crawler are very disordered before the word lists are constructed, and two or more types of texts are mixed together. In the prior art, the required data is screened from the massive texts in a manual item-by-item screening mode, and two or more types of word lists are added respectively, so that a large amount of manpower and material resources are needed, and the efficiency is low.
Disclosure of Invention
To solve or at least alleviate at least one of the above technical problems, the present disclosure provides a text classification method and apparatus.
In a first aspect, the present disclosure provides a text classification method, including:
calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list;
judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
Optionally, each of the pre-vocabulary tables contains at least one predefined text belonging to the same class.
Optionally, the text classification method further includes:
and classifying the texts to be classified into a pre-constructed word list with the similarity larger than a preset threshold value, and then updating the pre-constructed word list.
In a second aspect, the present disclosure provides a text classification apparatus comprising: a calculation module and a judgment module, wherein,
the calculation module is used for calculating the similarity of the text to be classified in advance of each pre-constructed word list in at least one pre-constructed word list;
the judging module is used for judging whether the calculated similarity is greater than a preset threshold value or not, and if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
Optionally, each of the pre-vocabulary tables contains at least one predefined text belonging to the same class.
Optionally, the text classification apparatus further includes: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.
In a third aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text classification method of any one of claims 1 to 3.
In a fourth aspect, the present disclosure provides a computing device comprising a memory storing a computer program and a processor implementing the text classification method of any one of claims 1 to 3 when executing the computer program.
Compared with the prior art, the method has the following beneficial effects:
the method and the device solve the problem that a large amount of manual items-by-item screening is needed to obtain two or more specific word lists from disordered mass data, save manpower and material resources, effectively improve the working efficiency, effectively construct various word lists and reduce manual errors.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a schematic flow chart of a text classification method provided by the present disclosure;
FIG. 2 is a schematic flow chart diagram of another text classification method provided by the present disclosure;
fig. 3 is a block diagram of a structure of a text classification apparatus provided in the present disclosure.
Detailed Description
The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
As shown in fig. 1, the present disclosure provides a text classification method, including:
calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list;
judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
In one embodiment of the present disclosure, each of the pre-formed word lists includes at least one predefined text belonging to the same class.
In the embodiment, each text is segmented, word vectors are generated according to the segmentation result, and finally, matrix calculation is performed according to the word vectors, so that the similarity between the text to be classified and the text in the pre-constructed word list is obtained. And classifying the texts to be classified with the similarity greater than a preset threshold into corresponding pre-constructed word lists. For example, the pre-configuration word table 1 has a text a, a text B and a text c, the texts to be classified are a and B, the similarity between the text a to be classified and the text a, the text B and the text c is calculated, and if the calculation result is 97%, and the preset threshold is 95%, and obviously 97% > 95%, the text a to be classified is classified into the pre-configuration word table 1; and calculating the similarity between the text to be classified and the text a, the text B and the text c, wherein if the calculation result is 86% and the preset threshold value is 95%, the text B to be classified cannot be classified into the pre-constructed word list 1 because the similarity 86% obtained by calculation is less than 95%. In this example, the text a, the text b and the text c are predefined texts belonging to the same category in the pre-vocabulary.
As shown in fig. 2, in an embodiment provided by the present disclosure, the text classification method further includes:
and classifying the texts to be classified into a pre-constructed word list with the similarity larger than a preset threshold value, and then updating the pre-constructed word list.
The pre-constructed vocabulary in the present disclosure may also be referred to as a pre-constructed vocabulary, which is to customize a part of text (similar texts (including vocabularies) are expected to be obtained according to the requirements, that is, predefined texts belonging to the same category) according to the type of the vocabulary to be constructed, and put the part of text into the corresponding vocabulary respectively. And performing similarity calculation on the texts in the word bank and the mass texts so as to expand the word list. Similarity calculation is respectively carried out on the acquired massive texts and texts (including vocabularies) of various vocabularies, and the texts with the similarity larger than a preset threshold value are placed into the corresponding vocabularies, so that the vocabularies are continuously expanded. The preset threshold setting can be adaptively adjusted according to actual requirements. And then, similarity calculation is carried out on the expanded word list and the mass texts, the number of the texts in the word list is continuously expanded, a larger word list is established, and a larger database is correspondingly formed. In the above example, after the text to be classified is a, which is included in the pre-configured word table 1, the texts in the pre-configured word table 1 are the text a, the text b, and the text c to be classified.
The following explains the text classification method provided in the present disclosure again by taking emotion analysis as an example.
In the example, two types of vocabularies, positive and negative, are pre-constructed, and a part of vocabularies are customized and are respectively put into the two types of vocabularies. For example: placing in the active vocabulary: words such as happy, happy and pleasant; placing in a passive vocabulary: difficult, poor and damaged vocabulary.
Similarity calculation is respectively carried out on the vocabularies and the mass texts in the active vocabularies and the passive vocabularies, and if the similarity is more than ninety percent, the vocabularies and the mass texts are correspondingly put into the corresponding vocabularies. Where the preset threshold is set at ninety percent.
And performing similarity calculation on the two expanded word lists, continuously expanding the number of vocabularies in the word lists and establishing a larger database.
As shown in fig. 3, the present disclosure also provides a text classification apparatus including: a calculation module and a judgment module, wherein,
the calculation module is used for calculating the similarity of the text to be classified in advance of each pre-constructed word list in at least one pre-constructed word list;
the judging module is used for judging whether the calculated similarity is greater than a preset threshold value or not, and if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
In one embodiment of the present disclosure, each of the pre-formed word lists includes at least one predefined text belonging to the same class.
In one embodiment of the present disclosure, the text classification apparatus further includes: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.
For the information interaction, execution process, and other contents between the modules in the apparatus, the specific contents may refer to the description in the embodiment of the method of the present disclosure because the same concept is based on the embodiment of the method of the present disclosure, and are not described herein again.
The present disclosure also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements any of the text classification methods of the present disclosure.
The computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The present disclosure also provides a computing device comprising a memory and a processor, the memory storing a computer program, the processor implementing any of the text classification methods of the present disclosure when executing the computer program.
The computing devices of the disclosed embodiments exist in a variety of forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) Other electronic devices with data processing capabilities.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" does not exclude the presence of other similar elements in a process, method, article, or apparatus that comprises the element.
In the description herein, reference to the description of the terms "one embodiment/mode," "some embodiments/modes," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/mode or example is included in at least one embodiment/mode or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to be the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims (8)

1. A text classification method is characterized by comprising the following steps:
calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list;
judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
2. The method of claim 1, wherein each of the pre-formed word lists comprises at least one pre-defined text belonging to the same class.
3. The text classification method according to claim 2, characterized in that the text classification method further comprises:
and classifying the texts to be classified into a pre-constructed word list with the similarity larger than a preset threshold value, and then updating the pre-constructed word list.
4. A text classification apparatus, characterized by comprising: a calculation module and a judgment module, wherein,
the calculation module is used for calculating the similarity of the text to be classified in advance of each pre-constructed word list in at least one pre-constructed word list;
the judging module is used for judging whether the calculated similarity is greater than a preset threshold value or not, and if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.
5. The apparatus according to claim 4, wherein each of said pre-formed word lists comprises at least one predefined text belonging to the same class.
6. The apparatus for classifying text according to claim 5, further comprising: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the text classification method of any one of claims 1 to 3.
8. A computing device comprising a memory and a processor, wherein the memory stores a computer program and the processor implements the text classification method of any one of claims 1 to 3 when executing the computer program.
CN201911360415.9A 2019-12-25 2019-12-25 Text classification method and device Pending CN110990577A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911360415.9A CN110990577A (en) 2019-12-25 2019-12-25 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911360415.9A CN110990577A (en) 2019-12-25 2019-12-25 Text classification method and device

Publications (1)

Publication Number Publication Date
CN110990577A true CN110990577A (en) 2020-04-10

Family

ID=70076627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911360415.9A Pending CN110990577A (en) 2019-12-25 2019-12-25 Text classification method and device

Country Status (1)

Country Link
CN (1) CN110990577A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813941A (en) * 2020-07-23 2020-10-23 北京来也网络科技有限公司 Text classification method, device, equipment and medium combining RPA and AI
CN113535965A (en) * 2021-09-16 2021-10-22 杭州费尔斯通科技有限公司 Method and system for large-scale classification of texts

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN107729520A (en) * 2017-10-27 2018-02-23 北京锐安科技有限公司 File classifying method, device, computer equipment and computer-readable medium
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment
CN109492110A (en) * 2018-11-28 2019-03-19 南京中孚信息技术有限公司 Document Classification Method and device
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN107766426A (en) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 A kind of file classification method, device and electronic equipment
CN110019785A (en) * 2017-09-29 2019-07-16 北京国双科技有限公司 A kind of file classification method and device
CN107729520A (en) * 2017-10-27 2018-02-23 北京锐安科技有限公司 File classifying method, device, computer equipment and computer-readable medium
CN109492110A (en) * 2018-11-28 2019-03-19 南京中孚信息技术有限公司 Document Classification Method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813941A (en) * 2020-07-23 2020-10-23 北京来也网络科技有限公司 Text classification method, device, equipment and medium combining RPA and AI
CN113535965A (en) * 2021-09-16 2021-10-22 杭州费尔斯通科技有限公司 Method and system for large-scale classification of texts

Similar Documents

Publication Publication Date Title
CN104217717B (en) Build the method and device of language model
CN109741755B (en) Voice awakening word threshold management device and method for managing voice awakening word threshold
US7711735B2 (en) User segment suggestion for online advertising
CN112868004B (en) Resource recommendation method and device, electronic equipment and storage medium
CN110020010A (en) Data processing method, device and electronic equipment
CN107766891B (en) User gender identification method and device, storage medium and electronic equipment
CN111881316A (en) Search method, search device, server and computer-readable storage medium
CN106886568B (en) One kind divides table method, apparatus and electronic equipment
CN112784112B (en) Message verification method and device
CN110990577A (en) Text classification method and device
CN103235773A (en) Method and device for extracting text labels based on keywords
CN114756677B (en) Sample generation method, training method of text classification model and text classification method
CN117521625A (en) Question answering method, question answering device, electronic equipment and medium
CN109408714A (en) A kind of recommender system and method for multi-model fusion
CN109002434A (en) Customer service question and answer matching method, server and storage medium
US20170161322A1 (en) Method and electronic device for searching resource
US9195792B2 (en) Circuit design porting between process design types
CN113343713B (en) Intention recognition method and device, computer equipment and storage medium
CN107728806A (en) Input method candidate word methods of exhibiting, device, computer installation and storage medium
CN111461328B (en) Training method of neural network
US20170171330A1 (en) Method for pushing information and electronic device
CN108108345A (en) For determining the method and apparatus of theme of news
CN102541857A (en) Webpage sorting method and device
CN113704482A (en) Template determination method and related device for knowledge graph
CN104820695A (en) Method and device for resource acquisition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200410