CN117033633A - Text classification method, system, medium and equipment - Google Patents

Text classification method, system, medium and equipment Download PDF

Info

Publication number
CN117033633A
CN117033633A CN202310996367.2A CN202310996367A CN117033633A CN 117033633 A CN117033633 A CN 117033633A CN 202310996367 A CN202310996367 A CN 202310996367A CN 117033633 A CN117033633 A CN 117033633A
Authority
CN
China
Prior art keywords
word
information
text
groups
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310996367.2A
Other languages
Chinese (zh)
Inventor
张全飞
邓举名
张敏
卢春辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qizhi Technology Co ltd
Original Assignee
Qizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qizhi Technology Co ltd filed Critical Qizhi Technology Co ltd
Priority to CN202310996367.2A priority Critical patent/CN117033633A/en
Publication of CN117033633A publication Critical patent/CN117033633A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text classification method, a system, a medium and a device relate to the field of text processing. The method comprises the following steps: sentence dividing is carried out on the text information, and a plurality of sentence information are obtained; identifying word information in the clause information according to a preset word stock; judging whether a plurality of target words which cross and share the same word exist in the word information; if yes, the same sentence information is divided according to the target words, a plurality of groups of to-be-selected word groups are generated, selecting an optimal word group from a plurality of groups of word groups to be selected; if not, directly word segmentation is carried out on the sentence information to obtain a common word segmentation group; and inputting the common word segmentation group and the optimal word segmentation group into a text classification model to obtain a text classification result. The words with the same characters are screened out, words are segmented respectively, accurate segmentation is screened out according to the relevance between the segmented words, and the accurate segmentation is input into a text classification model, so that an accurate text classification result is obtained.

Description

Text classification method, system, medium and equipment
Technical Field
The application relates to the field of text processing, in particular to a text classification method, a system, a medium and equipment.
Background
Text classification refers to classifying a given text into different predefined categories. The process can help to automatically process a large amount of texts, so that applications such as information filtering, information retrieval, emotion analysis and the like are realized.
When the Chinese text is classified, a word segmentation step is needed, but words which are mutually influenced exist in the text, for example, a new word is formed by the tail and the head between two adjacent words, for example, a word of fruit appears when the word segmentation is caused by the fact that water is absorbed and is adjacent to the word of fruit; for another example, "water detection" may occur, which may lead to inaccuracy of the word segmentation, and inputting the word segmentation into the text classification model may lead to inaccuracy of the text classification.
Disclosure of Invention
In order to enable text classification to be more accurate, the application provides a text classification method, a system, a medium and equipment.
In a first aspect, the present application provides a text classification method.
A text classification method comprising the steps of:
sentence dividing is carried out on the text information, and a plurality of sentence information are obtained;
identifying word information in the clause information according to a preset word stock;
judging whether a plurality of target words which cross and share the same word exist in the word information;
if yes, the same sentence information is divided according to the target words, generating multiple groups of to-be-selected word groups and according to the to-be-selected word groups the relevance of the words among the word groups is selected, selecting an optimal word group from a plurality of groups of word groups to be selected;
if not, directly word segmentation is carried out on the sentence information to obtain a common word segmentation group;
and inputting the common word segmentation group and the optimal word segmentation group into a text classification model to obtain a text classification result.
By adopting the technical scheme, the target words which cross and share the same word are distinguished through the word stock, then the relevance of the separated target words is calculated, and the optimal target words are screened out according to the relevance, so that the word segmentation accuracy is improved, and the text classification is more accurate.
Optionally, in identifying word information in the phrase information according to the preset word stock, the method further includes the following steps:
dividing the clause information into a plurality of clause groups according to a dividing rule;
all word information in each group of component sentences is concurrently identified.
By adopting the technical scheme, the concurrent recognition can shorten the processing time of the overall data by processing a plurality of data at one time, improve the recognition efficiency of word information and make the text classification faster.
Optionally, the dividing rule is to sequentially select the clause information with the maximum concurrent processing number according to the arrangement sequence of the clause information in the text information, and divide the clause information into a group of clause groups.
By adopting the technical scheme, the efficiency of concurrent processing is highest by selecting the clause information with the maximum processing quantity, and the processing speed is accelerated.
Optionally, in inputting the common word segmentation group and the optimal word segmentation group into the text classification model to obtain a text classification result, the method further comprises the following steps:
replacing untrained text word segmentation with the near meaning word of the training near meaning word library to obtain a plurality of replaced text word segmentation, wherein the text word segmentation is any word in common word segmentation groups and optimal word segmentation groups;
and sequentially inputting the text segmentation and the replacement text segmentation into the text classification model to obtain a text classification result.
By adopting the technical scheme, the text word segmentation is replaced by the near-meaning words of the training near-meaning word library, so that the text classification model can process untrained words in the text, and inaccuracy of classification results caused by untrained text word segmentation is not easy to occur.
Alternatively, in replacing untrained text segmentation where a hyponym exists with a training hyponym library hyponym, the method comprises the following steps:
judging whether a plurality of hyponyms exist in the text segmentation;
if yes, selecting the paraphrasing with the highest relativity with the clause information as the text segmentation.
By adopting the technical scheme, the paraphrasing words are screened, the paraphrasing word with the highest degree of relativity is selected, namely the paraphrasing word closest to the sentence information language environment is selected, and the text segmentation with the expression meaning close to that of the original text in the replaced text segmentation is enabled to reduce the influence on the classification result.
Optionally, selecting the optimal word group from the multiple groups of word groups according to the relativity of words among the word groups to be selected, and further comprising the following steps:
and selecting the word group to be selected with the maximum sum of the cosine similarity of the word vectors of the adjacent words in the word groups to be selected as the optimal word group.
By adopting the technical scheme, because the relativity between the new word formed by the tail and the head of the two adjacent words and other words in the sentence is lower, the formed sentence semantics are usually deviated from reality, so that the optimal word segmentation group for screening out the word segmentation group with the highest relativity among words in all word segmentation groups to be selected is usually the word segmentation group which is correct in grammar of the sentence through word vector cosine similarity comparison.
Optionally, after identifying word information in the phrase information according to the preset word stock, the method further comprises the following steps:
text cleansing is performed on the clause information to filter nonsensical words.
By adopting the technical scheme, the nonsensical words are filtered through cleaning the text, so that the interference of the nonsensical words on the clauses is reduced, and the text classification is more efficient.
In a second aspect of the application, a text classification system is provided.
A text classification system, comprising:
the sentence dividing module is used for dividing the text information to obtain a plurality of sentence information;
the word recognition module is used for recognizing word information in the clause information according to a preset word stock;
the target word judging module is used for judging whether a plurality of target words which share the same word in a crossing way exist in the word information;
the word-to-be-selected word-group generating module is used for respectively word-dividing the same sentence information according to the target words when a plurality of target words which share the same word in a crossing way exist in the word information, so as to generate a plurality of groups of word-to-be-selected word groups;
the optimal word group selecting module is used for selecting the optimal word group from a plurality of groups of word groups to be selected according to the relativity of words among the word groups to be selected;
the common word group generation module is used for directly word-dividing the sentence information to obtain a common word group when a plurality of target words which share the same word in a crossing way do not exist in the word information;
and the text classification result output module is used for inputting the common word segmentation group and the optimal word segmentation group into the text classification model to obtain a text classification result.
In a third aspect of the application, an electronic device is provided comprising a processor, a memory for storing instructions, and a transceiver for communicating with other devices, the processor for executing the instructions stored in the memory to cause the electronic device to perform a text classification method.
In a fourth aspect of the application, a computer readable storage medium is provided, the computer readable storage medium storing instructions that, when executed, perform a text classification method.
In summary, one or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
1. identifying and screening words containing the same words in the sentence information, respectively segmenting the words, correspondingly segmenting the rest of the sentence information after the words are removed, screening the sentence information to obtain accurate segmented words according to the correlation between the segmented words, and inputting the accurate segmented words into a text classification model to obtain an accurate text classification result;
2. the text classification method has higher text classification efficiency by concurrently identifying word information of a plurality of clause information;
3. the text classification result is more accurate by replacing untrained words in the text segmentation with the closest words with the highest relevance.
Drawings
FIG. 1 is a flow chart of a text classification method according to an embodiment of the application;
FIG. 2 is a flowchart illustrating steps for optimizing word information in identifying phrase information based on a pre-set word stock in accordance with an embodiment of the present application;
FIG. 3 is a schematic diagram of a specific flow chart of a step of inputting a common word segment group and an optimal word segment group into a text classification model to obtain a text classification result according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a text classification system according to an embodiment of the application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals illustrate: 1. a sentence module; 2. a word recognition module; 3. a target word judging module; 4. a generating module of the word group to be selected; 5. the optimal word group selecting module; 6. a common word segmentation group generation module; 7. a text classification result output module; 1000. an electronic device; 1001. a processor; 1002. a communication bus; 1003. a user interface; 1004. a network interface; 1005. a memory.
Detailed Description
In order that those skilled in the art will better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.
In describing embodiments of the present application, words such as "for example" or "for example" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "such as" or "for example" in embodiments of the application should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "or" for example "is intended to present related concepts in a concrete fashion.
In the description of embodiments of the application, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
Referring to fig. 1, a text classification method includes the steps of:
s1: sentence dividing is carried out on the text information, and a plurality of sentence information are obtained;
specifically, the text information is divided into a plurality of clause information by identifying the end punctuation marks, which are periods, semicolons, question marks or exclamation marks, at the positions where the end punctuation marks appear.
S2: identifying word information in the clause information according to a preset word stock;
specifically, the preset word library may be a word library storing a large number of words, a target word is selected from the preset word library, the target word is the word with the same first two words as any group of adjacent words in the clause information, the target word in the word library is compared with the words in the clause information one by one, whether the clause information has words consistent with the target word or not is judged, if yes, the words are extracted as word information;
in addition, the recognition step is further optimized, so that the recognition processing efficiency is improved, and the specific optimization steps are shown in S21-S23:
s21: dividing the clause information into a plurality of clause groups according to a dividing rule;
specifically, the dividing rule is to sequentially select the clause information with the largest concurrent processing quantity according to the arrangement sequence of the clause information in the text information and divide the clause information into a group of clause groups until all the clause information is completely divided to form a plurality of groups of clause groups; in the embodiment, a plurality of clause information is processed simultaneously in a concurrent processing mode, so that the recognition efficiency of word information in the clause information is improved.
S22: concurrently identifying all word information in each group of sentences;
simultaneously identifying word information in all clause information in a group of clause groups in a concurrent processing mode; after all word information in one group of clause groups is identified, identifying the word information in the next group of clause groups until all the clause groups are identified; it should be noted that, concurrent processing refers to the capability of executing multiple tasks simultaneously in a computer system, which may specifically be multithreaded parallel computation, or may be executed by a distributed computing manner to identify word information in each group of sentences.
S23: text cleaning is carried out on the clause information to filter nonsensical words;
specifically, the text cleaning comprises removing HTML labels and special characters, punctuation marks and numbers, chinese and English blank spaces and stop words; the HTML labels, special characters, punctuations, numbers, chinese and English spaces and stop words in the clause information are removed by using text cleaning, and noise in the clause information is removed, so that the subsequent clause segmentation processing is more accurate.
S3: judging whether a plurality of target words which cross and share the same word exist in the word information; if yes, word segmentation is carried out on the same sentence information according to the target words, and a plurality of groups of word groups to be selected are generated; selecting the optimal word groups from a plurality of groups of word groups to be selected according to the relativity of the words among the word groups to be selected; if not, directly word segmentation is carried out on the sentence information to obtain a common word segmentation group;
specifically, the target words which cross and share the same word are illustrated, for example, a power-on machine runs, wherein two words of a motor and a machine cross and share the same word machine, and words which interfere with word segmentation results are screened out;
after judging that a plurality of target words which share the same word in a crossing way exist in word information, extracting a plurality of target words, selecting the number of one target word, and performing word segmentation operation; the word segmentation operation is as follows: removing selected target words from the sentence information, then performing word segmentation to generate a group of word groups to be selected, and adding the selected target words into the word groups to be selected according to the sequence of the text information; then selecting one of the remaining target words, performing word segmentation operation, and then repeatedly selecting one of the remaining target words, performing word segmentation operation until all the target words are selected, thereby obtaining a plurality of groups of word groups to be selected;
and then, calculating cosine similarity between Word vectors among Word groups through a Word vector model Word2Vec or GloVe, taking the sum of the cosine similarity of the Word vectors of all adjacent words as the relevance of the words, and selecting the Word group to be selected with the highest relevance of the words as the optimal Word group.
After judging that a plurality of target words which cross and share the same word are not present in word information, after the word segmentation of the sentence information is described, the word information is input into a text classification model, the condition that text classification is inaccurate due to the fact that a plurality of results appear in word segmentation is avoided, and word segmentation is directly carried out on the sentence information, so that a common word segmentation group is obtained.
S4: inputting the common word segmentation group and the optimal word segmentation group into a text classification model to obtain a text classification result;
the specific processing procedure is as follows:
s41: replacing untrained text segmentation words with the near-meaning words of the training near-meaning word library to obtain a plurality of replaced text segmentation words;
specifically, the text word segmentation is any word in a common word segmentation group and an optimal word segmentation group, the training word library is a word library preset by personnel, the training word library comprises all training words in the training process of the text classification model and the word library corresponding to the training words, whether the text word segmentation belongs to the word library corresponding to the training words or not can be inquired through the training word library, and then the training words corresponding to the text word segmentation can be found through the corresponding relation;
the specific replacement process also comprises the following steps:
s411: judging whether a plurality of hyponyms exist in the text segmentation; if yes, selecting the paraphrasing words with the highest relativity with the clause information to replace the text clauses respectively; if not, the text word segmentation is directly replaced by the paraphrasing.
It should be noted that, because a text word segmentation can appear corresponding to a plurality of hyponyms, if only one text word segmentation is selected, special cases are easy to appear, resulting in inaccurate results; if all the data are selected, the condition that the data to be processed are too much is caused, so that the processing efficiency is too low; it is therefore necessary to control the number of paraphraseology selected. After judging that a plurality of hyponyms exist in the text word segmentation, calculating the relevance between the hyponyms and sentence segmentation information, wherein the specific relevance can be calculated in a mode adopted in the step S32 by a method based on word vectors, and selecting the hyponym with the highest relevance as a target hyponym; and respectively replacing the text word segmentation with the target near-meaning word to obtain a replaced text word segmentation, and replacing the position of the original text word segmentation in the common word segmentation group or the optimal word segmentation group with the replaced text word segmentation. After judging that only one near meaning word exists in the text word, the near meaning word is directly used for replacing the position in the common word group or the optimal word group of the text word. When judging that the text word segmentation does not have the hyponym, not replacing the text word segmentation to reserve the original text word segmentation;
s42: and sequentially inputting the text segmentation and the replacement text segmentation into the text classification model to obtain a text classification result.
Specifically, the text classification model is a deep learning model, and can be a cyclic neural network (RNN) or a Convolutional Neural Network (CNN); in other embodiments, the text classification model may also be a naive bayes classifier, where the naive bayes classifier has a better classification effect on the text data and a faster computation speed; in the embodiment, after text segmentation is replaced by a near meaning word, according to the arrangement sequence of the text segmentation in text information, the text segmentation and the text segmentation corresponding to the replaced text segmentation are sequentially input into a trained text classification model to obtain a text classification result, so that text classification is completed; it should be noted that the trained text classification model belongs to the prior art, and is not described herein.
Referring to fig. 5, the application further provides a text classification method system.
A text classification method system, comprising:
the sentence module 1 is used for sentence dividing of the text information to obtain a plurality of sentence information;
the word recognition module 2 is used for recognizing word information in the clause information according to a preset word stock;
a target word judging module 3, configured to judge whether a plurality of target words which cross and share the same word exist in the word information;
the word group to be selected generating module 4 is used for respectively word-dividing the same sentence information according to the target words when a plurality of target words which share the same word in a crossing way exist in the word information, so as to generate a plurality of groups of word groups to be selected;
the optimal word group selecting module 5 is used for selecting the optimal word group from a plurality of groups of word groups to be selected according to the relativity of the words among the word groups to be selected;
the common word group generation module 6 is used for directly word-dividing the sentence information to obtain a common word group when a plurality of target words which cross and share the same word do not exist in the word information;
and the text classification result output module 7 is used for inputting the common word segmentation group and the optimal word segmentation group into the text classification model to obtain a text classification result.
It should be noted that: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.
The application further provides electronic equipment.
Referring to fig. 5, a schematic structural diagram of an electronic device is provided in an embodiment of the present application. As shown in fig. 5, the electronic device 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002.
Wherein the communication bus 1002 is used to enable connected communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may further include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Wherein the processor 1001 may include one or more processing cores. The processor 1001 connects various parts within the overall electronic device 1000 using various interfaces and lines, performs various functions of the electronic device 1000 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of digital signal processing (DigitalSignalProcessing, DSP), field programmable gate array (Field-ProgrammableGateArray, FPGA), programmable logic array (ProgrammableLogicArray, PLA). The processor 1001 may integrate one or a combination of several of a central processing unit (CentralProcessingUnit, CPU), an image processing unit (GraphicsProcessingUnit, GPU), a modem, and the like. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 1001 and may be implemented by a single chip.
The memory 1005 may include a random access memory (RandomAccessMemory, RAM) or a Read-only memory (Read-only memory). Optionally, the memory 1005 includes a non-transitory computer readable medium (non-transitoroompter-readabblestonemachineum). The memory 1005 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described respective method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. The memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and an application program of an integrated power consumption management method for a government and enterprise.
It should be noted that: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the embodiments of the method are detailed in the method embodiments, and details are not repeated. In the electronic device 1000 shown in fig. 4, the user interface 1003 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke an application program in the memory 1005 that stores a traffic light intelligent adjustment method that, when executed by one or more processors, causes the electronic device to perform the method as described in one or more of the embodiments above.
An electronic device readable storage medium storing instructions. When executed by one or more processors, cause an electronic device to perform the method as described in one or more of the embodiments above.
It will be clear to a person skilled in the art that the solution according to the application can be implemented by means of software and/or hardware. "Unit" and "module" in this specification refer to software and/or hardware capable of performing a particular function, either alone or in combination with other components, such as Field programmable gate arrays (Field-ProgrammaBLEGateArray, FPGA), integrated circuits (IntegratedCircuit, IC), and the like.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be performed by hardware associated with a program that is stored in a computer readable memory, which may include: flash disk, read-only memory (ROM), random access memory (RandomAccessMemory, RAM), magnetic or optical disk, and the like.
The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims (10)

1. A method of classifying text, comprising the steps of:
sentence dividing is carried out on the text information, and a plurality of sentence information are obtained;
identifying word information in the clause information according to a preset word stock;
judging whether a plurality of target words which cross and share the same word exist in the word information;
if yes, the same sentence information is divided according to the target words, generating multiple groups of to-be-selected word groups and according to the to-be-selected word groups the relevance of the words among the word groups is selected, selecting an optimal word group from a plurality of groups of word groups to be selected;
if not, directly word segmentation is carried out on the sentence information to obtain a common word segmentation group;
and inputting the common word segmentation group and the optimal word segmentation group into a text classification model to obtain a text classification result.
2. The text classification method of claim 1, further comprising, in identifying word information in the phrase information based on a preset word stock, the steps of:
dividing the clause information into a plurality of clause groups according to a dividing rule;
all word information in each group of component sentences is concurrently identified.
3. A method of text classification as claimed in claim 2, wherein: the dividing rule is to sequentially select the clause information with the maximum concurrent processing quantity according to the arrangement sequence of the clause information in the text information and divide the clause information into a group of clause groups.
4. The text classification method according to claim 1, wherein in inputting the common word segment group and the optimal word segment group into the text classification model to obtain the text classification result, the method further comprises the steps of:
replacing untrained text word segmentation with the near meaning word of the training near meaning word library to obtain a plurality of replaced text word segmentation, wherein the text word segmentation is any word in common word segmentation groups and optimal word segmentation groups;
and sequentially inputting the text segmentation and the replacement text segmentation into the text classification model to obtain a text classification result.
5. A text classification method according to claim 4, characterized in that in replacing untrained text segmentation in which a paraphrase exists with a training paraphrase library paraphrase, comprising the steps of:
judging whether a plurality of hyponyms exist in the text segmentation;
if yes, selecting the paraphrasing with the highest relativity with the clause information as the text segmentation.
6. The text classification method of claim 1, wherein selecting the best word group from the plurality of groups of word groups to be selected according to the relativity of words among the word groups to be selected, further comprises the steps of:
and selecting the word group to be selected with the maximum sum of the cosine similarity of the word vectors of the adjacent words in the word groups to be selected as the optimal word group.
7. The text classification method of claim 1, further comprising the steps of, after identifying word information in the phrase information according to a preset word stock:
text cleansing is performed on the clause information to filter nonsensical words.
8. A system based on the text classification method of any of claims 1-7, comprising:
the sentence dividing module (1) is used for dividing the text information to obtain a plurality of sentence information;
the word recognition module (2) is used for recognizing word information in the clause information according to a preset word stock;
the target word judging module (3) is used for judging whether a plurality of target words which cross and share the same word exist in the word information;
the word-to-be-selected phrase generation module (4) is used for respectively word-dividing the same sentence information according to the target words when a plurality of target words which share the same word in a crossing way exist in the word information, so as to generate a plurality of groups of word-to-be-selected phrases;
the optimal word group selecting module (5) is used for selecting the optimal word group from a plurality of groups of word groups to be selected according to the relativity of the words among the word groups to be selected;
the common word-phrase generation module (6) is used for directly word-segmenting the word-phrase information to obtain common word-phrase when a plurality of target words which cross and share the same word do not exist in the word-phrase information;
and the text classification result output module (7) is used for inputting the common word segmentation group and the optimal word segmentation group into the text classification model to obtain a text classification result.
9. An electronic device comprising a processor, a memory for storing instructions, and a transceiver for communicating with other devices, the processor for executing the instructions stored in the memory to cause the electronic device to perform the method of any of claims 1-7.
10. A computer readable storage medium storing instructions which, when executed, perform the method steps of any of claims 1-7.
CN202310996367.2A 2023-08-08 2023-08-08 Text classification method, system, medium and equipment Pending CN117033633A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310996367.2A CN117033633A (en) 2023-08-08 2023-08-08 Text classification method, system, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310996367.2A CN117033633A (en) 2023-08-08 2023-08-08 Text classification method, system, medium and equipment

Publications (1)

Publication Number Publication Date
CN117033633A true CN117033633A (en) 2023-11-10

Family

ID=88633022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310996367.2A Pending CN117033633A (en) 2023-08-08 2023-08-08 Text classification method, system, medium and equipment

Country Status (1)

Country Link
CN (1) CN117033633A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065739A (en) * 2021-11-12 2022-02-18 北京沃东天骏信息技术有限公司 Text word segmentation method and device, electronic equipment and computer readable medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065739A (en) * 2021-11-12 2022-02-18 北京沃东天骏信息技术有限公司 Text word segmentation method and device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN110287961B (en) Chinese word segmentation method, electronic device and readable storage medium
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN111858843B (en) Text classification method and device
CN113326380B (en) Equipment measurement data processing method, system and terminal based on deep neural network
Singh et al. A decision tree based word sense disambiguation system in Manipuri language
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN107885744A (en) Conversational data analysis
CN111145903A (en) Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system
WO2021001047A1 (en) System, apparatus and method of managing knowledge generated from technical data
CN117033633A (en) Text classification method, system, medium and equipment
US20220318514A1 (en) System and method for identifying entities and semantic relations between one or more sentences
CN113127607A (en) Text data labeling method and device, electronic equipment and readable storage medium
CN116090450A (en) Text processing method and computing device
CN113836297B (en) Training method and device for text emotion analysis model
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN114332476A (en) Method, device, electronic equipment, storage medium and product for identifying dimensional language
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
CN112329478A (en) Method, device and equipment for constructing causal relationship determination model
Yasin et al. Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text
CN112069322A (en) Text multi-label analysis method and device, electronic equipment and storage medium
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
US20220092260A1 (en) Information output apparatus, question generation apparatus, and non-transitory computer readable medium
RU2760637C1 (en) Method and system for retrieving named entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination