CN112364159A - Method, device and storage medium for classifying texts - Google Patents

Method, device and storage medium for classifying texts Download PDF

Info

Publication number
CN112364159A
CN112364159A CN201910684756.5A CN201910684756A CN112364159A CN 112364159 A CN112364159 A CN 112364159A CN 201910684756 A CN201910684756 A CN 201910684756A CN 112364159 A CN112364159 A CN 112364159A
Authority
CN
China
Prior art keywords
word
text
pinyin
basic
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910684756.5A
Other languages
Chinese (zh)
Inventor
乔宏利
罗欢
权圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN201910684756.5A priority Critical patent/CN112364159A/en
Publication of CN112364159A publication Critical patent/CN112364159A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device and a storage medium for classifying texts. Wherein, the method comprises the following steps: acquiring a plurality of basic word elements for classification from a text to be classified; determining the word element pinyin corresponding to the basic word elements; and determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyins by utilizing a pre-trained text classification model classified according to the word elements and the pinyins. Therefore, the purpose of classifying the text by combining Chinese words and pinyin is achieved, compared with the mode training and text classification which are carried out only through the text or the pinyin of the text, the technical effect of improving the accuracy rate, the accuracy rate and the recall rate index of the result is achieved, the characteristic that the homonymous meaning words of the original Chinese characters have different meanings is kept, meanwhile, the influence of misclassifying the words of the Internet is overcome, and the effect and the generalization capability of classifying the Chinese text can be effectively improved.

Description

Method, device and storage medium for classifying texts
Technical Field
The present application relates to the field of computers and artificial intelligence, and in particular, to a method, an apparatus, and a storage medium for classifying texts.
Background
The text classification is one of main application forms of applying a machine learning technology to the field of NLP (natural language processing), and the functions of semantic identification and intention prediction in a specific semantic space can be realized by classifying common texts aiming at texts with different semantic labels.
The existing text classification technology is mostly a method of firstly training text word vectors and then calculating the text word vectors by adopting various machine learning algorithms. In the actual network application environment, for example, in the application fields of intelligent customer service and chat robots, the input text data often has some homophonic error inputs, such as phonetic words and homophonic characters, that is, the input itself often has noise, for example: the relevant positions in the corpus input in the sample training process are both 'simple and general', and the model trained in this way cannot be generalized to the variant if the wrongly-typed word input at the position by the user is 'spectrum reduction' in practical use. Therefore, training a text classifier with such corpora or training a classifier with standard corpora to recognize such real user input often affects the practical effect. In addition, the homonymous wrongly-written characters input by the user can be counted and expressed according to different word results by only classifying through texts, so that invalid word elements are added, and classification calculation and semantic recognition are influenced; a large number of homonymic and heteronymic words cannot be accurately distinguished by simply classifying through pinyin, and the distinguished word elements are regarded as the same, so that the final result is influenced.
Aiming at the technical problems that in the prior art, classification model training is carried out and texts are classified only by a Chinese word or Chinese pinyin mode, and the accuracy of classification results is influenced due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises are possibly generated in the input process, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the disclosure provides a method, a device and a storage medium for classifying texts, which at least solve the technical problems that in the prior art, classification model training is performed only by a Chinese word or Chinese pinyin mode, and texts are classified, and the accuracy of classification results is affected due to the fact that noises such as homophones, mispronounced words and homophones are possibly generated in the input process.
According to an aspect of the embodiments of the present disclosure, there is provided a method of classifying text, including: acquiring a plurality of basic word elements for classification from a text to be classified; determining the word element pinyin corresponding to the basic word elements; and determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyins by utilizing a pre-trained text classification model classified according to the word elements and the pinyins.
According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is run.
According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for classifying text, including: the acquisition module is used for acquiring a plurality of basic word elements for classification from the text to be classified; the determining module is used for determining the word element pinyin corresponding to the basic word elements; and the classification module is used for determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyin by utilizing a pre-trained text classification model classified according to the word elements and the pinyin.
According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for classifying text, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a plurality of basic word elements for classification from a text to be classified; determining the word element pinyin corresponding to the basic word elements; and determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyins by utilizing a pre-trained text classification model classified according to the word elements and the pinyins.
In the embodiment of the disclosure, a plurality of basic lemmas for classification are obtained from a text to be classified, then, the lemma pinyins corresponding to the basic lemmas are determined, and finally, the classification of the text to be classified is determined according to the plurality of basic lemmas and the corresponding lemma pinyins by utilizing a pre-trained text classification model classified according to the lemmas and the pinyins. The text classification model is also obtained by training Chinese words of the corpus and corresponding Chinese pinyin. Therefore, the purpose of classifying the text by combining the Chinese words and the pinyin is realized, compared with the mode training and the text classification which are carried out only by the text or the pinyin of the text, the technical effect of improving the accuracy rate, the accuracy rate and the recall rate index of the result is achieved, the characteristic that the homonymous meaning words of the original Chinese characters have different meanings is also kept, meanwhile, the influence of misclassifying the words on the Internet is overcome, and the effect and the generalization capability of classifying the Chinese text can be effectively improved. The method further solves the technical problems that in the prior art, classification model training is carried out only by a Chinese word or Chinese pinyin mode, and texts are classified, and the accuracy of classification results is affected due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises are possibly generated in the input process.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a hardware configuration block diagram of a [ computer terminal (or mobile device) ] for implementing the method according to embodiment 1 of the present disclosure;
fig. 2 is a schematic flow chart of a method for classifying text according to a first aspect of embodiment 1 of the present disclosure;
fig. 3 is a schematic diagram of an apparatus for classifying text according to embodiment 2 of the present disclosure; and
fig. 4 is a schematic diagram of an apparatus for classifying texts according to embodiment 3 of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with the present embodiment, there is provided an embodiment of a method of classifying text, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method provided by the embodiment can be executed in a mobile terminal, a computer terminal or a similar operation device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing the method of classifying text. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for classifying texts in the embodiment of the present disclosure, and the processor 102 executes various functional applications and data processing by executing the software programs and modules stored in the memory 104, that is, implementing the method for classifying texts of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.
Under the operating environment, according to a first aspect of the present embodiment, there is provided a method for classifying text, where fig. 2 shows a flowchart of the method, and with reference to fig. 2, the method includes:
s202: acquiring a plurality of basic word elements for classification from a text to be classified;
s204: determining the word element pinyin corresponding to the basic word elements; and
s206: and determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyins by utilizing a pre-trained text classification model classified according to the word elements and the pinyins.
As described in the background art, most of the existing text classification techniques are methods of performing text word vector training and then performing text word vector calculation by using various machine learning algorithms. In an actual network application environment, for example, in the application field of intelligent customer service and chat robots, input text data often has some homophonic error inputs, such as phonetic words and homophonic characters, that is, the input itself often has noise, and the actual effect is often influenced by training a text classifier by using the corpus or recognizing the real user input by using a standard corpus training classifier. In addition, the homonymous wrongly-written characters input by the user can be counted and expressed according to different word results by only classifying through texts, so that invalid word elements are added, and classification calculation and semantic recognition are influenced; a large number of homonymic and heteronymic words cannot be accurately distinguished by simply classifying through pinyin, and the distinguished word elements are regarded as the same, so that the final result is influenced.
Specifically, for the problem existing in the existing text classification manner in the background art, the technical solution of the present embodiment first obtains a plurality of basic lemmas for classification from a text to be classified. The specific process of obtaining the basic lemma may be to first perform word segmentation on the text to be classified to obtain a plurality of lemmas, and then use the vocabulary to screen the basic lemma from the plurality of lemmas. For example: the text to be classified is "weekend and Xiaoming to play ball", the corresponding participle is (weekend, Xiaoming, playing ball), the vocabulary can be [ weekend, playing ball, playing game ], and the obtained basic lemma corresponding to the text to be classified is (weekend, playing ball). It should be noted that the vocabulary is generated in the process of training the classification model by using corpus, and a corresponding description is provided later.
Further, the word element pinyin corresponding to the basic word element is determined, namely the word element pinyin is (zhoumodaqiu). Finally, the classification of the text to be classified is determined according to a plurality of basic word elements (weekends and playing balls) and corresponding word element pinyin (zhoumo daqiu) by utilizing a pre-trained text classification model classified according to the word elements and the pinyin, wherein the text classification model is also obtained by training Chinese words of the corpus and the corresponding Chinese pinyin.
Therefore, through the method, a plurality of basic word elements for classification are obtained from the text to be classified, word element pinyin corresponding to the basic word elements is determined, and finally the classification of the text to be classified is determined according to the basic word elements and the corresponding word element pinyin by utilizing a preset text classification model classified according to the word elements and the pinyin. Therefore, the purpose of classifying the text by combining Chinese words and pinyin is achieved, compared with the method of classifying the text by only text or pinyin of the text, the method avoids the influence of homophone and wrongly written characters on the result, and achieves the technical effects of improving the accuracy rate, the accuracy rate and the recall rate index of the result. The technical problem that in the prior art, the accuracy of the classification result is affected due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises can be generated in the input process when the text is classified simply by means of Chinese words or Chinese pinyin is solved.
Optionally, the operation of determining the category of the text to be classified includes: determining a plurality of basic word elements and word vectors corresponding to the pinyin of the corresponding word elements; and determining the category of the text to be classified according to the word vector of the determined basic word element and the word vector of the corresponding word element pinyin by using a text classification model.
Specifically, in the operation of determining the category of the text to be classified, a plurality of basic word elements and word vectors corresponding to pinyin of the corresponding word elements are determined, for example: and calculating each basic word element and corresponding word element pinyin by a word embedding algorithm to obtain respective word vectors (for example, the word vector corresponding to the weekend is 0.1,0.2 and 0.3. ], and the word vector corresponding to the zhoumo is-0.1, 0.02 and 0.03. ]). In this way, the text and pinyin can be converted into vectors recognizable to the computer, and the calculation is completed.
Optionally, the operation of determining the category of the text to be classified according to the determined word vector of the basic word element and the word vector of the corresponding word element pinyin comprises: combining the word vectors of a plurality of basic word elements and the word vectors of corresponding word element pinyin to generate combined information corresponding to the text to be classified; and determining the category of the text to be classified according to the combined information by using the text classification model.
Specifically, in the operation of determining the category of the text to be classified according to the determined word vector of the basic word element and the word vector of the corresponding word element pinyin, the basic word element and the word vector of the word element pinyin are combined firstly, wherein the combination mode can be tiling and concatenation, dimension-increasing alignment and the like, and then the combination information corresponding to the text to be classified is generated. And finally, determining the category of the text to be classified according to the combined information by using a text classification model.
Optionally, the method further comprises generating the combined information according to the following operations: arranging the word vectors of a plurality of basic word elements according to the original text sequence of the text to be classified; and arranging the word vectors of the word element pinyin behind the word vectors of the basic word elements according to the sequence corresponding to the basic word elements.
Specifically, in the operation of generating the combination information, firstly, word vectors of a plurality of basic lemmas are arranged according to the original text sequence of the text to be classified, that is, < weekend > < playing >, then word vectors of the pinyin of the lemmas are arranged behind the word vectors of the plurality of basic lemmas according to the sequence corresponding to the plurality of basic lemmas, and finally, the obtained combination information is, for example: < weekend > < batting > < zhoumo > < daqiu >, the corresponding word vector arrangement is, for example: [0.1,0.2,0.3...],[ -0.1,0.02,0.03...],[.....],[......].
Optionally, the method further comprises generating the combined information according to the following operations: arranging word vectors of the word element pinyin corresponding to a plurality of basic word elements according to a sequence; and arranging the word vectors of the plurality of basic word elements behind the word vector of the word element pinyin according to the sequence corresponding to the word element pinyin.
Specifically, in the operation of generating the combined information, firstly, word vectors of the word element pinyin corresponding to a plurality of basic word elements are arranged in sequence, and then the word vectors of the basic word elements are arranged behind the word vectors of the word element pinyin according to the sequence corresponding to the word element pinyin, so that another combined information is obtained:
< zhoumo > < daqiu > < weekend > < batting >, the corresponding word vector arrangement is, for example:
[0.2,0.3,0.4...],[0.2,0.3,0.5..],[....],[.....]。
optionally, the method further comprises generating the combined information according to the following operations: arranging the word vectors of the basic word elements and the word vectors of the word element pinyin in a crossed mode, wherein the word vectors of the word element pinyin are adjacently arranged before or after the word vectors of the corresponding basic word elements.
Specifically, the word vectors of the basic lemmas and the word vectors of the phoneticization of the lemmas are arranged in a crossed manner, wherein the word vectors of the phoneticization of the lemmas are adjacently arranged in front of the word vectors of the corresponding basic lemmas, and the obtained another combination information is as follows: < weekend > < zhoumo > < batting > < daqiu >, the corresponding word vector arrangement is, for example: [0.1,0.2,0.3...],[.....],[ -0.1,0.02,0.03...],[......].
Arranging the word vectors of the basic word elements and the word vectors of the word element pinyin in a crossed mode, wherein the word vectors of the word element pinyin are adjacently arranged in front of the word vectors of the corresponding basic word elements, and the obtained other type of combined information is as follows: the word vector arrangement for < zhoumo > < weekend > < daqi > < batting > is, for example:
[0.2,0.3,0.4...],[....],[0.2,0.3,0.5..],[.....]。
the four kinds of combined information can be used for classifying texts by using a text classification model alone or combined to be used as a matrix, and classification is performed by using the classification model, wherein the word vector results obtained by word embedding calculation are possibly different due to different arrangement modes of each word element and pinyin. In the final classification operation, the input word vector combination mode can adopt a stacking or splicing mode.
Therefore, by the mode, four groups of word vectors can be obtained through four expression forms, and the four word vectors are combined into a matrix. And training by using a preset classification algorithm to obtain a classification model, generating an input parameter matrix for the text to be predicted according to the mode, inputting the input parameter matrix into the classification model obtained by training, and outputting according to the model to further finish text classification. Because different arrangement forms of the text and the pinyin are combined, compared with the method of simply passing the text or the pinyin of the text, the accuracy rate and the recall rate of the method are obviously improved, and the model not only relates the mined text vocabulary, but also fully learns the word-sound sequence relation.
Optionally, the operation of determining the category of the text to be classified further includes: and respectively taking the word vector of the basic word element and the word vector of the corresponding word element pinyin as different inputs, inputting the different inputs to the corresponding classification models, and determining the category of the text to be classified.
Specifically, in the operation of determining the category of the text to be classified, the word vector of the basic word element and the word vector of the corresponding word element pinyin are respectively used as different inputs, that is: basic word elements (weekend and batting) and word element pinyin (zhoumo daqiu) are respectively used as different inputs and then combined, and the combination method can be tiling and concatenation, dimension increasing alignment and dimension reducing alignment. And finally, carrying out text classification by using a classification model obtained by training the corresponding combined training corpus. Therefore, a classification input is added, so that the classification result is more accurate.
Further, referring to fig. 1, according to a second aspect of the present embodiment, a storage medium 104 is provided. The storage medium 104 comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.
In addition, the embodiment also provides a method for training a classification model, which specifically comprises the following steps:
determining a first word list from the classified linguistic data, wherein the first word list is used for recording high semantic lemmas appearing in the classified linguistic data. Wherein, the classified corpora include a plurality of corpora, for example:
and 1, corpus one: and playing the game on weekends.
And II, corpus II: go to play the ball on weekends and then go to play the game.
And 3, corpus III: go to the park to play the ball and then go to eat.
And fourthly, corpus: go to restaurant for meal on monday and then go to play ball.
........
Further, the segmentation of the corpus is performed, taking corpus one as an example, and the result of corpus-one segmentation is (weekend, game). Then, the word segmentation result is obtained from all the corpora, that is, all the words obtained by segmenting all the corpora are, for example, (weekend, batting, game, park, monday, restaurant, dining), and then the word frequency (i.e., the number of times each word appears) of each word is counted and sorted according to the word frequency, so as to obtain: playing balls (3), weekends (2), games (2), meals (2), parks (1), Mondays (1) and restaurants (1). Selecting a predetermined number of words from the sorting result, wherein the predetermined number of words can be selected by setting a threshold, for example: words greater than threshold 2 are considered high frequency words (high frequency words: playing, weekend, playing, eating). High semantic lemmas are then selected from the high frequency words according to the tf-idf algorithm (i.e., screening words important for categorizing corpora), such as: and calculating to obtain high semantic lemma (weekend, batting and game) in the corpus. It should be noted that, the selection of high semantic lemmas by using tf-idf algorithm is well known to those skilled in the art, and detailed descriptions thereof are omitted here. Then, a second vocabulary is determined, which is the pinyin corresponding to the words in the first vocabulary, so the second vocabulary is: [ zhoumo daqiu you youxi ]. Combining the first word list and the second word list to obtain a third word list, wherein the third word list is as follows: [ weekend bowling game zhoumo daqiu youxi ].
And further, representing the classified corpora by using a third vocabulary to obtain a corresponding corpus table. Taking corpus one as an example, one way to represent corpus one by using the third vocabulary is, for example: removing the words except the first vocabulary to obtain a vocabulary sequence C1 (weekend game), expressing each word in C1 by using a second vocabulary sequence to obtain a vocabulary spelling sequence C2, wherein the expression method is C1+ C2, namely the corpus table corresponding to corpus one is as follows: < weekend game zhoumo you xi >.
Further, performing word embedding training on the obtained corpus table to obtain a word vector of each word. For example: < weekend > the corresponding word vector is [0.1,0.2,0.3. ] and < game > the corresponding word vector is [ -0.1,0.5,0.2. ] and so on. The word vector training method is a conventional method, which can be understood by those skilled in the art and is not described herein again. Then, the training input is obtained according to the training vector representation corpus table (namely, the vector of the word replaces the word in the corpus table to obtain a matrix form).
Taking corpus one as an example, another way to represent corpus one by using the third vocabulary is as follows: removing the words except the first vocabulary to obtain a vocabulary sequence C1 (game on weekend), expressing each word in C1 by using a second vocabulary sequence to obtain a vocabulary spelling sequence C2, wherein the expression method is C2+ C1, namely the other language material table corresponding to language material one is as follows: < Zhoumo you xi weekend game >.
Taking corpus one as an example, another way to represent corpus one by using the third vocabulary is as follows: removing the words outside the first vocabulary to obtain a sequence of lemmas C1 (game on weekend), adding the corresponding pinyin in the second vocabulary after each word in C1, i.e. another vocabulary corresponding to a first vocabulary is: < weekend zhoumo game youxi >.
Taking corpus one as an example, another way to represent corpus one by using the third vocabulary is as follows: removing the words outside the first vocabulary to obtain a sequence of lemmas C1 (game on weekend), adding the corresponding pinyin in the second vocabulary to the front of each word in C1, i.e. another vocabulary table corresponding to a first vocabulary is: < Zhoumo game of youxi on weekend >.
And then, carrying out word embedding training on each corpus table of the corpus one to obtain four groups of word embedding tables, wherein word embedding vectors of the same word are not necessarily the same in the four different tables. For example, "weekend" is trained in a first representation to have words embedded in the table whose corresponding vectors are [0.1,0.2,0.3. ], and its second representation in the table may be [ -0.1,0.05, 0.003. ].
In addition, the method also comprises the step of respectively taking the word vector of the word element and the word vector of the corresponding pinyin as different inputs, namely: taking corpus one as an example, the lemma (weekend, batting) and the lemma pinyin (zhoumo daqiu) are respectively used as different inputs and then combined, and the combination method can be tiling and concatenation, dimension increasing alignment and dimension reducing alignment. For example, a WDL algorithm can be adopted in the training process, word features corresponding to the lemmas are used as W-layer input, the phoneticizing features of the lemmas are used as D-layer input, and then model training is performed by the WDL algorithm to obtain a classification model.
Further, the corresponding vectors in the four words embedding tables are stacked or spliced to obtain the training input of the corpus one, so that the training input matrix is obtained. And finally, training the input matrix of each marked corpus by using a predetermined algorithm (SVM, LR, NN neural network, WDL) to obtain a classification model.
In addition, it should be added that the lemma in the scheme may be a natural chinese word calculated according to a word segmentation algorithm, or may be a word connection obtained by N-Gram segmentation, which does not affect the protection range of the scheme.
Therefore, according to the embodiment, a plurality of basic lemmas for classification are obtained from the text to be classified, then, the lemma pinyins corresponding to the basic lemmas are determined, and finally, the classification of the text to be classified is determined according to the plurality of basic lemmas and the corresponding lemma pinyins by using the text classification model which is obtained through the training and the like and is classified according to the lemmas and the pinyins. The text classification model is also obtained by training Chinese words of the corpus and corresponding Chinese pinyin. Therefore, the purposes of training classification models by integrating Chinese words and pinyin and classifying texts are achieved, and compared with the mode training and text classification by only using texts or text pinyin, the technical effect of improving the accuracy rate, accuracy rate and recall rate index of results is achieved. Further, the technical problems that in the prior art, models are trained simply by means of Chinese words or Chinese pinyin, and texts are classified, and the accuracy of classification results is affected due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises are possibly generated in the input process.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
Fig. 3 shows an apparatus 300 for classifying text according to the present embodiment, the apparatus 300 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 3, the apparatus 300 includes: an obtaining module 310, configured to obtain a plurality of basic lemmas for classification from a text to be classified; a determining module 320, configured to determine a word pinyin corresponding to the basic word; and a classification module 330, configured to determine a category of a text to be classified according to a plurality of basic lemmas and corresponding lemmas pinyin by using a pre-trained text classification model classified according to the lemmas and the pinyins.
Optionally, the classification module 330 includes: the determining submodule is used for determining a plurality of basic word elements and word vectors corresponding to the pinyin of the corresponding word elements; and the first classification submodule is used for determining the classification of the text to be classified according to the word vector of the determined basic word element and the word vector of the corresponding word element pinyin by using a text classification model.
Optionally, the first classification submodule includes: the combination unit is used for combining the word vectors of a plurality of basic word elements and the word vectors of corresponding word element pinyin to generate combination information corresponding to the text to be classified; and the classification unit is used for determining the category of the text to be classified according to the combined information by utilizing the text classification model.
Optionally, a combination unit comprising: the first arrangement subunit is used for arranging the word vectors of the basic word elements according to the original text sequence of the text to be classified; and the second arrangement subunit is used for arranging the word vectors of the pinyin of the word elements behind the word vectors of the basic word elements according to the sequence corresponding to the basic word elements.
Optionally, the combination unit further comprises: the third arrangement subunit is used for arranging word vectors of the word element pinyin corresponding to the plurality of basic word elements according to the sequence; and the fourth arrangement subunit is used for arranging the word vectors of the plurality of basic word elements behind the word vector of the word element pinyin according to the sequence corresponding to the word element pinyin.
Optionally, the combination unit further comprises: and the fifth arrangement subunit is used for arranging the word vectors of the basic word elements and the word vectors of the word element pinyin in a crossed manner, wherein the word vectors of the word element pinyin are adjacently arranged before or behind the word vectors of the corresponding basic word elements.
Optionally, the classification module 330 further includes: and the second classification submodule is used for inputting the word vector of the basic word element and the word vector of the corresponding word element pinyin into the text classification model respectively as different inputs and determining the category of the text to be classified.
Therefore, according to the embodiment, the purpose of classifying the text by integrating the Chinese words and the Chinese pinyin is realized through the device 300 for classifying the text, and compared with the text classification by only using the text or the text pinyin, the technical effect of improving the accuracy rate, the accuracy rate and the recall rate index of the result is achieved. The technical problem that in the prior art, the accuracy of the classification result is affected due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises can be generated in the input process when the text is classified simply by means of Chinese words or Chinese pinyin is solved.
Example 3
Fig. 4 shows an apparatus 400 for classifying text according to the present embodiment, the apparatus 400 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 to process the following process steps: acquiring a plurality of basic word elements for classification from a text to be classified; determining the word element pinyin corresponding to the basic word elements; and determining the category of the text to be classified according to a plurality of basic word elements and corresponding word element pinyins by utilizing a pre-trained text classification model classified according to the word elements and the pinyins.
Optionally, the operation of determining the category of the text to be classified includes: determining a plurality of basic word elements and word vectors corresponding to the pinyin of the corresponding word elements; and determining the category of the text to be classified according to the word vector of the determined basic word element and the word vector of the corresponding word element pinyin by using a text classification model.
Optionally, the operation of determining the category of the text to be classified according to the determined word vector of the basic word element and the word vector of the corresponding word element pinyin comprises: combining the word vectors of a plurality of basic word elements and the word vectors of corresponding word element pinyin to generate combined information corresponding to the text to be classified; and determining the category of the text to be classified according to the combined information by using the text classification model.
Optionally, the memory 420 is further configured to provide the processor 410 with instructions to process the following processing steps: generating the combined information according to the following operations: arranging the word vectors of a plurality of basic word elements according to the original text sequence of the text to be classified; and arranging the word vectors of the word element pinyin behind the word vectors of the basic word elements according to the sequence corresponding to the basic word elements.
Optionally, the memory 420 is further configured to provide the processor 410 with instructions to process the following processing steps: generating the combined information according to the following operations: arranging word vectors of the word element pinyin corresponding to a plurality of basic word elements according to a sequence; and arranging the word vectors of the plurality of basic word elements behind the word vector of the word element pinyin according to the sequence corresponding to the word element pinyin.
Optionally, the memory 420 is further configured to provide the processor 410 with instructions to process the following processing steps: generating the combined information according to the following operations: arranging the word vectors of the basic word elements and the word vectors of the word element pinyin in a crossed mode, wherein the word vectors of the word element pinyin are adjacently arranged before or after the word vectors of the corresponding basic word elements.
Optionally, the operation of determining the category of the text to be classified further includes: and respectively taking the word vector of the basic word element and the word vector of the corresponding word element pinyin as different inputs, inputting the different inputs into the text classification model, and determining the category of the text to be classified.
Therefore, according to the embodiment, the purpose of classifying the text by integrating the Chinese words and the Chinese pinyin is realized through the device 400 for classifying the text, and compared with the text classification by only using the text or the text pinyin, the technical effect of improving the accuracy rate, the accuracy rate and the recall rate index of the result is achieved. The technical problem that in the prior art, the accuracy of the classification result is affected due to the fact that homophonic wrongly-written characters, homophonic words, homophonic characters and other noises can be generated in the input process when the text is classified simply by means of Chinese words or Chinese pinyin is solved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method of classifying text, comprising:
acquiring a plurality of basic word elements for classification from a text to be classified;
determining the word element pinyin corresponding to the basic word elements; and
and determining the category of the text to be classified according to the plurality of basic word elements and the corresponding word element pinyin by utilizing a pre-trained text classification model classified according to the word elements and the pinyin.
2. The method according to claim 1, wherein the operation of determining the category of the text to be classified comprises:
determining word vectors corresponding to the multiple basic word elements and corresponding word element pinyin respectively; and
and determining the category of the text to be classified according to the word vector of the determined basic word element and the word vector of the corresponding word element pinyin by using the text classification model.
3. The method of claim 2, wherein the operation of determining the category of the text to be classified according to the determined word vector of the basic lemma and the word vector of the corresponding lemma pinyin comprises:
combining the word vectors of the basic word elements and the word vectors of the corresponding word element pinyin to generate combined information corresponding to the text to be classified; and
and determining the category of the text to be classified according to the combined information by using the text classification model.
4. The method of claim 3, further comprising generating the combined information according to the operations of:
arranging the word vectors of the basic word elements according to the original text sequence of the text to be classified; and
and arranging the word vectors of the word element pinyin behind the word vectors of the basic word elements according to the sequence corresponding to the basic word elements.
5. The method of claim 3, further comprising generating the combined information according to the operations of:
arranging word vectors of the word element pinyin corresponding to the plurality of basic word elements according to the sequence; and
and arranging the word vectors of the basic word elements behind the word vector of the word element pinyin according to the sequence corresponding to the word element pinyin.
6. The method of claim 3, further comprising generating the combined information according to the operations of:
arranging the word vectors of the basic word elements and the word vectors of the word element pinyin in a crossed mode, wherein the word vectors of the word element pinyin are adjacently arranged before or after the word vectors of the corresponding basic word elements.
7. The method of claim 2, wherein determining the category of the text to be classified further comprises:
and respectively taking the word vector of the basic word element and the word vector of the corresponding word element pinyin as different inputs, inputting the different inputs into corresponding classification models, and determining the category of the text to be classified.
8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.
9. An apparatus for classifying text, comprising:
the acquisition module is used for acquiring a plurality of basic word elements for classification from the text to be classified;
the determining module is used for determining the word element pinyin corresponding to the basic word element; and
and the classification module is used for determining the category of the text to be classified according to the plurality of basic word elements and the corresponding word element pinyin by utilizing a pre-trained text classification model classified according to the word elements and the pinyin.
10. An apparatus for classifying text, comprising:
a processor; and
a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:
acquiring a plurality of basic word elements for classification from a text to be classified;
determining the word element pinyin corresponding to the basic word elements; and
and determining the category of the text to be classified according to the plurality of basic word elements and the corresponding word element pinyin by utilizing a pre-trained text classification model classified according to the word elements and the pinyin.
CN201910684756.5A 2019-07-26 2019-07-26 Method, device and storage medium for classifying texts Pending CN112364159A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910684756.5A CN112364159A (en) 2019-07-26 2019-07-26 Method, device and storage medium for classifying texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910684756.5A CN112364159A (en) 2019-07-26 2019-07-26 Method, device and storage medium for classifying texts

Publications (1)

Publication Number Publication Date
CN112364159A true CN112364159A (en) 2021-02-12

Family

ID=74516367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910684756.5A Pending CN112364159A (en) 2019-07-26 2019-07-26 Method, device and storage medium for classifying texts

Country Status (1)

Country Link
CN (1) CN112364159A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577662A (en) * 2017-08-08 2018-01-12 上海交通大学 Towards the semantic understanding system and method for Chinese text
CN109189901A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system
CN109857868A (en) * 2019-01-25 2019-06-07 北京奇艺世纪科技有限公司 Model generating method, file classification method, device and computer readable storage medium
CN109977361A (en) * 2019-03-01 2019-07-05 广州多益网络股份有限公司 A kind of Chinese phonetic alphabet mask method, device and storage medium based on similar word

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577662A (en) * 2017-08-08 2018-01-12 上海交通大学 Towards the semantic understanding system and method for Chinese text
CN109189901A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system
CN109857868A (en) * 2019-01-25 2019-06-07 北京奇艺世纪科技有限公司 Model generating method, file classification method, device and computer readable storage medium
CN109977361A (en) * 2019-03-01 2019-07-05 广州多益网络股份有限公司 A kind of Chinese phonetic alphabet mask method, device and storage medium based on similar word

Similar Documents

Publication Publication Date Title
CN105009064B (en) Use the touch keyboard of language and spatial model
CN106940788B (en) Intelligent scoring method and device, computer equipment and computer readable medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN110309283A (en) A kind of answer of intelligent answer determines method and device
CN109165384A (en) A kind of name entity recognition method and device
CN106202059A (en) Machine translation method and machine translation apparatus
CN109241525B (en) Keyword extraction method, device and system
CN107145571B (en) Searching method and device
CN104573099B (en) The searching method and device of topic
CN110532354A (en) The search method and device of content
CN111310440A (en) Text error correction method, device and system
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN110377916A (en) Word prediction technique, device, computer equipment and storage medium
CN102609424B (en) Method and equipment for extracting assessment information
CN107918778A (en) A kind of information matching method and relevant apparatus
CN109241452A (en) Information recommendation method and device, storage medium and electronic equipment
CN109710732A (en) Information query method, device, storage medium and electronic equipment
CN115858741A (en) Intelligent question answering method and device suitable for multiple scenes and storage medium
CN111382243A (en) Text category matching method, text category matching device and terminal
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN111291561B (en) Text recognition method, device and system
CN110874408B (en) Model training method, text recognition device and computing equipment
CN111507062A (en) Text display method, device and system, storage medium and electronic device
CN110598112A (en) Topic recommendation method and device, terminal equipment and storage medium
CN112364159A (en) Method, device and storage medium for classifying texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination