CN111767728A - Short text classification method, device, equipment and storage medium - Google Patents

Short text classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN111767728A
CN111767728A CN202010604146.2A CN202010604146A CN111767728A CN 111767728 A CN111767728 A CN 111767728A CN 202010604146 A CN202010604146 A CN 202010604146A CN 111767728 A CN111767728 A CN 111767728A
Authority
CN
China
Prior art keywords
short text
target
determining
target short
frequency vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010604146.2A
Other languages
Chinese (zh)
Inventor
张瑾
庞敏辉
杨舰
冯博豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010604146.2A priority Critical patent/CN111767728A/en
Publication of CN111767728A publication Critical patent/CN111767728A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The application discloses a short text classification method, a short text classification device, short text classification equipment and a storage medium, and relates to the technical field of natural language processing and deep learning. The specific implementation scheme is as follows: acquiring a target short text; determining a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set; determining the position information of the target high-frequency vocabulary in the target short text; determining a sentence vector of the target short text; and classifying the target short text according to the position information and the sentence vector. The method can make full use of the context information and external knowledge of the short text, and improve the accuracy of short text classification.

Description

Short text classification method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of natural language processing and deep learning technologies, and in particular, to a short text classification method, apparatus, device, and storage medium.
Background
A bank or a financial institution usually encounters many bill auditing works, and a practitioner usually needs to classify bills according to the contents of the bills and input element information contained in the bills in a system, such as time, amount of money, purposes, names of people and the like, as the basis of financial reimbursement and the like. The processing of massive bills needs to occupy more manpower and financial resources, and the high-efficiency automatic auditing and automatic extraction of the element information in the bills are one of the hotspots of the research in the intelligent financial field.
Disclosure of Invention
A short text classification method, apparatus, device and storage medium are provided.
According to a first aspect, there is provided a short text classification method, comprising: acquiring a target short text; determining a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set; determining the position information of the target high-frequency vocabulary in the target short text; determining a sentence vector of the target short text; and classifying the target short text according to the position information and the sentence vector.
According to a second aspect, there is provided a short text classification apparatus comprising: a short text acquisition unit configured to acquire a target short text; the high-frequency vocabulary determining unit is configured to determine target high-frequency vocabularies in the target short text according to the target short text and a preset high-frequency vocabulary set; a position information determining unit configured to determine position information of the target high-frequency vocabulary in the target short text; a sentence vector determination unit configured to determine a sentence vector of the target short text; and the short text classification unit is configured to classify the target short text according to the position information and the sentence vector.
According to a third aspect, there is provided a short text classification device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.
According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.
According to the short text classification method and device, the technical problem that existing short text classification is limited by too short text and cannot obtain a good classification effect is solved, context information and external knowledge of the short text are fully utilized, and the accuracy of short text classification is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a short text classification method according to the present application;
FIG. 3 is a diagram illustrating an application scenario of the short text classification method according to the present application;
FIG. 4 is a flow diagram of another embodiment of a short text classification method according to the present application;
FIG. 5 is a schematic diagram of an embodiment of a short text classification device according to the application;
fig. 6 is a block diagram of an electronic device for implementing the short text classification method according to the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the short text classification method or short text classification apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as shopping applications, voice recognition applications, etc., may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, car computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server providing language models on the terminal devices 101, 102, 103. The background server may train the initial language model by using the training samples to obtain a target language model, and feed the target language model back to the terminal devices 101, 102, and 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the short text classification method provided by the embodiment of the present application is generally performed by the server 105. Accordingly, the short text classification device is generally provided in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a short text classification method according to the present application is shown. The short text classification method of the embodiment comprises the following steps:
step 201, acquiring a target short text.
In this embodiment, an execution subject (for example, the server 105 shown in fig. 1) of the short text classification method may acquire the target short text in various ways. The execution subject may obtain the target short text in various ways, for example, determine the target short text by means of image recognition, or obtain the short text by a social platform, and so on. Here, the target short text may be a text in which the number of words is less than a preset threshold, and the technician may set the preset threshold according to an actual application scenario. In some specific application scenarios, the target short text may include a business name, a person name, and the like.
Step 202, determining a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set.
After the target short text is obtained, the execution main body may compare the target short text with a preset high-frequency vocabulary set, and determine a target high-frequency vocabulary included in the target short text. The high frequency vocabulary set includes a plurality of high frequency vocabularies. Here, the high-frequency vocabulary refers to a vocabulary having a number of occurrences greater than a preset number of times in the same category of text of the target short text. For example, if the target short text is a business name, such as "XXXX limited", the executing entity may first determine a plurality of texts in the business category, and then count words with occurrence times greater than N (N is a natural number) in each text, and use these words as high-frequency words.
The execution main body can perform word segmentation on the target short text to obtain each vocabulary obtained in the target short text. Then, the words in the high-frequency vocabulary set in each vocabulary are determined as target high-frequency vocabularies.
And step 203, determining the position information of the target high-frequency vocabulary in the target short text.
After determining the target high-frequency vocabulary, the execution subject can determine the position information of the target high-frequency vocabulary in the target short text. Specifically, the execution subject may determine the location of the first character of the target high-frequency vocabulary relative to the target first character of the short text. Alternatively, the execution body may position the end character of the target high frequency vocabulary relative to the target short text end character.
Step 204, determining a sentence vector of the target short text.
The execution body may also determine a sentence vector for the target short text. Specifically, the execution subject may determine a sentence vector of the target short text according to a word vector of the high-frequency vocabulary included in the target short text. Alternatively, the execution subject may determine a sentence vector of the target short text from a word vector of a word included in the target short text.
And step 205, classifying the target short texts according to the position information and the sentence vectors.
After the position information and the sentence vector are obtained, the execution main body can splice the position information and the sentence vector to obtain a spliced vector, and then the spliced vector is input into a classification model, so that the classification of the target short text is determined. When the position information and the sentence vectors are spliced, the position information can be directly spliced behind the sentence vectors, and the position information and the sentence vectors can be respectively intercepted and then spliced. The classification model can be realized by a softmax layer, and can also be realized by algorithms such as a support vector machine, naive Bayes, random forests and the like.
With continued reference to fig. 3, a schematic diagram of an application scenario of the short text classification method according to the present application is shown. In the application scenario of fig. 3, after receiving an invoice image sent by the terminal 302, the server 301 performs character recognition on the invoice image, and obtains a short text in the invoice as "one-to-two home private stores". And then determining the short text as the merchant name after the processing of the steps 201 to 205.
The short text classification method provided by the embodiment of the application makes full use of the context information and the external knowledge of the short text, and improves the accuracy of short text classification.
With continued reference to FIG. 4, a flow 400 of another embodiment of a short text classification method according to the present application is shown. As shown in fig. 4, the short text classification method of this embodiment may include the following steps:
step 401, a plurality of shop names are obtained from a preset website.
In this embodiment, the execution subject may obtain a plurality of shop names from a preset website. The preset website may be an e-commerce website or a website for registering store information. Taking e-commerce website as an example, the executive body can crawl the shop names selling different types of objects in the website by using a crawler. For example, for makeup category, the store name may include "XX makeup exclusive store" and "XX makeup flagship store". For a large family of appliances, the store name may include "XX appliance exclusive store" or the like.
Step 402, determining a high-frequency vocabulary set according to a plurality of shop names and preset custom words.
In this embodiment, after crawling a plurality of store names, the executive body may further determine a high-frequency vocabulary therein by combining a preset custom word. In particular, the name of the store may contain names that are not easily recognized as a word when processed. The user can preset custom words for these names. In this way, the executing agent can recognize these store names when processing them. For example, a certain store name is "one-two-home-dedicated store", and the executive subject may not recognize "one-two" as a word when recognizing the vocabulary included in the store name. The user can preset the self-defined word "one, two and two", so that the execution main body can recognize the word "one, two and two".
After accurately recognizing the vocabulary included in the shop, the execution subject may count the recognized vocabulary to obtain the number of occurrences of each vocabulary. And taking the vocabulary with the occurrence frequency larger than a preset threshold value as a high-frequency vocabulary, and adding the high-frequency vocabulary into the high-frequency vocabulary set.
It is understood that the above steps 401 to 402 can also be executed by other electronic devices, for example, a server of an e-commerce website. After determining the high-frequency vocabulary set, the other electronic equipment can send the high-frequency vocabulary set to an execution body of the short text classification method.
In some optional implementations of the embodiment, the preset websites include different categories of store names. The step 402 can be implemented by the following steps not shown in fig. 4: for each category, determining a high-frequency vocabulary subset of the category according to the shop name and the custom word of the category; and determining a high-frequency vocabulary set according to each high-frequency vocabulary subset.
In this implementation, the executive may determine the store name for each category, and then determine the high-frequency vocabulary subset for that category in combination with the custom word. Then, according to each high-frequency vocabulary subset, a high-frequency vocabulary set is determined. Specifically, the execution main body may merge the high-frequency vocabulary subsets and then remove duplication, or the execution main body may take an intersection of the high-frequency vocabulary subsets. And taking the obtained set as a high-frequency vocabulary set after the processing.
And step 403, acquiring the target short text.
And step 404, determining a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set.
The principle of steps 403 to 404 is similar to that of steps 201 to 202, and the description thereof is omitted.
And step 405, determining the position information of the first character of the target high-frequency vocabulary in the target short text.
After determining the target high-frequency vocabulary, the execution subject may determine the position information of the first character of each target high-frequency vocabulary in the target short text. Specifically, the execution subject may take the distance between the first character of each target high-frequency vocabulary and the first character of the target short text as the position information.
Step 406, in response to determining that the high frequency vocabulary is not included in the target short text, setting the location information to a default value.
In the present embodiment, when the target short text is a person's name or shorter text, since there is no association between the words therein and the number of words contained is small, it is possible that a high-frequency vocabulary is not included. At this time, the execution body may set the above position information to a default value, such as-1 or 0.
Step 407, generating a model according to the target short text and the pre-trained sentence vector, and determining the sentence vector of the target short text.
In this embodiment, the execution subject may further determine a sentence vector of the target short text by using a pre-trained sentence vector generation model. The sentence vector generation model can be implemented by various algorithms, and specifically, the vector generation model can be determined by adopting a Doc2Vec algorithm of Gensim. Doc2Vec, also called Paragraph Vector, is proposed by Tomas Mikolov based on a word2Vec model, and has some advantages, such as that sentences with different lengths are received as training samples without fixing the lengths of the sentences, Doc2Vec is an unsupervised learning algorithm which is used for predicting a Vector to represent different documents, and the structure of the model potentially overcomes the defects of a bag-of-words model. There are two models for Doc2 vec: a distributed memory model of sentence vectors (PV-DM) and a distributed bag of words model of sentence vectors (PV-DBOW). The present embodiment may employ any model to determine a sentence vector.
And step 408, splicing the position information and the sentence vector to obtain a spliced vector.
After the position information and the sentence vector are obtained, the execution main body can splice the position information and the sentence vector to obtain a spliced vector.
And step 409, classifying the target short texts according to the splicing vectors.
The execution subject may input the stitching vector into a classification model to determine a classification of the target short text. In practice, random forests are preferably used as classifiers, and the accuracy can reach 97.1% by reasonably setting the parameters of the random forests.
According to the short text classification method provided by the embodiment of the application, the high-frequency words in the short text are determined by fully utilizing the short text knowledge of the external website. And a sentence vector of the short text is utilized, wherein context information of the short text is included. The classification result obtained by combining the position information of the high-frequency vocabulary in the short text and the sentence vector is more accurate.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a short text classification apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 5, the output information device 500 of the present embodiment includes: a short text acquisition unit 501, a high-frequency vocabulary determination unit 502, a position information determination unit 503, a sentence vector determination unit 504, and a short text classification unit 505.
A short text acquisition unit 501 configured to acquire a target short text.
The high-frequency vocabulary determining unit 502 is configured to determine a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set.
A position information determining unit 503 configured to determine position information of the target high-frequency vocabulary in the target short text.
A sentence vector determination unit 504 configured to determine a sentence vector of the target short text.
And a short text classification unit 505 configured to classify the target short text according to the position information and the sentence vector.
In some optional implementations of this embodiment, the location information determining unit 503 may be further configured to: and determining the position information of the first character of the target high-frequency vocabulary in the target short text.
In some optional implementations of this embodiment, the location information determining unit 503 may be further configured to: in response to determining that the high frequency vocabulary is not included in the target short text, the location information is set to a default value.
In some optional implementations of this embodiment, the sentence vector determination unit 504 may be further configured to: and determining a sentence vector of the target short text according to the target short text and a pre-trained sentence vector generation model.
In some optional implementations of this embodiment, the short text classification unit 505 may be further configured to: splicing the position information and the sentence vector to obtain a spliced vector; and classifying the target short texts according to the splicing vectors.
In some optional implementations of this embodiment, the apparatus 500 may further include a vocabulary set determining unit, not shown in fig. 5, configured to: acquiring a plurality of shop names from a preset website; and determining a high-frequency vocabulary set according to the plurality of shop names and preset custom words.
In some alternative implementations of the present embodiment, the preset websites include different categories of store names. The vocabulary set determination unit is further configured to: for each category, determining a high-frequency vocabulary subset of the category according to the shop name and the custom word of the category; and determining a high-frequency vocabulary set according to each high-frequency vocabulary subset.
It should be understood that units 501 to 505 recited in the short text classification device 500 correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above for the short text classification method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 6, it is a block diagram of an electronic device that executes a short text classification method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.
The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of performing short text classification provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of performing short text classification provided herein.
The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the execution of the short text classification method in the embodiment of the present application (for example, the short text acquisition unit 501, the high-frequency vocabulary determination unit 502, the position information determination unit 503, the sentence vector determination unit 504, and the short text classification unit 505 shown in fig. 5). The processor 601 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 602, namely, implementing the short text classification method in the above method embodiment.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device performing short text classification, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 602 optionally includes memory located remotely from processor 601, which may be connected via a network to an electronic device that performs short text classification. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device performing the short text classification method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.
The input device 603 may receive input numeric or character information and generate key signal inputs related to performing user settings and function control of the short text classification electronic apparatus, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, etc. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the short text classification method and device, the context information and the external knowledge of the short text are fully utilized, and the accuracy of short text classification is improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (16)

1. A short text classification method, comprising:
acquiring a target short text;
determining a target high-frequency vocabulary in the target short text according to the target short text and a preset high-frequency vocabulary set;
determining the position information of the target high-frequency vocabulary in the target short text;
determining a sentence vector of the target short text;
and classifying the target short text according to the position information and the sentence vector.
2. The method of claim 1, wherein the determining the position information of the target high-frequency vocabulary in the target short text comprises:
and determining the position information of the first character of the target high-frequency vocabulary in the target short text.
3. The method of claim 1, wherein the determining the position information of the target high-frequency vocabulary in the target short text comprises:
setting the location information to a default value in response to determining that the high frequency vocabulary is not included in the target short text.
4. The method of claim 1, wherein the determining a sentence vector of the target short text comprises:
and determining a sentence vector of the target short text according to the target short text and a pre-trained sentence vector generation model.
5. The method of claim 1, wherein the classifying the target short text according to the position information and the sentence vector comprises:
splicing the position information and the sentence vector to obtain a spliced vector;
and classifying the target short text according to the splicing vector.
6. The method of claim 1, wherein the set of high frequency words is obtained by:
acquiring a plurality of shop names from a preset website;
and determining a high-frequency vocabulary set according to the shop names and preset custom words.
7. The method of claim 6, wherein the preset websites include store names of different categories; and
determining a high-frequency vocabulary set according to the shop names and preset custom words, wherein the determining comprises the following steps:
for each category, determining a high-frequency vocabulary subset of the category according to the shop name of the category and the custom word;
and determining the high-frequency vocabulary set according to each high-frequency vocabulary subset.
8. A short text classification apparatus comprising:
a short text acquisition unit configured to acquire a target short text;
the high-frequency vocabulary determining unit is configured to determine target high-frequency vocabularies in the target short text according to the target short text and a preset high-frequency vocabulary set;
a position information determining unit configured to determine position information of the target high-frequency vocabulary in the target short text;
a sentence vector determination unit configured to determine a sentence vector of the target short text;
a short text classification unit configured to classify the target short text according to the position information and the sentence vector.
9. The apparatus of claim 8, wherein the location information determining unit is further configured to:
and determining the position information of the first character of the target high-frequency vocabulary in the target short text.
10. The apparatus of claim 8, wherein the location information determining unit is further configured to:
setting the location information to a default value in response to determining that the high frequency vocabulary is not included in the target short text.
11. The apparatus of claim 8, wherein the sentence vector determination unit is further configured to:
and determining a sentence vector of the target short text according to the target short text and a pre-trained sentence vector generation model.
12. The apparatus of claim 8, wherein the short text classification unit is further configured to:
splicing the position information and the sentence vector to obtain a spliced vector;
and classifying the target short text according to the splicing vector.
13. The apparatus of claim 8, wherein the apparatus further comprises a vocabulary set determination unit configured to:
acquiring a plurality of shop names from a preset website;
and determining a high-frequency vocabulary set according to the shop names and preset custom words.
14. The apparatus of claim 13, wherein the preset websites include store names of different categories; and
the vocabulary set determination unit is further configured to:
for each category, determining a high-frequency vocabulary subset of the category according to the shop name of the category and the custom word;
and determining the high-frequency vocabulary set according to each high-frequency vocabulary subset.
15. A short text classification apparatus comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202010604146.2A 2020-06-29 2020-06-29 Short text classification method, device, equipment and storage medium Pending CN111767728A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010604146.2A CN111767728A (en) 2020-06-29 2020-06-29 Short text classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010604146.2A CN111767728A (en) 2020-06-29 2020-06-29 Short text classification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111767728A true CN111767728A (en) 2020-10-13

Family

ID=72722989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010604146.2A Pending CN111767728A (en) 2020-06-29 2020-06-29 Short text classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111767728A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948584A (en) * 2021-03-03 2021-06-11 北京百度网讯科技有限公司 Short text classification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744958A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification algorithm based on distributed computation
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN111125356A (en) * 2019-11-29 2020-05-08 江苏艾佳家居用品有限公司 Text classification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744958A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification algorithm based on distributed computation
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN111125356A (en) * 2019-11-29 2020-05-08 江苏艾佳家居用品有限公司 Text classification method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948584A (en) * 2021-03-03 2021-06-11 北京百度网讯科技有限公司 Short text classification method, device, equipment and storage medium
CN112948584B (en) * 2021-03-03 2023-06-23 北京百度网讯科技有限公司 Short text classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111428008B (en) Method, apparatus, device and storage medium for training a model
CN111523326B (en) Entity chain finger method, device, equipment and storage medium
US10929383B2 (en) Method and system for improving training data understanding in natural language processing
EP3923159A1 (en) Method, apparatus, device and storage medium for matching semantics
US11556698B2 (en) Augmenting textual explanations with complete discourse trees
US20170185835A1 (en) Assisting people with understanding charts
US11636376B2 (en) Active learning for concept disambiguation
CN111191428B (en) Comment information processing method and device, computer equipment and medium
US11842289B2 (en) Original idea extraction from written text data
CN110674260B (en) Training method and device of semantic similarity model, electronic equipment and storage medium
CN111611990A (en) Method and device for identifying table in image
CN111522944A (en) Method, apparatus, device and storage medium for outputting information
CN111127191A (en) Risk assessment method and device
US20230237277A1 (en) Aspect prompting framework for language modeling
CN111782785A (en) Automatic question answering method, device, equipment and storage medium
CN114021548A (en) Sensitive information detection method, training method, device, equipment and storage medium
CN111783427B (en) Method, device, equipment and storage medium for training model and outputting information
US20220027876A1 (en) Consolidating personal bill
CN111767728A (en) Short text classification method, device, equipment and storage medium
CN113239273B (en) Method, apparatus, device and storage medium for generating text
US11475221B2 (en) Techniques for selecting content to include in user communications
CN112948584A (en) Short text classification method, device, equipment and storage medium
US20220043977A1 (en) Determining user complaints from unstructured text
CN113806541A (en) Emotion classification method and emotion classification model training method and device
CN112329429A (en) Text similarity learning method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination