CN115062150B

CN115062150B - Text classification method and device, electronic equipment and storage medium

Info

Publication number: CN115062150B
Application number: CN202210733565.5A
Authority: CN
Inventors: 解春欣; 张金晶; 吴荣强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2024-04-02
Anticipated expiration: 2042-06-27
Also published as: CN115062150A

Abstract

The application discloses a text classification method, a text classification device, electronic equipment and a storage medium, which can be applied to the field of maps and relate to the technical field of natural language processing. The method comprises the following steps: based on a candidate word set obtained by carrying out word mining processing on a short text training set, carrying out word recombination processing to obtain a word combination set, and carrying out text expression mining processing on the word combination set to obtain an initial text expression set; respectively carrying out grammar analysis on each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence; target text expressions matched with a plurality of short texts in a short text training set are respectively screened from an initial text expression set, and a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text are set as a classification group; based on the obtained classification groups, a target classification rule base is obtained. Therefore, the short text to be classified can be rapidly classified by using the target classification rule base, and the classification efficiency is improved.

Description

Text classification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a text classification method, a device, an electronic device, and a storage medium.

Background

With the development of artificial intelligence and computer technology, more and more objects select to express and transmit information through text on the internet, so that more and more scenes related to text processing are involved, such as searching positions in an electronic map based on short text, searching videos in video software based on text, and the like.

Currently, short text classification is typically performed using a deep learning model determined based on various variants of the bi-directional encoder language characterization model (Bidirectional Encoder Representation from Transformers, BERT).

However, although the deep learning model has better generalization performance, the deep learning model needs to rely on a large amount of high-quality labeling data and a pre-trained model, so that when the large amount of high-quality labeling data does not include short text labeling data of a new category, the classification accuracy of the short text of the new category is poor, and when the classification accuracy of the short text of the new category is desired to be improved, the labeling needs to be performed again and the new pre-trained model needs to be determined, so that the whole classification process is complicated, and the classification efficiency of the short text is low.

In addition, since the training process of the deep learning model belongs to the black box model, when the training object wants to adjust a certain classification rule in the deep learning model, the classification rule cannot be accurately adjusted, so that the classification accuracy of the short text corresponding to the classification rule is lower.

Obviously, when short text classification is performed under the related technology, the technical problems of low classification efficiency and low accuracy of the short text exist.

Disclosure of Invention

The embodiment of the application provides a text classification method, a text classification device, electronic equipment and a storage medium, which are used for improving the efficiency and accuracy of classifying short texts.

In one aspect, a method for classifying text is provided, the method comprising:

based on a candidate word set obtained by carrying out word mining processing on a short text training set, carrying out word recombination processing to obtain a word combination set, and carrying out text expression mining processing on the word combination set to obtain an initial text expression set;

respectively carrying out grammar analysis on each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence; each vocabulary index sequence includes an index position of at least one vocabulary in a corresponding initial text expression;

Target text expressions matched with a plurality of short texts in the short text training set are respectively screened out from the initial text expression set, and vocabulary index sequences associated with the target text expressions and classification identifiers corresponding to the short texts are set as a classification group;

based on the obtained classification groups, a target classification rule base is obtained, and the target classification rule base is used for classifying short texts to be classified.

In one aspect, there is provided a text classification apparatus, the apparatus comprising:

the processing unit is used for carrying out vocabulary recombination processing on the basis of a candidate word set obtained by carrying out vocabulary mining processing on the short text training set to obtain a vocabulary combined set, and carrying out text expression mining processing on the vocabulary combined set to obtain an initial text expression set;

the parsing unit is used for respectively carrying out grammar parsing on each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence; each vocabulary index sequence includes an index position of at least one vocabulary in a corresponding initial text expression;

the screening unit is used for respectively screening target text expressions matched with a plurality of short texts in the short text training set from the initial text expression set, and setting a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group;

The obtaining unit is used for obtaining a target classification rule base based on the obtained classification groups, wherein the target classification rule base is used for classifying short texts to be classified.

Optionally, the device further comprises an excavating unit for:

performing vocabulary mining processing on the short text training set to obtain at least one character vocabulary group; each character vocabulary group comprises a plurality of vocabularies with the same number of characters, and the vocabularies belonging to different character vocabulary groups contain different numbers of characters;

screening the at least one character vocabulary group based on the vocabulary evaluation value corresponding to each character vocabulary group to obtain at least one first vocabulary group; each vocabulary evaluation value is determined based on the frequency of occurrence of the corresponding character vocabulary group in the plurality of short texts and the combined frequency of the other character vocabulary groups;

and selecting word frequencies of words in the plurality of short texts from the at least one first word group, selecting candidate words meeting a first threshold, and obtaining the candidate word set based on each obtained candidate word.

Optionally, the excavating unit is further configured to:

the following operations are respectively executed for each character vocabulary group:

Selecting a first target evaluation value which meets a first screening condition from first evaluation values which correspond to at least two sub-vocabulary groups respectively and are obtained by splitting a character vocabulary group; wherein each first evaluation value characterizes the frequency of occurrence of a corresponding sub-vocabulary group in the plurality of short texts;

selecting a second target evaluation value which meets a second screening condition from second evaluation values which correspond to at least two associated vocabularies respectively and are obtained by carrying out association processing on the character vocabulary group; wherein each second evaluation value characterizes the frequency of occurrence of the corresponding associative vocabulary in the plurality of short texts;

based on the first target evaluation value and the second target evaluation value, a vocabulary evaluation value of the one character vocabulary group is determined.

Optionally, the excavating unit is further configured to:

determining at least one splitting mode based on the number of words contained in the one character word group, wherein each splitting mode is used for splitting the one character word group into at least two sub word groups;

and splitting the character vocabulary group into at least two corresponding sub-vocabulary groups according to the at least one splitting mode.

Optionally, the excavating unit is further configured to:

each vocabulary contained in the character vocabulary group is respectively determined, and at least one associated vocabulary which takes the character vocabulary group as a suffix is respectively corresponding to each vocabulary;

and respectively determining each vocabulary contained in the character vocabulary group, and respectively corresponding at least one associated vocabulary prefixed by the vocabulary.

Optionally, the processing unit is configured to:

constructing a target dictionary tree based on the candidate word set; the root node of the target dictionary tree is empty, and each leaf node except the root node contains a vocabulary;

the following is performed for each leaf node:

respectively obtaining a first path between one leaf node and each other leaf node; each first path includes: the vocabulary contained in one leaf node is used as a starting vocabulary, and the vocabulary contained in the corresponding other leaf nodes is used as an ending vocabulary, so that a vocabulary combination is formed;

based on the obtained vocabulary combinations, a vocabulary combination set is obtained.

Optionally, the processing unit is further configured to:

selecting a candidate vocabulary combination corresponding to the occurrence frequency meeting a second threshold from the occurrence frequencies corresponding to the vocabulary combinations in the vocabulary combination set; wherein each occurrence frequency is determined based on the number of occurrences of the corresponding vocabulary combination in the plurality of short texts and the number of the plurality of short texts;

The vocabulary in each candidate vocabulary combination and the vocabulary in other candidate vocabulary combinations are respectively connected through at least one grammar symbol according to a preset arrangement rule, and a plurality of first initial text expressions are obtained;

matching each first initial text expression with a preset text expression library to obtain a corresponding second initial text expression containing the first initial text expression;

based on each of the obtained first initial text expression and second initial text expression, an initial text expression set is obtained.

Optionally, the parsing unit is configured to:

converting each initial text expression contained in the initial text expression set into a corresponding grammar tree; each grammar tree comprises a plurality of grammar nodes, and each grammar node comprises a grammar symbol or vocabulary;

simplifying the grammar nodes of each grammar tree, and executing the following operations on the simplified initial text expressions:

determining an expression type corresponding to the simplified initial text expression, and determining a target index sequence construction rule based on the expression type and a mapping relation between the expression type and a preset index sequence construction rule;

And determining the corresponding relation between each vocabulary and the corresponding index in the simplified initial text expression based on the target index sequence construction rule, and obtaining a vocabulary index sequence based on the corresponding relation between each vocabulary and the corresponding index.

Optionally, the parsing unit is configured to:

the following operations are performed on each syntax tree:

when determining a grammar node containing a first grammar symbol existing in a grammar tree, removing the grammar node to obtain the first grammar tree;

and when determining grammar nodes containing second grammar symbols in the first grammar tree, performing recursion operation and removal operation on the grammar nodes to obtain a second grammar tree, and taking an expression corresponding to the second grammar tree as a simplified initial text expression.

Optionally, the obtaining unit is configured to:

respectively determining the matching frequency corresponding to each classification group; the matching frequency is used to characterize: the vocabulary index sequences in the corresponding classification groups are matched to the frequencies of the corresponding short texts in the plurality of short texts;

selecting a target matching probability which is not smaller than a third threshold value from the obtained matching probabilities;

And constructing the target classification rule base based on the candidate classification group corresponding to the obtained target matching probability.

In one aspect, an electronic device is provided that includes a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the text classification method described above.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device performs the text classification method described above.

Embodiments of the present application provide a computer readable storage medium comprising program code for causing an electronic device to perform the steps of any one of the text classification methods described above, when the program product is run on the electronic device.

The beneficial effects of the application are as follows:

in the embodiment of the application, the text classification method, the device, the electronic equipment and the storage medium are provided, and because the electronic equipment can perform vocabulary mining on a short text training set and then perform vocabulary recombination processing on an obtained candidate word set, the vocabulary combination set is subjected to text expression mining processing, so that mining from single vocabulary to text expressions can be rapidly and efficiently realized. Further, considering the efficiency of searching the matched text expressions later, the electronic device may parse each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence. Then, the determined vocabulary index sequences associated with the target text expressions and the classification identifications corresponding to the short text are set as a classification group. In this way, a target classification rule base can be obtained based on each obtained classification group. Therefore, when the short text to be classified is received, the short text to be classified can be matched with the vocabulary index sequence, and the classification identification corresponding to the short text to be classified is determined, so that the classification efficiency and the classification accuracy of the short text are improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for a person having ordinary skill in the art.

Fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a flow chart of a text classification method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of determining candidate word collections according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a vocabulary mining process according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a character vocabulary set according to an embodiment of the present application;

FIG. 6 is a schematic diagram of determining a sub-vocabulary group according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a target dictionary tree in an embodiment of the present application;

FIG. 8 is a schematic diagram of yet another target dictionary tree in an embodiment of the present application;

FIG. 9 is a schematic diagram of a syntax tree in an embodiment of the present application;

FIG. 10 is a schematic diagram of a mapping relationship between an expression type and a preset index sequence construction rule in an embodiment of the present application;

FIG. 11 is a schematic diagram of a text classification method in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a text classification device according to an embodiment of the present application;

fig. 13 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure. Embodiments and features of embodiments in this application may be combined with each other arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

The terms first and second in the description and claims of the present application and in the above-described figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, some key terms related to the embodiments of the present application are explained here:

1. short text: in the embodiment of the present application, short text may be understood as text having a length not exceeding a preset number of characters. Wherein, the preset number can be correspondingly determined based on actual implementation, and the preset number is 100, for example, short text can be understood as text with a length of not more than 100 characters. And, it should be noted that the character may be at least one of a chinese character, an english character, and other language characters, which is not limited in the embodiment of the present application.

2. And (3) classification identification: in the embodiment of the application, the classification identifier can be understood as a label which can be enumerated and exhausted and is used for representing the semantic classification of the short text under a specific scene. The classification identifier may be represented by one or more of various characters, such as connection symbols, numbers, letters, and characters, and may also be represented by other manners, such as graphics, patterns, etc., which are not limited in the embodiments of the present application.

In addition, it should be noted that, in the embodiment of the present application, the classification identifier is a label of semantic classification under a specific scenario, that is, the same short text, the classification identifier 1 in the scenario 1, and the classification identifier 2 in the scenario 2 may be the same or different.

3. Short text classification group: in embodiments of the present application, a short text classification group may be used to characterize a short text classification rule. Each short text classification group is a binary group containing a vocabulary index sequence corresponding to the text expression and classification identification.

4. Target classification rule base: a set of a plurality of short text classification groups.

5. Short text training set: in the embodiment of the application, the short text training set may be understood as a short text set including a plurality of training items, wherein each training item includes a short text and a classification identifier corresponding to the short text.

It should be noted that, the short text training set may be updated, for example, periodically or may be updated when a trigger condition is met (for example, a new training item reaches a threshold value, or a training object manually increases a training item), which is not limited in the embodiment of the present application. Furthermore, a short text training set may be understood as being determined for a particular scene, i.e. different scenes correspond to different short text training sets.

The intelligent transportation system (Intelligent Traffic System, ITS), also called intelligent transportation system (Intelligent Transportation System), is a comprehensive transportation system which uses advanced scientific technology (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence, etc.) effectively and comprehensively for transportation, service control and vehicle manufacturing, and enhances the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy.

The intelligent vehicle-road cooperative system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS), which is simply called a vehicle-road cooperative system, is one development direction of an Intelligent Transportation System (ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, carries out vehicle-vehicle and vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, fully realizes effective cooperation of people and vehicles and roads, ensures traffic safety, improves traffic efficiency, and forms a safe, efficient and environment-friendly road traffic system.

The following briefly describes the design concept of the embodiment of the present application:

at present, in the related art, in an intelligent traffic system or an intelligent vehicle-road coordination system, an object needs to input a short text in a vehicle-mounted system, so that the intelligent traffic system or the intelligent vehicle-road coordination system can accurately and rapidly classify the short text, thereby determining a destination corresponding to the short text input by the object, and further realizing services such as rapid and accurate navigation.

In the related art, a deep learning model is generally used to classify short texts, however, the deep learning model needs to rely on a large amount of high-quality labeling data and a pre-trained model, so that when the large amount of high-quality labeling data does not include the short text labeling data of a new class, the classification accuracy of the short text of the new class is poor, and when the classification accuracy of the short text of the new class is desired to be improved, the labeling needs to be performed again and the new pre-trained model needs to be determined, which results in complicated whole classification process and lower classification efficiency of the short text.

Specifically, when the electronic map Point of interest (POI) data is automatically classified, the total classification category of the electronic map is about 500 or more, and the diversity and the specificity of the classification category corresponding to the POI of the electronic map, so that in the related art, the simple deep learning model cannot well solve the classification problem of classifying the short text data corresponding to the POI of the electronic map.

In summary, when short text classification is performed in the related art, there are technical problems that the classification efficiency and the accuracy of short text classification are low.

In view of this, the present application provides a text classification method, and compared with the scheme in the related art, the text classification method in the present application does not need any other material except the short text training set, so when it is desired to perform classification and recognition on the short text of a new class, only the short text training set needs to be updated, and no other material needs to be determined, thereby reducing steps in the classification process and improving the classification efficiency.

Specifically, the embodiment of the application provides a text classification method, and an electronic device can perform vocabulary mining processing on a short text training set to obtain a candidate word set, then perform vocabulary recombination processing based on the candidate word set to obtain a vocabulary combined set, and then perform text expression mining processing on the vocabulary combined set to obtain an initial text expression set. And in order to quickly realize the construction of the target classification rule base, the electronic equipment can parse each initial text expression contained in the initial text expression set so as to obtain a corresponding vocabulary index sequence. Further, the electronic device may determine each classification group including the vocabulary index sequence and the classification identifier corresponding thereto, and obtain the target classification rule base based on each obtained classification group.

Therefore, in the embodiment of the application, the electronic device can quickly generate tens of thousands of target classification rule bases with higher precision based on the short text training set, and can then classify the short text to be classified based on the target classification rule bases, so that the short text can be classified quickly and accurately.

After the design concept of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used to illustrate the embodiment of the present application and are not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to any scene related to short text classification, such as cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

Specifically, the scenario application scenario provided in the embodiment of the present application may be a scenario related to classifying short text, for example, a scenario of searching video in video software based on short text, a scenario of searching resources such as music or articles in an application based on short text, or a scenario of searching positions in an electronic map based on short text, which is not limited in the embodiment of the present application. Of course, the scheme application scenario provided in the embodiment of the present application may also be a scenario processed based on short text classification, that is, after classifying the short text, performing related recommendation or subsequent guiding operation, etc., for example, the scenario may be applied to a scenario in which electronic map POI data is automatically classified, and an application scenario corresponding to an electronic map, or an intelligent traffic system or an intelligent vehicle-road coordination system, etc., which is not limited in the embodiment of the present application.

Fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application. The application scenario schematic diagram includes a terminal device 110 and a server 120, where the terminal device 110 and the server 120 may communicate through a communication network.

In the embodiment of the present application, the terminal device 110 is an electronic device used by a user, and the electronic device may be a personal computer, a mobile phone, a tablet computer, a notebook, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, or the like. Each terminal device 110 may communicate with the server 120 via a communication network.

In an alternative embodiment, the communication network may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which is not limited in this embodiment. Accordingly, the terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.

The server 120 may be an independent physical server 120, may be an edge device 120 in the cloud computing field, or may be a cloud server 120 that provides cloud services, cloud databases, cloud computing, cloud storage, cloud functions, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform.

It should be noted that, the text classification method in the embodiment of the present application may be executed by the server 120 or the terminal device 110 alone, or may be executed by both the server 120 and the terminal device 110 together. When the server 110 and the terminal device 120 perform together, for example, the terminal device 110 determines a short text to be classified, and then sends the short text to be classified to the server 120, the server 120 responds to the received short text to be classified, classifies the short text to be classified based on the target classification rule base, and obtains a classification result, so that the classification result is sent to the terminal device 110 for display.

Specifically, the server 120 may obtain the target classification rule base in the following manner:

determining a candidate word set obtained based on vocabulary mining processing of a short text training set, performing vocabulary recombination processing to obtain a vocabulary combination set, and performing text expression mining processing on the vocabulary combination set to obtain an initial text expression set; respectively carrying out grammar analysis on each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence; each vocabulary index sequence includes an index position of at least one vocabulary in a corresponding initial text expression; respectively screening target text expressions matched with a plurality of short texts in a short text training set from an initial text expression set, and setting a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group; based on the obtained classification groups, a target classification rule base is obtained.

It should be noted that, in the actual implementation process, the server 120 may determine the short text training set, or the terminal device 110 may determine the short text training set and send the short text training set to the server 120, which is not limited in this embodiment of the present application. Furthermore, the short text training set may be updated based on actual implementation, for example, periodic update, or update based on manual triggering of the object, which is not limited in the embodiment of the present application. That is, the target classification rule base obtained in the application can be updated based on the update of the short text training set, so that the short text to be classified can be classified more accurately.

Hereinafter, the server alone is mainly used as an example, and is not particularly limited herein.

Of course, the method provided in the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may also be used in other possible application scenarios, for example, application scenarios where multiple terminal devices interact with multiple servers, which is not limited in the embodiment of the present application. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

In order to further explain the technical solutions provided in the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operational steps as shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the apparatus is performed.

Referring to fig. 2, a flowchart of a text classification method is provided in an embodiment of the present application. As shown in fig. 2, the flow of the method may be performed by an electronic device, which may be the server shown in fig. 1.

The specific flow is as follows:

step 201: based on a candidate word set obtained by carrying out word mining processing on the short text training set, carrying out word recombination processing to obtain a word combination set, and carrying out text expression mining processing on the word combination set to obtain an initial text expression set.

In an embodiment of the present application, before performing step 201, the electronic device may perform vocabulary mining processing on the short text training set to determine the candidate word sets.

Referring to fig. 3, in a specific implementation, the electronic device may determine the candidate word set by, but not limited to, the following steps:

step 301: performing vocabulary mining processing on the short text training set to obtain at least one character vocabulary group; each character vocabulary group comprises a plurality of vocabularies with the same number of characters, and the vocabularies belonging to different character vocabulary groups contain different numbers of characters.

In the embodiment of the application, the electronic device may first determine a short text training set, where the short text training set includes at least one training item, and each training item includes a short text and a classification identifier corresponding to the short text. For example, one training item may be represented as (short text t, class identifier c).

In the embodiment of the application, after the electronic device determines the short text training set, vocabulary mining processing may be performed on the short text training set to obtain at least one character vocabulary group.

Specifically, the electronic device may continuously scan training items in the short text training set, and then mine out each character vocabulary group that meets the preset frequency threshold. The preset frequency threshold may be correspondingly determined based on actual implementation, and is, for example, 4, 10, etc., which is not limited in the embodiment of the present application. Specifically, the preset frequency threshold is used for screening the vocabulary with the occurrence times larger than the preset times in all short texts in the short text training set, and the vocabulary is used as the vocabulary in a character vocabulary group. For example, the word "building" may be determined as a word in a word vocabulary group including one character, if the number of occurrences in the short text training set is 50 times and the preset frequency threshold is 10 times.

It can be seen that each character vocabulary group in the embodiment of the present application includes a plurality of vocabularies with the same number of characters, and the vocabularies belonging to different character vocabulary groups include different numbers of characters.

Referring to fig. 4, a character vocabulary group 1 including only 1 character number of vocabulary may be referred to as 1-gram, a character vocabulary group 2 including 2 character number of vocabulary may be referred to as 2-gram, a character vocabulary group 3 including 3 character number of vocabulary may be referred to as 2-gram, … …, and a character vocabulary group N including N character number of vocabulary may be referred to as N-gram, where N is a positive integer, and N is used to characterize the maximum length limitation of vocabulary, i.e., the maximum number of character numbers.

For example, referring to fig. 5, assume that the short text in the training term is: "company headquarter building", vocabulary mining processing is performed based on this, and the obtained vocabulary corresponding to 1-gram includes: general and public, department, general parts, big, building "; the vocabulary corresponding to 2-gram is: "company, headquarter, building"; the vocabulary corresponding to the 4-gram is 'corporate headquarters, headquarter buildings and corporate buildings'; the 6-gram corresponds to the word "corporate headquarter building".

Step 302: screening at least one character vocabulary group based on the vocabulary evaluation value corresponding to each character vocabulary group to obtain at least one first vocabulary group; each vocabulary evaluation value is determined based on the frequency of occurrence of the corresponding character vocabulary group in the plurality of short texts and the frequency of combination with other character vocabulary groups.

In the embodiment of the application, after the electronic device determines at least one character vocabulary group, a vocabulary evaluation value corresponding to each character vocabulary group may be determined.

In the embodiment of the application, the electronic device may determine the vocabulary evaluation value corresponding to each character vocabulary group by adopting, but not limited to, the following steps. In the following, a description will be given of an example of determining a vocabulary evaluation value of one character vocabulary group, that is, in the implementation, the following operations may be performed for each character vocabulary group, respectively:

step a: splitting a character vocabulary group, and selecting a first target evaluation value which meets a first screening condition from the first evaluation values which are respectively corresponding to at least two sub-vocabulary groups; wherein each first evaluation value characterizes a frequency of occurrence of the corresponding sub-vocabulary group in the plurality of short texts.

In the embodiment of the application, the electronic device may determine at least one splitting manner based on the number of words included in one character vocabulary group, where each splitting manner is used to split one character vocabulary group into at least two sub-vocabulary groups; and splitting a character vocabulary group into at least two corresponding sub-vocabulary groups according to at least one splitting mode.

Alternatively, assuming that the character vocabulary group is represented by a k-gram, where k=1, 2, … …, N-1, the electronic device may split the k-gram into two parts, a-gram (prefix) and b-gram (suffix), i.e., a-gram may be understood as a sub-vocabulary group and b-gram may be understood as a sub-vocabulary group.

For example, referring to fig. 6, the electronic device splits the character vocabulary group a into 2 sub-vocabulary groups, and the character vocabulary group a includes 10 vocabularies, so that the splitting manner corresponding to the character vocabulary group a includes: 1 vocabulary is used as a-gram, and 9 vocabularies are used as b-gram; 2 vocabularies are used as a-gram, and 8 vocabularies are used as b-gram; 3 vocabularies are used as a-gram, and 7 vocabularies are used as b-gram; 4 vocabularies are used as a-gram, and 6 vocabularies are used as b-gram; and, 5 vocabularies are designated as a-gram and 5 vocabularies are designated as b-gram.

Alternatively, assuming that the character vocabulary group is represented by a k-gram, where k=1, 2, … …, N-1, the electronic device may divide the k-gram into three parts, a-gram, b-gram, and c-gram, i.e., a-gram may be understood as a sub-vocabulary group, b-gram may be understood as a sub-vocabulary group, and c-gram may be understood as a sub-vocabulary group. Wherein a and b may be combined in various ways.

Assuming that the electronic device splits the character vocabulary group M into 3 sub-vocabulary groups, and the character vocabulary group B includes 10 vocabularies, the number of vocabularies included in the 3 sub-vocabulary groups can be determined, so as to correspondingly determine multiple splitting modes corresponding to the character vocabulary group M.

Specifically, assuming that the number of words corresponding to the b-gram is 2, it may be determined that the plurality of splitting modes corresponding to the character word group M are: 1 vocabulary is used as a-gram, 2 vocabularies are used as b-gram, and 7 vocabularies are used as c-gram; 2 vocabularies are used as a-gram, 2 vocabularies are used as b-gram, and 6 vocabularies are used as c-gram; 3 vocabularies are used as a-gram, 2 vocabularies are used as b-gram, and 5 vocabularies are used as c-gram; 4 vocabularies are used as a-gram, 2 vocabularies are used as b-gram, and 4 vocabularies are used as c-gram; taking 5 vocabularies as a-gram, 2 vocabularies as b-gram and 3 vocabularies as c-gram; 6 vocabularies are used as a-gram, 2 vocabularies are used as b-gram, and 2 vocabularies are used as c-gram; 7 vocabularies are designated as a-gram, 2 vocabularies are designated as b-gram, and 1 vocabularies are designated as c-gram.

It should be noted that, in the actual implementation process, when the electronic device performs splitting on any character vocabulary group, the number of corresponding sub-vocabulary groups may be determined based on the actual implementation requirement, which is not limited in the embodiment of the present application. In this embodiment of the present application, the electronic device may split from front to back according to the order of the vocabularies in the character vocabulary group, which is not limited in this application, although other splitting orders are also possible.

In the embodiment of the present application, after determining at least two sub-vocabulary groups corresponding to each character vocabulary group, a first evaluation value corresponding to each sub-vocabulary group may also be determined.

Alternatively, the electronic device may determine the first evaluation value corresponding to the sub-vocabulary group using the following formula:

p=f (k-gram) short text total/(F (a-gram) F (b-gram)

Wherein P is used for representing a first evaluation value, F (k-gram) is used for representing the occurrence frequency of a character vocabulary group in a plurality of short texts in a short text training set, F (a-gram) is used for representing the occurrence frequency of an a-gram sub-vocabulary group in a plurality of short texts in the short text training set, and F (b-gram) is used for representing the occurrence frequency of a b-gram sub-vocabulary group in a plurality of short texts in the short text training set.

In this embodiment of the present application, after determining the first evaluation values corresponding to the respective sub-vocabulary groups based on the foregoing formula, the first target evaluation value meeting the first screening condition may be selected from the obtained first evaluation values corresponding to the respective at least two sub-vocabulary groups. The first screening condition may be understood as screening a first evaluation value having a smallest value from among a plurality of first evaluation values as a first target evaluation value.

Step b: performing association processing on a character vocabulary group, and selecting a second target evaluation value which meets a second screening condition from second evaluation values which correspond to at least two associated vocabularies respectively; wherein each second evaluation value characterizes the occurrence frequency of the corresponding associative vocabulary in a plurality of short texts;

in the embodiment of the application, the electronic device performs at least one of the following operations on each character vocabulary group: each vocabulary contained in a character vocabulary group is respectively determined, and at least one associated vocabulary which takes the character vocabulary group as a suffix is respectively corresponding to each vocabulary; each word included in a character word group is determined, and at least one associated word with the character word group as a prefix is corresponding to each word.

That is, a character vocabulary group corresponds to: each vocabulary contained in the character vocabulary group is respectively corresponding to at least one associated vocabulary taking the character vocabulary group as a prefix; and each word contained in the character word group is respectively corresponding to at least one associated word with the character word group as a suffix.

For example, assume that the character vocabulary group s is an included vocabulary: word 1, word 2, word 3 and word 4, the character word group s corresponds to at least one associated word prefixed with word 1, at least one associated word prefixed with word 2, at least one associated word prefixed with word 3 and at least one associated word prefixed with word 4; and at least one associated vocabulary having vocabulary 1 as a suffix, at least one associated vocabulary having vocabulary 2 as a suffix, at least one associated vocabulary having vocabulary 3 as a suffix, and at least one associated vocabulary having vocabulary 4 as a suffix.

Optionally, in the practical implementation process, the electronic device may determine at least one associated vocabulary that uses each vocabulary in the character vocabulary group as a prefix, as a right character vocabulary group; and the electronic device may determine at least one associated vocabulary prefixed with each of the vocabulary of the character vocabulary group as the left character vocabulary group.

Further, the electronic device records the left character distribution information entropy, namely the second evaluation value, as LQ (k-gram) by using the frequency statistic value of the (k+1) -gram with the k-gram as the suffix, and based on a similar method, the electronic device can determine the other second evaluation value, namely the right character distribution information entropy, as RQ (k-gram).

In the embodiment of the present application, after obtaining the second evaluation values corresponding to at least two associated vocabularies corresponding to the character vocabulary group, a second target evaluation value meeting the second screening condition is selected from the obtained second evaluation values. The second screening condition may be selecting an evaluation value with the smallest second evaluation value as the second target evaluation value.

Step c: a vocabulary evaluation value for a character vocabulary group is determined based on the first target evaluation value and the second target evaluation value.

In the embodiment of the present application, after obtaining the first target evaluation value and the second target evaluation value corresponding to the character vocabulary group, the electronic device may determine the vocabulary evaluation value of one character vocabulary group based on the first target evaluation value and the second target evaluation value.

Alternatively, the electronic device may determine the vocabulary evaluation value for a character vocabulary group using, but not limited to, the following formula:

S(k-gram)＝(m*P(k-gram))*(n*Q(k-gram))；

wherein S (k-gram) is used for representing vocabulary evaluation values of the k-gram character vocabulary group, P (k-gram) is used for representing first target evaluation values of the k-gram character vocabulary group, and Q (k-gram) is used for representing second target evaluation values of the k-gram character vocabulary group; m and n are weight coefficients, m+n=1, and m and n determine specific numerical values based on actual implementation.

In this embodiment of the present application, after the electronic device determines the vocabulary evaluation value corresponding to each character vocabulary group based on the foregoing method, at least one first vocabulary group may be obtained by screening at least one character vocabulary group.

Optionally, the electronic device may select, as the first vocabulary group, a character vocabulary group having a vocabulary evaluation value greater than a preset threshold, where the preset threshold may be determined correspondingly based on actual implementation, which is not limited in the embodiment of the present application.

Alternatively, the electronic device may select, as at least one first vocabulary group, the first N character vocabulary groups arranged from among the character vocabulary groups arranged from large to small in terms of the vocabulary evaluation values. Wherein N is a positive integer, and N is determined correspondingly based on actual implementation, which is not limited in the embodiment of the present application.

Of course, in the implementation process, the electronic device may also adopt other manners to screen at least one character vocabulary group based on the vocabulary evaluation value corresponding to each character vocabulary group, which is not limited in the embodiment of the present application.

Step 303: and selecting word frequencies of words in the plurality of short texts from at least one first word group, selecting candidate words meeting a first threshold, and obtaining a candidate word set based on each obtained candidate word.

In the embodiment of the application, a dictionary can be built based on at least one first vocabulary group, the total score is maximized, and the optimal segmentation of each input short text is calculated by using a viterbi algorithm, so that the segmentation result of the short text training set is obtained. Then, the electronic device may count word frequencies corresponding to the respective first vocabulary groups based on the segmentation result of the short text training set. Further, the electronic device may select the vocabulary in each vocabulary group with the word frequency ranked in the front D, and determine the candidate vocabulary set. Wherein D is a positive integer, and specific numerical values are determined correspondingly based on actual implementation, which is not limited. The numerical value corresponding to the D-th word frequency of the preceding D is used as the first threshold value.

In the embodiment of the application, after the electronic device obtains the candidate word set, a relevant step of the initial text expression mining process may be performed.

In the embodiment of the application, the electronic equipment can construct a target dictionary tree based on the candidate word sets; wherein the root node of the target dictionary tree is empty, and each leaf node except the root node contains a vocabulary. For example, referring to fig. 7, a schematic diagram of a target dictionary tree according to an embodiment of the present application is shown.

In the embodiment of the application, after the electronic device builds the target dictionary tree, the following operations may be performed for each leaf node:

respectively obtaining a first path between one leaf node and each other leaf node; each first path includes: the vocabulary contained in one leaf node is used as a starting vocabulary, and the vocabulary contained in the corresponding other leaf nodes is used as an ending vocabulary, so that a vocabulary combination is formed; based on the obtained vocabulary combinations, a vocabulary combination set is obtained.

In a specific implementation process, the first path between one leaf node and other leaf nodes may include 2 words, may include 3 words, and may include 4 words, or the like. That is, the first path may connect only two leaf nodes, and may also connect three leaf nodes.

For example, referring to fig. 8, a first path between leaf node 1 and leaf node 2 includes 2 words, a first path between leaf node 1 and leaf node 3 includes 3 words, and a first path before leaf node 1 and leaf node 4 includes 2 words. Thus, a vocabulary combination 1 in which vocabulary is recombined with 2 vocabularies and a vocabulary combination 2 in which vocabulary is recombined with 3 vocabularies can be obtained.

As can be seen, the vocabulary combination set in the embodiment of the present application includes: the method comprises the following steps of carrying out vocabulary combination 1 by 2 vocabularies, carrying out vocabulary combination 2 by 3 vocabularies, carrying out vocabulary combination 3 by 4 vocabularies, … …, carrying out vocabulary combination N by M vocabularies, wherein M is a positive integer.

Optionally, after the electronic device obtains each vocabulary combination, each vocabulary combination may be further processed, so that the positions where the obtained vocabulary appears in the original text cannot be overlapped, and the vocabulary combination with the length of m is recorded as m-sense, so as to obtain a vocabulary combination set. Wherein, m can be understood as the word recombination of m words, and m is a positive integer.

Alternatively, after the electronic device determines the vocabulary assembly set, the electronic device may perform text-expression mining processing on the vocabulary assembly set to obtain an initial text-expression set using, but not limited to, the following steps.

Step A: selecting candidate vocabulary combinations corresponding to the occurrence frequency meeting a second threshold from the occurrence frequencies corresponding to the vocabulary combinations in the vocabulary combination set; wherein each occurrence frequency is determined based on the number of occurrences of the corresponding vocabulary combination in the plurality of short texts and the number of the plurality of short texts.

In the embodiment of the present application, the electronic device may determine the frequency of occurrence corresponding to each vocabulary combination based on the number of occurrences of each vocabulary combination in the short text included in the short text training set and the number of short texts in the short text training set. Then, the electronic device may select, from the frequencies of occurrence corresponding to each of the plurality of vocabulary combinations in the vocabulary combination set, a candidate vocabulary combination corresponding to the frequency of occurrence that meets the second threshold. The second threshold may be determined based on the actual implementation, that is, the second threshold may be updated, which is not limited in the embodiment of the present application.

And (B) step (B): and respectively connecting the vocabulary in each candidate vocabulary combination with the vocabulary in other candidate vocabulary combinations according to a preset arrangement rule by at least one grammar symbol to obtain a plurality of first initial text expressions.

In this embodiment of the present application, the electronic device may connect the vocabulary in each candidate vocabulary combination with the vocabulary in other candidate vocabulary combinations according to a preset arrangement rule with at least one grammar symbol, so as to obtain a corresponding first initial text expression.

For example, the two words of "building" and "beijing headquarter" are connected by grammatical symbols "x" according to the arrangement rule of "beijing headquarter" and "building", and the first initial text expression "beijing headquarter x building" can be obtained.

It should be noted that, in the embodiment of the present application, the preset arrangement rule may be a preset arrangement of two vocabularies, or the vocabulary in the vocabulary combination with the front vocabulary combination is arranged before the vocabulary combination. Further, syntax symbols may be ", etc., which the present application does not limit.

Step C: and matching each first initial text expression with a preset text expression library to obtain a corresponding second initial text expression containing the first initial text expression.

In the embodiment of the application, the electronic device may determine a preset text expression library, where the preset text expression library includes a plurality of text expressions, and the preset text expression library may be determined correspondingly based on an actual application scenario.

Further, the electronic device may match each first initial text expression with a preset text expression library to obtain a corresponding second initial text expression including the first initial text expression.

For example, assuming that the first initial text expression is "headquarter building", the first initial text expression is matched with a preset text expression library to obtain a second initial text expression: "Beijing headquarter building", "headquarter (first) building", "headquarter |subsection building".

Step D: based on each of the obtained first initial text expression and second initial text expression, an initial text expression set is obtained.

In the embodiment of the application, the electronic device may form the obtained first initial text expression and the second initial text expression into one initial text expression set, thereby obtaining the initial text expression set.

Step 202: respectively carrying out grammar analysis on each initial text expression contained in the initial text expression set to obtain a corresponding vocabulary index sequence; each vocabulary index sequence includes an index position of at least one vocabulary in the corresponding initial text expression.

In the embodiment of the application, the electronic device may convert each initial text expression included in the initial text expression set into a corresponding syntax tree; each grammar tree includes a plurality of grammar nodes, each grammar node including a grammar symbol or vocabulary.

For example, please refer to fig. 9, which is a schematic diagram of a syntax tree provided in an embodiment of the present application. And connecting each grammar node so as to determine an initial text expression corresponding to the grammar tree.

In the embodiment of the application, the electronic device performs simplification processing on the grammar nodes of each grammar tree to obtain simplified initial text expressions.

Optionally, the electronic device may perform the following for each syntax tree:

when determining grammar nodes containing first grammar symbols in a grammar tree, removing the grammar nodes to obtain a first grammar tree;

and when determining grammar nodes containing grammar symbols of a second type in the first grammar tree, performing recursion operation and removal operation on the grammar nodes to obtain a second grammar tree, and taking an expression corresponding to the second grammar tree as a simplified initial text expression.

In an embodiment of the present application, the electronic device may remove a syntax node that includes a first type of syntax symbol, where the first type of syntax symbol may include, but is not limited to, the following syntax symbols: "head of line matching ()," end of line matching ($), word separator, arbitrary character (),? ". Of course, the first type syntax symbols may be other symbols that do not provide any valuable index filtering information, which is not limited in this application.

In the embodiment of the present application, after the electronic device performs the removal operation on the syntax node including the first type of syntax symbol to obtain the first syntax tree, the electronic device may further determine whether the first syntax tree includes the syntax node of the second type of syntax symbol. The second type of syntax symbol may include, but is not limited to "(), +, { }.

Alternatively, when the electronic device determines that the initial text expression is a real expression, it is reserved, for example, abc is a real expression, and it is simplified that abc text itself.

Alternatively, when the electronic device determines that the initial text expression is the "() expression" or "+ expression", its sub-expression may be shifted up by one layer and recursively processed to obtain a simplified initial text expression.

Alternatively, when the electronic device determines that the initial text expression is "{ m, n } expression", when m > =1 is determined, processing is performed in a processing manner of "+expression", otherwise, the initial text expression is removed.

Alternatively, when the electronic device determines that the initial text expression is a "connection expression," for example, the initial text expression is: abc def, it can be determined that "abc, & def" is its three sub-expressions, each sub-expression can be recursively processed, the sub-expression that should be removed is removed from the list of sub-expressions, and if all sub-expressions are removed, the original text expression should be removed.

Alternatively, when the electronic device determines that the initial text expression is "|expression", each sub-expression may be recursively processed, and if there is one sub-expression "removed", the initial text expression is also "removed" because the respective sub-expressions of "|" are "or" in relation to each other, and "removed" represents "unlimited" in the processing in the embodiment of the present application, and logically there is one branch "unlimited", which is also "unlimited", thus performing the aforementioned processing.

Optionally, when the electronic device determines that the initial text expression is "chase expression", for example "[0-9]" or "[ a-z ]" which is an expression representing a character set, it determines whether the vocabulary in the initial text expression set is too large, and when it determines that the vocabulary in the initial text expression set exceeds a preset threshold, the preset threshold may be adjusted according to actual service needs, for example, 128 characters, and the filtering effect is not too good, so that the filtering effect is removed. When it is determined that the vocabulary in the initial text expression set does not exceed the preset threshold, each character is extracted and converted into an |expression consisting of a plurality of single characters for processing.

In the embodiment of the application, when the electronic device performs simplification processing on the grammar nodes of the grammar tree corresponding to each initial text expression, each simplified initial text expression is obtained, and three types of expressions, namely a 'Literminal' expression, a 'connection expression' can be obtained.

Further, the electronic device performs the following operations on the obtained simplified initial text expressions:

determining an expression type corresponding to the simplified initial text expression, and determining a target index sequence construction rule based on the expression type and the mapping relation between the expression type and a preset index sequence construction rule;

and determining the corresponding relation between each vocabulary and the corresponding index in a simplified initial text expression based on the target index sequence construction rule, and obtaining the vocabulary index sequence based on the corresponding relation between each vocabulary and the corresponding index.

Fig. 10 is a schematic diagram of a mapping relationship between an expression type and a preset index sequence construction rule in the embodiment of the present application.

Optionally, when the electronic device determines that the expression type corresponding to the simplified initial text expression is a Lireal type, it may determine that the target index sequence construction rule is an index sequence construction rule 1, and the index sequence construction rule 1 determines the index of each vocabulary according to the vocabulary position of each vocabulary in the simplified initial text expression.

For example, assuming that the simplified initial text expression is w, it can be determined that its corresponding vocabulary index sequence is [ { w } ].

Optionally, when the electronic device determines that the expression type corresponding to the simplified initial text expression is |type, it may determine that the target index sequence construction rule is index sequence construction rule 2, and index sequence construction rule 2 is: when each sub-expression in the simplified initial text expression is a neutral expression, determining the index of each vocabulary according to the vocabulary position of each vocabulary in the simplified initial text expression; when the sub-expressions in the simplified initial text expression have the connection expressions, all the neutral expressions are collected together, if the number of the neutral expressions is larger than 0, the neutral expressions are used for determining the index of each vocabulary according to the vocabulary positions of each vocabulary in the simplified initial text expression, each other connection sub-expression is subjected to recursion processing, and all the recursion processing results and the results are combined to be used as the vocabulary index.

For example, assume that the index of the simplified initial text expression as the corresponding sub-expression is: { [ S1] }, { [ S2, S3], [ S4] }, the vocabulary index corresponding to the simplified initial text expression is: { [ S1], [ S2, S3], [ S4] }.

Optionally, when the electronic device determines that the expression type corresponding to the simplified initial text expression is a connection type, it may determine that the target index sequence construction rule is an index sequence construction rule 3, where the index sequence construction rule 3 is: and performing recursion processing on each sub-expression, and performing sequence set cross-connection operation on the result after the recursion processing to obtain the simplified initial text expression.

For example, assume that the set of index sequences for which the reduced initial text expression is the corresponding sub-expression is: { [ S1], [ S2] } and { [ S3], [ S4] }, and then performing cross-connection operation on { [ S1], [ S2] } and { [ S3], [ S4] }, the index sequence corresponding to the simplified initial text expression can be determined as follows: { [ S1, S3], [ S1, S4], [ S2, S3], [ S2, S4] }.

It can be seen that, in the embodiment of the present application, the electronic device may parse each initial text expression included in the initial text expression set to obtain a corresponding vocabulary index sequence, that is, an initial text expression set, and the electronic device may convert the initial text expression set into a vocabulary index sequence set.

Step 203: and respectively screening target text expressions matched with a plurality of short texts in the short text training set from the initial text expression set, and setting a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group.

In this embodiment of the present application, the electronic device matches each training term, that is, the training term including (short text t, class label c) from the initial text expression set, and when determining that the short text t matches the vocabulary index sequence r1 corresponding to the text expression 1, the vocabulary index sequences r2, … corresponding to the text expression 2, and the vocabulary index sequence rk corresponding to the text expression k in the initial text expression set, determines a class group (ri, c). Wherein, the value range of i is 1,2, … and k.

Step 204: based on the obtained classification groups, a target classification rule base is obtained, and the target classification rule base is used for classifying short texts to be classified.

In this embodiment of the present application, for each obtained classification group, when one training item matches an index sequence r1, r2, …, rk corresponding to a text expression in an initial text expression, 1 is added to the classification group (ri, c) in one global counter, so that a matching frequency corresponding to each classification group can be determined based on the global counter. Wherein the matching frequency is used to characterize: the vocabulary index sequences in the respective taxonomies are matched to the frequency of the corresponding short text in the plurality of short texts.

In the embodiment of the application, after the electronic device determines the matching frequency corresponding to each classification group, a target matching probability which is not smaller than a third threshold value is selected from the obtained matching probabilities. Then, the electronic device constructs a target classification rule base based on the candidate classification group corresponding to the obtained target matching probability. The third threshold is, for example, 0.8, 0.9, etc., which is not limited in the embodiment of the present application.

Optionally, the electronic device may further assign a unique number sid to each vocabulary index sequence for recording the relationship between each sid and the initial text expression, and the length slen of each sid sequence.

Specifically, for each vocabulary index sequence, the electronic device may record index information, including the presence of the sed and the presence location, for each vocabulary in the vocabulary index sequence. For example, the vocabulary index sequence is: [ { w1, w2}, { w3, w4} ], then the electronic device may record index information for each vocabulary w1, w2, w3, w 4: w1- > (sed, 1), w2- > (sed, 1), w3- > (sed, 2), w4- > (sed, 2).

Optionally, the electronic device may further group the vocabulary position index information according to the vocabulary content for each vocabulary in the whole sequence, collect the vocabulary position index information into a list, sort the elements in the list according to two dimensions (sed, rn), where sed is a high-priority sorting field, then build a trie tree based on the whole vocabulary, and directly query the position index list of the hit vocabulary in the trie tree, so as to obtain a target classification rule base including classification identifiers, the vocabulary index sequence and positions of the respective vocabularies.

In the embodiment of the application, after the electronic device receives the short text to be classified, for the input short text t, the target dictionary tree is used for performing rapid full-text word segmentation, word segmentation with overlapping text positions is removed, and a hit word segmentation hit sequence of the short text to be classified in the target classification rule base is determined.

Optionally, the electronic device tries to find the vocabulary index sequence hit result in the corresponding target classification rule base according to the word segmentation hit sequences (w 1, w2, …, wn), and further finds the initial text expression to perform actual regular verification, so as to finally classify the short text to be classified.

Step1: assuming that the position index list corresponding to the word-cutting hit sequence is l1, l2, …, ln, since each element in the list is ordered, each index element can be grouped according to sil by using a multi-way merging and ordering method, so as to obtain a plurality of sil groups.

Step2: for each sed packet, it may be determined that the sequence length corresponding to the sed packet is sl, then the rn portion in all the position index entries (sed, rn) in the group may be verified, it is determined whether the rn portion includes all the numbers 1, 2, …, sl, if so, it is determined that the current sed index hits rn, and it is determined that the classification identifier corresponding to rn is the classification identifier of the short text to be identified.

In a specific implementation process, the text classification method provided by the embodiment of the application can be combined with POI automatic classification, and a single machine 10 kernel server can finish mining of 1000 ten thousand POI name rules within 20 minutes, yield 7 ten thousand rules, data coverage rate is 80%, accuracy is 96%, and the text classification method provided by the application can be used for rapidly and accurately classifying short texts to be classified. In addition, the target classification rule base can be subjected to further manual deep processing and other treatments so as to improve the classification accuracy of the short text to be classified.

Based on the foregoing embodiments, a detailed description of a text classification method according to an embodiment of the present application will be described below by using a specific example, with reference to fig. 11, which is an exemplary diagram of the text classification method according to the embodiment of the present application, and specifically includes:

first, the electronic device may determine that the short text training set includes 3 training items, respectively: training item 1 (corporate headquarter building, commercial building), training item 2 (cheng du river beach park, leisure sports), and training item 3 (hotpot, restaurant).

Secondly, the electronic equipment can perform vocabulary mining processing on the short text training set to obtain candidate word sets. Wherein, candidate word sets are: public, department, general, department, large, building, adult, city, river, beach, public, garden, fire, pot, store, company, headquarter, building, adult, river beach, park, chafing dish store, company headquarter, headquarter building, company building, adult river beach, river beach park, company headquarter building, adult river beach park.

Then, the electronic device may perform a vocabulary recombination process on the vocabulary in the candidate vocabulary set to obtain a vocabulary combination set. Wherein, the vocabulary combination set is: companies, headquarters, buildings, achievements, river beaches, parks, chafing dishes, corporate headquarters, corporate buildings, river beach parks, achievements chafing dishes, achievements headquarters, achievements parks, achievements chafing dishes, corporate headquarters, achievements river beach parks, achievements headquarters, achievements corporate buildings.

Again, the electronic device may perform text-expression mining processing on the vocabulary assembly set to obtain an initial text expression set. Wherein the initial text expression set is: all-purpose company's headquarter building, (all-purpose company's) building, all-purpose hot pot store, all-purpose river beach park, all-purpose park, hot pot store.

Then, the electronic device may parse each initial text expression included in the initial text expression set to obtain a corresponding vocabulary index sequence. Wherein, the vocabulary index sequences are respectively: adult- > { sil 1,1}, company- > { sil 1,2}, headquarter building- > { sil 1,3}; adult- > { sil 2,1}, company- > { sil 2,2}, building- > { sil 2,3}; adult- > { sil 3,1}, chaffy dish shop- > { sil 3,2}; adult- > { sil 4,1}, chaffy dish shop- > { sil 4,3}; adult- > { sil 5,1}, river beach park- > { sil 5,3}; adult- > { sil 6,1}, park- > { sil 6,3}; chaffy dish store- > { sil 7,1}.

And the electronic equipment can respectively screen out target text expressions matched with a plurality of short texts in the short text training set from the initial text expression set, and set a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group. Specifically, the classification groups are respectively: classification group 1 (sid 1, commercial building), classification group 2 (sid 2, commercial building), classification group 3 (sid 3, dining), classification group 4 (sid 4, dining), classification group 5 (sid 6, leisure sports), classification group 6 (sid 7, dining).

Finally, a target classification rule base is obtained based on the classification group 1, the classification group 2, the classification group 3, the classification group 4, the classification group 5 and the classification group 6, and the short text to be classified is classified based on the target classification rule base.

Based on the same inventive concept, the embodiment of the application also provides a text classification device. As shown in fig. 12, which is a schematic structural diagram of a text classification apparatus 1200, may include:

the processing unit 1201 is configured to perform vocabulary recombination processing based on a candidate word set obtained by performing vocabulary mining processing on the short text training set, obtain a vocabulary combination set, and perform text expression mining processing on the vocabulary combination set to obtain an initial text expression set;

The parsing unit 1202 is configured to parse each initial text expression included in the initial text expression set to obtain a corresponding vocabulary index sequence; each vocabulary index sequence includes an index position of at least one vocabulary in a corresponding initial text expression;

a screening unit 1203, configured to screen target text expressions matching with a plurality of short texts in the short text training set from the initial text expression set, respectively, and set a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group;

an obtaining unit 1204, configured to obtain, based on each obtained classification group, a target classification rule base, where the target classification rule base is used to classify the short text to be classified.

Optionally, the text classification apparatus further comprises a mining unit for:

screening at least one character vocabulary group based on the vocabulary evaluation value corresponding to each character vocabulary group to obtain at least one first vocabulary group; each vocabulary evaluation value is determined based on the occurrence frequency of the corresponding character vocabulary group in the plurality of short texts and the combination frequency with other character vocabulary groups;

And selecting word frequencies of words in the plurality of short texts from at least one first word group, selecting candidate words meeting a first threshold, and obtaining a candidate word set based on each obtained candidate word.

Optionally, the excavation unit is further configured to:

the following operations are performed for each character vocabulary group respectively:

selecting a first target evaluation value which meets a first screening condition from first evaluation values which correspond to at least two sub-vocabulary groups respectively and are obtained by splitting a character vocabulary group; wherein each first evaluation value characterizes the frequency of occurrence of the corresponding sub-vocabulary group in the plurality of short texts;

selecting a second target evaluation value which meets a second screening condition from second evaluation values which correspond to at least two associated vocabularies respectively and are obtained by carrying out association processing on a character vocabulary group; wherein each second evaluation value characterizes the occurrence frequency of the corresponding associative vocabulary in a plurality of short texts;

a vocabulary evaluation value for a character vocabulary group is determined based on the first target evaluation value and the second target evaluation value.

Optionally, the excavation unit is further configured to:

determining at least one splitting mode based on the number of words contained in one character word group, wherein each splitting mode is used for splitting one character word group into at least two sub word groups;

And splitting a character vocabulary group into at least two corresponding sub-vocabulary groups according to at least one splitting mode.

Optionally, the excavation unit is further configured to:

each vocabulary contained in a character vocabulary group is respectively determined, and at least one associated vocabulary which takes the character vocabulary group as a suffix is respectively corresponding to each vocabulary;

each word included in a character word group is determined, and at least one associated word with the character word group as a prefix is corresponding to each word.

Optionally, the processing unit 1201 is configured to:

constructing a target dictionary tree based on the candidate word sets; the root node of the target dictionary tree is empty, and each leaf node except the root node contains a vocabulary;

the following is performed for each leaf node:

Optionally, the processing unit 1201 is further configured to:

selecting candidate vocabulary combinations corresponding to the occurrence frequency meeting a second threshold from the occurrence frequencies corresponding to the vocabulary combinations in the vocabulary combination set; wherein each occurrence frequency is determined based on the number of occurrences of the corresponding vocabulary combination in the plurality of short texts and the number of the plurality of short texts;

Optionally, the parsing unit 1202 is configured to:

the following operations are performed for each syntax tree:

Optionally, the obtaining unit 1204 is configured to:

and constructing a target classification rule base based on the candidate classification group corresponding to the obtained target matching probability.

Those skilled in the art will appreciate that the various aspects of the present application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a text classification device according to the present application may include at least a processor and a memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps in the text classification method according to various exemplary embodiments of the application described in this specification. For example, the processor may perform the steps as shown in fig. 2.

The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method. In this embodiment, the structure of the electronic device may be shown in fig. 13, and the embodiment of the present application further provides an electronic device, as shown in fig. 13, where the electronic device in the embodiment of the present application includes at least one processor 1301, and a memory 1302 and a communication interface 1303 connected to the at least one processor 1301, where the embodiment of the present application does not limit a specific connection medium between the processor 1301 and the memory 1302, and in fig. 13, the connection between the processor 1301 and the memory 1302 is taken as an example, and in fig. 13, the bus 1300 is represented by a thick line, and a connection manner between other components is merely illustrative, and not limited to. The bus 1300 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 13 for convenience of illustration, but does not represent only one bus or one type of bus.

In the embodiment of the present application, the memory 1302 stores instructions executable by the at least one processor 1301, and the at least one processor 1301 can perform the steps included in the text classification method by executing the instructions stored in the memory 1302.

The processor 1301 is a control center of the electronic device, and may connect various parts of the entire electronic device using various interfaces and lines, and implement various functions of the electronic device by executing or executing instructions stored in the memory 1302 and calling data stored in the memory 1302. Alternatively, processor 1301 may include one or more processing units, and processor 1301 may integrate an application processor and a modem processor, where processor 1301 primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1301. In some embodiments, processor 1301 and memory 1302 may be implemented on the same chip, and in some embodiments they may be implemented separately on separate chips.

Processor 1301 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, which may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

The memory 1302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1302 may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory), magnetic Memory, magnetic disk, optical disk, and the like. Memory 1302 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1302 in the present embodiment may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.

The communication interface 1303 is a transmission interface that can be used for communication, and data can be received or transmitted through the communication interface 1303.

In addition, the electronic device includes a basic input/output system (I/O system) 1304, a mass storage device 1308 for storing an operating system 1305, application programs 1306, and other program modules 1307, which facilitate the transfer of information between the various devices within the electronic device.

The basic input/output system 1304 includes a display 1309 for displaying information and an input device 1310, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1309 and an input device 1310 are connected to the processor 1301 through a basic input/output system 1304 that is connected to the system bus 1300. The basic input/output system 1304 may also include an input-output controller for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus, among others. Similarly, the input-output controller also provides output to a display screen, a printer, or other type of output device.

In particular, mass storage device 1308 is connected to processor 1301 through a mass storage controller (not shown) connected to system bus 1300. Wherein the mass storage device 1308 and its associated computer-readable media provide non-volatile storage for the server package. That is, mass storage device 1308 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.

According to various embodiments of the present application, the electronic device may also operate through a network, such as the Internet, to a remote computer on the network. I.e., the electronic device may be connected to the network 1311 through a communication interface 1303 coupled to the system bus 1300, or alternatively, the communication interface 1303 may be used to connect to other types of networks or remote computer systems (not shown).

In some possible embodiments, aspects of the text classification method provided herein may also be implemented in the form of a program product comprising program code for causing an electronic device to perform the steps of the text classification method according to various exemplary embodiments of the application described herein above when the program product is run on the electronic device, e.g. the electronic device may perform the steps as shown in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of text classification, the method comprising:

based on a candidate word set obtained by carrying out word mining processing on the short text training set, carrying out word recombination processing to obtain a word combination set; the short text training set at least comprises one training item, and each training item comprises a short text and a classification identifier corresponding to the short text;

selecting a candidate vocabulary combination corresponding to the occurrence frequency meeting a second threshold from the occurrence frequencies corresponding to the vocabulary combinations in the vocabulary combination set; wherein each occurrence frequency is determined based on the number of occurrences of the corresponding vocabulary combination in the plurality of short texts in the short text training set and the number of the plurality of short texts;

obtaining an initial text expression set based on each obtained first initial text expression and second initial text expression;

converting each initial text expression contained in the initial text expression set into a corresponding grammar tree; wherein each grammar tree comprises a plurality of grammar nodes, and each grammar node comprises a grammar symbol or vocabulary;

determining the corresponding relation between each vocabulary and the corresponding index in the simplified initial text expression based on the target index sequence construction rule, and obtaining a vocabulary index sequence based on the corresponding relation between each vocabulary and the corresponding index; wherein each vocabulary index sequence includes an index position of at least one vocabulary in the corresponding initial text expression;

Respectively screening target text expressions matched with the short texts from the initial text expression set, and setting a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group;

2. The method of claim 1, wherein the vocabulary reorganizing process is preceded by the candidate vocabulary collection obtained by performing the vocabulary mining process on the short text training set, further comprising:

3. The method of claim 2, wherein before the filtering the at least one character vocabulary group based on the vocabulary evaluation value corresponding to each character vocabulary group, further comprises:

4. The method of claim 3, wherein splitting a character vocabulary set comprises:

5. The method of claim 3, wherein said performing associative processing on said one character vocabulary group comprises at least one of:

6. The method according to claim 1 or 2, wherein the performing a vocabulary recombination process based on the candidate word set obtained by performing the vocabulary mining process on the short text training set to obtain the vocabulary combination set includes:

The following is performed for each leaf node:

7. The method according to claim 1 or 2, wherein the simplifying the syntax node of each syntax tree to obtain simplified initial text expressions includes:

the following operations are performed on each syntax tree:

8. The method according to claim 1 or 2, wherein the obtaining a target classification rule base based on each obtained classification group comprises:

9. A text classification device, the device comprising:

the processing unit is used for carrying out vocabulary recombination processing on the basis of a candidate word set obtained by carrying out vocabulary mining processing on the short text training set to obtain a vocabulary combination set; the short text training set at least comprises one training item, and each training item comprises a short text and a classification identifier corresponding to the short text;

the parsing unit is used for converting each initial text expression contained in the initial text expression set into a corresponding grammar tree; each grammar tree comprises a plurality of grammar nodes, and each grammar node comprises a grammar symbol or vocabulary;

a screening unit, configured to screen target text expressions matched with the plurality of short texts from the initial text expression set, and set a vocabulary index sequence associated with each target text expression and a classification identifier corresponding to the short text as a classification group;

10. An electronic device comprising a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-8.

11. A computer readable storage medium, characterized in that it comprises a program code for causing an electronic device to perform the steps of the method of any of claims 1-8 when said storage medium is run on said electronic device.