CN115563990A - Information classification method, information classification device, and readable storage medium - Google Patents

Information classification method, information classification device, and readable storage medium Download PDF

Info

Publication number
CN115563990A
CN115563990A CN202211181226.7A CN202211181226A CN115563990A CN 115563990 A CN115563990 A CN 115563990A CN 202211181226 A CN202211181226 A CN 202211181226A CN 115563990 A CN115563990 A CN 115563990A
Authority
CN
China
Prior art keywords
information
preset
intention
parameter
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211181226.7A
Other languages
Chinese (zh)
Inventor
任欣源
吴士中
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yonyou Network Technology Co Ltd
Original Assignee
Yonyou Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yonyou Network Technology Co Ltd filed Critical Yonyou Network Technology Co Ltd
Priority to CN202211181226.7A priority Critical patent/CN115563990A/en
Publication of CN115563990A publication Critical patent/CN115563990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention provides an information classification method, an information classification device and a readable storage medium, wherein the information classification method comprises the following steps: under the condition that the first information is received, extracting a plurality of target keywords of the first information; determining a first parameter corresponding to each intention category in a preset word bank according to the target keywords, wherein the first parameter is associated with the occupation ratio of the target keywords in a vocabulary set corresponding to each intention category; determining a target intention category in a plurality of intention categories according to the first parameter; the preset word bank comprises a plurality of intention categories and a plurality of vocabulary sets, the vocabulary sets correspond to the intention categories one by one, the use requirement of small data volume in the 2B field can be met, the reasonable configuration of server resources is realized, meanwhile, a large amount of calculation and training are not needed, and the recognition and response speed of the information intention categories can be improved.

Description

Information classification method, information classification device, and readable storage medium
Technical Field
The present invention relates to the field of information search technologies, and in particular, to an information classification method, an information classification device, and a readable storage medium.
Background
At present, the dialog, question answering and search of information need To confirm the intention type of the information, a deeply trained network model is often adopted in the 2C field (To Consumer, the field for users), the data volume of the network model is large, the response time is slow, and the confirmation process of the intention type cannot be displayed.
Compared with the 2C domain, the 2B domain (To Business domain) domain has a smaller data size and needs To display the confirmation process of the intention category, and the deeply trained network model is not adapted To the 2B domain and cannot meet the requirement of response time of the 2B domain.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
In view of the above, a first aspect of the present invention provides an information classification method.
A second aspect of the present invention provides an information classification apparatus.
A third aspect of the present invention provides an information classification apparatus.
A fourth aspect of the invention provides a readable storage medium.
In order to achieve at least one of the above objects, an aspect of the first aspect of the present invention provides an information classification method, including: under the condition that first information is received, extracting a plurality of target keywords of the first information; determining a first parameter corresponding to each intention category in a preset word bank according to the target keywords, wherein the first parameter is associated with the occupation ratio of the target keywords in a vocabulary set corresponding to each intention category; determining a target intention category in a plurality of intention categories according to the first parameter; the preset word bank comprises a plurality of intention categories and a plurality of vocabulary sets, and the vocabulary sets correspond to the intention categories one by one.
According to the technical scheme, a plurality of target keywords capable of representing first information can be extracted from application scenes of information conversation, question answering and searching, the occupation ratio of each intention category in a preset word bank in a vocabulary set is determined according to the target keywords, the intention category of the first information is determined according to the occupation ratio, compared with a semantic training model, only a certain number of intention categories and first parameters corresponding to the intention categories are stored in the preset word bank, the target intention category is determined through light-weight calculation of the first parameters, the use requirement of small data volume in the 2B field can be met, meanwhile, a large amount of calculation and training are not needed, and the recognition and response speed of the information intention category can be improved.
And furthermore, extracting a plurality of keywords, performing multi-thread calculation at the same time, further accelerating the identification and response speed of the information intention categories, simultaneously extracting target keywords, determining a first parameter corresponding to each intention category in a preset word library, and determining a confirmation process that the target intention categories in the plurality of intention categories can show the complete information intention categories according to the first parameters, thereby giving a reasonable interpretability result of the target intention categories of the user and enabling the user to know the confirmation process of the target intention categories.
Furthermore, the first information without identifying the target intention category can also give a display result, so that the display result can be fed back to the user in time, the preset word bank can be updated in time according to the requirement of the user, the target intention category can be determined in time in subsequent application scenes of information conversation, question answering and search, and the use experience of the user is improved.
In addition, the information classification method in the above technical solution provided by the present invention may further have the following additional technical features:
in the above technical solution, in a case that the first information is received, extracting the plurality of target keywords of the first information includes: identifying preset keywords in the first information through a preset word bank; under the condition that the first information is identified to include a preset keyword, determining the preset keyword in the first information as a target keyword; under the condition that the first information is not identified to include the preset keyword, inquiring second information in a preset corpus, wherein the semantics of the second information are matched with those of the first information; and performing word segmentation processing on the second information to obtain a target keyword.
In the technical scheme, under the condition that the first information is recognized to comprise the preset keywords, the preset keywords in the first information are determined as the target keywords, repeated confirmation of the target keywords can be avoided, under the condition that the first information is not recognized to comprise the preset keywords, second information in a preset corpus is inquired, compared with a large amount of data stored by a semantic training model, a plurality of intention categories are prestored in the preset corpus, each intention category of the intention categories is correspondingly provided with a plurality of pieces of second information, the storage pressure of a server can be reduced, the use requirement in the 2B field can be met, the second information is matched with the semantics of the first information, and therefore the accuracy of intention category recognition is guaranteed.
In any of the above technical solutions, after performing word segmentation processing on the second information to obtain the target keyword, the method further includes: and storing the target keywords corresponding to the second information in a preset word bank.
In the technical scheme, the target keywords corresponding to the second information are stored in the preset word bank, the preset word bank can be called quickly, the storage pressure of the preset word bank is reduced, and the use requirements in the 2B field are met, so that the processing speed is higher for less data, and the use experience of a user is improved.
In any of the above technical solutions, before the preset keyword in the first information is identified by using the preset thesaurus, the method further includes: and extracting preset keywords from a preset corpus to generate a preset word bank.
According to the technical scheme, the preset keywords are extracted from the preset corpus to generate the preset word bank, the preset corpus and the preset word bank can be independently set, the phenomenon that the storage pressure of the database is too high due to the fact that all data are stored in the database at the same time is avoided, meanwhile, the process of confirming the target intention category can be simplified, the preset corpus and the preset word bank respectively and independently process relevant information and relevant data, asynchronous processing of the information and the data is achieved, and therefore the recognition and response speed of the information intention category is improved.
In any one of the above technical solutions, extracting preset keywords from a preset corpus to generate a preset lexicon, including: acquiring a first word stock corresponding to each intention category in a preset corpus; determining a second parameter of each first vocabulary in the first vocabulary bank, wherein the second parameter is associated with the occurrence frequency of the first vocabulary in the first vocabulary bank; determining a third parameter according to the occurrence frequency, the total number of the intention categories and the number of the intention categories corresponding to the first vocabulary; and screening the preset keywords in each first word bank through the third parameters to generate a preset word bank.
In the technical scheme, each intention category in the preset corpus is correspondingly provided with the first lexicon, so that the intention category setting is guaranteed to have certain distinctiveness, the third parameter is determined, the preset keywords in each first lexicon are screened through the third parameter, calculation is not carried out according to data volume, and compared with a semantic training model, the recognition and response speed of the information intention categories can be improved.
In any of the above technical solutions, before obtaining the first lexicon corresponding to each intention category in the preset corpus, the method further includes: receiving third information from the client and an intention category corresponding to the third information; and updating the preset corpus according to the third information and the intention category corresponding to the third information.
According to the technical scheme, the preset corpus can be updated according to the user requirements according to the received third information and the intention category corresponding to the third information, so that the application scene requirements of information conversation, question answering and searching are met, and the use experience of a user is improved.
In any of the above technical solutions, determining the first parameter of each of the plurality of intention categories according to the plurality of target keywords comprises: calculating a fourth parameter of each target keyword in each vocabulary set, wherein the fourth parameter is associated with the proportion of the target keyword in the vocabulary sets; and calculating sum values of a plurality of fourth parameters in each vocabulary set, determining the sum value as the first parameter, and enabling each fourth parameter in the plurality of fourth parameters to correspond to each keyword in the plurality of keywords one by one.
In the technical scheme, the sum of the plurality of fourth parameters corresponding to each vocabulary set is calculated, the plurality of fourth parameters can be normalized, the sum is determined as the first parameter, compared with a semantic training model, the calculation process is simplified, the use requirement in the 2B field is met, and the recognition and response speed of the information intention category can be improved.
A second aspect of the present application provides an information classification apparatus, including: an extraction unit configured to extract a plurality of target keywords of the first information in a case where the first information is received; the determining unit is used for determining a first parameter corresponding to each intention category in a preset word stock according to the target keywords, wherein the first parameter is associated with the proportion of the target keywords in a vocabulary set corresponding to each intention category; the determining unit is further configured to determine a target intention category of the plurality of intention categories based on the first parameter.
According to the technical scheme, the extracting unit can extract a plurality of target keywords capable of representing first information in application scenes of information conversation, question answering and searching, the determining unit determines the proportion of each intention category in a vocabulary set in a preset word bank according to the target keywords, the intention category of the first information is determined according to the proportion, compared with a semantic training model, only a certain number of intention categories and first parameters corresponding to the intention categories need to be stored in the preset word bank, light weight of data storage is achieved, data only need to be calculated once, recognition and response speed of the information intention categories can be improved, and application scene use requirements of conversation, question answering and searching in the 2B field are met.
A third aspect of the present invention provides an information classification apparatus, including: a memory and a processor, the memory storing a program or instructions for execution on the processor, the program or instructions when executed by the processor implementing the steps of the information classification method of the first aspect.
In this technical solution, the information classification apparatus includes a memory and a processor, where the memory stores a program or an instruction running on the processor, and the program or the instruction implements the steps of the information classification method of the first aspect when executed by the processor, so as to have all the beneficial technical effects of any one of the above technical solutions, and no further description is given here.
A fourth aspect of the present invention proposes a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the information classification method of the first aspect.
In this technical solution, a readable storage medium stores a program or an instruction thereon, and the program or the instruction, when executed by a processor, implements the steps of the information classification method according to the first aspect, so as to have all the beneficial technical effects of any one of the above technical solutions, and no further description is provided herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 shows one of the flow diagrams of an information classification method according to one embodiment of the invention;
FIG. 2 is a second flowchart of an information classification method according to an embodiment of the invention;
FIG. 3 is a third flowchart of an information classification method according to an embodiment of the invention;
FIG. 4 shows a fourth flowchart of an information classification method according to an embodiment of the invention;
FIG. 5 shows a fifth flowchart of an information classification method according to an embodiment of the invention;
FIG. 6 shows one of the processing diagrams of an information classification method according to one embodiment of the invention;
FIG. 7 illustrates a second process diagram of an information classification method according to an embodiment of the invention;
FIG. 8 is a third process diagram illustrating an information classification method according to an embodiment of the invention;
fig. 9 shows one of the block diagrams of the structure of an information classifying apparatus according to an embodiment of the present invention;
fig. 10 shows a second block diagram of the information classification apparatus according to an embodiment of the present invention.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein and, therefore, the scope of the present invention is not limited by the specific embodiments disclosed below.
Information classification methods, information classification apparatuses, and readable storage media according to some embodiments of the present invention are described below with reference to fig. 1 to 10.
Example one
As shown in fig. 1, an information classification method according to some embodiments of the present application includes:
step 102, under the condition that first information is received, extracting a plurality of target keywords of the first information;
the first information may be information input by the user, for example, question and answer information, dialogue information, search information, and the like input by the user. The target keyword may specifically be semantic-related information in question and answer information, semantic-related information in dialog information, semantic-related information in question and answer information, and the like.
104, determining a first parameter corresponding to each intention category in a preset word stock according to the target keywords, wherein the first parameter is associated with the occupation ratio of the target keywords in a vocabulary set corresponding to each intention category;
when question and answer information, dialogue information and search information input by a user are received, the intention category related to the semantic information in the preset word bank can be determined through the extracted target keywords related to the semantic information in the input information.
A plurality of intention categories are stored in the preset word stock. A plurality of target keywords are correspondingly arranged in each intention category in the preset word stock, and the target keywords form a vocabulary set corresponding to the intention category. The preset word library further stores a first parameter corresponding to each intention category, for example, the first parameter may be a weight, and the weight specifically indicates the importance degree of each target keyword to the intention category. The occupation ratio of each target keyword in the vocabulary set is different, and the importance degree of the target keyword to the intention category is different.
Meanwhile, the plurality of target keywords are extracted, independent processing can be carried out on each target keyword in the plurality of target keywords, and the multi-thread first parameter can be confirmed, so that the processing speed of input information is increased on one hand, and the accuracy of intention identification can be improved by confirming the first parameters of the plurality of target keywords on the other hand.
And 106, determining a target intention category in the plurality of intention categories according to the first parameter.
According to the method and the device, the target intention category is determined according to the first parameter corresponding to the intention category in the preset lexicon, light-weight calculation can be performed through preset data, extra data of the acquired first information do not need to be calculated, the calculation amount is reduced, and the determination speed of the target intention category is improved.
A semantic training model in the correlation technique needs a large amount of data to train at the earlier stage, but the application does not need a large amount of data to calculate, and can meet the use requirements in the 2B field. Moreover, the semantic training model needs to load parameter data of the model and relevant data of input information, response time is long, data does not need to be loaded, and response time can be shortened and response speed is accelerated by processing pre-stored data of a preset word bank.
Meanwhile, the semantic training model determines semantic information in input information, the model is required to confirm a large amount of data, the overall semantic meaning of the identified whole input information has a complex processing process, the preset intention is classified, the intention category with stronger relevance to the input information is determined through the comparison result of parameters corresponding to a plurality of intention categories, and the intention category with stronger relevance is taken as the target intention category, so that the semantic information of the input information is determined through the target intention category.
According to the method and the device, the extracted target keywords can reflect semantic information of the first information more accurately, the first parameters corresponding to each intention category in the preset word bank are determined according to the target keywords, the target intention categories are determined according to the first parameters, on one hand, the first parameters corresponding to each intention category are independently calculated, multi-thread calculation of the intention categories is achieved, on the other hand, compared with a semantic training model, light-weight calculation is conducted through pre-stored light-weight data, the calculated data quantity and the early-stage preparation time of model training can be reduced, and the use requirements in the 2B field are met.
Further, a confirmation process that the target keyword is extracted, a first parameter corresponding to each intention category in a preset word stock is determined, the target intention category in the plurality of intention categories can show the complete information intention category is determined according to the first parameter, and therefore a reasonable interpretable result of the target intention category of the user is given.
As shown in fig. 2, in one embodiment of the present application, in a case where the first information is received, extracting a plurality of target keywords of the first information includes:
step 202, identifying preset keywords in the first information through a preset word bank;
the preset keywords are keywords set in advance, and the use requirements of the user can be reflected.
Step 204, identifying whether the first information comprises a preset keyword, if so, executing step 206, and if not, executing step 208;
step 206, determining a preset keyword in the first information as a target keyword;
the preset keywords in the first information are determined as the target keywords, so that repeated confirmation of the target keywords can be avoided when subsequent input information is similar dialogue information, question and answer information and search information, the identification difficulty of the target keywords is reduced, the identification speed of the target keywords is improved, and the use experience of a user is improved.
Step 208, inquiring second information in a preset corpus, wherein the semantics of the second information is matched with that of the first information;
the second information may be sentences in a predetermined corpus, and a plurality of sentences corresponding to each intention category are stored in the predetermined corpus.
And step 210, performing word segmentation processing on the second information to obtain a target keyword.
The semantics of the second information and the semantics of the first information are matched, so that the matched second information is processed, the accuracy of the processing of the easy information can be ensured, and the target keywords having a certain matching relation with the first information can be obtained.
The word segmentation processing on the second information specifically comprises slot position word processing, word segmentation and stop word processing, a slot is a well-defined attribute, the slot is formed by slots, the second information is 'i want to take a car', a departure place, a destination and departure time are well-defined in the second information, the departure place, the destination and the departure time can be determined from 'i want to take a car', the attribute of the departure place slot is the departure place, the attribute of the destination is the destination, the attribute of the departure time is the departure time, one slot corresponds to one slot, and it can be understood that one slot is a slot filling mode. The user-defined word segmentation can be set according to user requirements, the server can generate the slot position word segmentation table according to a certain period, in order to reduce the load of the server, the slot position word segmentation table can be generated when the server is off-line, and the specific generation period can be set according to the user requirements.
The word segmentation and stop word processing is specifically to segment the second information into a plurality of words, and it can be understood that after the word segmentation is performed on the second information, stop words in the plurality of words are removed, and the second information is taken as "i want to customize a schedule of making a call", wherein "i", "want", "one" and "are stop words, and the stop words can be common articles, prepositions, conjunctions, pronouns and the like.
When the first information is identified to include the preset keyword, the preset keyword in the first information is determined as the target keyword, when the first information is identified not to include the preset keyword, sentences similar to the semantics of the first information in a preset corpus are inquired, word segmentation processing is performed on the second information, the target keyword is obtained, the preset keyword can be independently processed according to the identified preset keyword and the unidentified preset keyword, and therefore the target keyword can be obtained under the condition that the input information and the identified input information are not identified, the use experience of a user is provided, wherein a plurality of intention categories are prestored in the preset corpus, and each intention category of the plurality of intention categories is correspondingly provided with a plurality of pieces of second information, for example, 22 intention categories can be stored, and 6500 pieces of data are stored.
In an embodiment of the application, after the word segmentation processing is performed on the second information to obtain the target keyword, the target keyword corresponding to the second information is stored in a preset word bank.
In this embodiment, the target keyword corresponding to the second information is stored in the preset lexicon, which can facilitate quick call of the preset lexicon, and meanwhile, the target keyword and the first parameter are stored in the preset lexicon, so that statements do not need to be stored, and the first parameter does not need to be stored in the preset corpus, which can reduce the storage pressure of the preset lexicon and the preset corpus, and meet the use requirements in the 2B field, thereby providing faster processing speed for less data volume, and improving the use experience of users.
In an embodiment of the present application, before the identifying the preset keyword in the first information through the preset lexicon, the method further includes: and extracting preset keywords from a preset corpus to generate a preset word bank.
In this embodiment, the preset keyword is extracted from the preset corpus to generate the preset lexicon, a sentence in the preset corpus can be participated to obtain a plurality of participles, and the preset keyword is determined according to the importance degree of each participle in the plurality of participles. The method comprises the steps that a plurality of preset keywords are determined in a plurality of sentences, each intention category corresponds to a plurality of sentences, and after the preset keywords of the plurality of sentences are collected, a preset word bank is generated.
The preset corpus and the preset word stock are independently arranged, the condition that the storage pressure of the database is overlarge due to the fact that all data are stored in the same database is avoided, meanwhile, the preset corpus and the preset word stock respectively and independently process related sentences and related data, asynchronous processing of the sentences and the data is achieved, and therefore the recognition and response speed of the information intention categories is improved.
As shown in fig. 3, in an embodiment of the present application, extracting a preset keyword from a preset corpus to generate a preset lexicon includes:
step 302, acquiring a first word stock corresponding to each intention category in a preset corpus;
the preset corpus stores a plurality of intention categories, each intention is different, the corresponding categories are different, for example, the intention "weather" and the intention "make a call" are two completely different categories, the corresponding sentences are different, the keywords are different, each intention category is provided with a plurality of sentences correspondingly, each sentence is subjected to word segmentation to form a plurality of words, and the words form a first word bank.
For example, a preset corpus may be provided with a "weather" intention, which may be a plurality of sentences, such as "today is a sunny day", "today is bad weather", and "it is unknown that the weather does not rain" in the sunny day, and after processing the plurality of sentences, the vocabularies of "sunny day", "weather" and "rain" are obtained, and a first corpus is formed for "sunny day", "weather" and "rain".
Step 304, determining a second parameter of each first vocabulary in the first vocabulary bank, wherein the second parameter is associated with the occurrence frequency of the first vocabulary in the first vocabulary bank;
the first thesaurus stores a plurality of first words processed by sentences, the first words can appear in the sentences for a plurality of times, and the second parameter is related to the number of times of the first words appearing in the sentences, for example, when the input information is dialogue information, the sentences in the first thesaurus can be' A: how do the weather today? B: bad weather today ". "weather" as a first vocabulary appears twice in the sentence, while the second parameter is related to twice.
The second parameter may be calculated from the number of occurrences of each word in the first vocabulary in the corresponding intent category and the number of occurrences of all words in the first vocabulary in the corresponding intent category, and may be expressed by the following formula:
Figure BDA0003866875130000111
wherein, tf word Representing a second parameter being the frequency of occurrence of a word in the first vocabulary, N word Representing the number of occurrences of a word in a first vocabulary, N all Indicating the number of occurrences of the word used in the first vocabulary.
Step 306, determining a third parameter according to the occurrence frequency, the total number of the intention categories and the number of the intention categories corresponding to the first vocabulary;
the third parameter may be a word frequency-inverse text frequency, and the third parameter may be determined by the following formula:
Figure BDA0003866875130000112
wherein tfidf word Representing word frequency-inverse text frequency, D word Indicating the number of intention categories corresponding to a word, and D indicating the total number of intention categories.
And 308, screening the preset keywords in each first word stock through the third parameters to generate a preset word stock.
The third parameter may be calculated by a TF-IDF (term frequency-inverse text frequency index) method, which may calculate how many intention categories include the same keyword.
And extracting K words from the first word stock corresponding to each intention category, wherein K can be selected according to the actual data volume, and can be specifically selected from 200 to 1000. The method for calculating the third parameter is not particularly limited in the present application. The method comprises the steps that a first word stock is correspondingly arranged in each intention category in a preset corpus, so that the intention category setting is guaranteed to have certain distinctiveness, a second parameter is determined according to the number of times of each word in a first vocabulary appearing in the corresponding intention category and the number of times of all words in the first vocabulary in the corresponding intention category, a user can know the determination process of the second parameter, and the degree of understanding of the user on the target intention category determination process is increased.
According to the occurrence frequency, the total number of the intention categories and the number of the intention categories corresponding to each word in the first vocabulary, the third parameter is determined, and the user can know the determination process of the third parameter, so that the degree of understanding of the user on the target intention category determination process is increased, the determination process of the target intention category is not calculated according to the data amount, meanwhile, the semantic training model is complex in calculation process and cannot clearly display the calculation result, the complete determination process can be displayed, the degree of understanding of the user on the target intention category determination process is increased, the user can determine whether the target intention category is reasonable or not according to the determination process, the processing steps are adjusted according to needs, and the use experience of the user is improved.
As shown in fig. 4, in an embodiment of the present application, before acquiring the first lexicon corresponding to each intention category in the preset corpus, the method further includes:
step 402, receiving third information from the client and an intention category corresponding to the third information;
the third information may be a sentence in which the keyword is not recognized, and the sentence that the user can input, for example, the input sentence is "today is a typhoon day", and the user sets the intention category corresponding to the sentence to "weather".
Step 404, updating the preset corpus according to the third information and the intention category corresponding to the third information.
For the situation that the first information is not identified and the first information is not matched with the second information, the user is required to input the related information and the intention category corresponding to the information into the preset corpus, so that the intention category of the similar sentence can be accurately identified when the similar sentence appears subsequently.
The user can change the corresponding relation between the intention category and the sentences in the pre-stored corpus as required, the number of the sentences can be increased or reduced, the setting is carried out according to the user requirement, and the use experience of the user is improved.
The user can set the update cycle in advance, for example, the update cycle is set to seven days, after the information in seven days and the intention categories corresponding to the information are collected, on the eighth day, the information in seven days and the intention categories corresponding to the information are uniformly sent to the preset corpus, so that the condition that the update is not standard is avoided, the update does not have the problem of regularity, and the use experience of the user is improved.
According to the method and the device, the third information of the user and the intention corresponding to the third information are received, the intention corresponding to the third information is updated to the preset corpus, the target intention type of the input information can be identified when the input information similar to the third information is subsequently received, the accuracy of identifying the target intention type is improved, and the use experience of the user is improved.
As shown in fig. 5, in one embodiment of the present application, determining the first parameter of each of the plurality of intention categories according to the plurality of target keywords comprises:
step 502, calculating a fourth parameter of each target keyword in each vocabulary set, wherein the fourth parameter is associated with the proportion of the target keyword in the vocabulary set;
the fourth parameter may be a weight value of each target keyword in the vocabulary set, indicating the importance of each target keyword in the vocabulary set.
Step 504, calculating a sum of a plurality of fourth parameters corresponding to each vocabulary set, and determining the sum as the first parameter, wherein each fourth parameter in the plurality of fourth parameters corresponds to each keyword in the plurality of keywords one by one.
As shown in fig. 6, the sum of a plurality of fourth parameters may be calculated using a DF (Document frequency) method. Taking the first information as "i want to customize a schedule of making a call", the keywords of the first information are identified as customization, making a call and schedule. Three intention categories, namely phone, schedule, weather and business-trip, are prestored in the DF document frequency, only schedule corresponding to the intention categories is customized, the score corresponding to the schedule is 100%, the score corresponding to the call is 99.54%, the score corresponding to the schedule is 0.22%, the score corresponding to the weather is 0.22%, the score corresponding to the schedule is 100%, the sum of the scores of the three keywords is calculated to be 0.9954, 2.0022, 0.0022 and 0, the sum of the scores of the four keywords is calculated to be 0, the sum of the scores of the four intention categories and the values are compared, the sum of the orders is determined to be the maximum, and the schedule is taken as a target intention category.
For calculating the sum of the plurality of fourth parameters, methods such as TF (Term Frequency), TF-IDF (Term Frequency-Inverse text Frequency index), and the like may be used, and the method for calculating the sum of the plurality of fourth parameters is not limited in the present application. When the calculation method is TF-IDF, the calculated sum is the sum of the TF-IDF scores.
The method comprises the steps of calculating a fourth parameter in each vocabulary set, calculating sum values of a plurality of fourth parameters in the vocabulary sets, determining the importance degree of each keyword to the vocabulary sets, normalizing the plurality of fourth parameters, comparing the sum values, determining a value with a larger sum value as a first parameter, and determining a target intention category according to the first parameter.
As shown in fig. 7, in an embodiment of the present application, the predetermined corpus provides a data interface connected to a remote database, and the server further includes a data interface of the user database, where the data interface of the user database can import the weight data after the keyword is calculated into the user database. The remote database can update changed data to the user database regularly, so that sharing and fusion of multi-user data are facilitated, and the accuracy and robustness of the information classification method are improved.
As shown in fig. 8, in an embodiment of the present application, a corpus and B \8230arestored in a preset corpus, N intents are provided, each intention is provided with a plurality of sentences, for example, the a intention is provided with an A1 sentence, an A2 sentence, and an A3 sentence \8230, the B intention is provided with a B1 sentence, a B2 sentence, and a B3 sentence \8230, the N intention is provided with an N1 sentence, an N2 sentence, and an N3 sentence \8230, each sentence is subjected to processing of adding self-defined word segmentation and segmentation to remove stop words, and a slot segment word list is periodically updated offline. After the sentence is participated, calculating TF-IDF value of each keyword, and generating a Term-Weight table, wherein Term refers to the keyword, and Weight refers to the TF-IDF value. After calculating the TF-IDF value, the self-testing intention of the data is anticipated, the minimum value of the judged pair is used as a threshold value, wherein the threshold value is larger than 0.2, and an intention threshold value table is generated.
As shown in fig. 8, after the user inputs information, the information input by the user is segmented according to the slot segmentation word list, and the segmented words are transmitted to the preset word bank, and the preset word bank calculates the score of each keyword in the intention. For example, keyword 1 has a score of 0.2 in the A intent, keyword 2 has appeared twice in the B intent, a score of 0.3, a score of 0.4, keyword 2 has a score of 0.1 in the C intent, and keyword 4 has a score of 0.4 in the B intent. And (3) carrying out sum value calculation on each intention score, and finally obtaining the intention score condition, wherein the sum value is shown by the following formula:
Score=Sum/Num;
wherein Score represents the Score value of a certain intention, sum represents the Sum of scores of keywords in the intention, and Num represents the Sum of the number of times a certain keyword appears in a certain intention.
The score of the A intention is 0.2/1, the score of the B intention is 0.7/2, the keyword 4 appears in the intention B once, the corresponding keyword appears in the B intention twice, the score of the C intention is 0.1/1, the score of the B intention is determined to be high through calculation, the B intention is taken as a target classification intention, and whether the target classification intention is accurate or not is determined through comparison with an intention threshold table.
The vocabulary-weight table and the intention threshold value table are respectively calculated, the mixed flow is separated into two independent flows, so that errors can be found conveniently, the accuracy of data calculation is improved, meanwhile, a table associated with the calculation process is correspondingly generated in each calculation process, a user can visually know the information classification process, and the interpretability of target intention category confirmation is realized. The format of the generation table is not particularly limited in this application.
Example two
As shown in fig. 9, an embodiment of the present application proposes an information classification apparatus 900 including: an extracting unit 902, configured to, in a case where the first information is received, extract a plurality of target keywords of the first information; a determining unit 904, configured to determine, according to the multiple target keywords, a first parameter corresponding to each intention category in the preset lexicon, where the first parameter is associated with a proportion of the multiple target keywords in a vocabulary set corresponding to each intention category; the determining unit 904 is further configured to determine a target intention category of the plurality of intention categories according to the first parameter.
In this embodiment, the extracting unit 902 can extract a plurality of target keywords capable of representing first information in application scenarios of dialogue, question and answer of information and search, the determining unit 904 determines proportion of each intention category in a vocabulary set in a preset lexicon according to the target keywords, determines the intention category of the first information according to the proportion, and compared with a semantic training model, only a certain number of intention categories and first parameters corresponding to the intention categories need to be stored in the preset lexicon, so that light weight of data storage is achieved, data only need to be calculated once, recognition and response speed of the intention categories of information can be improved, and application scenario use requirements of dialogue, question and answer and search in the 2B field are met.
EXAMPLE III
As shown in fig. 10, an embodiment of the present invention proposes an information classification apparatus 1000 including: a memory 1002 and a processor 1004, the memory 1002 storing a program or instructions running on the processor 1004, the program or instructions when executed by the processor implementing the steps of the information classification method of any of the embodiments described above.
In this embodiment, the information classification apparatus 1000 includes a memory 1002 and a processor 1004, where the memory 1002 stores a program or an instruction running on the processor 1004, and the program or the instruction implements the steps of the information classification method according to any of the above embodiments when executed by the processor, so that all the beneficial technical effects of any of the above embodiments are achieved, and are not described herein again.
Example four
An embodiment of the present invention proposes a readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement the steps of the information classification method of any of the above embodiments.
In this embodiment, the readable storage medium stores a program or an instruction, and the program or the instruction, when executed by the processor, implements the steps of the information classification method according to any one of the above embodiments, so as to have all the beneficial technical effects of any one of the above embodiments, and therefore, the description is omitted here.
Further, it will be understood that any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and that the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An information classification method, comprising:
under the condition that first information is received, extracting a plurality of target keywords of the first information;
determining a first parameter corresponding to each intention category in a preset word bank according to the target keywords, wherein the first parameter is associated with the proportion of the target keywords in a vocabulary set corresponding to each intention category;
determining a target intent category of the plurality of intent categories based on the first parameter;
the preset word stock comprises a plurality of intention categories and a plurality of vocabulary sets, and the vocabulary sets correspond to the intention categories one by one.
2. The information classification method according to claim 1, wherein, in a case where first information is received, extracting a plurality of target keywords of the first information includes:
recognizing preset keywords in the first information through the preset word bank;
under the condition that the first information is identified to include the preset keyword, determining the preset keyword in the first information as a target keyword;
under the condition that the first information is not identified to include the preset keyword, querying second information in a preset corpus, wherein the second information is matched with the semantics of the first information;
and performing word segmentation processing on the second information to obtain the target keyword.
3. The information classification method according to claim 2, wherein after performing word segmentation processing on the second information to obtain the target keyword, the method further comprises:
and storing the target keyword corresponding to the second information in the preset word bank.
4. The information classification method according to claim 2, wherein before the identifying the preset keyword in the first information by the preset lexicon, the method further comprises:
and extracting the preset keywords from the preset corpus to generate a preset word bank.
5. The information classification method according to claim 4, wherein the extracting the preset keyword from the preset corpus to generate a preset lexicon comprises:
acquiring a first word stock corresponding to each intention category in the preset corpus;
determining a second parameter of each first vocabulary in the first vocabulary bank, wherein the second parameter is associated with the occurrence frequency of the first vocabulary in the first vocabulary bank;
determining a third parameter according to the occurrence frequency, the total number of the intention categories and the number of the intention categories corresponding to the first vocabulary;
and screening the preset keywords in each first word bank through the third parameters to generate the preset word bank.
6. The information classification method according to claim 5, wherein before the obtaining of the first lexicon corresponding to each intention category in the preset corpus, the method further comprises:
receiving third information from a client and the intention category corresponding to the third information;
and updating the preset corpus according to the third information and the intention type corresponding to the third information.
7. The information classification method according to any one of claims 1 to 6, wherein determining, from the plurality of target keywords, a first parameter for each of the plurality of intention categories comprises:
calculating a fourth parameter of each target keyword in each vocabulary set, wherein the fourth parameter is associated with the proportion of the target keyword in the vocabulary sets;
calculating sum values of a plurality of corresponding fourth parameters in each vocabulary set, determining the sum values as the first parameters, and enabling each fourth parameter in the plurality of fourth parameters to correspond to each keyword in the plurality of keywords one by one.
8. An information classification apparatus, comprising:
an extraction unit configured to extract a plurality of target keywords of first information in a case where the first information is received;
the determining unit is used for determining a first parameter corresponding to each intention category in a preset word stock according to the target keywords, and the first parameter is associated with the proportion of the target keywords in a vocabulary set corresponding to each intention category;
the determining unit is further configured to determine a target intention category of the plurality of intention categories according to the first parameter.
9. An information classification apparatus, comprising: a memory and a processor, the memory storing a program or instructions running on the processor, the program or instructions when executed by the processor implementing the steps of the information classification method of any one of claims 1 to 7.
10. A readable storage medium on which a program or instructions are stored, characterized in that the program or instructions, when executed by a processor, implement the steps of the information classification method according to any one of claims 1 to 7.
CN202211181226.7A 2022-09-27 2022-09-27 Information classification method, information classification device, and readable storage medium Pending CN115563990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211181226.7A CN115563990A (en) 2022-09-27 2022-09-27 Information classification method, information classification device, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211181226.7A CN115563990A (en) 2022-09-27 2022-09-27 Information classification method, information classification device, and readable storage medium

Publications (1)

Publication Number Publication Date
CN115563990A true CN115563990A (en) 2023-01-03

Family

ID=84743224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211181226.7A Pending CN115563990A (en) 2022-09-27 2022-09-27 Information classification method, information classification device, and readable storage medium

Country Status (1)

Country Link
CN (1) CN115563990A (en)

Similar Documents

Publication Publication Date Title
CN108847241B (en) Method for recognizing conference voice as text, electronic device and storage medium
US20160210962A1 (en) Methods and systems for analyzing communication situation based on dialogue act information
CN108829893A (en) Determine method, apparatus, storage medium and the terminal device of video tab
CN110413760B (en) Man-machine conversation method, device, storage medium and computer program product
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN106997342B (en) Intention identification method and device based on multi-round interaction
CN109545185B (en) Interactive system evaluation method, evaluation system, server, and computer-readable medium
CN108027814B (en) Stop word recognition method and device
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN112579733B (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN109582788A (en) Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing
CN114625855A (en) Method, apparatus, device and medium for generating dialogue information
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN114003682A (en) Text classification method, device, equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN114706945A (en) Intention recognition method and device, electronic equipment and storage medium
CN116662555B (en) Request text processing method and device, electronic equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN116304046A (en) Dialogue data processing method and device, storage medium and electronic equipment
CN115577109A (en) Text classification method and device, electronic equipment and storage medium
CN110874408A (en) Model training method, text recognition device and computing equipment
CN113012687B (en) Information interaction method and device and electronic equipment
CN115563990A (en) Information classification method, information classification device, and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination