CN116842168B

CN116842168B - Cross-domain problem processing method and device, electronic equipment and storage medium

Info

Publication number: CN116842168B
Application number: CN202311105721.4A
Authority: CN
Inventors: 任梦星; 刘迎建; 彭菲; 吴雅萱
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-14
Anticipated expiration: 2043-08-30
Also published as: CN116842168A

Abstract

The application discloses a method and a device for processing cross-domain problems, electronic equipment and a storage medium, and belongs to the technical field of natural language processing. The method comprises the following steps: performing problem classification processing on input problems based on a preset field word stock of each field and a pre-trained text classification model to obtain a field classification result of the input problems; and then, taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model. The method can effectively improve the accuracy of question and answer of the preset language processing model.

Description

Cross-domain problem processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of natural language processing, and in particular, to a cross-domain problem processing method, apparatus, electronic device, and computer readable storage medium.

Background

In the field of Natural Language Processing (NLP), with the development of deep learning technology, preset language processing models based on convertors (such as ChatGPT, chatGLM, etc.), that is, large models, have achieved remarkable success, and these preset language processing models are trained by means of a large amount of text data, so that they have strong generating capability and understanding capability, however, although these preset language processing models can generate smooth and seemingly reasonable texts, they may generate inaccurate and even misleading information, so that landing of the preset language processing models becomes difficult.

The text classification method commonly used in the prior art comprises the following steps: a method of implementing text classification using deep learning correlation techniques, and a method of implementing text classification based on rules. The method for realizing text classification by using deep learning related technology (for example, realizing classification based on BERT) comprises larger model parameters, which results in slower classification speed, larger training data set scale, more resources occupied by the training process and longer time, and limits the rapid iteration and real-time application of the classification model. However, a method for implementing text classification based on rules generally requires manually writing rules, so that generalization capability is not strong when processing new unseen samples, and complex semantics and language changes are difficult to process, and on the other hand, the method is based on classification generally based on some explicit rules, and cannot capture implicit semantics and context information.

Disclosure of Invention

The embodiment of the application provides a cross-domain problem processing method and device, electronic equipment and storage medium, which can promote a preset language processing model to fall to the ground and improve the accuracy of overall questions and answers of the preset language processing model.

In a first aspect, an embodiment of the present application provides a cross-domain problem processing method, including:

Performing problem classification processing on input problems based on a preset field word stock of each field and a pre-trained text classification model, and obtaining a field classification result of the input problems;

and taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model.

In a second aspect, an embodiment of the present application provides a cross-domain problem processing apparatus, including:

the problem classification module is used for carrying out problem classification processing on the input problems based on a preset field word stock of each field and a pre-trained text classification model, and obtaining a field classification result of the input problems;

and the problem solving module is used for taking the domain classification result and the input problem as input of a preset language processing model, and acquiring an answer corresponding to the input problem through the preset language processing model.

In a third aspect, an embodiment of the present application provides a cross-domain problem processing method, including:

performing keyword matching on input questions based on a preset domain word stock of each domain to obtain first index values of keywords in each domain word stock of the input questions, and sub-domains to which the hit keywords belong;

Classifying the input problems through a preset first classification model to obtain second index values of the input problems matched with the fields;

determining a target field matched with the input problem according to the first index value and the second index value;

classifying the input problems through a preset second classification model corresponding to the target field to obtain a third index value of the input problems matched with preset sub-fields of the target field;

and classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and obtaining a domain classification result of the input problem.

In a fourth aspect, an embodiment of the present application provides a cross-domain problem processing apparatus, including:

the keyword matching module is used for carrying out keyword matching on the input problem based on a preset domain word stock of each domain to obtain a first index value of the keywords in each domain word stock of the input problem, and the sub-domain to which the hit keywords belong;

the first classification module is used for classifying the input problems through a preset first classification model to obtain second index values of the input problems matched with the fields;

The first classification result determining module is used for determining the target field matched with the input problem according to the first index value and the second index value;

the second classification module is used for classifying the input problems through a preset second classification model corresponding to the target field to obtain a third index value of the input problems matched with preset sub-fields of the target field;

and the classification result acquisition module is used for classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and acquiring a domain classification result of the input problem.

In a fifth aspect, the embodiment of the present application further discloses an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the cross-domain problem processing method according to the embodiment of the present application when executing the computer program.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the cross-domain problem processing method disclosed in the embodiments of the present application.

According to the cross-domain problem processing method disclosed by the embodiment of the application, keyword matching is carried out on input problems based on the preset domain word banks of all domains, so that a first index value of the keywords in all the domain word banks of the input problems is obtained, and the hit keywords belong to the sub-domains; classifying the input problems through a preset first classification model to obtain second index values of the input problems matched with the fields; determining a target field matched with the input problem according to the first index value and the second index value; classifying the input problems through a preset second classification model corresponding to the target field to obtain a third index value of the input problems matched with preset sub-fields of the target field; and classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and acquiring a domain classification result of the input problem, so that cross-domain classification can be quickly performed on the input problem, and the classification accuracy is higher.

On the other hand, the embodiment of the application also discloses a cross-domain problem processing method, which is used for carrying out problem classification processing on input problems based on a preset domain word stock and a pre-trained text classification model in each domain to obtain a domain classification result of the input problems; and then, taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model, so that the accuracy of question and answer of the preset language processing model can be effectively improved.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

FIG. 1 is one of the flowcharts of a cross-domain problem processing method disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a first classification model and a second classification model in a cross-domain problem processing method according to an embodiment of the present application;

FIG. 3 is a second flowchart of a cross-domain problem processing method according to an embodiment of the present application;

FIG. 4 is a third flowchart of a cross-domain problem processing method disclosed in an embodiment of the present application;

FIG. 5 is a flow chart of a classification step in a cross-domain problem processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a cross-domain problem processing apparatus according to an embodiment of the present application;

FIG. 7 is a second schematic diagram of a cross-domain problem processing apparatus according to an embodiment of the present application;

fig. 8 schematically shows a block diagram of an electronic device for performing the method according to the application; and

fig. 9 schematically shows a memory unit for holding or carrying program code for implementing the method according to the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application also discloses a cross-domain problem processing method, as shown in fig. 1, comprising the following steps: step 110 and step 120.

Step 110, carrying out problem classification processing on input problems based on a preset domain word library of each domain and a pre-trained text classification model, and obtaining a domain classification result of the input problems;

and 120, taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model.

The preset language processing model in the embodiment of the application can be a cross-domain preset language processing model in the prior art. For example based on a large model such as ChatGPT, chatGLM. The preset language processing model supports outputting corresponding answers according to the question text input by the user.

Optionally, the preset fields may include: education, medical, legal, etc. And matching the preset fields with the service fields supported by the preset language processing model. For example, when the preset language processing model is a model trained based on data of three business fields of education, medical treatment and law, that is, the preset language processing model supports knowledge question-answering of the three business fields of education, medical treatment and law, the preset fields include: education, medical treatment, law.

In the embodiment of the application, for each preset domain, a domain word stock needs to be built for each domain individually in advance. The domain word stock contains keywords of the domain, and each keyword belongs to a sub-domain. For specific embodiments of building a domain word stock for each domain, refer to the following description, and are not described herein.

In the embodiment of the application, the pre-trained text classification model is a lightweight text classification model, and has the capability of processing a large number of text classifications and unknown text classifications, and has higher classification accuracy and classification speed.

In some embodiments of the present application, the training data of the text classification model is obtained after fine tuning based on the training data set of the preset language processing model. The text classification model may employ a two-stage classification model, where the first stage employs a text classification model (i.e., a preset first classification model) for classifying an input problem into a service domain, and the second stage employs a text classification model (i.e., a preset second classification model) for classifying the input problem into a sub-domain classification of the service domain.

The text classification model of each stage, that is, the preset first classification model and the preset second classification model, may adopt a structure as shown in fig. 2. As shown in fig. 2, the preset first classification model and the preset second classification model respectively include: a word vector encoding sub-model 210, a word vector encoding sub-model 220, and a text classification sub-model 230.

The Word vector coding sub-model can adopt a FastText coding model, the Word vector coding sub-model can adopt a Word2Vec coding model, and the text classification sub-model can adopt a textCNN model.

The structure and training method of the text classification model are described below, and are not described in detail herein.

Optionally, the text classification model includes: presetting a first classification model and a second classification model, and performing problem classification processing on input problems based on a field word stock and a pre-trained text classification model of each field to obtain field classification results of the input problems, wherein the method comprises the following steps: performing keyword matching on input questions based on a preset domain word stock of each domain to obtain first index values of keywords in each domain word stock of the input questions, and sub-domains to which the hit keywords belong; classifying the input problems through the preset first classification model to obtain second index values of the input problems matched with the fields; determining a target field matched with the input problem according to the first index value and the second index value; classifying the input problems through the preset second classification model corresponding to the target field to obtain a third index value of each preset sub-field of the target field matched with the input problems; and classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and obtaining a domain classification result of the input problem.

Keyword matching is performed on the input questions based on the preset domain word banks of each domain, so as to obtain first index values of keywords in each domain word bank of the input questions, and specific embodiments of the sub-domains to which the hit keywords belong are described below, which are not repeated here.

Optionally, classifying the input problem by presetting a first classification model to obtain a second index value of the input problem matched with each field, including: performing word vector coding on the input problem to obtain a first vector; performing word vector coding on the input problem to obtain a second vector; fusing the first vector and the second vector to obtain a third vector; and performing text classification processing based on the third vector to obtain a second index value of the input problem matched with each preset field.

Specifically, for example, classifying the input problem by presetting a first classification model, and obtaining the second index value of the input problem matched with each field includes: performing word vector coding on the input problem through a word vector coding sub-model 210 to obtain a first vector; performing word vector coding on the input problem through a word vector coding sub-model 220 to obtain a second vector; and fusing the first vector and the second vector through the text classification sub-model 230 to obtain a third vector, and performing text classification processing based on the third vector to obtain a second index value of the input problem matched with each preset field.

The specific implementation of word vector encoding for the input problem to obtain the first vector is described below, and will not be described herein.

The word vector encoding is performed on the input problem, and a specific embodiment of obtaining the second vector is described below, which is not described herein.

The specific embodiment of fusing the first vector and the second vector to obtain the third vector is described below, and will not be described herein.

The text classification processing is performed based on the third vector, and a specific implementation manner of obtaining the second index value of each preset field matching the input problem is described below, which is not repeated here.

Optionally, determining the target field of matching the input problem according to the first index value and the second index value includes:

carrying out weighted summation on the first index value and the second index value by adopting a preset weight value to obtain a matching degree score corresponding to each preset field;

and determining the preset field with the highest matching degree score as the target field for matching the input problem.

And carrying out weighted summation on the first index value and the second index value by adopting a preset weight value to obtain a matching degree score corresponding to each preset field, and determining the preset field with the highest matching degree score as the target field of matching the input problem, wherein the specific implementation of the target field is described below and is not repeated herein.

Optionally, classifying the input problem through the preset second classification model corresponding to the target domain, to obtain a third index value of the input problem matched with preset sub-domains of the target domain, where the third index value includes: performing word vector coding on the input problem to obtain a fourth vector; performing word vector coding on the input problem to obtain a fifth vector; fusing the fourth vector and the fifth vector to obtain a sixth vector; and performing text classification processing based on the sixth vector to obtain a third index value of each preset sub-field of the input problem matching the target field.

Specifically, for example, the classifying the input problem by the preset second classification model corresponding to the target domain, and obtaining the third index value of the input problem matching with the preset sub-domains of the target domain includes: performing word vector coding on the input problem through a word vector coding sub-model 210 to obtain a fourth vector; performing word vector coding on the input problem through a word vector coding sub-model 220 to obtain a fifth vector; and fusing the fourth vector and the fifth vector through a text classification sub-model 230 to obtain a sixth vector, and performing text classification processing based on the sixth vector to obtain a third index value of each preset sub-field of the input problem matching the target field.

The word vector encoding is performed on the input problem to obtain a specific implementation manner of the fourth vector, which is described below and will not be repeated here.

The word vector encoding is performed on the input problem to obtain a specific embodiment of the fifth vector, which is described below and will not be repeated here.

The fourth vector and the fifth vector are fused to obtain a specific embodiment of the sixth vector, which is described below and will not be repeated here.

And performing text classification processing based on the sixth vector to obtain a specific embodiment of the input problem matching the third index value of each preset sub-field of the target field, which is described below and not repeated here.

And classifying the input questions in the sub-domains according to the third index value and the sub-domains to which the hit keywords belong, and specific embodiments for obtaining domain classification results of the input questions are described below, which are not repeated here.

Optionally, the domain word stock is constructed by the following method:

for a target preset field, acquiring keywords of the target preset field by adopting one or more of the following keyword extraction methods: extracting a first keyword from the sample text based on a preset rule; extracting long tail keywords from the sample text by adopting a word splicing mode, and taking the long tail keywords as second keywords; clustering the sub-words of the preset long-tail keywords to obtain target sub-words as third keywords; and constructing a domain word stock of the target preset domain according to one or more keywords of the first keyword, the second keyword and the third keyword.

For a specific embodiment of extracting the first keyword from the sample text based on the preset rule, refer to the following description, which is not repeated here.

Optionally, the extracting long tail keywords from the sample text by word stitching includes: sentence division is carried out on the sample text according to the preset symbol and the text length, so that sentences to be processed are obtained; performing word segmentation processing on the obtained sentence to be processed to obtain candidate word segmentation; screening and part-of-speech tagging are carried out on the candidate segmented words, and target candidate segmented words with preset part-of-speech are obtained; splicing to obtain noun phrases serving as candidate long tail keywords according to the target candidate segmentation; obtaining importance scores of the candidate long tail keywords; and selecting the candidate long tail keywords as second keywords according to the importance degree scores.

The term splicing method is adopted to extract long tail keywords from the sample text, and the specific implementation of the step of using the long tail keywords as the second keywords is described below, and is not repeated here.

Clustering the sub-words of the preset long-tail keywords to obtain specific embodiments of the target sub-words as the third keywords, which are described below, and are not repeated here.

According to one or more keywords of the first keyword, the second keyword and the third keyword, a specific embodiment of a domain word stock of the target preset domain is constructed, which is described below and will not be repeated here.

In the foregoing step 110, the input questions of the user may be first pre-classified based on the domain word library of each preset domain and the pre-trained text classification model, to obtain the sub-domain of the service domain to which the input questions belong, and then the sub-domain information and the input questions are used together as the input of the preset language processing model, the input questions are combined by the preset language processing model, and the sub-domain of the input questions is output as an answer.

The specific implementation manner of obtaining the corresponding answer by the preset language processing model according to the input question text is the prior art, and is not repeated in the embodiment of the present application.

According to the cross-domain problem processing method disclosed by the embodiment of the application, the problem classification processing is carried out on the input problems based on the preset domain word stock of each domain and the pre-trained text classification model, so that the domain classification result of the input problems is obtained; and then, taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model, so that the accuracy of question and answer of the preset language processing model can be effectively improved.

Furthermore, the method accurately classifies the user problems into specific fields through the pre-classification algorithm, and then further classifies the user problems into sub-fields of the fields, so that classification errors can be reduced. Through accurate classification, the preset language processing model can perform problem understanding and solving according to knowledge content in the sub-field, and irrelevant content is prevented from being generated, so that the accuracy of overall question and answer is improved, and the application of the preset language processing model is possible to land. The hierarchical question classification and question-answering system is introduced, so that questions of a user can be answered more accurately and comprehensively, and the use experience and satisfaction of the user are improved.

The text classification model adopted in the embodiment of the application can quickly classify the user input problem and has high accuracy. And directly calling a preset second classification model of the corresponding industry field by using the classification label output by the preset first classification model to obtain a sub-field classification result. Then, based on the sub-domain classification result, the preset language processing model can query the sub-domain knowledge base in the corresponding industry domain, so that quick result output is realized. Through the process, the preset language processing model can be used for solving the problems more efficiently, the working efficiency is improved, and the professionality and the accuracy of the preset language processing model are enhanced.

According to the embodiment of the application, the efficient text classification model is introduced, so that the problems input by a user can be accurately classified. This classification process is not only intended to categorize the problem into different categories, but more importantly to provide a powerful support for further processing of large models. By accurately classifying the questions and pertinently calling corresponding modules in the large model according to classification results, the accuracy is improved in a question-answering environment.

In embodiments of the present application, text classification models are serviced to enhance question-answer accuracy for large models. By finely classifying the questions, the large model can concentrate on the specific question field more effectively, so that the overall question-answering accuracy is improved. By introducing a text classification technology, key support is provided for optimization of a large model, and a solid foundation is laid for solving the question-answer problem in practical application.

The embodiment of the application also discloses a cross-domain problem processing method, as shown in fig. 3, comprising the following steps: steps 310 to 350.

Step 310, performing keyword matching on the input problem based on a preset domain word stock of each domain, to obtain a first index value of the keywords in each domain word stock of the input problem, and a sub-domain to which the hit keywords belong.

Optionally, the first index value may be the number of keywords in each of the domain word banks for the input question.

When the method is implemented, firstly, a domain word stock of the corresponding domain is constructed according to the question-answer requirements of a preset language processing model. As shown in fig. 4, keyword matching is performed on an input problem based on a preset domain word stock of each domain, so as to obtain a first index value of the input problem for hitting keywords in each domain word stock, and before the hit keywords belong to the sub-domain, the method further includes:

step 300, constructing a domain word stock of each preset domain.

The domain word stock comprises the following components: keywords of the corresponding domain, and sub-domains to which each keyword belongs.

For example, for a preset language processing model supporting professional fields such as legal fields, medical fields, educational fields, etc., a field word stock of the corresponding professional field is first constructed in advance. Each domain word stock comprises a plurality of keywords corresponding to the professional domain. For example, for a domain lexicon of legal domain, names of laws and regulations, crime names, and the like may be included. As another example, for a domain word stock of the medical domain, there may be included, for example: drug name, disease name, typical symptom name, medical institution name, etc.

Further, in the application process, for the word stock of the specified domain, the input problem can be compared with the keywords in the word stock of the specified domain, so as to determine that the input problem hits the list of the keywords in the word stock of the specified domain, and the hit keywords belong to the sub-domain. And further determining the keywords of each sub-domain hitting the domain corresponding to the word stock of the designated domain by counting the hit keywords in the sub-domain, and accumulating the number of the hit keywords of each sub-domain in the domain to obtain the number of the hit keywords of the domain as a first index value.

In some embodiments of the present application, keywords that contain matches may be used to determine the hits to the input problem. For example, when the input question text contains a certain keyword D1 in the domain word stock, it can be considered that the input question hits the keyword D1. For another example, when a certain keyword D2 in the domain word stock contains an input question text, the input question is considered to hit the keyword D2.

In specific implementation, a keyword hit judgment standard can be set according to the actual matching precision requirement. In the embodiment of the application, the specific implementation mode of keyword matching between the input problem and the domain word stock of a certain preset domain is not limited.

In order to improve the accuracy of keyword matching and the accuracy of subsequent classification, the construction of a domain word stock with rich and accurate content plays an important role. In the embodiment of the application, the domain word stock is constructed by the following method: for a target preset field, acquiring keywords of the target preset field by adopting one or more of the following keyword extraction methods: extracting a first keyword from the sample text based on a preset rule; extracting long tail keywords from the sample text by adopting a word splicing mode, and taking the long tail keywords as second keywords; clustering the sub-words of the preset long-tail keywords to obtain target sub-words as third keywords; and constructing a domain word stock of the target preset domain according to one or more keywords of the first keyword, the second keyword and the third keyword.

Sample text in embodiments of the present application may be selected from a classification problem dataset used to train a pre-set language processing model.

The above-described various keyword extraction methods are respectively exemplified below.

Extracting a first keyword from a sample text based on a preset rule

Alternatively, the preset rules may be determined according to industry keywords given by professionals in the industry field. For example, the preset rules may be a complete keyword match based on a keyword table given by a person skilled in the art.

For example, for the medical field, a medical field keyword table may be constructed by a person skilled in the art, including, for example: pharmacopoeia name, fixed point hospital name, disease name, and domain keywords such as: "medical insurance", hospital "," physician of major choice "," expert number "," disease ", etc. And then, extracting keywords from the sample text according to the preset rule to obtain a first keyword. For example, for sample text: the text of which website can be hung on the expert number of cardiovascular and cerebrovascular diseases, wherein the text can relate to the keywords of 3 medical fields, namely 'cardiovascular and cerebrovascular', 'diseases' and 'expert number', and 3 first keywords can be extracted from the text of the sample.

According to the method, a plurality of first keywords in the medical field can be extracted from a plurality of sample texts.

Similarly, by setting keyword tables of other fields, keywords of the corresponding field can be extracted according to the method, so that a plurality of first keywords of the field can be obtained.

Secondly, extracting long tail keywords from the sample text in a word splicing mode to serve as second keywords

Each industry domain typically further includes a number of sub-domains, denoted as "sub-domains" in embodiments of the present application. Taking the legal field as an example, the sub-fields mainly include: marital households, contract disputes, intellectual property rights, internet disputes, traffic accidents, creditor liabilities, criminal cases, real estate disputes, inheritance, medical disputes, damage reimbursement, symptomatic removals, and the like. Taking the medical field as an example, the sub-fields mainly comprise: ophthalmology, dentistry, otorhinolaryngology, orthopedics, gastroenterology, endocrinology, surgery, medicine, pediatrics, gynaecology and obstetrics, and the like.

Because of the specificity of the data in the industry field, the knowledge content related to the sub-field has a plurality of longer special words, which are marked as long tail keywords in the embodiment of the application, for example, the personal information protection law of the people's republic of China. The conventional keyword extraction method cannot effectively extract the long-tail keywords. In the embodiment of the application, a method for extracting long-tail keywords from a sample text by adopting a word splicing mode is provided, sub-domain knowledge of each industry domain is integrated, and keywords are extracted.

Optionally, extracting long tail keywords from the sample text in a word splicing manner, as second keywords, including: substep S1 to substep S6.

And S1, dividing sentences of the sample text according to the preset symbol and the text length to obtain sentences to be processed.

Alternatively, the preset symbol may be a punctuation mark, such as a period, a question mark, an exclamation mark, etc., which indicates the end of the sentence. The text length is determined according to a specific field, and for example, the text length may be 20.

For example, the sample text (e.g., the question text in the classification dataset) may be divided into words according to periods and lengths of 20, resulting in one or more sentences included in the sample text as sentences to be processed.

And S2, performing word segmentation processing on the obtained sentence to be processed to obtain candidate word segmentation.

Then, word segmentation processing can be performed on the sentences to be processed based on the custom dictionary, so that the segmented words included in each sentence to be processed are obtained and serve as candidate segmented words.

And S3, screening and part-of-speech tagging are carried out on the candidate segmented words, and target candidate segmented words with preset parts-of-speech are obtained.

The resulting candidate segmentations are then screened, including but not limited to one or more of the following: stop words are removed, nonsensical words are removed, etc. And then, part-of-speech tagging is carried out on the candidate segmented words reserved after screening. For example, candidate segmentations retained after the labeling screening are part of speech such as nouns or adjectives. Further, candidate segmentations marked with noun parts of speech and adjective parts of speech are obtained as target candidate segmentations.

And S4, splicing to obtain noun phrases serving as candidate long tail keywords according to the target candidate segmentation.

Then, the target candidate word obtained in the previous step, namely the word part of noun and the word part of adjective, can be spliced according to a preset rule to obtain noun phrases. And taking the noun phrases obtained by splicing as candidate long tail keywords. In embodiments of the application, regular expressions may be utilized, as per, for example: adjectives, nouns, and the like, and two or more target candidate segmentations are spliced into a noun serving as a candidate long-tail keyword.

For example, for the text "personal information protection law of the people's republic of China" which is a law issued at what time, the following nouns and adjectives "the people's republic of China", "personal information", "protection law", "time" and "law" can be obtained through word segmentation, screening of the segmented words and part of speech tagging, and then the candidate long tail keyword "personal information protection law of the people's republic of China" can be obtained through noun splicing.

And S5, obtaining importance scores of the candidate long tail keywords.

Next, a importance score of each candidate long-tail keyword with respect to the original sentence (e.g., sentence in which the candidate long-tail keyword is obtained by word segmentation) is calculated. Alternatively, the word vector of each candidate long-tail keyword may be calculated by using an ELMO (Embeddings from Language Models, sentence-mode word vector coding method) method, a pre-trained model such as SIF (Smooth Inverse Frequency) is used to calculate the sentence vector of the original sentence, and then the similarity between the word vector and the sentence vector is calculated by the cosine distance as the importance score of the candidate long-tail keyword. The higher the similarity, the higher the importance score, indicating that the candidate long tail keyword is more important. The more important the candidate keywords are, the more representative the meaning of the text entered.

For example, for what the difficulty of learning the probability part knowledge in the high school mathematics is, what is an effective method or learning tool can assist in learning, wherein the importance degree score obtained by the candidate long tail keywords obtained by splicing the high school mathematics, the probability knowledge and the learning tool is higher, and the important degree score can represent the core point of the problem.

And S6, selecting the candidate long tail keywords as second keywords according to the importance degree scores.

Finally, the candidate long tail keywords with the importance scores meeting the preset conditions can be selected as second keywords. For example, N candidate long-tail keywords with highest importance scores may be selected as the second keywords. Wherein, N can be an integer greater than 1, and can take different values according to actual conditions.

Clustering the sub-words of the preset long-tail keywords to obtain target sub-words as third keywords

Optionally, the preset long-tail keyword may be a long-tail keyword extracted by the method, or may be a long-tail keyword obtained by other methods.

In the industry, non-professional users often have questions that do not contain complete terminology. These keywords may not exist in existing word stock due to long tail effect of long tail keywords. However, the problem posed by professional users typically involves a large number of technical terms, which often belong to long-tailed vocabulary. In order to effectively process the long-tail words, in the embodiment of the application, the core sub-words forming the long-tail keywords are obtained by clustering the sub-words forming the long-tail keywords, so that the classification accuracy of the questions presented by professional users is improved.

Optionally, clustering the sub-words of the preset long-tail keyword, and obtaining the target sub-word as the third keyword includes: obtaining sub-words forming a preset long-tail keyword; clustering the subwords according to semantic similarity to obtain one or more subword clusters; and selecting the subwords in the subword cluster as a third keyword according to the number and the part of speech of the subwords included in the subword cluster.

In some embodiments of the present application, core subwords that form long-tail keywords may be obtained by a subword clustering method. For example, special characters, stop words, and redundant punctuation marks may be removed by first performing text pre-processing on the subwords that make up the long-tail keywords. Each pre-processed sub-word is then converted into a word vector using a word vector encoding method (e.g., word2vec method). For example, in order to facilitate subsequent clustering analysis, a Principal Component Analysis (PCA) method may be used to reduce the space of the word vectors to two dimensions, then take the two-dimensional data after the dimension reduction as input of a clustering algorithm (such as a k-means clustering algorithm), and perform a clustering operation, to finally obtain a clustering result of the subwords, that is, one or more subword clusters. Next, a sub word in a sub word cluster including a sub word number greater than a preset number threshold may be selected as a third keyword; alternatively, nouns in a cluster of subwords including a number of subwords greater than a preset number threshold may be selected as the third keyword.

After clustering, the obtained clustering result is some sub-words, for example, the sub-words forming the long tail keywords are respectively: clustering these words [ suppurative, conjunctivitis ], [ severe blepharitis ] resulted in: blepharitis, conjunctivitis, etc. The subwords forming the long-tail keywords are clustered together due to semantic similarity, and further nouns in the subwords obtained by clustering can be used as part of the keywords, namely third keywords.

Then, the first keyword, the second keyword and the third keyword of the domain X obtained by the above 3 methods may be all used as keywords of the domain X and added to the domain word stock of the domain X. According to the method, the domain word library of each domain of the problems corresponding to the preset language processing model can be constructed.

In the embodiment of the application, a sub-domain attribute is further required to be set for each first keyword, and is used for marking the sub-domain to which the keyword belongs. For example, for the medical field, it includes sub-fields: ophthalmic, dental, digestive, and the like. When a domain word stock of each domain is constructed, sub-domain attributes can be set for each keyword in the domain word stock at the same time, and the sub-domain attributes are used for marking the sub-domain to which each keyword belongs. For example, the sub-field attribute values are set to "ophthalmic" for the keywords "myopia", "astigmatism", "cataract", "blepharitis", "meibomian gland", "conjunctivitis", "dry eye", and "dental" for the keywords "dental caries", "dental pulp theory", "gingivitis".

Taking the word stock as an example in the medical field, the keywords included therein can be shown in the following table.

Table 1, keywords in a domain lexicon of the medical domain

In the embodiment of the application, the long tail keywords are obtained by splicing the words with the preset parts of speech, so that the classification accuracy of the problems proposed by the professionals in the field can be improved. For example, when a user inputs a question of whether or not personal network information related protection regulations are included in the "personal information protection law of the people's republic of China", the long-tail keyword "personal information protection law of the people's republic of China" can be accurately matched, thereby classifying the question as a legal field question.

On the other hand, the sub-words of the long-tail keywords are clustered, the third keywords are extracted to construct a domain word stock, and therefore the classification accuracy of the questions provided by non-professionals can be improved. For example, after clustering the subwords of the long tail keyword "the criminal litigation method of the people's republic of China" obtained by the second method, third keywords "criminals" and "litigation method" can be obtained and added into a domain word stock in the legal domain. Thus, the domain word stock in the legal domain includes both the long tail keyword "criminal law of the people's republic of China" and the third keyword "criminal" and "law". In the classification of questions, for professionals in the legal field, questions may include a relatively long term, such as: "you know when" law of criminal litigation in the people's republic "was issued", at which time "law of criminal litigation in the people's republic" can be hit by matching the input problem with long tail keywords (i.e., the aforementioned second keywords) in the domain word stock, thereby classifying the problem into the legal domain; for an average user, the input problem is more spoken, such as: "when criminal law is issued", at this time, by matching an input problem with a keyword in a domain word stock, "criminal" can be hit, thereby classifying the problem into a legal domain.

Step 320, classifying the input problem by presetting a first classification model, and obtaining a second index value of the input problem matched with each field.

The preset first classification model and the preset second classification model used in the embodiment of the application need to be trained in advance,

As shown in fig. 3, keyword matching is performed on an input problem based on a preset domain word stock of each domain, so as to obtain a first index value of the input problem for hitting keywords in each domain word stock, and before the hit keywords belong to the sub-domain, the method further includes:

step 301, training a preset first classification model and a preset second classification model.

Specific embodiments for training the preset first classification model and the preset second classification model are described below, and are not described herein.

The execution sequence of step 300 and step 301 is not limited in the embodiment of the present application.

And on the other hand, classifying the input questions based on text features to obtain second index values of the input questions matched with the fields.

Optionally, as shown in fig. 5, the classifying the input problem by presetting a first classification model to obtain a second index value of the input problem matched with each of the fields includes: sub-step 3201, sub-step 3202, sub-step 3203 and sub-step 3204.

In a sub-step 3201, the input problem is subjected to word vector coding, and a first vector is obtained.

In a sub-step 3202, word vector encoding is performed on the input problem, and a second vector is obtained.

Sub-step 3203, fusing the first vector and the second vector to obtain a third vector.

And a sub-step 3204 of performing text classification processing based on the third vector to obtain a second index value of the input question matched with each preset field.

Optionally, the classifying the input problem by presetting a first classification model, and obtaining the second index value of the input problem matched with each field includes: performing word vector coding on the input problem through a word vector coding sub-model 210 to obtain a first vector; performing word vector coding on the input problem through a word vector coding sub-model 220 to obtain a second vector; and fusing the first vector and the second vector through the text classification sub-model 230 to obtain a third vector, and performing text classification processing based on the third vector to obtain a second index value of the input problem matched with each preset field.

As shown in fig. 2, the preset first classification model includes: word vector encoding sub-model 210, word vector encoding sub-model 220, and text classification sub-model 230

Character vector coding is a technique for mapping words to successive vector representations, and in view of the diversity, complexity and mass of classification data, the implementation of character vector coding based on FastText is adopted in the embodiment of the application, and the technique can capture character-level information, help capture the internal structure of words, such as affix and morphological changes, and understand the meaning and context of words. Furthermore, fastText training is very fast, and has better calculation efficiency especially on a large-scale corpus.

Meanwhile, the character vector coding technology can better process rare words and unregistered words (i.e. words which do not appear in training data), has better generalization capability, and can more effectively process low-frequency words and unregistered words because the Fasttext coding model uses character-level information. This means that the FastText coding model can generate reasonable vector representations for these words, which other word-level based methods may not do. The text content field related in the actual business is particularly wide, and a plurality of low-frequency unregistered words exist, so that the character vector coding technology can effectively solve the problem.

In addition, fastText-based character vector coding can support multiple languages, and since character-level representations are generic, the FastText coding model can be readily applied to text representations in different languages, particularly those that are lexically structurally rich.

In some embodiments of the present application, the FastText encoding model structure may comprise: the input layer is a vector of words and N-Gram features of the words in the text, the hidden layer is mainly used for averaging hidden layer vectors input by the input layer, and the output layer is used for directly outputting coded values of the input text as a first vector. N represents the length of the segmentation primitive word, and N can be customized, for example, N can take a value of 3. The method for generating N-Gram Feature of the word is referred to in the prior art, and will not be described here.

In the application phase, in the aforementioned sub-step 3201, the input question is input to the FastText coding model, which will output the first vector of the input question.

The implementation of word vector coding based on the FastText coding model has some drawbacks, such as: the ability of word vectors to capture word sense relationships may be somewhat inadequate compared to word vectors.

Word2Vec Word vector coding is a technique that maps words into a semantic space, resulting in a semantic space vector. The Word2Vec coding model is usually implemented by means of a neural network, and considering the context of text, there are two models CBOW and Skip-Gram, which are similar in the training process. The Skip-Gram model predicts the context around a word using it as input, and the CBOW model predicts the word itself using the context of the word as input. The Skip-Gram has better training speed and processing effect on large data, and the word vector representation is realized by selecting the Skip-Gram mode in combination with the actual service requirement, and the specific processing steps are as follows:

(1) Determining a window size window, generating 2window training samples for each word, (i, i-window), (i, i-window+1), (i, i+window-1), (i, i+window), wherein i represents a window number.

(2) The number of samples, batch_size, is determined, noting that the size of batch_size must be an integer multiple of 2window to ensure that each batch of samples contains all samples for a word.

(3) There are two training algorithms: layers Softmax and Negative Sampling.

(4) The neural network is trained for a certain number of times in an iterative manner to obtain a parameter matrix from the input layer to the hidden layer, wherein the transpose of each row in the matrix is a word vector of a corresponding word.

In the application phase, in the aforementioned sub-step 3202, the input question is input to the Word2Vec Word vector coding sub-model trained in advance, and the Word vector output by the Word2Vec Word vector coding sub-model is obtained as the second vector.

Next, in the foregoing sub-step 3203, the first vector (i.e., the word vector) and the second vector (i.e., the word vector) may be stitched to obtain a third vector.

Combining the Word vector (i.e., the first vector) output by the FastText model with the Word vector (i.e., the second vector) output by the Word2Vec model can provide richer semantic information, i.e., the Word vector and the Word vector capture semantic information from Word and character levels, respectively, and combining them helps to provide richer semantic information, thereby improving the performance of natural language processing tasks. Meanwhile, after the word vectors and the word vectors are combined, the defects of the word vectors and the word vectors can be made up. Specifically, word vectors perform poorly in handling unregistered words and low frequency words, while word vectors are relatively weak in capturing word sense relationships. Combining word vectors and word vectors of the input problem may remedy their respective shortcomings, providing a more accurate and comprehensive text representation.

For the classification task of more complicated multi-industry sub-fields, the word vectors and the word vectors are combined, so that the accuracy of model classification is improved, and the method has stronger robustness. For example, models that combine word vectors and word vectors are more robust in handling misspellings, grammatical errors, or unregistered words, thereby improving the stability and reliability of classification tasks in practical applications. In addition, since there are many long-tail words in the training data, there is a long-tail problem (in the text classification task, the long-tail problem refers to that the number of samples of part of the classes is far smaller than that of other classes), and combining the word vectors and the word vectors can improve the recognition capability of the model on the low-frequency words and rare classes, so as to improve the long-tail problem.

In a word, the word vector and the word vector are combined to bring more abundant semantic information, better generalization capability and stronger robustness for the text classification task, so that the performance of the classification task is improved.

In some embodiments of the application, the TextCNN model may include: an embedding layer, a convolution layer, a pooling layer and an output layer. The embedded layer is used for splicing Word vectors output by the Word2Vec model and Word vectors output by the FastText model to obtain a third vector for inputting the problem text.

The convolution layer of the textCNN model is used for carrying out convolution operation on the third vector, the pooling layer is used for carrying out pooling operation on the hidden layer vector output by the convolution layer, and finally, the output layer carries out feature mapping on the hidden layer vector output by the pooling layer and outputs a category label of a preset category and a category prediction probability value corresponding to the category.

In the foregoing sub-step 3204, the merged third vector is classified and mapped by the convolution layer, the pooling layer and the output layer of the TextCNN model to obtain a classification prediction probability value of each preset class, where the classification prediction probability value of each preset class may be used as the second index value of the preset domain corresponding to the corresponding class.

In order to make the scheme of classifying the input problem by the preset first classification model and obtaining the second index value of the input problem matched with each field clearer, the training process of the preset first classification model is described below with reference to fig. 2.

The word vector coding sub-model 210, the word vector coding sub-model 220, and the text classification sub-model 230 used in the embodiment of the present application may be obtained by fine tuning based on training data used in the language processing model preset in each field. A preset first classification model is trained based on a dataset comprising a plurality of domain data. The method comprises the steps of training sample data of a preset first classification model to be a problem text, and training a sample label to be a field class true value corresponding to the problem text.

For example, first, a problem input by a user is classified and mapped to a corresponding industry field, for example, the problem "eye is very painful" for the user as if blepharitis is obtained, what should be done, and obviously the problem belongs to a problem in the medical field, and the problem should be replied by a preset language processing model in the medical field, so that a QA (question-answer pair) data set trained by a medical large model needs to be processed as a classified data set.

When processing a QA data set in the medical field, it is first necessary to split the questions and answers inside the QA data set, and consider them as part of training data, respectively. Then, prior to processing the data, pre-processing and data cleansing operations are required, including but not limited to: noise removal such as irrelevant information or interfering text, removal of stop words and punctuation, and the like. And then, the text after data cleaning can be used as sample data, and a sample label corresponding to the medical field is marked for the sample.

Similarly, corresponding data processing and preprocessing steps are required to be performed for preset fields such as legal fields, educational fields and the like, and finally a classification data set of each field is obtained. For example, a sample tag of a classification dataset may include tag values corresponding to medical, legal, educational fields.

In the training process of a preset first classification model, for each training sample in a classification data set, firstly, performing word vector coding on an input problem in the training sample through a word vector coding sub-model 210 to obtain a first vector; performing word vector coding on the input problem in the training sample through a word vector coding sub-model 220 to obtain a second vector; splicing the first vector and the second vector through a text classification sub-model 230 to obtain a third vector; then, text classification processing is carried out based on the third vector, and probability values of matching of input problems in the training sample with all preset fields are obtained; and then, calculating the classification loss of the training sample according to the probability value and the sample label of the training sample. And then, calculating a loss value of a preset first classification model according to the classification loss of all training samples, optimizing model parameters by taking the minimum loss value as a target, and carrying out iterative training on the preset first classification model.

In the embodiment of the application, the loss function for calculating the model loss value is not limited.

The word vector coding sub-model 210, the word vector coding sub-model 220, and the text classification sub-model 230 used in the embodiment of the present application may be model structures in the prior art, and the loss function of the preset first classification model may be in the prior art.

And step 330, determining the target field matched with the input problem according to the first index value and the second index value.

In some embodiments of the present application, determining the target area for matching the input problem according to the first index value and the second index value includes: carrying out weighted summation on the first index value and the second index value by adopting a preset weight value to obtain a matching degree score corresponding to each preset field; and determining the preset field with the highest matching degree score as the target field for matching the input problem.

For example, the first index value obtained in the step 310 and the second index value obtained in the step 320 may be set to weight values in advance according to the formula: score ₁ = α ₁ * First index value +β ₁ * A second index value is calculated to obtain a matching degree score of the input problem matching each preset field ₁ . Wherein the weight value alpha ₁ And beta ₁ The importance of the output result of the first classification model is preset according to the number of hit keywords (i.e., the first index value). For example, when the importance of the keyword is considered to be higher, α may be set ₁ Greater than beta ₁ 。

Finally, the comprehensive matching degree score ₁ The matching degree score is the largest, and the representative corresponding category is the judged target category.

And step 340, classifying the input problems through a preset second classification model corresponding to the target field, and obtaining a third index value of the input problems matched with preset sub-fields of the target field.

After determining the domain where the inputted questions match, domain subdivision is further performed. For example, after classifying an input question into a medical field, it is further subdivided into which sub-field of the medical field the input question belongs.

Optionally, the classifying the input problem through a preset second classification model corresponding to the target domain, to obtain a third index value of the input problem matching preset sub-domains of the target domain, includes: performing word vector coding on the input problem to obtain a fourth vector; performing word vector coding on the input problem to obtain a fifth vector; fusing the fourth vector and the fifth vector to obtain a sixth vector; and performing text classification processing based on the sixth vector to obtain a third index value of each preset sub-field of the input problem matching the target field.

Optionally, the structure of the preset second classification model is shown in fig. 2, and includes: a word vector encoding sub-model 210, a word vector encoding sub-model 220, and a text classification sub-model 230. The Word vector coding sub-model can adopt a FastText coding model, the Word vector coding sub-model can adopt a Word2Vec coding model, and the text classification sub-model can adopt a textCNN model.

In the embodiment of the application, aiming at each preset field, respectively training a sub-field classification model of the field, namely a preset second classification model, according to the data of the preset field. Thus, in the application stage, after the input problem is classified into the target domain through the preliminary domain classification, the input problem can be further classified into the sub-domain in the target domain through the sub-domain classification model of the target domain, namely, the preset second classification model trained based on the data of the target domain.

The training data of the preset second classification model is derived from fine classification knowledge base data in various industrial fields and the data after fine adjustment of the training data of the preset second classification model. The process of processing these data may refer to the specific manner of processing the training data of the preset second classification model, which is not described herein.

For example, knowledge texts and the like related to sub-fields of each field are processed, such as sub-fields of legal fields: medical disputes, contract disputes, damage reimbursements, and the like, are constructed as classification data sets of respective sub-domains of the legal domain, and sub-domain labels of the legal domain are set for each training sample, e.g., the sub-domain labels of the legal domain include: criminal cases, labor disputes, contractual disputes, and the like. Then, realizing the sub-domain classification in the legal domain based on a preset second classification model of FastText+Word2Vec+TextCNN structure.

The training method of the second classification model is preset, and the training method of the first classification model is preset, which is not repeated here.

Correspondingly, the classifying the input problem through the preset second classification model corresponding to the target field, and obtaining the third index value of the input problem matched with the preset sub-fields of the target field includes: performing word vector coding on the input problem through a word vector coding sub-model 210 to obtain a fourth vector; performing word vector coding on the input problem through a word vector coding sub-model 220 to obtain a fifth vector; and fusing the fourth vector and the fifth vector through a text classification sub-model 230 to obtain a sixth vector, and performing text classification processing based on the sixth vector to obtain a third index value of each preset sub-field of the input problem matching the target field.

The input problems are classified through a preset second classification model corresponding to the target field, so that a specific implementation mode of the input problems matching the third index value of each preset sub-field of the target field can be obtained, the input problems can be classified through a preset first classification model in the foregoing, and a specific implementation mode of the input problems matching the second index value of each field is obtained, and is not repeated here.

And step 350, classifying the input questions in the sub-fields according to the third index value and the sub-fields to which the hit keywords belong, and obtaining field classification results of the input questions.

Optionally, the classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and obtaining a domain classification result of the input problem includes: acquiring a fourth index value of the keywords in each sub-domain of the input problem according to the sub-domain to which the hit keywords belong; carrying out weighted summation on the third index value and the fourth index value by adopting a preset weight value to obtain a matching degree score of each corresponding sub-field; and determining the classification result of the input problem in the sub-field of the target field according to the matching degree score.

Next, according to the sub-domain to which the keyword of the input question hit acquired in step 310 belongs, the number of keywords of each sub-domain of the input question hit is further determined as a fourth index value of the keywords of the sub-domain of the input question hit. Then, the third index value and the fourth index value obtained in the foregoing step 340 may be calculated according to a preset weight value by the formula: score ₂ = α ₂ * Fourth index value +β ₂ * Third index value is calculated to obtain matching degree score of each sub-field of the input problem matching ₂ . Wherein the weight value alpha ₂ And beta ₂ The method comprises the steps of presetting according to the number of hit keywords and the importance degree of output results of a second classification model. For example, when the importance of the keyword is considered to be higher, α may be set ₂ Greater than beta ₂ 。

Finally, the comprehensive matching degree score ₂ And determining classification results of the input problems in the sub-fields of the target field. For example, a sub-field corresponding to the maximum value of the matching score may be selected as the target category of the input question matching.

The embodiment of the application also discloses a cross-domain problem processing device, as shown in fig. 6, comprising:

the problem classification module 610 is configured to perform problem classification processing on an input problem based on a preset domain word library and a pre-trained text classification model in each domain, and obtain a domain classification result of the input problem;

and a question answering module 620, configured to take the domain classification result and the input question as input of a preset language processing model, and obtain an answer corresponding to the input question through the preset language processing model.

Optionally, the text classification model includes: a first classification model and a second classification model are preset, and the problem classification module 610 is further configured to:

classifying the input problems through the preset first classification model to obtain second index values of the input problems matched with the fields;

Classifying the input problems through the preset second classification model corresponding to the target field to obtain a third index value of each preset sub-field of the target field matched with the input problems;

Optionally, the classifying the input problem by the preset first classification model to obtain a second index value of the input problem matched with each field includes:

performing word vector coding on the input problem to obtain a first vector;

performing word vector coding on the input problem to obtain a second vector;

fusing the first vector and the second vector to obtain a third vector;

and performing text classification processing based on the third vector to obtain a second index value of the input problem matched with each preset field.

Optionally, the domain word stock is constructed by the following method:

for a target preset field, acquiring keywords of the target preset field by adopting one or more of the following keyword extraction methods: extracting a first keyword from the sample text based on a preset rule; extracting long tail keywords from the sample text by adopting a word splicing mode, and taking the long tail keywords as second keywords; clustering the sub-words of the preset long-tail keywords to obtain target sub-words as third keywords;

And constructing a domain word stock of the target preset domain according to one or more keywords of the first keyword, the second keyword and the third keyword.

Optionally, the extracting long tail keywords from the sample text by word stitching includes:

sentence division is carried out on the sample text according to the preset symbol and the text length, so that sentences to be processed are obtained;

performing word segmentation processing on the obtained sentence to be processed to obtain candidate word segmentation;

screening and part-of-speech tagging are carried out on the candidate segmented words, and target candidate segmented words with preset part-of-speech are obtained;

splicing to obtain noun phrases serving as candidate long tail keywords according to the target candidate segmentation;

obtaining importance scores of the candidate long tail keywords;

and selecting the candidate long tail keywords as second keywords according to the importance degree scores.

Optionally, the determining, according to the first index value and the second index value, the target field of matching the input problem includes:

The cross-domain problem processing device disclosed by the embodiment of the application is used for realizing the cross-domain problem processing method disclosed by the embodiment of the application, and the specific implementation of each module of the device is not repeated, and can be referred to the specific implementation of the corresponding steps of the method embodiment.

According to the cross-domain problem processing device disclosed by the embodiment of the application, the problem classification processing is carried out on the input problem based on the domain word stock of each preset domain and the pre-trained text classification model, so that the domain classification result of the input problem is obtained; and then, taking the domain classification result and the input questions as input of a preset language processing model, and acquiring answers corresponding to the input questions through the preset language processing model, so that the accuracy of question and answer of the preset language processing model can be effectively improved.

Further, when the text classification is carried out, keyword matching is carried out on the input problem based on a preset domain word stock of each domain, so that a first index value of the keywords in each domain word stock of the input problem is obtained, and the hit keywords belong to the sub-domain; classifying the input problems through a preset first classification model to obtain second index values of the input problems matched with the fields; determining a target field matched with the input problem according to the first index value and the second index value; classifying the input problems through a preset second classification model corresponding to the target field to obtain a third index value of the input problems matched with preset sub-fields of the target field; and classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and acquiring a domain classification result of the input problem, so that cross-domain classification can be quickly performed on the input problem, and the classification accuracy is higher.

The embodiment of the application also discloses a cross-domain problem processing device, as shown in fig. 7, comprising:

the keyword matching module 710 is configured to perform keyword matching on an input problem based on a domain word stock of each preset domain, so as to obtain a first index value of keywords in each domain word stock of the input problem, and a sub-domain to which the hit keywords belong;

the first classification module 720 is configured to classify the input problem by presetting a first classification model, and obtain a second index value of the input problem matched with each of the fields;

a first classification result determining module 730, configured to determine, according to the first index value and the second index value, a target domain in which the input problem is matched;

the second classification module 740 is configured to classify the input problem according to a preset second classification model corresponding to the target domain, and obtain a third index value of the input problem that matches preset sub-domains of the target domain;

the classification result obtaining module 750 is configured to classify the input question in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and obtain a domain classification result of the input question.

Optionally, the first classification module 720 is further configured to:

performing word vector coding on the input problem to obtain a first vector;

performing word vector coding on the input problem to obtain a second vector;

fusing the first vector and the second vector to obtain a third vector;

Optionally, the domain word stock is constructed by the following method:

obtaining importance scores of the candidate long tail keywords;

Optionally, the first classification result determining module 730 is further configured to:

According to the cross-domain problem processing device disclosed by the embodiment of the application, keyword matching is carried out on input problems based on the domain word banks of preset domains, so that a first index value of the keywords in each domain word bank of the input problems is obtained, and the hit keywords belong to the sub-domains; classifying the input problems through a preset first classification model to obtain second index values of the input problems matched with the fields; determining a target field matched with the input problem according to the first index value and the second index value; classifying the input problems through a preset second classification model corresponding to the target field to obtain a third index value of the input problems matched with preset sub-fields of the target field; and classifying the input problem in the sub-domain according to the third index value and the sub-domain to which the hit keyword belongs, and acquiring a domain classification result of the input problem, so that cross-domain classification can be quickly performed on the input problem, and the classification accuracy is higher.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The above description of the present application provides a method and apparatus for cross-domain problem treatment, and specific examples are applied to illustrate the principles and embodiments of the present application, where the above description of the examples is only for helping to understand the method and a core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in an electronic device according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

For example, fig. 8 shows an electronic device in which the method according to the application may be implemented. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device conventionally comprises a processor 810 and a memory 820 and a program code 830 stored on said memory 820 and executable on the processor 810, said processor 810 implementing the method described in the above embodiments when said program code 830 is executed. The memory 820 may be a computer program product or a computer readable medium. The memory 820 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 820 has a storage space 8201 for program code 830 of a computer program for performing any of the method steps described above. For example, the memory space 8201 for the program code 830 may include individual computer programs that are each used to implement various steps in the above methods. The program code 830 is computer readable code. These computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform a method according to the above-described embodiments.

The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the cross-domain problem processing method according to the embodiment of the application.

Such a computer program product may be a computer readable storage medium, which may have memory segments, memory spaces, etc. arranged similarly to the memory 820 in the electronic device shown in fig. 8. The program code may be stored in the computer readable storage medium, for example, in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 9. In general, the memory unit comprises computer readable code 830', which computer readable code 830' is code that is read by a processor, which code, when executed by the processor, implements the steps of the method described above.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, it is noted that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A cross-domain problem processing method, the method comprising:

taking the domain classification result and the input problem as input of a preset language processing model, and acquiring an answer corresponding to the input problem through the preset language processing model;

wherein the text classification model comprises: presetting a first classification model and a second classification model, and performing problem classification processing on input problems based on a field word stock and a pre-trained text classification model of each field to obtain field classification results of the input problems, wherein the method comprises the following steps:

2. A cross-domain problem processing method, the method comprising:

3. The method according to claim 2, wherein classifying the input problem by a preset first classification model to obtain a second index value of the input problem matching each of the fields includes:

performing word vector coding on the input problem to obtain a first vector;

performing word vector coding on the input problem to obtain a second vector;

fusing the first vector and the second vector to obtain a third vector;

4. The method of claim 2, wherein the domain word stock is constructed by:

5. The method of claim 4, wherein extracting long-tail keywords from the sample text by means of word stitching includes:

obtaining importance scores of the candidate long tail keywords;

6. The method of claim 2, wherein the determining the target area for the input question match based on the first index value and the second index value comprises:

7. A cross-domain problem processing apparatus, the apparatus comprising:

the problem solving module is used for taking the domain classification result and the input problem as input of a preset language processing model, and obtaining an answer corresponding to the input problem through the preset language processing model;

wherein the text classification model comprises: the method comprises the steps of presetting a first classification model and presetting a second classification model, wherein the problem classification module is further used for:

8. A cross-domain problem processing apparatus, the apparatus comprising:

9. An electronic device comprising a memory, a processor and program code stored on the memory and executable on the processor, wherein the processor implements the cross-domain problem handling method of any of claims 1 to 6 when the program code is executed by the processor.

10. A computer readable storage medium having stored thereon program code, which when executed by a processor realizes the steps of the cross-domain problem processing method of any of claims 1 to 6.