CN118211585A - Problem identification method, device, electronic equipment and storage medium - Google Patents

Problem identification method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN118211585A
CN118211585A CN202410146033.0A CN202410146033A CN118211585A CN 118211585 A CN118211585 A CN 118211585A CN 202410146033 A CN202410146033 A CN 202410146033A CN 118211585 A CN118211585 A CN 118211585A
Authority
CN
China
Prior art keywords
original text
type
question
subject
solving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410146033.0A
Other languages
Chinese (zh)
Inventor
韩红旗
徐紫燕
张均胜
李琳娜
王莉军
周则旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Scientific And Technical Information Of China
Original Assignee
Institute Of Scientific And Technical Information Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Scientific And Technical Information Of China filed Critical Institute Of Scientific And Technical Information Of China
Priority to CN202410146033.0A priority Critical patent/CN118211585A/en
Publication of CN118211585A publication Critical patent/CN118211585A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a problem identification method, a device, electronic equipment and a storage medium, and relates to the technical field of data mining. Wherein the method comprises the following steps: determining the solving type of at least two original texts; wherein the solution type is research purpose information indicating the original text; inputting each original text into a title generation model corresponding to the solving type, and generating candidate titles corresponding to each original text respectively; wherein the title generation model is pre-trained; carrying out the same semantic merging processing on all the candidate titles to obtain a reference problem; and determining whether the reference question spans disciplines according to discipline types of the original text corresponding to the reference question. According to the embodiment of the disclosure, the interdisciplinary research of the problem granularity is realized, the fineness of interdisciplinary research analysis is improved, and the accuracy of interdisciplinary research results is further improved.

Description

Problem identification method, device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of data mining, in particular to a problem identification method, a device, electronic equipment and a storage medium.
Background
Cross-discipline intelligence research is a research method integrating multiple discipline fields, aiming at integrating knowledge and methods of different disciplines to more fully and deeply understand the essence, generation and application of intelligence. The current interdisciplinary information research mostly expands research on the granularity of papers or topics, and searches whether a specific paper or topic spans disciplines and the types of the spanned disciplines, but because the contents recorded by the papers or topics possibly cover the theories and methods of a plurality of different disciplines, the research scope is large, and the easily-appearing identified discipline types cannot accurately reflect the contents of the focused research of the paper or topic, so that the accuracy of the interdisciplinary research results is low.
Under such circumstances, there is a need to provide a problem identification scheme that improves the accuracy of interdisciplinary research results.
Disclosure of Invention
The embodiment of the disclosure provides a problem identification method, a device, electronic equipment, a computer readable storage medium and a computer program product, which are used for solving the technical problem of low accuracy of interdisciplinary research results.
According to an aspect of the embodiments of the present disclosure, there is provided a problem identification method including:
Determining the solving type of at least two original texts; the solving type is research purpose information indicating the original text;
Inputting each original text into a title generation model corresponding to the solving type, and generating candidate titles corresponding to each original text respectively; wherein the title generation model is pre-trained;
carrying out the same semantic merging treatment on all candidate titles to obtain a reference problem;
and determining whether the reference question spans disciplines according to the discipline type of the original text corresponding to the reference question.
In one possible implementation, determining the solution type of the original text includes:
Inputting the original text into a classifier corresponding to each solving type respectively to obtain the probability that the original text belongs to each solving type; the classifier is obtained by training a sample original text with a solving type label;
And determining the solving type of the original text according to the probability of each solving type.
In one possible implementation, determining the subject type of the original text includes:
Inputting the original text and the classification number of the original text into a subject classification classifier to obtain the subject classification of the original text; the subject portal classifier is obtained by training a sample original text with a subject portal label and a classification number label;
Inputting the original text, the classification number and the discipline categories into a multi-label classifier to obtain discipline types of the original text; the multi-label classifier is obtained by training a sample original text with subject type labels and class number labels.
In one possible implementation, inputting the original text, the class number, and the subject class into a multi-label classifier, obtaining a subject type of the original text, including:
Inputting the original text, the classification number and the subject class into a multi-label classifier to obtain a classification result; the classification result comprises a probability that the subject type of the reference subject type and the subject type of the original text are the reference subject type;
Obtaining the corresponding conditional probability when the subject type of the original text is each reference subject type under the condition of a given classification number according to the classification result;
The subject type of the original text is determined based on the conditional probability of each reference subject type.
In one possible implementation, the title generation model is obtained by:
Carrying out solution type labeling on sample original texts matched with the candidate title templates of each solution type to obtain sample original texts with solution type labels corresponding to each solution type;
And respectively retraining the pre-trained model by taking the sample original text with the solving type label corresponding to each solving type as external data to obtain a title generation model corresponding to each solving type.
In one possible implementation, the same semantic merging process is performed on all candidate titles to obtain a reference problem, including:
extracting corresponding question index phrases from each candidate title;
Obtaining a question factor according to keywords in all the questions referring phrases; the problem factors are keywords with word frequency larger than a preset word frequency threshold value;
and carrying out the same semantic merging processing according to the question factors corresponding to all the question reference phrases to obtain the reference questions.
In one possible implementation manner, according to the question factors corresponding to all the question reference phrases, the same semantic merging process is performed to obtain the reference questions, including:
according to the question factors corresponding to all the question reference phrases, taking the question reference phrases with the same number of the question factors larger than a preset question factor threshold as suspected identical questions;
Taking the suspected identical problems as nodes, taking the text similarity among the suspected identical problems as the weight of the edge, and obtaining a suspected identical problem relation network;
identifying communities by using a network community discovery algorithm according to the suspected same problem relationship network;
and merging the problem-designated phrases in the same community to obtain a reference problem.
According to another aspect of the embodiments of the present disclosure, there is provided a problem identification apparatus including:
The solving type determining module is used for determining solving types of at least two original texts; the solving type is research purpose information indicating the original text;
The candidate title generation module is used for inputting each original text into a title generation model corresponding to the solving type to generate candidate titles respectively corresponding to each original text; wherein the title generation model is pre-trained;
the reference problem determining module is used for carrying out the same semantic merging processing on all the candidate titles to obtain a reference problem;
And the reference question identification module is used for determining whether the reference question crosses disciplines according to the discipline type of the original text corresponding to the reference question.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the problem identification method provided in any of the embodiments described above.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the problem identification method provided by any of the above embodiments.
According to an aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program for executing the steps of the problem identification method provided by any of the above embodiments by a processor.
The technical scheme provided by the embodiment of the disclosure has the beneficial effects that:
By determining the solving types of at least two original texts, a model is generated by pertinently using the title corresponding to the solving type of the original text, the candidate title corresponding to each original text is generated, and the accuracy of title generation can be improved by adapting the solving type of the original text and the title generating model type. And carrying out the same semantic merging processing on all candidate titles, integrating to obtain reference questions, identifying the reference questions related to a plurality of discipline types as interdisciplinary research questions according to discipline types of the original texts corresponding to the reference questions, and accurately judging whether the reference questions are interdisciplinary, so that interdisciplinary information analysis of the granularity of the questions is realized, the fineness of interdisciplinary research analysis is improved, and the reliability and the accuracy of analysis results are effectively improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments of the present disclosure will be briefly introduced below.
Fig. 1 is a schematic flow chart of a problem identification method according to an embodiment of the disclosure;
FIG. 2 is a schematic diagram of a solution type determination method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a subject type determination method according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a training method for a title generation model according to an embodiment of the disclosure;
FIG. 5 is a schematic diagram of a method for determining a reference problem according to an embodiment of the present disclosure;
FIG. 6 is a schematic flow chart of a method for identifying interdisciplinary research problems according to an embodiment of the present disclosure;
Fig. 7 is a schematic structural diagram of a problem identification device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present disclosure, and the technical solutions of the embodiments of the present disclosure are not limited.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Technical solutions of the embodiments of the present disclosure and technical effects produced by the technical solutions of the present disclosure are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
The disclosed embodiments provide a problem identification method, alternatively, the disclosed embodiments may be applied to a terminal or a server, and all steps in the method may be performed by the terminal or the server independently or by the terminal and the server together.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart voice interaction device (e.g., a smart speaker), a wearable electronic device (e.g., a smart watch), a vehicle-mounted terminal, a smart home appliance (e.g., a smart television), an AR/VR device, etc.
Embodiments of the present disclosure will be described hereinafter with reference to a server as the execution subject, however, this is not to be construed as limiting the embodiments of the present disclosure.
Fig. 1 is a schematic flow chart of a problem identification method provided by an embodiment of the present disclosure, and as shown in fig. 1, a technical solution provided by an embodiment of the present disclosure includes the following steps:
step S101, determining solving types of at least two original texts; wherein the solution type is research purpose information indicating the original text.
The original text may be text used for transmitting information in different fields, such as paper text, patent text, fund item text, work text, research report text, social media posts, news manuscripts, and the like. The solution type is research purpose information indicating the original text, and is used for dividing the original text into different solution types according to the gist of the content recorded in the original text.
Specifically, in step S101, the server acquires each original text in the original text dataset, and determines a solution type corresponding to each original text. Wherein the original text dataset comprises at least two original texts. The number of solution types and the specific type may be preset according to actual requirements (e.g., preset the solution types to the four types a-D).
It can be understood that the method can determine the solving type of the original text according to the manual experience and make manual labeling, or construct a solving type labeling model according to the actual requirement, train the model according to the sample original text of different solving types, and implement the above step S101 in such a manner that the trained solving type labeling model makes the determination and labeling of the solving type of the original text.
It should be noted that, if the solving type labeling model is used to complete the step S101, the algorithm adopted by the model, the specific structure of the model and the model training mode may be determined according to the actual requirement.
For example, a solution type tag is added to the sample original text according to the classification rules of the solution type. When the number of the solving types is 2, a two-classification algorithm can be adopted to construct a solving type labeling model, and when the number of the solving types is more than 2, a multi-classification algorithm (Multiclass Classification) can be adopted to construct a solving type labeling model, and the model is trained by using a labeled sample original text, so that the trained model can be subjected to correct two-classification or multi-classification, and the solving type of the original text is determined.
In addition to the method for constructing the single model, a plurality of models can be constructed to jointly determine the solving type of the original text. For example, a classifier (i.e., a classification model) is constructed for each solution type, and the solution types of the original text are determined together according to the recognition results of the multiple classifiers. Or, constructing a plurality of multi-classifiers (namely multi-classification models) by using different multi-class classification algorithms aiming at all solving types, and jointly determining the solving type of the original text by using the identification result of each multi-classifier.
Step S102, each original text is input into a title generation model corresponding to the solving type, and candidate titles corresponding to each original text are generated.
The title generation model is pre-trained and used for summarizing the content of the original text and generating candidate titles standardized in a description mode.
Specifically, since the original text of different solution types has large difference, the summarization modes of corresponding titles have large difference, in order to ensure the generation effect of the candidate titles, dedicated title generation models are respectively constructed and pre-trained for each solution type, and the standardized description mode of the candidate title corresponding to each title generation model and the training mode of the title generation model can be determined according to actual requirements.
In step S102, for any original text, the original text is input into a title generation model corresponding to the solution type of the original text, and candidate titles corresponding to the original text are generated. And processing each original text in the mode to obtain a candidate title corresponding to each original text.
Step S103, carrying out the same semantic merging processing on all the candidate titles to obtain a reference problem.
Specifically, the candidate titles serve as a high-level summary of the contents of the original texts, from which questions to be studied, analyzed, discussed, or solved (answered) in each original text can be determined.
For example, if the original text is a scientific paper, the scientific paper is generally intended to solve a problem, and the "problem" in this embodiment may refer to a "research problem" to be solved by the scientific paper. If the original text is social sciences kinds of articles, the "problem" in the embodiment may refer to the social problem discussed by the articles. If the original text is news text, the "question" in this embodiment may refer to a dispute problem for the reported analysis of news events.
Each original text has a candidate title corresponding to the original text, and considering that the problems to be solved by different original texts may be the same, in step S103, the same semantic merging process is performed on all candidate titles, so as to obtain a reference problem.
Because the reference questions are merged and generalized on the basis of the candidate titles (for example, the same questions are merged and different questions are reserved), the corresponding 'questions' of each original text in the original text data set can be represented representatively, repeated and redundant information is eliminated as far as possible among the reference questions, and the reference questions have the characteristics of being independent of each other, low in redundancy and large in information quantity.
It is understood that the basis of the same semantic merging process may be keyword matching, semantic similarity calculation, text clustering, etc. When the merging processing is performed, the candidate titles with the same semantic meaning can be merged according to the semantic meaning of the candidate titles, and then the reference problem is determined from the merged candidate titles. And extracting the problems corresponding to each candidate title, and merging the problems with the same semantics to obtain the reference problem. The basis of the same semantic merging process and the specific mode of the merging process can be determined according to actual requirements.
Step S104, determining whether the reference question spans disciplines according to the discipline type of the original text corresponding to the reference question.
Specifically, each original text has a discipline type corresponding to the original text, and the reference questions inherit the discipline types of the corresponding original text, and each reference question corresponds to at least one discipline type. The number and specific types of discipline types may be preset according to actual needs (e.g., all raw text in a raw text dataset is known to be medical science treatises, and discipline types may be divided according to medical science professions).
It can be appreciated that the manner of determining the subject type of the original text can be directly obtained (for example, obtaining the professional classification of the published paper as the subject type), or manually labeling according to experience, or constructing a subject type labeling model according to actual requirements, training the model according to sample original texts of different subject types, and determining and labeling the subject type of the original text according to the trained subject type labeling model.
It should be noted that, the algorithm adopted by the discipline type labeling model, the specific structure of the model and the model training mode can be determined according to actual requirements.
In step S104, according to the subject type of the original text corresponding to the reference question, it is determined whether the reference question crosses the subject according to the number of different subject types corresponding to the reference question.
For example, if the reference question Q1 corresponds to the original text from W1 to W10, and the same subject type is accumulated only once among the subject types corresponding to the original text, and the subject type corresponding to the reference question Q1 is determined to be D1 to D3, it means that the subject type covered by the reference question Q1 is greater than 1, and the cross subject of the reference question Q1 is determined.
It will be appreciated that when determining whether the reference question crosses disciplines, other discipline number thresholds than 1 may also be set, and when the number of discipline types covered by the reference question is greater than the discipline number threshold, the reference question is determined to be crossing discipline. In addition, considering the length of the original text and the amount of information that can be represented by each part of content, in this embodiment, the input (and training samples) of each model may be the original text with complete length or the original text after preprocessing (or the original text is preprocessed by the model), and the original text after preprocessing may be text or text representation corresponding to one or more of the title, abstract, keyword, conclusion, and the like of the original text.
According to the technical scheme provided by the embodiment of the disclosure, the solution types of at least two original texts are determined, the title generation model corresponding to the solution types of the original texts is used for generating the candidate title corresponding to each original text in a targeted manner, and the accuracy of title generation can be improved by adapting the solution types of the original texts and the title generation model types. And carrying out the same semantic merging processing on all candidate titles, integrating to obtain reference questions, identifying the reference questions related to a plurality of discipline types as interdisciplinary research questions according to discipline types of the original texts corresponding to the reference questions, and accurately judging whether the reference questions are interdisciplinary, so that interdisciplinary information analysis of the granularity of the questions is realized, the fineness of interdisciplinary research analysis is improved, and the reliability and the accuracy of analysis results are effectively improved.
In one possible implementation, determining the solution type of the original text includes:
Inputting the original text into a classifier corresponding to each solving type respectively to obtain the probability that the original text belongs to each solving type; the classifier is obtained by training a sample original text with a solving type label;
And determining the solving type of the original text according to the probability of each solving type.
Specifically, a classifier is built for each solving type, taking the step of determining the solving type of any original text as an example, the original text is respectively input into the two classifiers corresponding to each solving type to obtain the identification result corresponding to each two classifier, and the solving type of the original text is determined together according to the identification results of the two classifiers.
It can be understood that the number of the classifiers is the same as the number of the preset solving types, and each classifier is obtained by training the original text of the sample with the corresponding solving type label. The two classifiers can be constructed by adopting algorithms such as logistic regression (Logistic Regression), decision Tree (Decision Tree), random Forest (Random Forest), support vector machine (Support Vector Machines, SVM) and the like, and the specific training mode can be determined according to actual requirements.
For example, fig. 2 is a schematic diagram of a solution type determining method provided by an embodiment of the present disclosure, as shown in fig. 2, when an original text is a scientific paper, three types of solution types, namely, understanding type, solution type and exploratory type, are set according to a solution target corresponding to a research problem of the paper. Wherein the understanding papers are intended to explain and summarize the existing knowledge or insight. The solution papers are intended to address and solve specific problems or disputes. The research-type paper aims at finding new knowledge or view through independent research and investigation.
Considering that the number of labels is small, three solution type classifiers are respectively constructed by using an SVM classification algorithm. Aiming at three solving types of understanding type, solving type and exploring type, 500 paper data are obtained through manual labeling of each type, and an experimental data set is formed. 400 component model training sets are taken for each type, and 100 component model test sets are taken for each type. The tagged text is preprocessed by word segmentation, filtering and stopping words, and the like, a static word vector representation is formed, and a classifier is trained for each type by using an SVM algorithm.
Inputting the paper title, abstract and keyword corresponding to each paper into each trained classifier, outputting the probability P' 1-P'3 of the paper belonging to the corresponding solution type, and selecting the solution type with the highest probability as the final solution type labeling result.
It will be appreciated that the probability of the two classifier outputs may occur the same when multiple two classifiers are employed to determine the solution type of the original text. Under the circumstance, the introduction of a manual judgment or voting mechanism can be considered, the final classification result can be further determined, or a mode of integrated learning is adopted, a plurality of two classifiers are built for each solving type by adopting different algorithms, and the results of the plurality of two classifiers are subjected to weighted average or comprehensive evaluation so as to obtain more accurate classification results.
According to the technical scheme provided by the embodiment of the disclosure, the probability that the original text belongs to each solving type is obtained by inputting the original text into the classifier corresponding to each solving type, so that the solving type of the original text can be more accurately determined, and the title generation model corresponding to the original text can be further determined. Moreover, the classifier is obtained by training the sample original text with the solving type label, so that the existing labeling information can be fully utilized, the performance and generalization capability of the classifier are improved, and the solving type is more reliably and effectively identified.
In one possible implementation, determining the subject type of the original text includes:
Inputting the original text and the classification number of the original text into a subject classification classifier to obtain the subject classification of the original text; the subject portal classifier is obtained by training a sample original text with a subject portal label and a classification number label;
Inputting the original text, the classification number and the discipline categories into a multi-label classifier to obtain discipline types of the original text; the multi-label classifier is obtained by training a sample original text with subject type labels and class number labels.
Specifically, the discipline types are used for assisting in identifying the cross-discipline reference problem, and are classified into coarse classification and fine classification. Taking the step of determining the solving type of any original text as an example, the rough classification is used for determining the subject class corresponding to the original text by using a subject class classifier, inputting the classification numbers of the original text and the original text into the subject class classifier to obtain the subject class of the original text, the fine classification is used for determining the subject type of the original text by using a multi-label classifier, and the original text, the classification numbers and the subject class are input into the multi-label classifier to obtain the subject type of the original text.
The subject class classifier is obtained by training a sample original text with a subject class label and a class number label, and algorithms such as FastText (rapid text classification), logistic regression, support vector machine, decision tree, random forest, naive Bayes and the like can be adopted for constructing the subject class classifier.
The Multi-Label classifier is obtained by training a sample original text with subject type labels and class number labels, and algorithms such as Multi-Label K nearest neighbor (Multi-Label K-Nearest Neighbors, ML-KNN), multi-Label Learning back propagation (Back Propagation for Multi-Label Learning, BP-MLL), extreme Multi-Label classification (Deep Extreme Multi-Label Learning, deepXML) and the like can be adopted for constructing the Multi-Label classifier.
The algorithm used for constructing the classifier and the specific training mode of the classifier can be determined according to actual requirements.
It can be understood that different classification systems are adopted for the classification numbers and the discipline types (i.e. the classification rules corresponding to the classification numbers and the classification rules corresponding to the discipline types are different), and in this embodiment, the classification of the discipline types and the discipline types is assisted by the classification numbers, and the two classification systems are combined, so as to improve the accuracy of the classification results and obtain more accurate discipline type classification results.
For example, the education department discipline classification system includes: discipline categories, primary disciplines, and secondary disciplines. The present embodiment determines the discipline type with the department of education discipline classification system, the primary discipline (or secondary discipline) as the discipline type, and the chinese library classification method (Chinese Library Classification, CLC).
When the original text is a scientific paper, inputting the title, abstract, keyword and middle graph class number (CLC class number) of any paper into a discipline class classifier constructed by using FastText classification algorithm to realize coarse classification, obtaining unique discipline class labels of the paper, and then inputting the title, abstract, keyword, middle graph class number and discipline class into a multi-label classifier constructed by using DeepXML algorithm to realize fine classification, so as to obtain a plurality of possible discipline types and corresponding probabilities, wherein the discipline type with the highest probability is used as the discipline type of the original text.
It should be noted that, the classification system corresponding to the subject type and the classification number is only used as a specific example, in addition, other classification systems may be used to determine the classification number and the subject type of the original text, an existing classification system may be used, and a classification system meeting the requirement may be newly defined according to the actual situation of the original text.
For example, where the original text is a patent, the classification number may also be an international patent classification number (International Patent Classification, IPC), and the discipline type is determined by the international standard educational classification (International Standard Classification of Education, ISCED).
When the original text is news, the classification number can be a code determined according to Chinese news information classification and code standard (GB/T20093-2013) or a code determined by the classification level of the first class and the second class in the standard. The discipline type is determined by a discipline specialty classification system (Classification of Instructional Programs, CIP).
In one possible implementation, inputting the original text, the class number, and the subject class into a multi-label classifier, obtaining a subject type of the original text, including:
Inputting the original text, the classification number and the subject class into a multi-label classifier to obtain a classification result; the classification result comprises a probability that the subject type of the reference subject type and the subject type of the original text are the reference subject type;
Obtaining the corresponding conditional probability when the subject type of the original text is each reference subject type under the condition of a given classification number according to the classification result;
The subject type of the original text is determined based on the conditional probability of each reference subject type.
Specifically, when fine classification is performed, taking any original text as an example, the original text, the classification number and the discipline category determined by coarse classification are input into a multi-label classifier to obtain a classification result, and the classification result output by the multi-label classifier is a reference discipline type (i.e. a discipline type) and the discipline type of the original text is the probability of the reference discipline type.
And constructing a Bayesian classifier, combining the prior probability and the probability that the subject type of the original text is the reference subject type, calculating to obtain the corresponding conditional probability when the subject type of the original text is each reference subject type under the condition of a given class number, and determining the subject type of the original text according to the conditional probability of each reference subject type.
For example, fig. 3 is a schematic diagram of a discipline type determining method provided in the embodiments of the present disclosure, and as shown in fig. 3, when an original text is a scientific paper, a classification number is a middle graph classification number, a discipline category and a discipline type are respectively a discipline category and a primary discipline of the education department discipline classification system, the discipline type determining method is described in detail as follows:
In the discipline type labeling process, a self-constructed data set training model with education department discipline labels is utilized. The subject class classifier is constructed according to FastText algorithm, and takes the title, abstract, keyword and class number of the middle graph of the paper as input and the subject class corresponding to the paper as output. The multi-label classifier is constructed according to DeepXML algorithm and Bayesian classification algorithm, and takes the title, abstract, keyword, class number of middle graph and discipline class of the paper as input, and the corresponding first-level discipline of the paper as output.
When the discipline type marking is carried out on any paper, firstly, the title, abstract, keywords and class number A of the middle graph of the paper are input into a discipline class classifier, text vectors in FastText format are obtained through first-layer preprocessing, the text vectors are identified, and discipline classes corresponding to the paper are determined from S discipline classes and marked.
Secondly, inputting titles, abstracts, keywords, class numbers A of middle graphs and discipline categories of the papers into a multi-label classifier, preprocessing a second layer to obtain text vectors in a BOW (Bag of words) format, obtaining word embedding representation of the text vectors, outputting k discipline labels corresponding to the first-level disciplines and corresponding probabilities P 1-Pk by using DeepXML algorithm, selecting m first-level disciplines, which are consistent with discipline category labeling results, of the k discipline labels as possible first-level discipline labels (namely, reference discipline types), and determining the corresponding probabilities of the m first-level disciplines as P 1-Pm.
And counting the class numbers of the middle graphs of the data set according to disciplines in advance, and acquiring prior probability P (CLC i|X),P(CLCi |X) which represents the corresponding conditional probability when the class numbers of the middle graphs of the paper are CLC i under the condition that the first-level discipline of the paper in the data set is X.
Given a paper, since the class number of the graph is determined, the bayesian formula can be used to calculate the comprehensive probability P (X), where P (X) represents the conditional probability corresponding to the first-order discipline of the paper being X given the class number, and the calculation formula is as follows:
it can be understood that, in the case where there are L classification numbers corresponding to the original text (paper corresponds to L middle graph classification numbers), P (X) is an average value of products of the prior probabilities P (CLC i |x) corresponding to the L middle graph classification numbers and the probabilities P X corresponding to the first-order discipline labels.
And calculating the comprehensive probability corresponding to each reference subject type, and selecting the reference subject type corresponding to the maximum value of the comprehensive probability as the subject type of the original text.
The discipline type labeling method provided by the embodiment of the invention is based on a hierarchical classification algorithm, integrates an extreme multi-label classification algorithm, and introduces middle graph classification number information to optimize classification results. The extremely multi-label classification algorithm DeepXML can efficiently process large-scale data, has high calculation efficiency, and can assign one document to a plurality of classification labels. Meanwhile, according to prior knowledge of the statistical relationship between the class number of the middle graph and the discipline classification labels, a Bayesian classification algorithm is adopted to determine the first-class discipline label of a paper according to the multi-label classification result.
The technical scheme provided by the embodiment of the disclosure is based on a hierarchical classification algorithm, and the subject type of the original text is determined and divided into two steps of coarse classification and fine classification. The multi-label classifier is integrated, so that the calculation efficiency of processing large-scale data is improved, and meanwhile, one document can be endowed with a plurality of classification labels, and the probability corresponding to each label is obtained. According to prior knowledge of the statistical relationship between the class number of the middle graph and the subject classification labels, a Bayesian classification algorithm is adopted to determine the subject type according to the multi-label classification result, so that the accuracy and reliability of the subject classification result can be improved.
In one possible implementation, the title generation model is obtained by:
Carrying out solution type labeling on sample original texts matched with the candidate title templates of each solution type to obtain sample original texts with solution type labels corresponding to each solution type;
And respectively retraining the pre-trained model by taking the sample original text with the solving type label corresponding to each solving type as external data to obtain a title generation model corresponding to each solving type.
Specifically, as the paper difference of different solving types is larger, the corresponding candidate title difference is also larger, and a special title generation model is respectively constructed for each type in order to ensure the title generation effect. And when the method is used for generating, selecting a corresponding title generation model according to the labeling result of the paper solving type.
Presetting a corresponding candidate title template for each solving type, obtaining sample original text matched with the candidate title template of each solving type, and labeling the solving type for each sample original text to obtain the sample original text with the solving type label corresponding to each solving type.
The pre-training model (large model) is obtained by pre-training on large-scale unlabeled data, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (GENERATIVE PRE-trained Transformer) and the like, and can be finely tuned for specific tasks so as to enable the pre-training model to be specific to application scenes.
In the text generation technology based on the pre-training model, a mode of retraining the pre-training model is adopted, the sample original text with the solving type label corresponding to each solving type is respectively used as external data of the corresponding title generation model, the pre-training model is retrained, and the title generation model corresponding to each solving type is respectively obtained.
For example, fig. 4 is a schematic diagram of a title generation model training method provided by an embodiment of the present disclosure, as shown in fig. 4, when an original text is a scientific paper, three types of solution types, namely, understanding type, solution type and exploratory type, are set according to a solution target corresponding to a research problem of the paper. According to analysis of a large number of papers, a corresponding candidate title template is preset for each solving type, and the candidate title is a summary title of the paper research problem.
The understanding type candidate title template can be 'performance\characteristics + research\analysis' of A, the solving type candidate title template can be 'research based on B' of A, and the exploratory type candidate title template can be 'research current state\review' of A or 'research on influence of A on B\action mechanism'.
When external data is acquired, paper data matching each candidate title template is first acquired from a large paper document database by title search. And then labeling the paper data by using a solution type labeling model. Finally, selecting paper data with labels consistent with the title form as external data of the solving type. For example, 11000 pieces of external data are acquired for each solution type, 10000 pieces of external data are used for the title generation model training, and 1000 pieces of external data are used for the title generation model generation effect test.
In order to ensure the generation effect of candidate titles, when a title generation model corresponding to each solution type is established, partial parameters are migrated from a BART pre-training model which is trained by large-scale data in a large-scale task, and then external data obtained based on paper data is used for retraining the pre-training model to respectively obtain an understanding generation model, a solution generation model and a research generation model.
And inputting the paper abstract and the heading text of each paper in the paper set to be identified according to the heading-abstract sequence, and inputting the paper abstract and the heading text into a heading generation model corresponding to the paper solving type, so that candidate headings (understanding headings, solving headings or exploratory headings) which are matched with the paper solving type and contain research problems can be acquired in a targeted manner.
From the technical realization point of view, the current problem identification method mainly comprises an information extraction method and a text generation method.
The information extraction method is the current mainstream method, and phrases, words or sentences representing research questions are extracted from the original text of the paper. The extraction method can be divided into four types of manual labeling, rule matching, machine learning and mixing methods.
The manual labeling method generally adopts experts in the proper field to label the papers according to a predefined method; the rule matching method generally defines rules manually in advance, and then uses rule templates to match problems and methods in the literature; the machine learning method is one of the main methods at present, and uses the existing or improved machine learning model to extract research questions according to the thought of sequence labeling or text classification. The mixing method combines a rule matching method and a machine learning method to construct a recognition model of a research problem, and some researches additionally adopt features such as parts of speech, word shapes and the like to optimize recognition effects.
The text generation method is a newly-developed paper research problem acquisition method, and is used for generating a generalized representation of a research problem on the basis of understanding the text content of a paper by using a machine instead of extracting the research problem from the original text of the paper, so that a complex problem identification task is converted into a text generation task.
According to the text generation method, the research problem of the paper is considered to be that the text content of the paper is taken as a carrier, and the mapping relation from the original text to the research problem can be learned by using a language generation model, so that the automatic generation of the research problem is realized. However, there is limited research in this regard, simply by identifying academic text vocabulary functions through text generation techniques or determining references to titles when writing papers. For example, there is a general description that a study considers a title as a document, having an important role of expressing the author's intention of writing and the core of text subject matter, so that a paper title in "B study based on a" format is generated using text, and then B is extracted as a study question. However, these proposed methods do not consider complex scientific paper types, and it is difficult to accurately identify research problems.
Therefore, the information extraction type research problem identification method is difficult to consider the precision and the efficiency, has high dependence on manual and manually marked corpus, has difficulty in data acquisition, data processing and corpus construction, requires a large amount of labor cost and time cost, and has a good extraction effect only in the vertical field. The information extraction model formed in one vertical field is generally difficult to apply in another vertical field.
The text generation type research problem identification method has low artificial dependence and easily acquired training data, but the currently proposed summary title of the form of 'B research based on A' cannot generally represent research problems of all types of papers, especially summary type and theoretical research type papers, so that the identification accuracy and effect are general.
According to the technical scheme provided by the embodiment of the disclosure, document data with a certain title characteristic in a large-scale document database can be utilized to directly obtain fine tuning training data required by a text generation model, a title generation model is conveniently constructed, a pre-training model (large model) with text generation capability is utilized, and a title text generation model is constructed through a large number of corpus fine tuning learning, so that the method can be applied to the field of interdisciplinary. The method is characterized in that the solving types of the papers are distinguished, a text generation model is used for generating a summary title for each paper, so that the identification of the research problem of the paper is realized, and the effect is better than that of a general text generation method.
Compared with the method that the corpus in each field needs to be marked to train a problem extraction model and then the method can be applied to the problem extraction in the field, the text generation type method provided by the embodiment of the disclosure does not need manual marking work, and can effectively solve the problem that the information extraction type method requires a large amount of manual marking data to construct the problem extraction model, so that labor cost and time cost are very high. And the title generation model corresponding to each solving type is obtained by retraining the pre-training model while the cost is effectively reduced, so that the field adaptability of the title generation model is high, and the accuracy of problem extraction can be effectively improved.
In one possible implementation, the same semantic merging process is performed on all candidate titles to obtain a reference problem, including:
extracting corresponding question index phrases from each candidate title;
Obtaining a question factor according to keywords in all the questions referring phrases; the problem factors are keywords with word frequency larger than a preset word frequency threshold value;
and carrying out the same semantic merging processing according to the question factors corresponding to all the question reference phrases to obtain the reference questions.
Specifically, the candidate titles generated according to the candidate title generation model contain content irrelevant to the problem corresponding to the original text, and the problem refers to that the position of the phrase is relatively fixed in the candidate titles of each type of solution type through analysis of the candidate title templates.
Therefore, the regular method can be utilized, and the reference phrase of the question is extracted from the candidate titles according to the position characteristics of the reference phrase of the question in the title, wherein the reference phrase of the question is the question corresponding to the corresponding original text. Besides the regularization method, the method such as a machine learning algorithm or a natural language processing technology can be adopted to extract the problem-indicating phrase.
And referring to keywords in the phrase according to all the questions, and taking the keywords with word frequency larger than a preset word frequency threshold value as question factors. The keyword extraction may use a TextRank algorithm, a TF-IDF (Term Frequency-inverse document Frequency) algorithm, a RAKE (Rapid Automatic Keyword Extraction, fast automatic keyword extraction) algorithm, and other keyword extraction algorithms, and the specific value of the preset Term Frequency threshold may be set according to the actual situation.
For example, taking the paper research question merging method as an example, the question factors need to be obtained from each question reference phrase in the merging process. For all questions referring to phrase segmentation, part-of-speech tagging and word frequency statistics, it is found that some different questions refer to phrases containing the same keywords, which are usually nouns or verbs with word frequency higher than 5, and the combination of these keywords can briefly describe a specific question, and such keywords are used as question factors. Question factors are screened out according to the word frequency and the part of speech of keywords in all question index phrases, and each question index phrase can be represented by the combination of the question factors contained in the question index phrase.
It will be appreciated that words that are not helpful to the formulation of the problem, such as general words of "performance", "process", "application", etc., may be deleted in order to enhance the expression of the problem factor.
And merging the questions with the same semantics according to the question factors corresponding to all the questions referring to the phrases to obtain the reference questions. The basis for the same semantic merging process may be keyword matching, semantic similarity calculation, text clustering, etc.
According to the technical scheme provided by the embodiment of the disclosure, the types of solutions of the papers are distinguished, and the text generation model is used for generating a summary type title for each paper, so that the identification of the research problem of the papers is realized, and phrases representing the problem are automatically extracted from the candidate titles, so that the accuracy of problem extraction can be effectively improved, and errors caused by human factors are reduced.
Based on the problem factors, the same semantic merging process is carried out on all the problem reference phrases, so that the obtained reference problems have higher generality and representativeness, according to the subject types of the original text corresponding to each reference problem, the analysis granularity of the problem view angle can be provided for the cross-subject information analysis, whether the reference problems cover a plurality of subject types can be accurately distinguished, the cross-subject research problem and related subjects thereof can be automatically identified from a batch of scientific papers, and the judgment efficiency of the cross-subject research problem and the related subjects thereof is improved. The processing speed is faster than that of manual judgment of the interdisciplinary research problem, and the threshold is lower.
The identification of the interdisciplinary research problem can provide support for interdisciplinary research in a problem view, help researchers to develop interdisciplinary research, help scientific research management institutions such as government to grasp important interdisciplinary problems, and make corresponding policies and supporting measures in advance, so that necessary resources and support are provided for interdisciplinary research, and the method has the advantages of preemptive development and good practicability.
In one possible implementation manner, according to the question factors corresponding to all the question reference phrases, the same semantic merging process is performed to obtain the reference questions, including:
according to the question factors corresponding to all the question reference phrases, taking the question reference phrases with the same number of the question factors larger than a preset question factor threshold as suspected identical questions;
Taking the suspected identical problems as nodes, taking the text similarity among the suspected identical problems as the weight of the edge, and obtaining a suspected identical problem relation network;
identifying communities by using a network community discovery algorithm according to the suspected same problem relationship network;
and merging the problem-designated phrases in the same community to obtain a reference problem.
Specifically, when the same questions are combined, firstly identifying suspected same questions by using question factors, and using the question-indicated phrases with the same number of the question factors larger than a preset question factor threshold as the suspected same questions according to the question factors corresponding to all the question-indicated phrases. The specific value of the preset problem factor threshold can be set according to actual requirements.
And then taking the suspected identical problems as nodes, taking the text similarity among the suspected identical problems as the weight of the edge, and obtaining a suspected identical problem relation network. The text similarity between the suspected identical questions can be that the questions are segmented, word embedding vectors corresponding to the questions are obtained by using a pre-training Word vector library (for example, a natural language processing Word2Vec model, gloVe model and the like), the similarity of the Word embedding vectors corresponding to each two suspected identical questions is calculated, and the similarity is used as the edge weight of the network.
And finally, identifying communities from the suspected same problem relationship network by using a community discovery (Community Detection) algorithm considering the edge weight. And merging the problem-designated phrases in the same community to obtain a reference problem.
For example, fig. 5 is a schematic diagram of a reference problem determining method provided in an embodiment of the present disclosure, and as shown in fig. 5, the problem in the embodiment is a research problem of a paper, and a phrase is indicated by a combination of problem factors.
The preset question factor threshold is 1, and it is considered that papers with 2 or more same question factors may have the same research questions and are suspected to be the same questions. And taking the research questions as nodes, the suspected identical relation as edges, and the text cosine similarity among the suspected identical questions as the weight of the edges to obtain a non-directional weighted suspected identical question relation network.
And finally, identifying communities in the communities by using a network community discovery algorithm considering the edge weights, considering research problems in the same communities as the same research problems, combining the same problems, and using the highest-frequency pre-set number of problem factors in the communities to represent the same research problems (namely, reference problems) corresponding to all the research problems of the communities.
When the technical scheme provided by the embodiment of the disclosure is used for generating the reference problem, not only the problem factors of the problem-designated phrases are considered, but also a suspected identical problem relation network is further constructed by utilizing the text similarity, and communities are identified through a network community discovery algorithm, so that the semantic similarity among the problems is more comprehensively considered, the identification accuracy of the identical problems and the fineness of merging processing are improved, and more accurate and useful reference problems are obtained.
The following describes in detail a specific application of the embodiment scheme of the present disclosure by taking a interdisciplinary research problem identification process of a scientific paper as an example:
Fig. 6 is a schematic flow chart of a method for identifying a interdisciplinary research problem, which is provided by an embodiment of the present disclosure, and as shown in fig. 6, a technical implementation scheme includes the following five steps:
paper type label:
And preprocessing each paper in the paper set to obtain N paper texts, and inputting the paper texts into a paper solving type labeling model and a paper discipline type labeling model to obtain a solving type label and a discipline type label corresponding to each paper. Solution types are used to support generalized title generation, discipline types are used to aid in the identification of cross-discipline study problems.
Summarizing title generation:
Because the paper difference of different solving types is large, the corresponding research problem summary title difference is also large, and a special generation model is respectively constructed aiming at each type in order to ensure the title generation effect. And during generation, selecting a corresponding generation model according to the labeling result of the paper solving type. And inputting the corresponding title generation model by the paper abstract and the title text which are combined according to the title-abstract sequence, and outputting the candidate title.
Questions refer to phrase extraction:
The candidate titles generated using the title generation model contain content unrelated to the paper study problem. By analyzing the candidate title forms, the positions of the problem-indicating phrases in the candidate titles of each type of solving type are found to be relatively fixed, and the regular method is utilized to extract the indicating phrases of the problem from the generated title according to the position characteristics of the problem-indicating phrases in the title, wherein the phrases are research problems of corresponding papers.
The same problem is combined:
and determining the problem factors of the discussion set through statistics, and identifying suspected identical problems by using the problem factors. And then constructing a suspected identical relation network of the research problems, segmenting the research problems, acquiring word embedding by using a pre-training word vector library, calculating the similarity between the research problems, and taking the similarity as the edge weight of the network. And finally, merging the research problems in the same community into the same research problem by using a community discovery algorithm considering the edge weight, and using three problem factors with highest frequency in the community to represent the research problem to obtain a reference problem.
Interdisciplinary study problem identification:
After labeling the discipline attributes of a paper, a paper has a discipline type label, which can be assigned to a study problem after identifying the study problem.
Counting the number of disciplines corresponding to each reference problem, identifying one problem as a interdisciplinary problem when the problem covers a plurality of discipline labels, wherein disciplines cross among the plurality of disciplines related to the reference problem, adding the problem to a interdisciplinary problem list, and identifying a problem which only belongs to one discipline as not a interdisciplinary problem.
The method utilizes a text generation type technology to realize problem identification, and establishes a corresponding generation model according to the characteristics of three different problem solving types of papers of understanding type, solving type and exploring type so as to realize the identification of the research problems of the papers. Each paper and the subject attribute types of the study questions are labeled by an automated method, the subject types of the study questions are integrated by merging the same questions, and the study questions involving multiple subjects are identified as cross-subject study questions.
The method can be applied to identifying the interdisciplinary research problem in the journal paper text set by the literature database, not only can solve the defect that the traditional Chinese literature database lacks discipline field labeling, but also can provide information support or decision reference for the retrieval of the scientific paper based on the problem, the analysis of the interdisciplinary problem, the solution search of the interdisciplinary problem, the interdisciplinary problem scientific research project stand, the interdisciplinary project review and the like, is helpful for determining interdisciplinary research projects and selecting interdisciplinary problems, provides support for the analysis, management and storage of the information resources of the scientific paper, contributes to relieving information overload and breaking the discipline barrier to promote interdisciplinary communication and cooperation, and can also provide help for the innovation idea formation and scientific research innovation of enterprises or research institutions.
Moreover, the current interdisciplinary information research is conducted on paper or topic granularity, and the recognition of the interdisciplinary research problem is not regarded as a research target, namely, the interdisciplinary information research method taking the problem as granularity is lacking, and the recognition of the interdisciplinary research problem is a problem which has not been conducted deeply.
Compared with a method taking a single paper text as an identification object in paper granularity, the method provided by the embodiment can combine the research questions on the basis of identifying the research questions corresponding to each paper, can more directly know the research questions in the range of the discussion, is not an isolated single text, can analyze the paper from a more essential angle, and further improves the accuracy of the interdisciplinary information research result.
Compared with the method of the paper subject granularity to study the subject, topic and concept discussed in the paper, the method provided by the embodiment of the invention provides a more specific study problem compared with the method of the identified and analyzed subject being a larger topic or field, for example, the subject of the paper may be "environmental protection", and the study problem may be "how to improve the efficiency of urban garbage classification". Can clearly study a certain aspect or solve a certain specific problem, and improve the accuracy of the interdisciplinary information study result.
Fig. 7 is a schematic structural diagram of a problem identification device according to an embodiment of the present disclosure, as shown in fig. 7, the problem identification device 70 includes:
A solution type determining module 701, configured to determine a solution type of at least two original texts; the solving type is research purpose information indicating the original text;
The candidate title generation module 702 is configured to input each original text to a title generation model corresponding to the solution type, and generate candidate titles corresponding to each original text respectively; wherein the title generation model is pre-trained;
The reference problem determining module 703 is configured to perform the same semantic merging process on all candidate titles to obtain a reference problem;
the reference question identification module 704 is configured to determine whether the reference question spans disciplines according to the discipline type of the original text corresponding to the reference question.
According to the technical scheme provided by the embodiment, the solution types of at least two original texts are determined, the title generation model corresponding to the solution types of the original texts is used for generating the candidate title corresponding to each original text in a targeted manner, and the accuracy of title generation can be improved by adapting the solution types of the original texts and the title generation model types. And carrying out the same semantic merging processing on all candidate titles, integrating to obtain reference questions, identifying the reference questions related to a plurality of discipline types as interdisciplinary research questions according to discipline types of the original texts corresponding to the reference questions, and accurately judging whether the reference questions are interdisciplinary, so that interdisciplinary information analysis of the granularity of the questions is realized, the fineness of interdisciplinary research analysis is improved, and the reliability and the accuracy of analysis results are effectively improved.
The apparatus of the embodiments of the present disclosure may perform the method provided by the embodiments of the present disclosure, and implementation principles of the method are similar, and actions performed by each module in the apparatus of the embodiments of the present disclosure correspond to steps in the method of the embodiments of the present disclosure, and detailed functional descriptions of each module in the apparatus may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.
In one possible implementation, the solution type determination module is configured to:
Inputting the original text into a classifier corresponding to each solving type respectively to obtain the probability that the original text belongs to each solving type; the classifier is obtained by training a sample original text with a solving type label;
And determining the solving type of the original text according to the probability of each solving type.
In one possible implementation, the method further includes: a discipline type determination module;
the discipline type determination module is used for:
Inputting the original text and the classification number of the original text into a subject classification classifier to obtain the subject classification of the original text; the subject portal classifier is obtained by training a sample original text with a subject portal label and a classification number label;
Inputting the original text, the classification number and the discipline categories into a multi-label classifier to obtain discipline types of the original text; the multi-label classifier is obtained by training a sample original text with subject type labels and class number labels.
In one possible implementation, the discipline type determination module is configured to:
Inputting the original text, the classification number and the subject class into a multi-label classifier to obtain a classification result; the classification result comprises a probability that the subject type of the reference subject type and the subject type of the original text are the reference subject type;
Obtaining the corresponding conditional probability when the subject type of the original text is each reference subject type under the condition of a given classification number according to the classification result;
The subject type of the original text is determined based on the conditional probability of each reference subject type.
In one possible implementation, the method further includes: generating a model training module;
a model training module is generated for:
Carrying out solution type labeling on sample original texts matched with the candidate title templates of each solution type to obtain sample original texts with solution type labels corresponding to each solution type;
And respectively retraining the pre-trained model by taking the sample original text with the solving type label corresponding to each solving type as external data to obtain a title generation model corresponding to each solving type.
In one possible implementation, the reference problem determination module is configured to:
extracting corresponding question index phrases from each candidate title;
Obtaining a question factor according to keywords in all the questions referring phrases; the problem factors are keywords with word frequency larger than a preset word frequency threshold value;
and carrying out the same semantic merging processing according to the question factors corresponding to all the question reference phrases to obtain the reference questions.
In one possible implementation, the reference problem determination module is configured to:
according to the question factors corresponding to all the question reference phrases, taking the question reference phrases with the same number of the question factors larger than a preset question factor threshold as suspected identical questions;
Taking the suspected identical problems as nodes, taking the text similarity among the suspected identical problems as the weight of the edge, and obtaining a suspected identical problem relation network;
identifying communities by using a network community discovery algorithm according to the suspected same problem relationship network;
and merging the problem-designated phrases in the same community to obtain a reference problem.
An electronic device (computer apparatus/device/system) is provided in an embodiment of the present disclosure, including a memory, a processor, and a computer program stored on the memory, the processor executing the computer program to implement the steps of the method provided in any of the alternative embodiments of the present disclosure. Compared with the prior art, can realize: by determining the solving types of at least two original texts, a model is generated by pertinently using the title corresponding to the solving type of the original text, the candidate title corresponding to each original text is generated, and the accuracy of title generation can be improved by adapting the solving type of the original text and the title generating model type. And carrying out the same semantic merging processing on all candidate titles, integrating to obtain reference questions, identifying the reference questions related to a plurality of discipline types as interdisciplinary research questions according to discipline types of the original texts corresponding to the reference questions, and accurately judging whether the reference questions are interdisciplinary, so that interdisciplinary information analysis of the granularity of the questions is realized, the fineness of interdisciplinary research analysis is improved, and the reliability and the accuracy of analysis results are effectively improved.
In an alternative embodiment, an electronic device is provided, and fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the disclosure, as shown in fig. 8, an electronic device 80 includes: a processor 801 and a memory 803. The processor 801 is coupled to a memory 803, such as via a bus 802. Optionally, the electronic device 80 may further comprise a transceiver 804, and the transceiver 804 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 804 is not limited to one, and the structure of the electronic device 80 is not limited to the embodiments of the present disclosure.
The Processor 801 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor ), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 801 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 802 may include a path to transfer information between the aforementioned components. Bus 802 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. Bus 802 may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The Memory 803 may be, without limitation, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 803 is used to store a computer program that executes embodiments of the present disclosure and is controlled to be executed by the processor 801. The processor 801 is arranged to execute computer programs stored in the memory 803 to implement the steps shown in the foregoing method embodiments.
The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), wearable devices, and the like, and stationary terminals such as digital TVs, desktop computers, and the like.
The disclosed embodiments provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The disclosed embodiments also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The computer readable storage medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operational steps are indicated by arrows in the flowcharts of the disclosed embodiments, the order in which these steps are performed is not limited to the order indicated by the arrows. In some implementations of embodiments of the present disclosure, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the scenario that the execution time is different, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, and the embodiment of the disclosure is not limited to this.
The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical idea of the solution of the present application, which is also within the protection scope of the embodiments of the present disclosure.

Claims (11)

1. A method of problem identification, comprising:
Determining the solving type of at least two original texts; wherein the solution type indicates a purpose of investigation of the original text;
Inputting each original text into a title generation model corresponding to the solving type, and generating candidate titles corresponding to each original text respectively; wherein the title generation model is pre-trained;
carrying out the same semantic merging processing on all the candidate titles to obtain a reference problem;
And determining whether the reference question spans disciplines according to discipline types of the original text corresponding to the reference question.
2. The problem identification method of claim 1, wherein determining the solution type of the original text comprises:
Inputting the original text into a classifier corresponding to each solving type respectively to obtain the probability that the original text belongs to each solving type; the classifier is obtained through training according to a sample original text with a solving type label;
and determining the solving type of the original text according to the probability of each solving type.
3. The problem identification method of claim 1, wherein determining the subject type of the original text comprises:
inputting the original text and the classification number of the original text into a subject class classifier to obtain the subject class of the original text; the subject class classifier is obtained by training a sample original text with a subject class label and a class number label;
inputting the original text, the classification number and the discipline categories into a multi-label classifier to obtain discipline types of the original text; the multi-label classifier is obtained by training according to a sample original text with subject type labels and class number labels.
4. The method of claim 3, wherein said inputting the original text, the class number, and the subject class into a multi-label classifier results in the subject type of the original text, comprising:
Inputting the original text, the classification number and the subject class into a multi-label classifier to obtain a classification result; wherein the classification result includes a reference subject type and a probability that the subject type of the original text is the reference subject type;
Obtaining the corresponding conditional probability when the subject type of the original text is each reference subject type under the condition of giving the classification number according to the classification result;
A subject type of the original text is determined based on the conditional probabilities for each reference subject type.
5. The problem identification method according to any one of claims 1 to 4, wherein the title generation model is obtained by:
Carrying out solution type labeling on sample original texts matched with the candidate title templates of each solution type to obtain sample original texts with solution type labels corresponding to each solution type;
And respectively retraining the pre-trained model by taking the sample original text with the solving type label corresponding to each solving type as external data to obtain the title generating model corresponding to each solving type.
6. The method for identifying a problem according to any one of claims 1 to 4, wherein the step of performing the same semantic merging process on all the candidate titles to obtain a reference problem includes:
extracting corresponding question index phrases from each candidate title;
obtaining a question factor according to keywords in all the questions referring phrases; the problem factors are keywords with word frequency larger than a preset word frequency threshold value;
and carrying out the same semantic merging processing according to the question factors corresponding to all the question reference phrases to obtain the reference questions.
7. The method for identifying a question according to claim 6, wherein the step of performing the same semantic merging process according to the question factors corresponding to all the question reference phrases to obtain the reference question includes:
according to the question factors corresponding to all the question reference phrases, taking the question reference phrases with the same number of the question factors larger than a preset question factor threshold as suspected identical questions;
taking the suspected identical problems as nodes, taking the text similarity among the suspected identical problems as the weight of the edges, and obtaining a suspected identical problem relation network;
Identifying communities by using a network community discovery algorithm according to the suspected same problem relationship network;
and merging the problem-designated phrases in the same community to obtain a reference problem.
8. A problem identification device, characterized by comprising:
the solving type determining module is used for determining solving types of at least two original texts; wherein the solution type is research purpose information indicating the original text;
The candidate title generation module is used for inputting each original text into a title generation model corresponding to the solving type and generating candidate titles corresponding to each original text respectively; wherein the title generation model is pre-trained;
the reference problem determining module is used for carrying out the same semantic merging processing on all the candidate titles to obtain a reference problem;
and the reference question identification module is used for determining whether the reference question spans disciplines according to the discipline type of the original text corresponding to the reference question.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-7.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the problem identification method of any of claims 1-7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
CN202410146033.0A 2024-02-01 2024-02-01 Problem identification method, device, electronic equipment and storage medium Pending CN118211585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410146033.0A CN118211585A (en) 2024-02-01 2024-02-01 Problem identification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410146033.0A CN118211585A (en) 2024-02-01 2024-02-01 Problem identification method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118211585A true CN118211585A (en) 2024-06-18

Family

ID=91451517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410146033.0A Pending CN118211585A (en) 2024-02-01 2024-02-01 Problem identification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118211585A (en)

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US20130060769A1 (en) System and method for identifying social media interactions
CN111832290B (en) Model training method and device for determining text relevance, electronic equipment and readable storage medium
CN106294786A (en) A kind of code search method and system
Shekhawat Sentiment classification of current public opinion on BREXIT: Naïve Bayes classifier model vs Python’s TextBlob approach
Mansur et al. Twitter hate speech detection: A systematic review of methods, taxonomy analysis, challenges, and opportunities
Patel et al. Dynamic lexicon generation for natural scene images
Islam et al. An in-depth exploration of Bangla blog post classification
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
James et al. Ontology matching for the semantic annotation of images
CN114842982A (en) Knowledge expression method, device and system for medical information system
Kumaravel et al. PQPS: Prior‐Art Query‐Based Patent Summarizer Using RBM and Bi‐LSTM
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
US11475529B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN114254622A (en) Intention identification method and device
Kortum et al. Leveraging Natural Language Processing to Analyze Scientific Content: Proposal of an NLP Pipeline for the Field of Computer Vision
Malhotra et al. An efficient fake news identification system using A-SQUARE CNN algorithm
CN118211585A (en) Problem identification method, device, electronic equipment and storage medium
Li et al. Attention-based LSTM-CNNs for uncertainty identification on Chinese social media texts
Rabby et al. Establishing a formal benchmarking process for sentiment analysis for the bangla language
Abudalfa Comparative study on efficiency of using supervised learning techniques for target-dependent sentiment polarity classification in social media
Wang et al. ProductNet: a Collection of High-Quality Datasets for Product Representation Learning
CN114372532B (en) Method, device, equipment, medium and product for determining label labeling quality
CN116150428B (en) Video tag acquisition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication