CN114201607B - Information processing method and device - Google Patents

Information processing method and device Download PDF

Info

Publication number
CN114201607B
CN114201607B CN202111514421.2A CN202111514421A CN114201607B CN 114201607 B CN114201607 B CN 114201607B CN 202111514421 A CN202111514421 A CN 202111514421A CN 114201607 B CN114201607 B CN 114201607B
Authority
CN
China
Prior art keywords
classification result
terms
paragraph
paragraph text
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111514421.2A
Other languages
Chinese (zh)
Other versions
CN114201607A (en
Inventor
李舰
史亚冰
蒋烨
柴春光
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111514421.2A priority Critical patent/CN114201607B/en
Publication of CN114201607A publication Critical patent/CN114201607A/en
Application granted granted Critical
Publication of CN114201607B publication Critical patent/CN114201607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The disclosure provides an information processing method and device, and relates to the technical field of artificial intelligence such as deep learning and knowledge maps. The specific implementation scheme is as follows: acquiring terms to be classified; retrieving the terms in a corpus to obtain at least one paragraph text; respectively scoring the at least one paragraph text; selecting a preset number of paragraph texts as contexts according to the order of the scores from high to low; the terms and the context are input into a first pre-trained language model, and a first classification result of the terms is output. The method can effectively reduce the construction cost of the term system and improve the map construction efficiency.

Description

Information processing method and device
Technical Field
The disclosure relates to the technical field of artificial intelligence such as deep learning and knowledge charts, in particular to an information processing method and device.
Background
The term system is the basis for knowledge graph construction, and in the term system, different types of terms are required to be in different subgraphs of the system. Term type prediction techniques, such as medical term type prediction, are a knowledge classification task that is primarily used to purposefully present their corresponding subset of types in a limited set of types given an output.
The term type is a special attribute of a term, and compared with other attribute extraction tasks, different points are mainly reflected in that: firstly, term type extraction has no explicit mining source and needs more implicit semantic knowledge; second, the term type is a finite set, so the knowledge extraction problem can be translated into a semantic matching problem.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, storage medium, and computer program product for information processing.
According to a first aspect of the present disclosure, there is provided an information processing method including: acquiring terms to be classified; retrieving the terms in a corpus to obtain at least one paragraph text; respectively scoring the at least one paragraph text; selecting a predetermined number of paragraph texts as contexts according to the order of the scores from high to low; the terms and the context are input into a first pre-trained language model, and a first classification result of the terms is output.
According to a second aspect of the present disclosure, there is provided an information processing apparatus comprising: an acquisition unit configured to acquire terms to be classified; a retrieval unit configured to retrieve the term from the corpus to obtain at least one paragraph text; a scoring unit configured to score the at least one paragraph text, respectively; a selection unit configured to select a predetermined number of paragraph texts in order of a score from high to low as a context; a prediction unit configured to input the term and the context into a first pre-trained language model and output a first classification result of the term.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.
According to the information processing method and device provided by the embodiment of the disclosure, terms are searched from an authoritative corpus to obtain paragraph texts, and after the paragraph texts are scored, the paragraph texts with higher scores are selected as the contexts of the terms. And classifying the terms and the context thereof by using a language model to obtain the categories of the terms. Therefore, the term system construction efficiency is optimized, a high calling-in effect can be achieved on the premise of not having too much manual intervention, the manual construction efficiency is greatly improved, and the labor cost is reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a method of information processing according to the present disclosure;
3a-3d are schematic diagrams of one application scenario of a method of information processing according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method of information processing according to the present disclosure;
FIG. 5 is a schematic block diagram of one embodiment of an information processing apparatus according to the present disclosure;
FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the information processing method or information processing apparatus of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a knowledge graph, a model training application, a language model, a corpus, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background knowledgegraph server providing support for a knowledgegraph displayed on the terminal devices 101, 102, 103. The background knowledge graph server can analyze and process the received data such as the term classification request and feed back the processing result (such as the type of the term) to the terminal equipment.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
It should be noted that the method for processing information provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, an apparatus for processing information is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method of information processing in accordance with the present disclosure is shown. The information processing method comprises the following steps:
in step 201, the terms to be classified are obtained.
In this embodiment, an execution subject of the method of information processing (e.g., a server shown in fig. 1) may receive a term classification request including terms to be classified from a terminal with which a user performs construction of a knowledge graph, through a wired connection manner or a wireless connection manner. The request may also include technical fields of terminology, e.g., medical, musical, physical, etc. Different fields correspond to different sets of types, for example, the medical field addresses the type prediction problem of disease, symptoms, signs, examination, surgery, operation 7 type terms.
In step 202, terms are retrieved from a corpus to obtain at least one paragraph text.
In this embodiment, the corpus is an authoritative book or database in a certain field, and includes various noun explanations, theoretical knowledge, etc. in the field. The terms can be directly searched in the corpus to carry out character string matching, and when the matching is successful, the natural section where the matched character string is located is copied to be used as the paragraph text. The term can find many matching character strings in the corpus, and then a plurality of paragraph texts can be obtained.
Alternatively, the term may be cut to obtain a term (term), and the term is used as a search term to search in the corpus. Different words and paragraph texts (para) can be associated and indexed in the corpus in advance, and the paragraph texts corresponding to the participles can be obtained only by searching the participles. For example, acute gastritis, which may be referred to as "gastritis", has been indexed in advance by all paragraph texts in the corpus that contain "gastritis", e.g., page 1, paragraph 2, page 2, paragraph 3 \8230.
Step 203, scoring the at least one paragraph text respectively.
In this embodiment, each paragraph may be scored according to a preset scoring rule, and the total score may be determined as a weighted sum of scores according to each requirement. Such as the length of the paragraph text, the frequency with which terms appear in the paragraph text, etc. The length can be divided into several levels, with a moderate length having the highest score, which can be taken as the reference length (e.g. 50-100 words), and the larger the difference from the reference length the lower the score, because the fewer the number of words, the less the effective information is provided, and the more the number of words, the less easy it is to find the effective information. The more frequently a term appears in the text of a paragraph, the higher the score.
Alternatively, other terms (identified by a ton, semicolon, or the like representing a side-by-side relationship) in the paragraph text can also be found juxtaposed to the term. The paragraph text may be scored if the type of other terms is certain, with a higher score if more other terms of a known type are present. For example, the term acute gastritis, the paragraph text searched for is "acute gastritis, chronic gastritis, gastric ulcer", which can be additionally scored if the type of gastric ulcer is known to be disease, or more if the type of chronic gastritis is also known to be disease.
Step 204, selecting a predetermined number of paragraph texts as contexts from high to low of the scores.
In this embodiment, since the number of searched paragraph texts is large, the paragraph texts with low scores can be removed by screening through scoring, and only the predetermined number of paragraph texts with the highest scores are reserved as contexts.
Step 205, inputting the terms and the context into a first pre-training language model, and outputting a first classification result of the terms.
In this embodiment, the essence of the pre-training concept is that the model parameters are not initialized randomly any more, but pre-trained by some tasks (e.g. language model); pre-training belongs to the category of migration learning, and the pre-training language model herein mainly refers to an unsupervised pre-training task (sometimes also referred to as self-learning or self-supervision), and the migration paradigm is mainly feature integration and model refinement (finetune). The pre-trained language model extracts features from the input text and then classifies the features. A natural language model such as BERT (Bidirectional Encoder representation of a converter) may be used herein. 2 pre-trained language models are used herein, which may take the same structure but differ in network parameters. For distinction, the terms "first pre-trained language model" and "second pre-trained language model" are named, respectively. The first pre-trained language model is suitable for the recognition of long text, entered as terms and context. While the second pre-trained language model is suitable for recognition of short text, the input being only terms.
The output classification result is the probability that the term belongs to different types, and the type with the highest probability can be determined as the type of the term, resulting in a < term, type > pair. The terms and corresponding types may then be used to construct the knowledge image. The term may belong to 7 types: disease, symptoms, signs, examinations, tests, surgery, procedures, the first classification result of the term "acute gastritis" may be 0.8,0.01,0.03,0.04, 82300.05, and the type of acute gastritis may be determined to be a disease.
The method provided by the above embodiment of the present disclosure provides a term type prediction scheme based on a pre-trained language model, which is mainly oriented to term system construction in atlas construction and assists in completing schema definition. The schema of a knowledge graph is equivalent to a data model in a field, and comprises meaningful concept types in the field and attributes of the concept types. The schema of any one domain is mainly expressed by type and property. The term system construction efficiency is optimized, and the method can achieve a high calling-in-place effect on the premise of not having too much manual intervention, so that the manual construction efficiency is greatly improved, and the labor cost is reduced.
In some optional implementations of this embodiment, the method further includes: matching the terms with a preset classification rule to obtain a second classification result; and taking the weighted sum of the first classification result and the second classification result as a final classification result. Matching the term to a preset classification rule is the 2 nd method of classifying terms provided herein, referred to as Pattern (Pattern) recognition. The main goal of Pattern recognition is to cure some common correct or incorrect patterns, covering as much of the sample as accurately as possible. The Pattern mainly involved at present is character string matching based on type indicator. The construction scheme of Pattern is mainly based on the manual configuration of the evaluation data in the model iteration process. A plurality of classification rules can be preset, matching is carried out one by one, and the type corresponding to the classification rule which is successfully matched is used as the type of the term. The second classification result is a group of vectors, each element in the vector represents a flag bit of a type, if the value of one element is 1, the term belongs to the type, and if the value of one element is 0, the term belongs to the type. A term may belong to one type or multiple types. For example, the term may belong to 7 types: disease, symptom, sign, examination, surgery, procedure, if the second classification result is 0000011, it means that the term belongs to both types of surgery and procedure.
The final classification result is a weighted sum of the first classification result and the second classification result. Therefore, the prediction results of the two modes can be fused, and the classification error caused by the condition that any prediction result is inaccurate is avoided. The defects of different classification methods can be avoided, and therefore classification accuracy is improved.
Different weights may be set empirically, for example, if there is only a single type in the second classification result, the weight of the second classification result may be set highest.
In some optional implementation manners of this embodiment, matching the term with a preset classification rule to obtain a second classification result includes: and carrying out keyword matching and/or prefix and suffix matching on the terms and preset classification rules to obtain a second classification result.
Key words: key indicators may be located in corresponding categories. For example, "" biopsy "".
Suffix before: key suffixes such as "XXX elevated" in the test results.
Keyword matching and prefix and suffix matching can be combined to prevent missing detection. The arrangement of keywords and suffixes can be derived from statistical analysis of a large amount of data. For example, "XXX elevated" occurs frequently in the test results, then "elevated" can be suffixed, and if the term is also suffixed to "elevated" then the flag position corresponding to the type "test results" can be set to 1.
In some optional implementations of this embodiment, the method further includes: inputting the terms into a second pre-training language model to obtain a third classification result; and taking the weighted sum of the first classification result and the third classification result as a final classification result. Classifying by the second pre-trained language model is the 3 rd method of classifying terms provided herein, referred to as short text-based type prediction. And predicting the type of the term through a second pre-training language model, wherein the obtained third score result is the probability that the term belongs to each type. In order to increase the calculation speed, the type having the probability greater than the predetermined threshold may be set to be effective, and the probability having the probability not greater than the predetermined threshold may be set to 0. The final classification result is a weighted sum of the first classification result and the third classification result. Therefore, the prediction results of the two modes can be fused, and the classification error caused by the condition that any prediction result is inaccurate is avoided. The defects of different classification methods can be avoided, and therefore the classification accuracy is improved.
In some optional implementations of this embodiment, the method further includes: matching the terms with a preset classification rule to obtain a second classification result; inputting the terms into a second pre-training language model to obtain a third classification result; and taking the weighted sum of the first classification result, the second classification result and the third classification result as a final classification result. This is a fusion of the results of the 3 classification methods. The robustness of the classification algorithm is further improved, even if one classification result is wrong, the other two classification results can be corrected, and the classification accuracy is improved.
In some optional implementations of this embodiment, the method further includes: scoring the at least one paragraph text, respectively, comprising: filtering at least one paragraph text according to a preset text filtering condition; for each paragraph text after filtering, extracting at least one feature of the paragraph text, and taking the weighted sum of the scores of the at least one feature as the score of the paragraph text. The text filtering condition may be a semantically independent primary filter. For example, filtering by whether paragraph text (para) contains an addition of a term (including the name and alias of the term, term after segmentation, etc.), the length of the paragraph text, and the typical segmentation of the paragraph text. If the searched paragraph text does not include names and aliases, term after word segmentation, etc., the paragraph text can be filtered out. Paragraph text that is too long (e.g., more than 500 words) or too short (less than 10 words) may also be filtered out. Paragraph text typical participle filtering refers to filtering for a particular paragraph text, which may be filtered out, for example, if the paragraph text includes "wrong exemplar" words. And after filtering, scoring the paragraph texts. Scoring may be based on semantically related features. For example, feature calculations are performed based on co-occurrence information. Wherein the characteristics mainly include: the frequency of occurrence of type text in the paragraph text, whether other terms in a side-by-side arrangement with the term have a certain type, etc. And after simple weighted fitting is carried out on scores obtained by calculation of the characteristics, the final score of the paragraph texts is obtained, and the predetermined number of paragraph texts with the highest score are obtained after sorting and are used as the final output of the paragraph sorting module.
With continuing reference to fig. 3a-3d, fig. 3a-3d are schematic diagrams of an application scenario of the method for information processing according to the present embodiment in the medical field. In the application scenario of fig. 3a-3d, three classification methods are combined. As shown in fig. 3a, the term is predicted by three branches in parallel (not in sequence), and then multi-source fusion preference is performed to obtain the final type. The specific process is as follows:
1. pattern recognition
Pattern recognition the main goal is to cure some common correct or incorrect patterns, covering as much of the sample as accurately as possible. The Pattern mainly involved at present is character string matching based on type indicator. The construction scheme of Pattern is mainly based on the artificial configuration of the evaluation data in the model iteration process. The method mainly comprises the following steps:
the indicator: the key indicators can be located to the corresponding categories. For example, "" biopsy "", is an operation.
Suffix before: key pre-and post-fixes such as "XXX elevated" in the test results.
2. Short text based type prediction
The main role of the type prediction based on the short text is to predict the probability that the short text belongs to each type under the condition of a given term by means of short text classification based on a second pre-training language model (aiming at the short text), and output a type list with the probability larger than a threshold value. The current basic model is a multi-classification task based on a pre-trained language model. The model structure is shown in FIG. 3b.
3. Context-based term type prediction
Both of the above two construction methods only use the information of the term name itself, and although the term set of the common pattern can be well covered, when facing more complex terms, the prediction result will be significantly reduced by being limited by the integrity of the input semantics. Therefore, a term type determination method with context based on authoritative corpus enhancement is designed, as shown in fig. 3c, and mainly comprises three steps:
a) Paragraph acquisition: enriching input corpus based on input term names
Paragraph fetching is the fetching of corpus text for predicting entity types from the terms entered. Since the term description in the real medical scene is different from that in the authoritative book, the paragraph text is obtained by using the segmented word as the search condition for the given entity name. To ensure the reliability of the data source, the text is mainly derived from an authoritative medical book.
b) Paragraph ordering: and sequencing the corpus data generated by enrichment, and calculating only high-weight paragraph texts.
The corpus text directly obtained by the paragraph acquisition method may contain more text paragraphs, the direct calculation cost is extremely high, and a paragraph sequencing module is added for reducing the text scale. The effect of paragraph ordering is to filter and order all paragraph text that is captured, leaving only paragraph text that is relevant to the semantics of the given term. Can mainly comprise two steps: the semantic irrelevant initial filtering and the semantic relevant sequencing.
Semantic-independent initial filtering: paragraph text is filtered by whether it contains a term' S fragment (including the name and alias of S, term after segmentation, etc.), the length of the paragraph text, and the typical segmentation of the paragraph text.
Semantic relevance ranking: in semantically related ranking, feature calculations are performed based on co-occurrence information. Wherein the characteristics mainly include: the frequency of occurrence of type text in the paragraph text, whether other terms in a side-by-side arrangement with the term have a certain type, etc. And performing simple weighted fitting on the scores calculated by the characteristics to obtain the final scores of the paragraph texts, and taking the final first N paragraph texts after the ranking as the final output of the paragraph text ranking.
3. Context-based type prediction: context-based term type determination based on authoritative corpus augmentation.
The role of context-based type prediction is to predict the probability that a first pre-trained language model (for long text) belongs to each type given terms and context (filtered, ordered paragraph text) by means of semantic matching. The basis of the current model is a multi-classification task based on a pre-trained language model.
Since the final paragraph text capture is usually long text, it far exceeds the input length limit of the pre-trained model. The input can be segmented according to paragraphs, and the original paragraph text can be divided into a plurality of segments (for example, one segment is divided when more than 500 words are included), and the term names are respectively predicted. For multiple output results, the output is converted to a fixed length by maximum pooling. FIG. 3d is the structure of single-type prediction.
And finally, weighting and combining the output results of the three modes to construct a complete output set, introducing upper and lower information and post-processing rules, filtering and judging the output results, and outputting the output results only through the knowledge of the filter.
With further reference to FIG. 4, a flow 400 of yet another embodiment of a method of information processing is shown. The flow 400 of the information processing method includes the following steps:
step 401, obtaining terms to be classified.
At step 402, terms are retrieved from a corpus to obtain at least one paragraph of text.
And 403, respectively scoring at least one paragraph text.
Step 404, selecting a predetermined number of paragraph texts as contexts from high to low of the score.
Step 405, inputting the terms and the context into a first pre-training language model, and outputting a first classification result of the terms.
And step 406, matching the terms with a preset classification rule to obtain a second classification result.
Step 407, input the terms into the second pre-training language model to obtain a third classification result.
And step 408, taking the weighted sum of the first classification result, the second classification result and the third classification result as a final classification result.
Steps 401-408 are substantially the same as steps 201-205 and are therefore not described in detail.
And 409, acquiring a classification result filtering condition, and checking the final classification result according to the classification result filtering condition.
In this embodiment, the classification result filtering condition may include upper and lower information of terms in the knowledge-graph. If the type of the lower part of the term and the type of the upper part of the term are consistent with the classification result, the check is passed.
The classification result filtering condition may also include post-processing rules, e.g., the terms belong to a repulsive type, the probability of the terms belonging to either type does not exceed a predetermined threshold, etc. If the probability of the term belonging to the animal in the classification result is 0.5, and the probability of belonging to the plant is 0.5, the verification fails. If the probability of none of the classification results is greater than 0.6, the check fails. And filtering out the classification result which fails to be checked.
And step 410, if the verification is successful, outputting a final classification result.
In this embodiment, the classification result that is verified successfully is the classification result that meets the requirement. If the check fails, the classification result can be corrected. For example, the same term is looked up from the historical annotation data, and the type of annotation is taken as the classification result. Optionally, the confidence levels of the three classification results may also be analyzed (for example, if a single type is hit in pattern recognition, the confidence level of the second classification result obtained by the method is the highest), and the weight of the classification result is set according to the confidence level when the weighted sum is calculated, and the weight is larger when the confidence level is higher. And recalculating the fused classification result, and checking again.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the information processing method in this embodiment represents a step of verifying the classification result. Therefore, the scheme described in the embodiment can be further verified in the modes of upper and lower information and the like, and the reliability of the classification result is improved. In addition, correction can be performed when the verification fails.
With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for information processing, which corresponds to the method embodiment shown in fig. 2, and which is specifically applicable to various electronic devices.
As shown in fig. 5, the information processing apparatus 500 of the present embodiment includes: an acquisition unit 501, a retrieval unit 502, a scoring unit 503, a selection unit 504, and a prediction unit 505. Wherein, the obtaining unit 501 is configured to obtain terms to be classified; a retrieving unit 502 configured to retrieve the term from the corpus to obtain at least one paragraph text; a scoring unit 503 configured to score the at least one paragraph text, respectively; a selecting unit 504 configured to select a predetermined number of paragraph texts in order of a score from high to low as a context; a prediction unit 505 configured to input the term and the context into a first pre-trained language model and output a first classification result of the term.
In the present embodiment, specific processing of the acquisition unit 501, the retrieval unit 502, the scoring unit 503, the selection unit 504, and the prediction unit 505 of the information processing apparatus 500 may refer to step 201, step 202, step 203, step 204, and step 205 in the corresponding embodiment of fig. 2.
In some optional implementations of this embodiment, the apparatus 500 further comprises a first fusion unit (not shown in the drawings) configured to: matching the terms with a preset classification rule to obtain a second classification result; and taking the weighted sum of the first classification result and the second classification result as a final classification result.
In some optional implementations of the present embodiment, the apparatus 500 further comprises a second fusion unit (not shown in the drawings) configured to: inputting the terms into a second pre-training language model to obtain a third classification result; and taking the weighted sum of the first classification result and the third classification result as a final classification result.
In some optional implementations of this embodiment, the apparatus 500 further comprises a third fusion unit (not shown in the drawings) configured to: matching the terms with a preset classification rule to obtain a second classification result; inputting the terms into a second pre-training language model to obtain a third classification result; and taking the weighted sum of the first classification result, the second classification result and the third classification result as a final classification result.
In some optional implementations of this embodiment, the apparatus 500 further comprises a verification unit (not shown in the drawings) configured to: obtaining a classification result filtering condition, wherein the classification result filtering condition comprises upper and lower information of the term in the knowledge graph; checking the final classification result according to the classification result filtering condition; and if the verification is successful, outputting the final classification result.
In some optional implementations of this embodiment, the first fusion unit is further configured to: and carrying out keyword matching and/or prefix and suffix matching on the terms and a preset classification rule to obtain a second classification result.
In some optional implementations of this embodiment, the scoring unit 503 is further configured to: filtering the at least one paragraph text according to a preset text filtering condition; and for each paragraph text after filtering, extracting at least one feature of the paragraph text, and taking the weighted sum of the scores of the at least one feature as the score of the paragraph text.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of flow 200 or 400.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of flows 200 or 400.
A computer program product comprising a computer program which, when executed by a processor, implements the method of flows 200 or 400.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, and the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the method of information processing. For example, in some embodiments, the method of information processing may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the method of information processing described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of information processing.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (16)

1. A method of information processing, comprising:
acquiring terms to be classified;
searching the term in a corpus to obtain at least one paragraph text, wherein a participle is obtained after the term is cut into words and is used as a search word to be searched in the corpus, different words and paragraph texts are related and indexed in the corpus in advance, and the paragraph text corresponding to the participle can be obtained as long as the participle is searched;
respectively scoring the at least one paragraph text, wherein the scoring rule comprises: the length of the paragraph text and the frequency of the terms appearing in the paragraph text are determined, the score of the paragraph text with the reference length is the highest, the score is lower when the difference between the length of the paragraph text and the reference length is larger, and the score is higher when the frequency of the terms appearing in the paragraph text is larger;
selecting a preset number of paragraph texts as contexts according to the order of the scores from high to low;
inputting the term and the context into a first pre-training language model, and outputting a first classification result of the term.
2. The method of claim 1, wherein the method further comprises:
matching the terms with a preset classification rule to obtain a second classification result;
and taking the weighted sum of the first classification result and the second classification result as a final classification result.
3. The method of claim 1, wherein the method further comprises:
inputting the terms into a second pre-training language model to obtain a third classification result;
and taking the weighted sum of the first classification result and the third classification result as a final classification result.
4. The method of claim 1, wherein the method further comprises:
matching the terms with a preset classification rule to obtain a second classification result;
inputting the terms into a second pre-training language model to obtain a third classification result;
and taking the weighted sum of the first classification result, the second classification result and the third classification result as a final classification result.
5. The method according to any one of claims 2-4, wherein the method further comprises:
obtaining a classification result filtering condition, wherein the classification result filtering condition comprises upper and lower information of the term in the knowledge graph;
checking the final classification result according to the classification result filtering condition;
and if the verification is successful, outputting the final classification result.
6. The method of claim 2, wherein the matching the term with a preset classification rule to obtain a second classification result comprises:
and carrying out keyword matching and/or prefix and suffix matching on the terms and preset classification rules to obtain a second classification result.
7. The method of claim 1, wherein said scoring said at least one paragraph text separately comprises:
filtering the at least one paragraph text according to a preset text filtering condition;
and for each paragraph text after filtering, extracting at least one feature of the paragraph text, and taking the weighted sum of the scores of the at least one feature as the score of the paragraph text.
8. An apparatus for information processing, comprising:
an acquisition unit configured to acquire terms to be classified;
the retrieval unit is configured to retrieve the term from the corpus to obtain at least one paragraph text, wherein the term is cut to obtain a participle which is used as a retrieval word to be retrieved from the corpus, different words and the paragraph text are associated and indexed in the corpus in advance, and the paragraph text corresponding to the participle can be obtained as long as the participle is searched;
a scoring unit configured to score the at least one paragraph text respectively, wherein the scoring rule includes: the length of the paragraph text and the frequency of terms appearing in the paragraph text are determined, the score of the paragraph text in the reference length is the highest, the score is lower when the difference between the length of the paragraph text and the reference length is larger, and the score is higher when the frequency of terms appearing in the paragraph text is higher;
a selection unit configured to select a predetermined number of paragraph texts in order of a score from high to low as a context;
a prediction unit configured to input the term and the context into a first pre-trained language model and output a first classification result of the term.
9. The apparatus of claim 8, wherein the apparatus further comprises a first fusion unit configured to:
matching the terms with a preset classification rule to obtain a second classification result;
and taking the weighted sum of the first classification result and the second classification result as a final classification result.
10. The apparatus of claim 8, wherein the apparatus further comprises a second fusion unit configured to:
inputting the terms into a second pre-training language model to obtain a third classification result;
and taking the weighted sum of the first classification result and the third classification result as a final classification result.
11. The apparatus of claim 8, wherein the apparatus further comprises a third fusion unit configured to:
matching the terms with a preset classification rule to obtain a second classification result;
inputting the terms into a second pre-training language model to obtain a third classification result;
and taking the weighted sum of the first classification result, the second classification result and the third classification result as a final classification result.
12. The apparatus according to any one of claims 9-11, wherein the apparatus further comprises a verification unit configured to:
obtaining a classification result filtering condition, wherein the classification result filtering condition comprises upper and lower information of the term in the knowledge graph;
checking the final classification result according to the classification result filtering condition;
and if the verification is successful, outputting the final classification result.
13. The apparatus of claim 9, wherein the first fusion unit is further configured to:
and carrying out keyword matching and/or prefix and suffix matching on the terms and a preset classification rule to obtain a second classification result.
14. The apparatus of claim 8, wherein the scoring unit is further configured to:
filtering the at least one paragraph text according to a preset text filtering condition;
and for each paragraph text after filtering, extracting at least one feature of the paragraph text, and taking the weighted sum of the scores of the at least one feature as the score of the paragraph text.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202111514421.2A 2021-12-13 2021-12-13 Information processing method and device Active CN114201607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111514421.2A CN114201607B (en) 2021-12-13 2021-12-13 Information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111514421.2A CN114201607B (en) 2021-12-13 2021-12-13 Information processing method and device

Publications (2)

Publication Number Publication Date
CN114201607A CN114201607A (en) 2022-03-18
CN114201607B true CN114201607B (en) 2023-01-03

Family

ID=80652692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111514421.2A Active CN114201607B (en) 2021-12-13 2021-12-13 Information processing method and device

Country Status (1)

Country Link
CN (1) CN114201607B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919689A (en) * 2017-03-03 2017-07-04 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
US20180225278A1 (en) * 2017-02-06 2018-08-09 International Business Machines Corporation Disambiguation of the meaning of terms based on context pattern detection
CN113344121A (en) * 2021-06-29 2021-09-03 北京百度网讯科技有限公司 Method for training signboard classification model and signboard classification
CN113688242A (en) * 2021-08-31 2021-11-23 上海基绪康生物科技有限公司 Method for classifying medical terms through text classification of network search results

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225278A1 (en) * 2017-02-06 2018-08-09 International Business Machines Corporation Disambiguation of the meaning of terms based on context pattern detection
CN106919689A (en) * 2017-03-03 2017-07-04 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
CN113344121A (en) * 2021-06-29 2021-09-03 北京百度网讯科技有限公司 Method for training signboard classification model and signboard classification
CN113688242A (en) * 2021-08-31 2021-11-23 上海基绪康生物科技有限公司 Method for classifying medical terms through text classification of network search results

Also Published As

Publication number Publication date
CN114201607A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
US20220318275A1 (en) Search method, electronic device and storage medium
CN113204621B (en) Document warehouse-in and document retrieval method, device, equipment and storage medium
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
CN112925883B (en) Search request processing method and device, electronic equipment and readable storage medium
JP7369228B2 (en) Method, device, electronic device, and storage medium for generating images of user interest
CN112560461A (en) News clue generation method and device, electronic equipment and storage medium
WO2023040230A1 (en) Data evaluation method and apparatus, training method and apparatus, and electronic device and storage medium
CN113806483B (en) Data processing method, device, electronic equipment and computer program product
US20220129634A1 (en) Method and apparatus for constructing event library, electronic device and computer readable medium
CN113139043B (en) Question-answer sample generation method and device, electronic equipment and storage medium
CN114201607B (en) Information processing method and device
CN114780821A (en) Text processing method, device, equipment, storage medium and program product
CN113408280A (en) Negative example construction method, device, equipment and storage medium
CN114118049A (en) Information acquisition method and device, electronic equipment and storage medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium
US20240070188A1 (en) System and method for searching media or data based on contextual weighted keywords
CN113656592B (en) Data processing method and device based on knowledge graph, electronic equipment and medium
CN115129816B (en) Question-answer matching model training method and device and electronic equipment
CN114238663B (en) Knowledge graph analysis method and system for material data, electronic device and medium
CN113971216B (en) Data processing method and device, electronic equipment and memory
CN117493704A (en) User credibility calculation method and device, electronic equipment and medium
CN117093601A (en) Recall method, device, equipment and medium for structured data
CN113850084A (en) Entity linking method and device, electronic equipment and storage medium
CN117609418A (en) Document processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant