CN114996458A

CN114996458A - Text processing method and device, equipment and medium

Info

Publication number: CN114996458A
Application number: CN202210739788.2A
Authority: CN
Inventors: 张智
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-02

Abstract

The embodiment of the application provides a text processing method, a text processing device, text processing equipment and a text processing storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: the method comprises the steps of extracting a primary abstract text which can most express the meaning of an original corpus by performing text extraction processing on the original corpus, and dividing the primary abstract text according to the length of a preset segmented text to obtain a target segmented text. And filtering the target segmented text to determine a target candidate phrase, performing semantic analysis processing on the target candidate phrase, determining the importance of the participle according to the word type of the participle, and linking the participle to a preset upper and lower cognitive map according to the word type of the participle to obtain the target upper and lower cognitive map. The upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower knowledge maps of the target, so that the efficiency of obtaining information can be improved through the text processing method of the embodiment of the application.

Description

Text processing method and device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method, apparatus, device, and medium.

Background

At present, a large amount of original corpora are generated every day by various big news media, public accounts, news broadcasters and the like, and the original corpora include but are not limited to news reports, comment prediction, analysis and interpretation and the like. The texts in the original corpus are often very long, and at the same time, the contents are complex, the viewpoints are different, and it is difficult to directly obtain effective information from the original corpus. Therefore, how to provide a text processing method capable of improving the efficiency of acquiring information becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a text processing method, a text processing device, text processing equipment and a text processing medium, and the efficiency of acquiring information can be improved.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a text processing method, where the method includes:

acquiring an original corpus;

performing text extraction processing on the original corpus to obtain a primary abstract text;

dividing the preliminary abstract text into a plurality of preliminary segmented texts according to the preset segmented text length;

filtering the plurality of preliminary segmented texts to obtain a plurality of preliminary candidate phrases;

calculating the similarity value of each preliminary candidate phrase and the original corpus, and taking the preliminary candidate phrases meeting the preset similarity condition as target candidate phrases;

performing semantic parsing processing on the target candidate word group to obtain a plurality of participles and word types of the participles;

and linking the multiple word segments to a preset upper cognitive map and a preset lower cognitive map according to the word types of the word segments so as to obtain the target upper cognitive map and the target lower cognitive map.

In some embodiments, said filtering a plurality of said preliminary segmented texts to obtain a plurality of preliminary candidate phrases comprises:

acquiring a reference scene classification of the original corpus;

predicting the plurality of preliminary segmented texts through a preset classification prediction model to obtain a prediction scene classification of each preliminary segmented text;

screening a plurality of preliminary segmented texts according to the matching relation between the reference scene classification and the prediction scene classification;

and determining a plurality of preliminary candidate phrases according to the screened plurality of preliminary segmented texts.

acquiring a reference scene classification of the original corpus;

determining a keyword set according to the reference scene classification, wherein the keyword set comprises a plurality of keywords;

and calculating a word matching value of each preliminary segmented text and a plurality of keywords, and obtaining a plurality of preliminary candidate phrases according to the plurality of preliminary segmented texts meeting preset matching conditions.

filtering the plurality of preliminary segmented texts through a preset bert quality model to obtain a plurality of target segmented texts;

calculating the text similarity between each target segmented text and the original corpus through a preset sifRank semantic model;

and screening the target segmented texts according to the text similarity to obtain a plurality of candidate phrases.

In some embodiments, the linking, according to the word type of the word, a plurality of the words to a preset upper-lower cognitive map to obtain a target upper-lower cognitive map includes:

acquiring a reference scene classification of the original corpus;

determining a scene node type according to the reference scene classification;

if the nodes of the upper and lower cognitive maps do not comprise the scene node types, determining nodes to be processed from the upper and lower cognitive maps according to the correlation of the scene node types;

creating scene nodes corresponding to the scene node types under the nodes to be processed to obtain a preliminary upper and lower cognitive map;

and linking a plurality of the participles to the preliminary upper and lower cognitive maps according to the word types of the participles to obtain a target upper and lower cognitive map.

constructing a preliminary word segmentation set according to the plurality of word segmentations;

if the word type of the participle is determined not to belong to a preset word type set, deleting the participle corresponding to the word type of the participle from the preliminary participle set to obtain a target participle set, wherein the target participle set comprises a plurality of target participles; the word type set comprises a region type, an action type and a quantity word type;

and linking the target word segmentation to a preset upper cognitive map and a preset lower cognitive map according to the word type of the word segmentation so as to obtain the target upper cognitive map and the target lower cognitive map.

In some embodiments, the performing text extraction processing on the original corpus to obtain a preliminary abstract text includes:

performing prediction processing on the original corpus through a preset event classification prediction model to obtain at least two corpus event types;

calculating the matching degree of each corpus event type and a preset key event to obtain an event matching value;

determining a target event type according to the event matching value and at least two corpus event types;

text interception is carried out on the original corpus according to a preset intercepted text length to obtain at least two original abstract texts;

and determining the preliminary abstract text according to the matching degree of the original abstract text and the target event type.

To achieve the above object, a second aspect of an embodiment of the present application proposes a text processing apparatus, including:

the acquisition module is used for acquiring the original corpus;

the abstract text extraction module is used for extracting texts from the original corpus to obtain a primary abstract text;

the text segmentation module is used for dividing the preliminary abstract text into a plurality of preliminary segmented texts according to the preset segmented text length;

a preliminary candidate phrase generating module, configured to filter the multiple preliminary segmented texts to obtain multiple preliminary candidate phrases;

the target candidate phrase generating module is used for calculating the similarity value of each preliminary candidate phrase and the original corpus, and taking the preliminary candidate phrases meeting the preset similarity condition as target candidate phrases;

the semantic parsing processing module is used for carrying out semantic parsing processing on the target candidate word group to obtain a plurality of participles and word types of the participles;

and the word linking module is used for linking the plurality of participles to a preset upper and lower cognitive map according to the word types of the participles so as to obtain the target upper and lower cognitive maps.

To achieve the above object, a third aspect of the embodiments of the present application provides a computer device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.

According to the text processing method, the text extraction processing device, the text processing equipment and the text extraction processing medium, the original corpus is subjected to text extraction processing, the primary abstract text which can express the meaning of the original corpus most is extracted, and the primary abstract text is divided according to the preset segmented text length to obtain the target segmented text. And filtering the target segmented text to determine a target candidate phrase, performing semantic analysis processing on the target candidate phrase, determining the importance of the participle according to the word type of the participle, and linking the participle to a preset upper and lower cognitive map according to the word type of the participle to obtain the target upper and lower cognitive map. The upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower knowledge maps of the target, so that the efficiency of obtaining information can be improved through the text processing method of the embodiment of the application.

Drawings

Fig. 1 is a flowchart of a text processing method provided in an embodiment of the present application;

FIG. 2 is a flowchart of step S102 in FIG. 1;

FIG. 3 is a flowchart of step S104 in FIG. 1;

FIG. 4 is a flowchart of step S104 in FIG. 1;

FIG. 5 is a flowchart of step S104 in FIG. 1;

fig. 6 is a flowchart of step S107 in fig. 1;

fig. 7 is a flowchart of step S107 in fig. 1;

fig. 8 is a block diagram of a module structure of a text processing method apparatus according to an embodiment of the present application;

fig. 9 is a hardware structure diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): the method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and it belongs to a branch of artificial intelligence, which is a cross discipline of computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information image processing, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like, which are related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

And (3) corpus: i.e., linguistic material, which is a basic unit that constitutes a corpus, typically a collection of text resources of a certain number and size. The corpus can be large or small, large to ten million, even hundreds of sentences or larger, and small to hundreds of sentences. One simply replaces text and replaces the context of the language in the real world with the context in the text. A collection of text may be referred to as a corpus, and when there are several such collections of text, a corpus may be referred to as a corpus collection. The internet itself is a huge and complex corpus. The corpus can be classified according to different standards, for example, the corpus can be a monolingual corpus or a multilingual corpus.

PEGASUS model: PEGASUS is a pre-training Model customized for the abstract, which can be used as a generic generative pre-training task, and is a standard transform (preamble codec predictor) with both encoder and decoder, and pre-training targets include GSG (Gap sequences Generation) and MLM (Masked Language Model).

Longest Common Subsequence (LCS): the method is to take out a part of characters as much as possible from given two sequences X and Y and arrange the characters according to the sequence of the characters in the original sequence. Subsequence (subsequence): a subsequence of a particular sequence is the result of removing zero or more elements from a given sequence (without changing the relative order of the elements). Common subsequence (common subsequence): given sequences X and Y, sequence Z is a subsequence of X and is also a subsequence of Y, and Z is a common subsequence of X and Y. For example, X ═ a, B, C, B, D, a, B ], Y ═ B, D, C, a, B, a [, then the sequence Z ═ B, C, a ] is a common subsequence of X and Y, having a length of 3. But Z is not the longest common subsequence of X and Y, and the sequences [ B, C, B, A ] and [ B, D, A, B ] are also the longest common subsequence of X and Y, and are 4 in length, while X and Y do not have a common subsequence of length greater than or equal to 5. For the common subsequences of sequence [ A, B, C ] and sequence [ E, F, G ] only a null sequence [ ]. Longest common subsequence: given sequences X and Y, the one or more of the longest length is selected from all their common subsequences.

BERT model (Bidirectional Encoder responses from transformations, BERT): the method is a deep learning model based on a transformations framework and an encoder. After the BERT model is pre-trained by the unlabelled training data, the BERT model can have the capability of Processing the downstream Processing task only by using a small amount of corresponding sample data for the specific downstream Processing task before being applied to the specific downstream Processing task, and the BERT model is very suitable for being applied to the fields of Natural Language Processing (NLP) and the like.

Based on this, the embodiments of the present application mainly aim to provide a text processing method, device, and medium, and the method and device are intended to extract a preliminary abstract text that best expresses the meaning of an original corpus by performing text extraction processing on the original corpus, perform semantic parsing processing on the preliminary abstract text, determine the importance of a participle according to the word type of the participle, and link the participle to a preset upper and lower cognitive map according to the word type of the participle to obtain a target upper and lower cognitive map. The upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower knowledge maps of the target, so that the efficiency of obtaining information can be improved through the text processing method of the embodiment of the application.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a text processing method, and relates to the technical field of artificial intelligence. The text processing method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a text processing method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Embodiments of the present application provide a text processing method and apparatus, a device, and a medium, which are specifically described in the following embodiments, and first describe a text processing method in the embodiments of the present application.

Fig. 1 is an optional flowchart of a text processing method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, obtaining an original corpus;

step S102, performing text extraction processing on the original corpus to obtain a preliminary abstract text;

step S103, dividing the preliminary abstract text into a plurality of preliminary segment texts according to the preset segment text length;

step S104, filtering the plurality of preliminary segmented texts to obtain a plurality of preliminary candidate phrases;

step S105, calculating the similarity value of each preliminary candidate phrase and the original corpus, and taking the preliminary candidate phrases meeting the preset similarity condition as target candidate phrases;

step S106, carrying out semantic analysis processing on the target candidate phrase to obtain a plurality of participles and word types of the participles;

and S107, linking a plurality of participles to a preset upper and lower cognitive map according to the word types of the participles to obtain a target upper and lower cognitive map.

In steps S101 to S107 illustrated in the embodiment of the present application, a primary abstract text that can most express the meaning of an original corpus is extracted by performing text extraction processing on the original corpus, and then the primary abstract text is divided according to a preset segment text length to obtain a target segment text. And filtering the target segmented text to determine a target candidate phrase. And performing semantic analysis processing on the target candidate word group, determining the importance of the word segmentation according to the word type of the word segmentation, and linking the word segmentation to a preset upper and lower cognitive map according to the word type of the word segmentation to obtain the target upper and lower cognitive map. The upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower knowledge maps of the target, so that the efficiency of obtaining information can be improved through the text processing method of the embodiment of the application.

In step S101 of some embodiments, the raw corpus may refer to news data, including news headlines and news content. Raw corpora may also refer to article data, including article titles and article content. Further, in the embodiment of the application, the original corpus can be crawled from the webpage through a crawler tool, and optionally, the crawler tool is constructed based on the node. In detail, the obtaining of the original corpus by the crawler tool includes: the method comprises the steps of crawling a Uniform Resource Locator (URL) address of an original corpus to be obtained by using nodes, carrying out character identification on the original corpus to be obtained, loading a system interface corresponding to the original corpus to be obtained according to the URL address, and obtaining the corresponding original corpus from the system interface according to the character identification.

It can be understood that the crawled original corpus contains a large number of original titles, original content corresponding to each original title is inconsistent in space, and difficulty is brought to subsequent semantic understanding.

In step S102 of some embodiments, a text extraction process is performed on the original corpus to obtain a preliminary abstract text. Specifically, a preset character length threshold is obtained, and a text extraction process is performed on the original corpus through a pagausus model or an LCS algorithm to obtain a preliminary abstract text. And the character length of the preliminary abstract text is less than the character length threshold value. The original corpus is preprocessed, so that the primary abstract text which accords with the character length threshold is extracted, and the efficiency of semantic understanding of the primary abstract text in the follow-up process is improved. Specifically, the preset character length threshold may be 512, or may be other values, and the embodiment of the present application is not particularly limited.

Specifically, referring to fig. 2, in some embodiments, step S102 includes, but is not limited to, step S201 to step S205:

step S201, performing prediction processing on an original corpus through a preset event classification prediction model to obtain at least two corpus event types;

step S202, calculating the matching degree of each corpus event type and a preset key event to obtain an event matching value;

step S203, determining a target event type according to the event matching value and the at least two corpus event types;

step S204, performing text interception on the original corpus according to a preset intercepted text length to obtain at least two original abstract texts;

and step S205, determining a preliminary abstract text according to the matching degree of the original abstract text and the target event type.

In steps S201 to S205 illustrated in the embodiment of the present application, the original corpus is predicted to determine the corpus event type of the original corpus. It should be noted that the event type can often represent the type of information that the user most wants to obtain, for example, in the event type of travel, insurance is most likely to be the information that the user most wants to obtain. Therefore, the target event type is determined from the corpus event types according to the original corpus, the original abstract text is screened according to the target event type, and the primary abstract text which best meets the target event type is obtained, so that information which is most desired to be known by a user is obtained according to the primary abstract text in the following process.

In step S103 of some embodiments, the preliminary digest text is divided into a plurality of preliminary segment texts according to a preset segment text length. It should be noted that, because a plurality of sentences exist in the preliminary abstract text, the meaning expressed by each sentence may be different, and the central meaning of the original corpus can be better reflected by processing the sentences. Accordingly, the preliminary digest text is divided into a plurality of preliminary segment texts by the segment text length.

In step S104 of some embodiments, a plurality of preliminary segmented texts can be filtered according to the semantics of the preliminary segmented texts, and an optimal preliminary candidate phrase that meets the semantics of the original corpus is selected. The specific numerical value of the length of the segmented text can be made according to actual requirements, for example, ten characters are taken as the length of the segmented text, the preliminary abstract text is segmented, the preliminary segmented text with unsmooth semantics is deleted, a plurality of preliminary segmented texts with complete semantics are obtained, and then preliminary candidate phrases are determined. It should be noted that a complete preliminary segmented text may be directly used as a preliminary candidate phrase, and keyword extraction may also be performed on the preliminary segmented text to obtain a candidate phrase.

Specifically, referring to fig. 3, in some embodiments, step S104 includes, but is not limited to, step S301 to step S304:

step S301, acquiring a reference scene classification of an original corpus;

step S302, a plurality of preliminary segmented texts are predicted through a preset classification prediction model, so that the prediction scene classification of each preliminary segmented text is obtained;

step S303, screening a plurality of preliminary segmented texts according to the matching relation between the reference scene classification and the prediction scene classification;

step S304, determining a plurality of preliminary candidate phrases according to the screened plurality of preliminary segmented texts.

In steps S301 to S304 illustrated in the embodiment of the present application, the preliminary segmented text conforming to the reference scene classification of the original corpus is determined by determining the matching relationship between the reference scene classification of the original corpus and the predicted scene classification of each preliminary segmented text, so as to further obtain a preliminary candidate phrase. The embodiment of the application aims to keep the scene classification of the candidate phrases consistent with the scene classification of the original corpus, so that the scene classification of information is not changed when information is acquired through the candidate phrases subsequently, and the accuracy of information acquisition is improved.

Specifically, referring to fig. 4, in other embodiments, step S104 includes, but is not limited to, step S401 to step S403:

step S401, obtaining a reference scene classification of an original corpus;

step S402, determining a keyword set according to the reference scene classification, wherein the keyword set comprises a plurality of keywords;

step S403, calculating a word matching value between each preliminary segmented text and the plurality of keywords, and obtaining a plurality of preliminary candidate phrases according to the plurality of preliminary segmented texts meeting the preset matching conditions.

In steps S401 to S403 illustrated in the embodiment of the present application, a keyword set is determined by referring to scene classification, and if a preliminary segmented text includes the keyword, or if a preliminary text includes a word whose matching value meets a matching condition with the keyword, a preliminary candidate phrase is determined according to the preliminary segmented text. The embodiment of the application aims to keep the scene classification of the candidate phrases consistent with the scene classification of the original corpus, so that the scene classification of information is not changed when the information is acquired through the candidate phrases subsequently, and the accuracy of the acquired information is improved.

Specifically, referring to fig. 5, in other embodiments, step S104 includes, but is not limited to, step S501 to step S503:

s501, filtering a plurality of primary segmented texts through a preset bert quality model to obtain a plurality of target segmented texts;

step S502, calculating the text similarity between each target segmented text and the original corpus through a preset sifRank semantic model;

step S503, a plurality of target segmented texts are screened according to the text similarity, so as to obtain a plurality of candidate phrases.

In steps S501 to S503 illustrated in the embodiment of the present application, a first filtering is performed through a bert quality model, and then a second filtering is performed on the obtained multiple target segmented texts through a sifRank semantic model, so as to obtain a candidate phrase. The embodiment of the application aims to ensure semantic information and semantic quality of the candidate word group through the bert quality model and the sifRank semantic model, and keep the scene classification of the candidate word group consistent with the scene classification of the original corpus, so that the scene classification of information is not changed when the information is acquired through the candidate word group subsequently, and the accuracy of information acquisition is improved. It can be understood that feature information of each dimension of a word in each preliminary segmented text is respectively obtained based on a refined phrase quality model (bert quality model), and a quality score is determined according to the feature information, so that the preliminary segmented text is screened according to the quality score to obtain a target segmented text.

In step S105 of some embodiments, a similarity value between each preliminary candidate phrase and the original corpus is calculated, and the preliminary candidate phrase meeting a preset similarity condition is taken as the target candidate phrase. In order to approach the semantics of the original corpus to the maximum extent, the similarity value of the preliminary candidate phrases and the original corpus can be calculated through the existing semantic similarity model, and the purpose is to screen a plurality of preliminary candidate phrases to obtain target candidate phrases meeting the preset similarity condition.

In step S106 of some embodiments, the target candidate word group is subjected to semantic parsing processing to obtain a plurality of participles and word types of the participles. The semantic parsing processing refers to segmenting words of a text and marking the category of the words. For example, the target candidate phrase can be black rice mixed with apple to make a fuzzy and richer nutrition, and the semantic parsing processing can be performed on the target candidate phrase through the existing word model. After semantic parsing processing is performed on the target candidate phrase, the result shown in table 1 is obtained.

Table 1:

and S107, linking a plurality of participles to a preset upper and lower cognitive map according to the word types of the participles to obtain a target upper and lower cognitive map. Specifically, the importance degree of the participle can be determined according to the word type of the participle, for example, the word types such as the side word type and the modified word type are non-key word types, while the diet type and the scene event type can be configured as key word types, and the participle of the key word type is linked to the preset superior-inferior cognitive map to obtain the target superior-inferior cognitive map, so that information can be clearly, simply and conveniently obtained through the target superior-inferior cognitive map, and the information obtaining efficiency is improved.

In a specific example, the information recommendation capability can be improved through the target superior-inferior cognitive map. Specifically, a service to be processed is obtained; predicting the prediction scene classification of the service to be processed through a preset classification model; matching corresponding scene nodes from the target cognitive map through predicting scene classification; and determining a target recommendation result according to a plurality of father scene nodes and sub-scene nodes corresponding to the scene nodes. The information recommendation capability is improved through the upper and lower cognitive maps of the target. It should be noted that, determining a recommendation result according to a node includes the following steps: the nodes in the target upper and lower cognitive maps are operated with graph embedding algorithms such as graph and the like, the emb representation of each node can be obtained, one participle is represented as the weighted sum of the emb of the related nodes (the related nodes comprise a father node, a child node and a brother node), and the target recommendation result is determined by using the distance between the emb.

Specifically, referring to fig. 6, in other embodiments, step S107 includes, but is not limited to, step S601 to step S604:

step S601, acquiring a reference scene classification of an original corpus, and determining a scene node type according to the reference scene classification;

step S602, if the nodes of the upper and lower cognitive maps do not comprise the scene node type, determining the nodes to be processed from the upper and lower cognitive maps according to the relevance of the scene node type;

step S603, creating scene nodes corresponding to the scene node types under the nodes to be processed so as to obtain a preliminary upper and lower cognitive map;

step S604, linking a plurality of participles to the preliminary upper and lower cognitive maps according to the word types of the participles to obtain the target upper and lower cognitive maps.

In steps S601 to S604 illustrated in the embodiment of the application, a node to be processed is determined in the upper and lower cognitive maps according to the correlation of the scene node, the scene node is created under the node to be processed, that is, the scene node is used as a child node of the node to be processed, a preliminary upper and lower cognitive map is obtained, and then a plurality of participles are linked to the preliminary upper and lower cognitive maps according to the word types of the participles, so as to obtain a target upper and lower cognitive map. Therefore, the corresponding node information is obtained through the scene node, and meanwhile, the node information of the father node and other brother nodes can also be obtained, and the information obtaining efficiency is further improved.

Specifically, referring to fig. 7, in other embodiments, step S107 includes, but is not limited to, steps S701 to S703:

step S701, constructing a preliminary word segmentation set according to a plurality of word segmentations;

step S702, if the word type of the participle is determined not to belong to a preset word type set, deleting the participle corresponding to the word type of the participle from the preliminary participle set to obtain a target participle set, wherein the target participle set comprises a plurality of target participles; wherein, the word type set comprises a region type, an action type and a quantity word type;

step S703, linking a plurality of target segmented words to a preset upper and lower cognitive map according to the word types of the segmented words to obtain the target upper and lower cognitive maps.

In steps S701 to S703 illustrated in the embodiment of the present application, the importance degree of the segmented word is determined according to the word type of the segmented word, and if it is determined that the word type of the segmented word does not belong to the preset word type set, for example, the word type of the segmented word is a modified word type, the segmented word corresponding to the modified word type is deleted from the preliminary word type set to obtain a target segmented word set, where the word type set includes a region type, an action type, a quantity word type, and the like, and may be configured according to actual requirements. The participles are screened according to the word types of the participles, so that the redundancy of the information is reduced, and the information acquisition efficiency is further improved.

Referring to fig. 8, a text processing apparatus is further provided in an embodiment of the present application, which can implement the foregoing text processing method, and fig. 8 is a block diagram of a module structure of the text processing apparatus provided in the embodiment of the present application, where the apparatus includes: an obtaining module 801, a abstract text extracting module 802, a text segmenting module 803, a preliminary candidate phrase generating module 804, a target candidate phrase generating module 805, a semantic parsing processing module 806 and a word linking module 807. The obtaining module 801 is configured to obtain an original corpus; the abstract text extraction module 802 is configured to perform text extraction processing on the original corpus to obtain a preliminary abstract text; the text segmentation module 803 is configured to divide the preliminary abstract text into a plurality of preliminary segment texts according to a preset segment text length; the preliminary candidate phrase generating module 804 is configured to filter the multiple preliminary segmented texts to obtain multiple preliminary candidate phrases; the target candidate phrase generating module 805 is configured to calculate a similarity value between each preliminary candidate phrase and the original corpus, and use the preliminary candidate phrases meeting a preset similarity condition as target candidate phrases; the semantic parsing module 806 is configured to perform semantic parsing on the target candidate word group to obtain a plurality of participles and word types of the participles; the word linking module 807 is configured to link the multiple segmented words to a preset upper and lower cognitive map according to the word types of the segmented words to obtain a target upper and lower cognitive map.

The text processing device can extract the preliminary abstract text which can most express the meaning of the original corpus by extracting and processing the text of the original corpus, and then divides the preliminary abstract text according to the length of the preset segmented text to obtain the target segmented text. And filtering the target segmented text to determine a target candidate phrase. And performing semantic analysis processing on the target candidate word group, determining the importance of the word segmentation according to the word type of the word segmentation, and linking the word segmentation to a preset upper and lower cognitive map according to the word type of the word segmentation to obtain the target upper and lower cognitive map. Because the upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower position knowledge maps of the target, the efficiency of acquiring the information can be improved through the text processing method of the embodiment of the application.

It should be noted that the specific implementation of the text processing apparatus is substantially the same as the specific implementation of the text processing method, and is not described herein again.

An embodiment of the present application further provides a computer device, where the computer device includes: the text processing system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the text processing method when being executed by the processor. The computer equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of a computer device according to another embodiment, where the computer device includes:

the processor 901 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present Application;

the Memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the text processing method of the embodiments of the present application;

an input/output interface 903 for implementing information input and output;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable a communication connection within the device with each other through a bus 905.

The embodiment of the application also provides a storage medium, which is a computer-readable storage medium for computer-readable storage, and the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the text processing method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the text processing method, the text processing device, the computer equipment and the storage medium, the original corpus is subjected to text extraction processing, the primary abstract text which can express the meaning of the original corpus most is extracted, and the primary abstract text is divided according to the preset segmented text length to obtain the target segmented text. And filtering the target segmented text to determine a target candidate phrase. And performing semantic analysis processing on the target candidate word group, determining the importance of the word segmentation according to the word type of the word segmentation, and linking the word segmentation to a preset upper and lower cognitive map according to the word type of the word segmentation to obtain the target upper and lower cognitive map. The upper and lower information of each participle and the relation among the participles can be easily obtained through the upper and lower knowledge maps of the target, so that the efficiency of obtaining information can be improved through the text processing method of the embodiment of the application.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents, and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of text processing, the method comprising:

acquiring an original corpus;

performing semantic analysis processing on the target candidate word group to obtain a plurality of participles and word types of the participles;

and linking the plurality of word segments to a preset upper cognitive map and a preset lower cognitive map according to the word types of the word segments so as to obtain the target upper cognitive map and the target lower cognitive map.

2. The method of claim 1, wherein filtering the plurality of preliminary segmented texts to obtain a plurality of preliminary candidate phrases comprises:

acquiring a reference scene classification of the original corpus;

3. The method of claim 1, wherein filtering the plurality of preliminary segmented texts to obtain a plurality of preliminary candidate phrases comprises:

acquiring a reference scene classification of the original corpus;

4. The method of claim 1, wherein filtering the plurality of preliminary segmented texts to obtain a plurality of preliminary candidate phrases comprises:

5. The method according to any one of claims 1 to 4, wherein the linking a plurality of the participles to a preset upper and lower cognitive map according to word types of the participles to obtain a target upper and lower cognitive map comprises:

acquiring a reference scene classification of the original corpus;

determining a scene node type according to the reference scene classification;

creating scene nodes corresponding to the scene node types under the nodes to be processed so as to obtain a preliminary upper and lower cognitive map;

6. The method according to any one of claims 1 to 4, wherein the step of linking a plurality of the participles to a preset upper and lower cognitive map according to the word types of the participles to obtain a target upper and lower cognitive map comprises the following steps:

if the word type of the participle does not belong to a preset word type set, deleting the participle corresponding to the word type of the participle from the preliminary participle set to obtain a target participle set, wherein the target participle set comprises a plurality of target participles; the word type set comprises a region type, an action type and a quantity word type;

7. The method according to any one of claims 1 to 4, wherein said performing text extraction on said original corpus to obtain a preliminary abstract text comprises:

8. A text processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring the original corpus;

the abstract text extraction module is used for performing text extraction processing on the original corpus to obtain a primary abstract text;

9. A computer arrangement comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling connection communication between the processor and the memory, the program, when executed by the processor, implementing the steps of the method according to any one of claims 1 to 7.

10. A storage medium, being a computer readable storage medium, for computer readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the steps of the method of any one of claims 1 to 7.