CN116467405A

CN116467405A - Text processing method, device, equipment and computer readable storage medium

Info

Publication number: CN116467405A
Application number: CN202210033962.1A
Authority: CN
Inventors: 曾双; 刘康龙; 荆宁; 梁海金
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2023-07-21

Abstract

The embodiment of the application provides a text processing method, a text processing device, text processing equipment and a computer readable storage medium, which are at least applied to the technical field of artificial intelligence, wherein the method comprises the following steps: for each first type word in the text to be processed, carrying out coding processing on the first type word and the text to be processed to obtain text word vectors corresponding to the first type word and the text to be processed; performing upper and lower relation decoding processing on the text word vector to obtain confidence that each word in the text to be processed has an upper and lower relation with the first type word; determining a second type word corresponding to each first type word from at least two segmentation words according to the confidence level; and associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed. Through the method and the device, the pairs of upper and lower position word pairs in the text to be processed can be accurately identified, and the identification efficiency of the upper and lower position word pairs can be improved.

Description

Text processing method, device, equipment and computer readable storage medium

Technical Field

Embodiments of the present application relate to the field of internet technologies, and relate to, but are not limited to, a text processing method, a text processing device, a text processing apparatus, and a computer readable storage medium.

Background

With the development of the internet technology, the information and data volumes on the internet are increased rapidly, so that the difficulty of information searching is increased, and with the development of the artificial intelligence technology, the requirement on the searching accuracy of the information searching is higher. In the case of searching information, in most cases, information of a lower-level entity word corresponding to an upper-level concept word is searched when the upper-level concept word is input, or information of an upper-level concept word corresponding to a lower-level entity word is searched when the lower-level entity word is input, so that a correspondence relationship between the upper-level concept word and the lower-level entity word needs to be obtained in advance, that is, a pair of upper-level and lower-level entity words needs to be determined in advance.

In the related art, when determining the context word pairs, the following manner is generally adopted: based on a preset rule, by template matching, by sequence labeling or by classification based on a context.

However, the method in the related art cannot accurately identify the pairs of upper and lower level words in the case of inputting only the plain text, and only one pair of upper and lower level word pairs in the text can be identified at a time, so that the accuracy and the efficiency of identifying the pairs of upper and lower level words in the related art are low.

Disclosure of Invention

The embodiment of the application provides a text processing method, a device, equipment and a computer readable storage medium, which are at least applied to the technical field of artificial intelligence, can accurately identify a plurality of pairs of upper and lower word pairs in a text to be processed, and can improve the identification efficiency of the upper and lower word pairs.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a text processing method, which comprises the following steps:

performing first type word recognition on the text to be processed to obtain at least one first type word;

for each first type word, carrying out coding processing on the first type word and the text to be processed to obtain text word vectors corresponding to the first type word and the text to be processed; the text to be processed comprises at least two segmentation words;

performing upper and lower relation decoding processing on the text word vector to obtain confidence degrees of upper and lower relation between each word segment in the at least two word segments and the first type word;

determining a second type word corresponding to each first type word from the at least two word segments according to the confidence level;

And associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed.

The embodiment of the application provides a text processing device, which comprises:

the recognition module is used for recognizing the first type words of the text to be processed to obtain at least one first type word;

the coding processing module is used for coding the first type word and the text to be processed aiming at each first type word to obtain text word vectors corresponding to the first type word and the text to be processed; the text to be processed comprises at least two segmentation words;

the decoding processing module is used for performing upper-lower relation decoding processing on the text word vector to obtain confidence degrees of upper-lower relation between each word segment in the at least two word segments and the first type word;

the determining module is used for determining a second type word corresponding to each first type word from the at least two word segments according to the confidence level;

and the association module is used for associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed.

a memory for storing executable instructions; and the processor is used for realizing the text processing method when executing the executable instructions stored in the memory.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor is configured to execute the computer instructions to implement the text processing method.

The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute the executable instructions to implement the text processing method.

The embodiment of the application has the following beneficial effects: when the text to be processed is processed, first type words in the text to be processed are recognized, then, based on the first type words, encoding processing and context relation decoding processing are sequentially carried out on the first type words and the text to be processed, confidence of each word in the text to be processed is obtained, and based on the confidence, second type words with context relation with each first type word are determined from at least two word segments, so that at least one context word pair is formed. Therefore, all the upper and lower word pairs in the text to be processed can be identified simultaneously under the condition that only the file to be processed is input, so that the text identification efficiency can be greatly improved, and the encoding processing and the upper and lower relation decoding processing are sequentially carried out based on the first type words which are identified first, so that the second type words corresponding to the first type words can be accurately identified, and a plurality of pairs of upper and lower word pairs in the text to be processed can be accurately identified.

Drawings

FIG. 1A is an interface diagram of a rule-based conceptual context mining method mining pairs of context words;

FIG. 1B is a context word pair recognition process based on Bootstrapping template matching techniques;

FIG. 1C is a schematic diagram of an extraction process of a method for extracting a context based on sequence labeling;

FIG. 1D is a schematic diagram of a classification process of a superior-inferior relationship classification method based on a pre-training model BERT;

FIG. 2 is a schematic diagram of an alternative architecture of a text processing system provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text processing device according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative text processing method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of another alternative text processing method provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart of yet another alternative text processing method provided in an embodiment of the present application;

FIG. 7 is a product interface diagram of search term recommendations provided by an embodiment of the present application;

FIG. 8 is a product interface diagram of concept class candidate search box word generation provided by an embodiment of the present application;

fig. 9 is a schematic diagram of a question-answer matching process of a question-answer system based on a knowledge graph according to an embodiment of the present application;

FIG. 10 is a flowchart of a text processing system implementing a text processing method according to an embodiment of the present application;

FIG. 11 is a schematic view of three scenarios provided in embodiments of the present application;

FIG. 12 is a schematic structural diagram of an MRC model provided in an embodiment of the present application;

FIG. 13 is a schematic structural diagram of an encoder of an MRC model provided in an embodiment of the present application;

fig. 14 is a schematic illustration of a case predicted using the method of an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. Unless defined otherwise, all technical and scientific terms used in the embodiments of the present application have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the present application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

Before explaining the schemes of the embodiments of the present application, the terms related to the embodiments of the present application are explained first:

(1) Bi-directional encoder representation (BERT, bidirectional Encoder Representations): the pre-trained language characterization model, BERT considers context from both sides (left and right) of the word.

(2) Machine reading understanding model (MRC, machine Reading Comprehension): is a reading and understanding model in a question and answer system.

(3) Bootstrapping algorithm: the method is to generate a semantic matching template from a large-scale corpus, namely, reestablishing a new sample which is enough to represent the parent sample distribution through repeated sampling for a plurality of times by using limited sample data.

(4) Information Box (Info Box): the information frame in the encyclopedia query page is used for recording attribute information of an entity corresponding to the encyclopedia page.

(5) Named entity recognition (NER, named Entity Recognition): it means that named entities having a specific meaning, such as a person's name, place's name, organization's name, work's name, proper noun, etc., are identified from the text.

(6) BIO (Begin-Inside-Outside) label: the sequence annotation model is a common annotation mode, and BIO annotation is to annotate each element as 'B-X', 'I-X' or 'O', wherein 'B-X' indicates that a fragment where the element is located belongs to an X type and the element is at the beginning of the fragment, 'I-X' indicates that the fragment where the element is located belongs to the X type and the element is at the middle position of the fragment, and 'O' indicates that the element is not of any type. For example, in NER, each word in a BIO tagged text sequence is a first word belonging to a certain entity (or concept), an internal word, or a non-entity (or concept) word.

(7) The hyponym: typically named entities such as people, organizations, places, works, etc.

(8) Hypernyms: generally, concepts, i.e., entities with more abstract semantics, and hypernyms can refer broadly to a class of named entities. For example, the concept "educational hand tour" may generally refer to entities such as various specific game names.

Before explaining the text processing method of the embodiment of the present application, a method in the related art will be first described.

In the related art, when performing the recognition of the upper and lower word pairs, the first method is a conceptual upper and lower relation mining method based on rules. In this approach, coarse-grained concepts are extracted from structured and semi-structured data such as open knowledge base, encyclopedia Info Box, etc. As shown in fig. 1A, the method for mining the conceptual upper and lower relation based on rules is to mine the interface diagram of the upper and lower word pairs, and in a certain encyclopedic query webpage, the "occupation" attribute in the info Box of the term "Zhang Sano" entity can obtain the upper concept corresponding to the lower entity "Zhang Sano" as follows: actors, singers, producers, word makers, etc.

In the rule-based concept context mining method, although the mining accuracy is high, the mining method does not have generalization capability, and the extracted concept is limited to specific Info Box attributes, such as occupation and position corresponding to a character entity and category corresponding to a video entity, and the ranges are relatively narrow. The method is completely dependent on the existing encyclopedia data, the upper and lower word pairs outside the encyclopedia data cannot be extracted, and the recognition range of the upper and lower word pairs is smaller.

The second approach is based on Bootstrapping's template matching technique, which automatically mines conceptual phrases from text. In this approach, bootstrapping-based template matching techniques are commonly used to solve the problem of sample starvation in machine learning by creating new samples sufficient to represent the maternal sample distribution via multiple re-sampling with limited samples. In the conceptual upper and lower extraction, the process is as follows: a small number of conceptual upper and lower position word pairs are used as an initial seed sample set, templates are obtained through prediction of some learning strategies, more samples are obtained based on predicted template labeling, the original seed sample set is expanded, and iteration is repeated. As shown in fig. 1B, in the context word pair recognition process based on the template matching technique of Bootstrapping, firstly, the input text 101"LL knows the bar everywhere, which is a well-known pianist," is aligned 102, and then template matching is performed on the text 101 based on the template 103 of Bootstrapping, so as to obtain sample words 104 such as "LL", "one-bit", "well-known pianist"; and then, labeling the obtained sample words through the regular expressions and the predefined dictionary to obtain final labeled words, and expanding the labeled words into the template 103 as an expansion sample.

In the template matching technology based on Bootstrapping, although the template matching technology has wider capability compared with a rule-based method, the template matched by the conceptual hyponym pairs can be efficiently and rapidly mined from large-scale expectation. However, since the method is very dependent on the initial seed sample set, the accuracy is not very high and the generalization capability is very limited.

The third way is a method for extracting the upper and lower relation based on the sequence labeling, and because the lower words are identified by named entities, the common and mature named entity extraction way is the sequence labeling, the upper words corresponding to the entities can be labeled while the sequence labeling extracts the entities. Sequence labeling adopts a BIO mode, and labels each word in the text with the initial word, the internal word and the non-entity (or concept) word of the word belonging to a certain entity or concept. According to the method, two sequence annotators are used for respectively annotating entities and concepts in the text, and finally extracted upper and lower word pairs can be obtained. As shown in fig. 1C, for the input text 105 "red drama played by the X national television station 1 st 2020" crossing YLJ ", a pair of upper and lower word pairs" crossing YLJ "and" red drama "may be marked by sequentially marking the extraction entity and the upper word corresponding to the entity.

In the method for extracting the upper and lower relation based on the sequence labeling, a neural network and the sequence labeling method are used, and conceptual upper and lower word pairs in the text are extracted in a combined mode, so that the method has certain generalization capability and accuracy. But this approach does not address the recognition of such samples, since there may be multiple pairs of "entity-concept" hypernyms in the input text.

The fourth way is a method for classifying the upper and lower relationships based on a pre-training model BERT, which assumes that entities and concepts in a text have been identified, and determines whether the upper and lower relationships between the entities and concepts are satisfied by training the classification model through the pre-training language model BERT. As shown in fig. 1D, the bet model 106 is used to determine that the inputted concept word "dessert" and the entity word "cream chestnut powder" are a kind of dessert "based on the text 107, and the result 108 is output by the model, and the" dessert "and the" cream chestnut powder "have a relationship between upper and lower positions according to the result 108.

In the method for classifying the upper and lower relationships based on the pretrained model BERT, the method utilizes the information contained in the pretrained language model, classifies whether the concept relationship exists based on a given entity, concept and corresponding text, and can decompose the complex extraction problem into simple classification problems, so that the model is easier to learn, and meanwhile, the method utilizes the knowledge expected to pretrain in a large scale and has strong performance. However, for all potential entity-concept pairs, the method requires the construction of such a sample, which results in a very high time cost. In addition, the method must extract entities and concepts in advance, and for the case that only plain text is given in a real scene, the method cannot be realized.

Based on the above method and the existing problems in the related art, the embodiments of the present application provide a text processing method, which provides a system for automatically extracting upper and lower word pairs from a plain text, and compared with other methods in the industry, the system can rapidly extract upper and lower word pairs in a text, has wider generalization capability, and can cover wider business scenarios. The main contributions are as follows: (1) Knowledge in the pre-training language model is effectively utilized, and lower entities and corresponding upper concepts thereof are directly extracted from the plain text; (2) The problem that a plurality of upper and lower word pairs possibly exist in a text is solved; (3) Under the condition that accuracy and recall are guaranteed, time complexity and overhead are lower than those of the method in the related art; (4) support three business scenarios: many-to-many scenes (extracting pairs of upper and lower terms from plain text), one-to-many scenes (given lower entities and text, extracting upper concepts in text), and many-to-one scenes (given upper concepts and text, extracting lower entities in text); (5) at least able to serve 3 possible big applications: query-understood search associative word recommendation, search box word recommendation and knowledge-based question-answering systems.

In the text processing method provided by the embodiment of the application, first, a text to be processed is subjected to first type word recognition to obtain at least one first type word; then, for each first type word, carrying out coding processing on the first type word and the text to be processed to obtain a text word vector corresponding to the first type word and the text to be processed; the text to be processed comprises at least two segmentation words; performing upper and lower relation decoding processing on the text word vector to obtain confidence that each word in at least two word segments has an upper and lower relation with the first type word; then, determining a second type word corresponding to each first type word from at least two word segments according to the confidence level; and finally, associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed. Therefore, all the upper and lower word pairs in the text to be processed can be identified simultaneously under the condition that only the file to be processed is input, so that the text identification efficiency can be greatly improved, and the encoding processing and the upper and lower relation decoding processing are sequentially carried out based on the first type words which are identified first, so that the second type words corresponding to the first type words can be accurately identified, and a plurality of pairs of upper and lower word pairs in the text to be processed can be accurately identified.

An exemplary application of the text processing device according to the embodiment of the present application is described below, and the text processing device provided in the embodiment of the present application may be implemented as a terminal or as a server. In one implementation manner, the text processing device provided in the embodiments of the present application may be implemented as any terminal having a search function and a text processing function, such as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), an intelligent robot, an intelligent home appliance, an intelligent speaker, an intelligent watch, and a vehicle-mounted terminal; in another implementation manner, the text processing device provided in the embodiment of the present application may be implemented as a server, where the server may be an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application. In the following, an exemplary application when the text processing device is implemented as a server will be described.

Referring to fig. 2, fig. 2 is an optional architecture diagram of a text processing system 10 according to an embodiment of the present application, to support a text processing application, and respond to an information search request of a user through the text processing application, a text to be processed may be identified, at least one context word pair in the text to be processed may be identified, and respond to the information search request of the user based on the context word pair. In the embodiment of the present application, the text processing system 10 includes at least a terminal 100, a network 200, and a server 300, where the server 300 forms the text processing device in the embodiment of the present application. The terminal 100 is connected to the server 300 through the network 200, and the network 200 may be a wide area network or a local area network, or a combination of the two. The terminal 100 is provided with at least one text processing application, the text processing application can be any one of an information searching application, a question-answer matching application, a shopping application and the like, which need to perform information searching or information matching, and the server 300 can acquire a text to be processed sent by the terminal 100 through the network 200; performing first-type word recognition on the text to be processed to obtain at least one first-type word; for each first type word, carrying out coding processing on the first type word and the text to be processed to obtain text word vectors corresponding to the first type word and the text to be processed; the text to be processed comprises at least two segmentation words; performing upper-lower relation decoding processing on the text word vector to obtain confidence degrees of upper-lower relation between each word segment in at least two word segments and the first type word; determining a second type word corresponding to each first type word from at least two segmentation words according to the confidence level; and associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed. After obtaining the context word pairs, the server 300 may construct an information base based on the obtained context word pairs, or may send the context word pairs to the terminal 100 via the network 200, and the terminal may construct the information base.

In some embodiments, when the text processing device is implemented as a terminal, the terminal may obtain a text to be processed input by a user through the terminal, and identify the text to be processed by using the text processing method provided by the embodiment of the present application, to identify a pair of upper and lower terms in the text to be processed, and construct an information base according to the identified pair of upper and lower terms, or respond to an information search request of the user according to the identified pair of upper and lower terms.

The text processing method provided in the embodiment of the present application may be implemented by a cloud technology based on a cloud platform, for example, the server 300 may be a cloud server, and the first type word recognition is performed on the text to be processed by the cloud server, and the second type word corresponding to each first type word is determined by the cloud server. In some embodiments, the system may further have a cloud memory, and the identified pairs of hypernyms may be stored in the cloud memory, or the text to be processed and the identified pairs of hypernyms may be mapped and stored in the cloud memory. Therefore, when the information search request of the user is responded subsequently, the upper word pair and the lower word pair can be directly obtained from the cloud storage, so that the information search request of the user can be responded rapidly and accurately.

Here, cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, and networks in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Fig. 3 is a schematic structural diagram of a text processing apparatus provided in an embodiment of the present application, where the text processing apparatus shown in fig. 3 includes: at least one processor 310, a memory 350, at least one network interface 320, and a user interface 330. The various components in the text processing device are coupled together by a bus system 340. It is understood that the bus system 340 is used to enable connected communications between these components. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 3 as bus system 340.

The processor 310 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, which may be a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 330 includes one or more output devices 331 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 330 also includes one or more input devices 332, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. Memory 350 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 350 described in embodiments of the present application is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 351 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

network communication module 352 for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

an input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in a software manner, and fig. 3 shows a text processing apparatus 354 stored in a memory 350, where the text processing apparatus 354 may be a text processing apparatus in a text processing device, and may be software in the form of a program and a plug-in, and includes the following software modules: the recognition module 3541, encoding processing module 3542, decoding processing module 3543, determination module 3544, and association module 3545 are logical, and thus may be arbitrarily combined or further split depending on the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the text processing method provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logi c Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), or other electronic components.

The text processing method provided in the embodiment of the present application will be described below in connection with exemplary applications and implementations of the text processing device provided in the embodiment of the present application, where the text processing device may be any one of a terminal having a search function and a text processing function, or may also be a server, that is, the text processing method in the embodiment of the present application may be executed by the terminal, may also be executed by the server, or may also be executed by the terminal interacting with the server.

Referring to fig. 4, fig. 4 is a schematic flowchart of an alternative text processing method provided in the embodiment of the present application, and the steps shown in fig. 4 will be described below, where it should be noted that the text processing method in fig. 4 is implemented by using a terminal as an execution body.

Step S401, performing first type word recognition on the text to be processed to obtain at least one first type word.

In the embodiment of the application, the text to be processed can be input by a user through a text processing application, can be downloaded from a network, and can be obtained from a text library. The user may also input other types of information, such as voice information, picture information and video information, and then obtain the text to be processed by performing voice recognition on the voice information, or recognize the picture information and the video information by using image recognition or text recognition technology, so as to obtain the text to be processed.

The text processing application can be an application special for identifying the context word pairs, and can also be any application needing information searching or information matching, such as an information searching application, a question-answer matching application, a shopping application and the like. When the text processing application is an application special for identifying the upper and lower word pairs, the text processing application is used for identifying the upper and lower word pairs and outputting the identified upper and lower word pairs to a specific application, or generating an information base according to the identified upper and lower word pairs, and providing the information base for other applications. When the text processing application is an application needing information searching or information matching, a server of the text processing application can process a text to be processed by the text processing method of the embodiment of the application in advance before the information searching or information matching is carried out, so that at least one upper and lower word pair is obtained, an information base is constructed according to the upper and lower word pairs, and therefore when the text processing application subsequently receives an information searching request of a user, the information matching can be carried out based on the information base which is built in advance, and accurate and rapid response to the information searching request of the user is achieved.

In some embodiments, after responding to the information search request of the user, the pre-constructed information base may be further expanded, that is, the text in the information search request may be determined as the text to be processed based on the pre-constructed information base, and the text to be processed may be further identified, so as to obtain the context word pair, and further expand the context word pair into the pre-constructed information base.

In the embodiment of the present application, the first type word recognition is performed on the text to be processed, which may be that a lower entity word in the text to be processed is recognized, or that an upper concept word in the text to be processed is recognized. Since the lower-level entity word recognition is usually named entity recognition, a sequence labeling method can be adopted to recognize the lower-level entity word in the text to be processed. In this embodiment of the present application, the sequence labeling may use a BIO manner, and label each word in the text to be processed with a first word, an internal word, and a non-entity word that the word belongs to a certain entity. Or, a sequence labeling mode can be adopted to label the upper concept words in the text to be processed, namely, a BIO mode can be adopted to label each word in the text to be processed, and the first word, the inner word and the non-concept word of the word belong to a certain concept are labeled. In the embodiment of the application, when the first type word recognition is performed on the text to be processed to identify the entity word in the text to be processed, the obtained first type word is a lower entity word; when the first type word recognition is performed on the text to be processed, the concept words in the text to be processed are recognized, and the obtained first type words are superior concept words.

In some embodiments, when the first type word is identified, the first type word may be identified by a pre-trained text identification model, where model parameters in the text identification model are model parameters for identifying lower level entity words or model parameters for identifying upper level concept words, and the model parameters may further include preset constraint conditions. The constraint condition here is a condition for constraining the entity word type or the concept word type, and for example, words of the type of person name, place name, organization name, work name, proper noun, etc. may be constrained to be subordinate entity words, so that when recognition is performed using a text recognition model, words of the person name, place name, organization name, work name, proper noun, etc. in the text to be processed may be recognized as subordinate entity words.

Step S402, aiming at each first type word, carrying out coding processing on the first type word and the text to be processed to obtain text word vectors corresponding to the first type word and the text to be processed; wherein the text to be processed comprises at least two segmentations.

The encoding processing refers to performing word vectorization processing on the first type word and the text to be processed, and converting the first type word and the text to be processed into word vector representation to obtain text word vectors of the first type word and the text to be processed. The text word vector comprises a first word vector corresponding to a first type word and a text vector corresponding to a text to be processed, and the text word vector is a word vector representation for simultaneously representing the first type word and the text to be processed.

In one implementation, when the first type word is encoded, each word in the first type word may be encoded to obtain a first word vector corresponding to the first type word; when the text to be processed is encoded, each word in each word of the text to be processed may be encoded, so as to obtain a text vector corresponding to the text to be processed.

In another implementation manner, when the first type word is encoded, the encoding processing may be performed on the whole first type word to obtain a first word vector corresponding to the first type word; when the text to be processed is subjected to coding processing, each word of the text to be processed is subjected to coding processing to obtain a second word vector of each word, and then the corresponding second word vectors are spliced according to the sequence of the words in the text to be processed to form a text vector of the text to be processed.

Because in some text processing scenarios, the upper-level concept phrase may be a higher-level concept phrase, and the upper-level concept phrase may be formed by a plurality of words (i.e., text segments), in this embodiment of the present application, at least two word segments included in the text to be processed may be at least two words, at least two phrases, or a part of words and a part of phrases.

Step S403, performing upper and lower relation decoding processing on the text word vector to obtain the confidence that each word in the at least two word segments has an upper and lower relation with the first type word.

Here, the context decoding process refers to decoding and calculating a text word vector of the text to be processed and the first type word, so as to obtain a confidence level that each word in the text to be processed has a context with the first type word. The confidence coefficient may be any real number with a value of 0 to 1, when the confidence coefficient is higher, the probability that the corresponding word has an upper-lower relationship with the first type word is larger, and when the confidence coefficient is lower, the probability that the corresponding word has an upper-lower relationship with the first type word is smaller. In the embodiment of the application, the confidence that each word segment has an upper-lower relationship with the first type word can be calculated.

In the embodiment of the application, when the recognized first type word is a lower level entity word, the confidence coefficient obtained by decoding calculation is the confidence coefficient of the upper level concept word of the corresponding word segmentation as the first type word; when the recognized first type word is the upper concept word, the confidence coefficient obtained by decoding calculation is the confidence coefficient of the lower entity word taking the corresponding segmentation word as the first type word.

In some embodiments, when the context decoding processing is performed, a pre-trained model may be used to perform the context determination, for example, a pre-trained MRC model may be used to perform the context determination, and the first type word and the text to be processed are used as inputs of the MRC model, and the context decoding processing is performed through the MRC model, so as to obtain a confidence level that each of the at least two segmentation words has a context with the first type word. That is, the loss function of the MRC model may be used to determine each word segment of the input first type word that has an upper-lower relationship with the text to be processed.

And step S404, determining a second type word corresponding to each first type word from at least two segmentation words according to the confidence level.

In the embodiment of the application, the word segmentation greater than the threshold value can be determined as the second type word corresponding to the first type word according to the confidence level. It should be noted that, the threshold may be determined according to a model parameter of the MRC model, or may be manually set according to a training result of the model. For example, the threshold may be set to 0.8, and when the confidence of any word is greater than 0.8, then it is determined that the word is a second type word corresponding to the first type word.

Step S405, associating the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed.

Here, the first type word and the second type word may be stored in an associated manner, that is, a mapping relationship between the upper concept word and the lower entity word may be determined, so that when the second type word is queried based on the first type word or the first type word is queried based on the second type word, the corresponding word may be queried rapidly based on the mapping relationship.

According to the text processing method, when the text to be processed is processed, first type words in the text to be processed are recognized, then encoding processing and context decoding processing are sequentially conducted on the first type words and the text to be processed based on the first type words, confidence of each word in the text to be processed is obtained, second type words with context with each first type word are determined from at least two words based on the confidence, and therefore at least one context word pair is formed. Therefore, all the upper and lower word pairs in the text to be processed can be identified simultaneously under the condition that only the file to be processed is input, so that the text identification efficiency can be greatly improved, and the encoding processing and the upper and lower relation decoding processing are sequentially carried out based on the first type words which are identified first, so that the second type words corresponding to the first type words can be accurately identified, and a plurality of pairs of upper and lower word pairs in the text to be processed can be accurately identified.

Fig. 5 is a schematic flow chart of another alternative text processing method provided in an embodiment of the present application, as shown in fig. 5, the method includes the following steps:

in step S501, the terminal acquires a text to be processed, and generates a text processing request according to the text to be processed.

In the embodiment of the application, the text input by the user can be acquired to obtain the text to be processed, or the acquired voice is subjected to voice recognition to obtain the text to be processed, or the acquired image or video is subjected to recognition to obtain the text to be processed.

In some embodiments, a user may operate on a client of a text processing application to generate a text processing request, e.g., may enter text to be processed and trigger a request button to generate a text processing request including the text to be processed.

In step S502, the terminal sends a text processing request to the server.

Here, the server refers to a server of the text processing application. The server may be a local server, a cloud server, or a server in a server cluster or a distributed system formed by a plurality of physical servers.

In step S503, the server performs first-type word recognition on the text to be processed to obtain at least one first-type word.

In step S504, the server performs feature extraction on the first type word by using the encoder for each first type word, to obtain a first word vector corresponding to the first type word.

Here, a Chinese BERT model (Chinese-BERT-wwm) based on a full-word mask technique may be used as an encoder to perform feature extraction on the first type word, resulting in a first word vector corresponding to the first type word.

In step S505, the server performs word segmentation on the text to be processed to obtain at least two segmented words.

In the embodiment of the application, when word segmentation is performed on the text to be processed, word segmentation can be performed by taking a word as a unit to obtain at least two words, each word is used as one word, and word segmentation can also be performed by taking a word as a unit to obtain at least two words.

Step S506, the server extracts the characteristics of each word through the encoder to obtain a second word vector corresponding to each word; all second word vectors corresponding to at least two word segments form text vectors of the text to be processed, and the first word vectors and the text vectors form text word vectors.

Here, the Chinese-BERT-wwm may be used as an encoder to extract features of each word segment, thereby obtaining a second word vector corresponding to each word segment.

In step S507, the server performs a context decoding process on the text word vector, to obtain a confidence level that each of the at least two segmented words has a context with the first type word.

In some embodiments, step S507 may be implemented by:

in step S5071, the second vector corresponding to each word segment is mapped in a classification manner based on the first word vector through the linear layer in the decoder, so as to obtain a classification result of the second word vector of each word segment.

In the embodiment of the application, when the first type word is a lower level entity word, the second type word is an upper level concept word; when the first type word is an upper concept word, the second type word is a lower entity word.

The classification result is used for representing a part of the superordinate concept words of the word belonging to the first type of words, or the classification result is used for representing a part of the subordinate entity words of the word belonging to the first type of words.

In step S5072, the server decodes the classification result to obtain a confidence level that each of the at least two segmented words has an upper-lower relationship with the first type word.

In one implementation, some of the tokens in the text to be processed may be determined as second type of tokens directly according to the confidence level, that is, the second type of tokens may be determined by:

In step S508, the server determines the word segment with the confidence level greater than the confidence level threshold as the second type word.

In another implementation, a plurality of consecutive segmentation words in the text to be processed may also be determined as one second type word according to the confidence, that is, a text segment or phrase in the text to be processed is determined as the second type word, and thus the second type word may be determined by:

in step S509, when the positions of the plurality of segmented words with the confidence degrees greater than the confidence degree threshold value in the text to be processed are continuous, the server determines the text segment corresponding to the continuous plurality of segmented words as the second type word.

In general, the upper concept words are often presented in the form of text segments or phrases, so when the embodiment of the present application determines that a text segment is a second type word, the second type word may be an upper concept word, and correspondingly, the first type word is a lower entity word.

In step S510, the server associates the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed.

In step S511, the server sends the identified at least one context word pair to the terminal.

According to the text processing method, the first type word and the text to be processed are coded through the specific encoder, the text word vector is decoded through the specific decoder, so that the confidence coefficient of each word segmentation and the first type word with the upper-lower relation is obtained, the word segmentation is determined to be the second type word or the text fragment is determined to be the second type word in different modes based on the confidence coefficient, accurate determination of the second type words of different types can be achieved, overall and complete identification of the upper-lower words in the text to be processed is achieved, and universality of the text processing method is improved.

Fig. 6 is a schematic flow chart of still another alternative text processing method provided in an embodiment of the present application, as shown in fig. 6, the method includes the following steps:

in step S601, the terminal acquires a text to be processed, and generates a text processing request according to the text to be processed.

In step S602, the terminal sends a text processing request to the server.

In step S603, the server performs first-type word recognition on the text to be processed to obtain at least one first-type word.

In step S604, the server performs, for each first type word, concatenation on the first type word and the text to be processed to form a spliced text.

Here, the text to be processed may be spliced after the first type word to form the spliced text.

In step S605, the server inputs the stitched text into a pre-trained MRC model.

And step S606, encoding the spliced text through an encoding module of the MRC model to obtain text word vectors corresponding to the first type words and the text to be processed.

In the embodiment of the application, the MRC model is obtained by training the sample words generated by adopting a sample word generation mode based on full word coverage, that is, when the MRC model is trained, the sample words can be generated by adopting the sample word generation mode based on full word coverage, and then the MRC model is trained by adopting the sample words. It should be noted that, the generation mode of the sample word based on the whole word mask refers to: in generating the training sample, if a partial term (WordPiece) sub-word of a complete word is masked (mask), other parts of the same word are also masked, i.e., the whole word is masked. It should be noted that the masking herein refers to a generalized masking, that is, masking may be performed in any of the following ways: the replacement of the word to be masked with the MASK flag, the preservation of the original vocabulary, the random replacement of the word to be masked with another word, and is not limited to the case of replacing the word with the MASK tag.

In some embodiments, the MRC model includes a first decoding module that corresponds to a first decoding parameter, the first decoding parameter being a parameter for determining a confidence level for each word segment having a context relationship with the first type word; correspondingly, the method further comprises:

in step S607, the first decoding module decodes the text word vector based on the first decoding parameter trained in advance to obtain the confidence level of each word segment of the at least two word segments having the upper-lower relationship with the first type word.

In step S608, the server determines, from the at least two segmented words, a second type word corresponding to each first type word according to the confidence level.

In some embodiments, the MRC model includes a second decoding module that corresponds to a second decoding parameter that is a parameter for determining a start confidence level for each of the tokens as a start position of a second type of token and an end confidence level as an end position of the second type of token; correspondingly, the method further comprises:

step S609, through a second decoding module, performing upper-lower relation decoding processing on the text word vector based on the second decoding parameters obtained through pre-training, and obtaining a start confidence coefficient of each of at least two segmented words serving as a start position of the second type word and an end confidence coefficient of each segmented word serving as an end position of the second type word.

In step S610, the server determines a second type word from the at least two segmentation words according to the start confidence level and the end confidence level.

In some embodiments, step S610 may be implemented by:

in step S6101, the word segment with the start confidence level greater than the first threshold is determined as the start word.

In the embodiment of the application, at least one start word can be determined from the text to be processed.

In step S6102, the word segment with the ending confidence level greater than the second threshold is determined as the ending word.

In the embodiment of the application, at least one end word can be determined from the text to be processed.

In step S6103, a target start word and a target end word, which are adjacent in position and have start words before end words, are determined from all start words and all end words.

Here, the adjacent start word and end word constitute a pair of words, the pair of words includes only one start word and one end word, and the position of the start word in the text to be processed is located before the position of the end word in the text to be processed. The beginning word in each word pair is the target beginning word and the ending word is the target ending word. There are no other start words and no other end words between the target start word and the target end word.

In step S6104, in the text to be processed, a text segment between the target start word and the target end word is determined as the second type word.

In the embodiment of the present application, since there are no other start words and no other end words between the target start word and the target end word, a pair of text segments between the target start word and the target end word constitute a superordinate concept word or a subordinate entity word of the first type word.

In step S611, the server associates the first type word with the second type word to obtain at least one context word pair corresponding to the text to be processed.

Step S612, the server sends the at least one recognized upper and lower position word pair to the terminal.

According to the text processing method, different decoding modules in the MRC model are adopted to perform different decoding processing on the text word vectors, so that recognition of different types of upper and lower word pairs can be achieved, word upper and lower word pairs and phrase upper and lower word pairs in the text to be processed are respectively recognized, and comprehensive and accurate recognition of the text to be processed is achieved.

In some embodiments, after identifying the text to be processed to obtain at least one context word pair, the method may further include:

And S11, constructing an information base by adopting the upper and lower word pairs.

Here, when at least one upper and lower word pair is obtained in the information base which is not constructed in advance, the information base can be constructed, and the upper and lower word pairs are stored in the information base; when an information base is already present, the information base may be constructed by updating the information base. In this embodiment, the updating of the information base may be performed by judging the existing pairs of upper and lower words in the information base according to the currently identified pairs of upper and lower words, and when the currently identified pairs of upper and lower words do not have the same pairs of upper and lower words in the information base, that is, the currently identified pairs of upper and lower words are new pairs of words, storing the pairs of upper and lower words in the information base; when the currently identified upper and lower word pairs have the same upper and lower word pairs in the information base, namely the currently identified upper and lower word pairs are the existing word pairs, the information base is not updated.

And step S12, constructing a conceptual diagram of the target field according to the upper and lower word pairs in the information base.

In the embodiment of the application, the domain of the upper concept word or the lower entity word in the upper word pair can be divided according to the content of the upper concept word or the lower entity word in the upper word pair and the lower word pair, and the target domain to which the upper word pair belongs is determined, or text domain identification can be performed based on the text to be processed, so that the domain corresponding to the text to be processed is obtained, and the domain corresponding to the text to be processed is determined as the identified target domain to which the upper word pair and the lower word pair belong.

The concept graph can be a graph formed based on an upper concept word and a lower entity word, the concept graph comprises a plurality of nodes, every two nodes are connected through a connecting line, and the connecting line represents the upper-lower relationship between the two connected nodes. Each node corresponds to a word, one node corresponds to an upper concept word, the other node corresponds to a lower entity word, and the upper concept word and the lower entity word form an upper word pair and a lower word pair.

In some embodiments, based on the continuous accumulation of the pairs of context words, the concept graph may be updated continuously, i.e., according to the new pairs of context words identified. In the embodiment of the application, the constructed information base can be obtained, and the conceptual diagram of the target field is constructed according to the upper and lower word pairs in the information base, so that when the information base is detected to be updated, the conceptual diagram can be dynamically updated according to the updated upper and lower word pairs in the information base.

And step S13, responding to the received information search request based on the conceptual diagram to obtain a response result.

Here, the information search request may be a search request issued by any search application, for example, an information search request input by search software, a commodity search request input by shopping software, a disease query request input by query application, or the like.

In one implementation manner, the concept graph of the embodiment of the application may be autonomously constructed by the search application, that is, the search application may identify the pairs of context words by using the text processing method of the embodiment of the application, and then construct the concept graph based on the pairs of context words, or the search application may request to obtain a plurality of pairs of context words identified by the text processing application, and then construct the concept graph based on the pairs of context words. In another implementation manner, the concept graph of the embodiment of the application may be well constructed by other text processing applications, and the search application may request to acquire and use the concept graph constructed by other text processing applications when responding to the information search request.

The embodiment of the application can be at least applied to any one of the following scenes:

scene one: search associative word recommendation when understanding an input problem. For example, after at least one context word pair is identified by the method according to the embodiment of the present application, an information base may be constructed according to the context word pair, where a plurality of context word pairs are stored, when a user performs a problem search query through a search application, after a part of problems are input, text recognition and understanding may be performed on the part of the problems input by the user, so as to determine entity words in the part of the problems input by the user, and corresponding concept words may be matched from the information base according to the entity words, or concept words in the part of the problems input by the user may also be determined, and corresponding entity words may be matched from the information base according to the concept words; then, a complete question is generated according to the concept word or the entity word or both, i.e. the associated word is obtained and the complete question is generated based on the associated word.

Scene II: and generating concept class candidate search box words. For example, in any search application, a plurality of operable items may be provided on the search interface, and when the user clicks any operable item, a plurality of entity contents corresponding to the operable item may be displayed or acquired, as a result of selection of the user operation, in response to the user operation. In this scenario, the information base may be constructed by identifying the obtained pairs of context words according to the method of the embodiment of the present application, when generating a plurality of entity contents corresponding to each operability, the information base may be implemented based on the constructed information base, that is, the operable items may be concept words of the pairs of context words in the information base, after the concept words are determined, at least one entity word corresponding to each concept word may be pulled from the information base, and then, information corresponding to the pulled entity word is used as the content under the operable item corresponding to the concept word.

Scene III: a question-answering system based on knowledge graph. For example, after at least one pair of upper and lower level words is identified by the method according to the embodiment of the present application, an information base may be constructed according to the pair of upper and lower level words, and a plurality of pairs of upper and lower level words are stored in the information base, and a concept graph may be constructed based on the pair of upper and lower level words in the information base, where in the concept graph, the upper level concept words and the lower level entity words are used as nodes in the concept graph, and a connection line between the concept words and the entity words represents a relationship between the concept words and the entity words, so that one upper level concept word is connected to a plurality of lower level entity words. In the question-answering system, the questions of the user can be answered based on the concept graph, that is, the answers to the questions of the user can be generated from the concept graph based on the relationship between the concept words and the entity words.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described. The embodiment of the application provides a text processing method, because a task oriented to text processing is a relation extraction task in a broad sense, a model sleeve capable of multiplexing relation extraction is applied to the task, but because a general relation extraction model has the following limitations, the characteristics aiming at a concept extraction task are considered, and a corresponding model is designed to realize recognition of a text to be processed:

1) The relationship extraction task extracts the relationship between entities, the extracted head entity and tail entity are homogeneous (that is to say are both concrete entities), while the entity and concept in the concept extraction task (that is to say, the extraction concept and the context word pair formed by the entity) are heterogeneous (that is to say, one is concrete entity and one is a generalized concept), the entity is concrete and the concept is abstract and has generalized meaning, so that the extraction paradigm customized by the fit needs to be designed by distinguishing the difference of the two; 2) In the concept extraction task, only the relationship between the entity and the concept is needed to be considered, but the relationship between the concept and the relationship between the entity and the entity are not needed to be considered, and the multiplexing relationship extraction model cannot avoid extracting the two types of relationships (namely, the relationship between the concept and the relationship between the entity and the entity), so that a more reasonable extraction mode, such as a mode of extracting the entity first and then extracting the upper concept corresponding to the entity, is needed to be considered when designing the model.

The text processing method provided by the embodiment of the application can simultaneously solve the following problems: 1) The knowledge of the pre-training language model BERT can be utilized to jointly extract the upper and lower word pairs in the text; 2) The problem that a plurality of upper and lower word pairs exist in the text can be solved; 3) Compared with the relation classification method based on the pretrained model BERT, the time cost of the embodiment of the application is lower, entities and concepts do not need to be extracted in advance, and all upper and lower word pairs with upper and lower relation in the text can be directly extracted from the plain text; 4) The model architecture of the text processing method has universality and is suitable for all other information extraction tasks, such as coreference resolution, event extraction, entity linking and the like; 5) The model of the text processing method in the embodiment of the application also supports various service scenes and various application products, and the applicable service scenes and application products will be illustrated in the lower level.

The text processing method of the embodiment of the application can be at least applied to the following products:

1) Search associative word recommendations for query understanding. For the searched query text, more search association word recommendations with abstract meaning can be obtained by identifying the entity in the query text and searching the conceptual diagram constructed by the concept upper and lower extraction method. As shown in fig. 7, in the product interface diagram of search association term recommendation provided in the embodiment of the present application, when a user inputs XX school in a search box 701, the user may identify an entity in the input content, and search the constructed concept graph by using the context word obtained by the text processing method provided in the embodiment of the present application, to obtain more search association term recommendations 702 with abstract meaning.

2) And generating concept class candidate search box words. All subordinate entities hung under the concept upper words can be obtained through the concept upper and lower extraction method, and the concept words are used as search cards and placed in a browser homepage to be used as search recommendations. As shown in fig. 8, a product interface diagram generated by using concept candidate search box words provided in the embodiment of the present application may have different types of dramas in a top concept of a drama ranking list, for example, under a top concept of a high-score ancient suspension drama, a plurality of specific names of dramas, which are lower entities of the top concept of the high-score ancient suspension drama. When the television series ranking list is formed, a plurality of lower entities corresponding to the upper concept of the high-score ancient suspension television series can be generated based on the text processing method provided by the embodiment of the application to form upper and lower word pairs, and then the television series ranking list is classified according to the upper and lower word pairs.

3) A question-answering system based on knowledge graph. The traditional knowledge graph-based question-answering system is based on graphs of entities and entity relations, and answers generated by questions are usually texts corresponding to a specific entity. However, the human answer is generally of abstract meaning, so that the text processing method of the embodiment of the application is used for identifying the upper and lower word pairs, then a conceptual diagram is constructed based on the identified upper and lower word pairs, and the text for answering the question is generated according to the conceptual diagram, so that the method is more in line with the logic of the human answer. For example, the question "where is the hometown of small Wang Laopo? "if a conceptual diagram is not constructed based on the method of the embodiment of the present application to perform question-answer matching, the result may be" X city ", but with the method of the embodiment of the present application, an answer is performed based on the constructed conceptual diagram, the result may be" X city at beautiful seashore city ". That is, when the method of the embodiment of the application is adopted to perform question-answer matching based on the knowledge graph, not only a specific entity can be answered, but also at least one concept can be answered, namely a more abstract answer can be obtained, and obviously, the answer accords with the language logic of human beings. As shown in fig. 9, a schematic diagram of a question-answer matching process of the question-answer system based on a knowledge graph provided in the embodiment of the present application, first, for the input question 901, "where is the hometown of small Wang Laopo? "perform semantic parsing 902 to obtain semantic representation 903, then perform semantic matching, query, reasoning 904 on the obtained semantic representation 903 based on a pre-constructed knowledge base 905 to obtain an answer 906. The concept graph is a knowledge base constructed in advance.

The text processing method of the embodiment of the present application is a method for extracting all possible pairs of upper and lower terms in a text from a plain text, and fig. 10 is a flowchart of a text processing system implementing the text processing method provided by the embodiment of the present application, as shown in fig. 10, where the whole text processing process includes a data labeling stage 1001, a model training stage 1002 and a model reasoning stage 1003.

In the data labeling stage 1001, the unlabeled data 1004 is labeled with the entity, concept and "is a" relationship 1005 by automatic manual operation, so as to obtain training data 1006 (i.e., training samples).

In model training stage 1002, question+text format data 1007 is constructed using training data 1006, and question+text format data 1007 is input into MRC model 1008 provided in the embodiments of the present application for prediction, thereby predicting a conceptual span tag 1009 (here the span tag is an in-line tag of the hypertext markup language (HTML) that is used to combine in-line elements in the document).

In the model reasoning stage 1003, entity extraction 1011 is performed based on the test data 1010, and a lower entity set 1012 is obtained, wherein the lower entity set 1012 includes at least one lower entity word. The question + text format data 1013 is constructed based on the lower entity set 1012, and the question + text format data 1013 is input into the MRC model 1014 to be predicted, so that the span label 1015 of the concept corresponding to the lower entity word is predicted, and in addition, the upper and lower triples 1016 are obtained based on the identified lower entity set 1012 and the span label 1015 of the concept corresponding to the lower entity word.

Fig. 11 is a schematic diagram of three scenarios provided in the embodiment of the present application, and as shown in fig. 11, the method provided in the embodiment of the present application may solve the conceptual context extraction under three scenarios:

1) Many-to-Many scene 1101 (management-to-management): the scene is given plain text, and all the upper and lower word pairs with upper and lower relation in the plain text are extracted through the MRC model 1104. The scene is used to mine all possible generic pairs of context words in the text.

2) One-to-Many scene 1102 (One-to-Many): the scenario gives a subordinate entity and the text in which the entity exists, and all the superordinate concepts of the subordinate entity in the text are extracted from the text by the MRC model 1104. The scene is used for mining the upper concept of a specific entity in the text, and can expand the concept set corresponding to the entity.

3) Many-to-One scene 1103 (Many-to-One): the scene gives the upper concept and the text in which the concept exists, and all lower entities on which the upper concept needs to be hung up in the text are extracted from the text through the MRC model 1104. The scene is used for mining the lower entity of a specific concept in the text, and a lower entity set corresponding to the concept can be expanded.

The following describes an entity extraction process in the text processing method provided in the embodiment of the present application. In this embodiment of the present application, the entity referred to represents an entity word, and the concept referred to represents a concept word.

The embodiment of the application firstly identifies all possible entities from the text, and can be realized by using a sequence marking tool and a deep learning NER mode corresponding to the sequence marking tool, so that the method has more generalization capability. For example, a given text "xx game is a educational class of hand-play under the T company flag, and is also the most popular game at the time. ", two named entities can be identified by the sequence labeling tool: "xx games" and "T company".

The following describes a concept extraction process of an entity in the text processing method provided in the embodiment of the present application.

In the embodiment of the present application, after all entities are extracted from the text, each entity may have a corresponding upper concept in the text. So for each entity, a machine-readable understanding model (i.e., MRC model) can be used, the entity is taken as a question, the text is taken as a context, and the upper concept corresponding to the entity can be extracted from the text. The concept extraction flow of entities, i.e., MRC model As shown in FIG. 12, MRC model 120 includes a Chinese BERT model 1201 (Chinese-BERT-wwm) based on full word mask technology and a multi-layer perceptron 1202 (MLP, multilayer Perceptron). The entity 1203 and the text 1204 are input into the MRC model 120, that is, the entity is taken as a question (query), the text is taken as a context (context), the question and the context are spliced, and as the input text of the MRC model 120, the confidence 1205 that each word in the text 1204 is taken as the upper concept of the entity 1203 can be obtained.

In the MRC model 120 part, using Chinese-BERT-wwm as the encoder of the MRC model, knowledge learned by the pre-training language model during the pre-training phase can be effectively utilized. As shown in fig. 13, a schematic structural diagram of an encoder of an MRC model provided in an embodiment of the present application is shown, in a Pre-training stage 131 (Pre-training), the input is an input sentence a and an input sentence B, the input sentence a and the input sentence B are input to a BERT model, and word segmentation and feature extraction are performed on the input sentence a and the input sentence B, so as to obtain an input vector. The BERT model is added with a plurality of special-effect zone bits, the [ CLS ] zone bit is placed at the first position of the first sentence, and the characterization vector C obtained through the BERT model can be used for the subsequent classification task; the [ SEP ] mark is used for separating two input sentences, such as input sentences A and B, and the [ SEP ] mark is added behind the input sentences A and B; the MASK flag is used to MASK some words in the sentence, and after masking the words with MASK, the MASK vector output by the BERT model is used to predict what the words are. After the BERT model obtains the sentence to be input, the word of the sentence is converted into an embedded vector, and the embedded vector is represented by E. In the Fine-Tuning stage 132 (Fine-Tuning), which is performed later for some downstream tasks, such as a question-answering system, etc., the BERT model can be Fine-tuned on different tasks without adjusting the structure.

In this embodiment of the present application, a linear layer (i.e., MLP) may be used to map the representation of each word obtained by the encoder to a classified space, and a classified multi-label multi-classification loss function is used to determine each word in the context of the input text, and determine whether the word belongs to a part of a certain hypernym concept.

Finally, in the reasoning stage, the predicted result is decoded by using a specific threshold value after parameter adjustment, and all continuous words with scores larger than the threshold value in the context output result can be used as the upper concepts corresponding to the lower entity (query). Since there may be a plurality of consecutive segments, in the embodiment of the present application, a plurality of upper concepts corresponding to one entity may be extracted. For the case that the subordinate entity does not have the superordinate concept in the text, the model of the embodiment of the application may also cover, that is, the output scores corresponding to all the words are smaller than the threshold value.

In the embodiment of the application, as 0 to more concepts corresponding to each entity identified in the text are identified, many-to-many scenes and one-to-many scenes can be solved; for many-to-one scenes, only the query of the MRC model is changed into an upper concept, and the decoding result is changed into a lower entity.

The embodiment of the application provides a method for identifying the entity first and identifying the concept corresponding to the entity later, which can effectively avoid the relationship between the entity and the entity, and the relationship between the concepts, so that the method is more in accordance with the task target of extracting the context word pair. In addition, the method can effectively solve the problem of word pair recognition in the business scenes of many-to-many, one-to-many, many-to-one and three-big. Table 1 below shows experimental results of performing recognition of pairs of upper and lower position words by using the method of the embodiment of the present application, where the test set is service real data. As can be seen from table 1, the verification set consisting of 500 samples includes 802 true sample words (TP, true positive) and 63 false positive sample words (FP, false negative) and 289 false negative sample words, and the result of the verification set after identification is that the precision P is 92.7%, the recall ratio R is 73.5% and the F value measure F1 is 82.0%. The test set consisting of 100 samples comprises 166 real sample words, 8 false positive sample words and 87 false negative sample words, and the result after the test set is identified is that the precision P is 95.4%, the recall ratio is 65.6% and the F value measurement F1 is 77.8%. Obviously, the precision is higher under the service real data.

TABLE 1

In addition, the text processing module provided by the embodiment of the application also extracts 340 ten thousand duplicate-removed upper and lower position word pairs. Fig. 14 is a schematic diagram of a case predicted by the method according to the embodiment of the present application, and it can be seen that the text processing module according to the embodiment of the present application can solve many-to-many and one-to-many scenarios at the same time. In addition, the architecture of the first extraction entity and the second extraction entity corresponding to the upper concept provided by the embodiment of the application has universality and can be used for other information extraction tasks, such as coreference resolution, event extraction and entity linking.

It should be noted that, the entity extraction process of the embodiment of the present application adopts a sequence labeling tool inside a company, which has a strong entity recognition capability for short texts in a general field, and may be applied to entity recognition for long texts in different fields, but if the method of the embodiment of the present application is used for other fields (such as medicine and finance) or other types of data (such as information flow data), the entity extraction model needs to be adaptively changed. For example, training a verticality model on verticality fields or data performs entity extraction. In addition, the decoder of the MRC model may employ a more complex decoder, for example, two linear layers are used to determine whether each word is a start position (start) or an end position (end) of a concept, and finally all adjacent start positions and end positions are used to obtain a final upper concept recognition result.

It will be appreciated that in embodiments of the present application, the content of the user information, such as the text to be processed entered by the user or the text to be processed associated with the user information, needs to be licensed or agreed upon by the user when the embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the relevant data needs to comply with relevant laws and regulations and standards of the relevant country and region.

Continuing with the description below, text processing device 354 provided in embodiments of the present application is implemented as an exemplary architecture of a software module, and in some embodiments, as shown in fig. 3, text processing device 354 includes:

the recognition module 3541 is configured to perform first-type word recognition on the text to be processed to obtain at least one first-type word;

the encoding processing module 3542 is configured to encode, for each of the first type words, the first type word and the text to be processed to obtain text word vectors corresponding to the first type word and the text to be processed; the text to be processed comprises at least two segmentation words;

the decoding processing module 3543 is configured to perform a context decoding process on the text word vector, so as to obtain a confidence level of a context between each of the at least two segmented words and the first type word;

A determining module 3544, configured to determine, according to the confidence level, a second type word corresponding to each of the first type words from the at least two segmented words;

and the association module 3545 is configured to associate the first type word with the second type word, so as to obtain at least one context word pair corresponding to the text to be processed.

In some embodiments, the encoding processing module is further to: extracting features of the first type word through an encoder to obtain a first word vector corresponding to the first type word; performing word segmentation processing on the text to be processed to obtain at least two segmented words; extracting features of each word segment through the encoder to obtain a second word vector corresponding to each word segment; all second word vectors corresponding to the at least two word segments form text vectors of the text to be processed, and the first word vectors and the text vectors form the text word vectors.

In some embodiments, the decoding processing module is further to: performing classification mapping on the second vector corresponding to each word segmentation based on the first word vector through a linear layer in a decoder to obtain a classification result of the second word vector of each word segmentation; and carrying out the decoding processing on the classification result to obtain the confidence that each segmentation word in the at least two segmentation words has the upper-lower relationship with the first type word.

In some embodiments, when the first type word is a subordinate entity word, the second type word is a superordinate concept word; when the first type word is an upper concept word, the second type word is a lower entity word; the classification result is used for representing a part of the upper concept words of the word segment belonging to the first type word, or the classification result is used for representing a part of the lower entity words of the word segment belonging to the first type word.

In some embodiments, the determination module is further configured to at least one of: determining the word segmentation with the confidence coefficient larger than a confidence coefficient threshold value as the second type word; and determining text fragments corresponding to the continuous multiple segmented words as the second type words when the positions of the multiple segmented words with the confidence degrees larger than a confidence degree threshold value in the text to be processed are continuous.

In some embodiments, the encoding processing module is further to: splicing the first type word and the text to be processed to form a spliced text; inputting the spliced text into a pre-trained MRC model, and carrying out coding processing on the spliced text through a coding module of the MRC model to obtain text word vectors corresponding to the first type words and the text to be processed; the MRC model is trained by adopting sample words generated in a sample word generation mode based on whole word masking.

In some embodiments, the MRC model includes a first decoding module; the decoding processing module is further configured to: and performing upper-lower relation decoding processing on the text word vector based on a first decoding parameter obtained through pre-training by the first decoding module to obtain the confidence degree of each word in the at least two words and the first type word with the upper-lower relation.

In some embodiments, the MRC model includes a second decoding module; the decoding processing module is further configured to: performing upper-lower relation decoding processing on the text word vector based on a second decoding parameter obtained through pre-training by the second decoding module to obtain a start confidence coefficient of each word of the at least two words serving as a start position of the second type word and an end confidence coefficient of each word of the at least two words serving as an end position of the second type word; the determining module is further configured to: and determining the second type word from the at least two word segments according to the start confidence and the end confidence.

In some embodiments, the determining module is further to: determining the word segmentation with the start confidence coefficient larger than a first threshold value as a start word; determining the word segmentation with the ending confidence coefficient larger than a second threshold value as an ending word; determining target start words and target end words which are adjacent in position and located in front of the end words in all the start words and all the end words; and determining a text segment between the target beginning word and the target ending word as the second type word in the text to be processed.

In some embodiments, the apparatus further comprises: the processing module is used for constructing an information base by adopting the upper and lower word pairs; constructing a conceptual diagram of the target field according to the upper and lower word pairs in the information base; and responding to the received information search request based on the conceptual diagram to obtain a response result.

It should be noted that, the description of the apparatus in the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. For technical details not disclosed in the embodiments of the present apparatus, please refer to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method according to the embodiment of the present application.

The present embodiments provide a storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform a method provided by the embodiments of the present application, for example, the method as shown in fig. 4.

In some embodiments, the storage medium may be a computer readable storage medium, such as a ferroelectric Memory (FRAM, ferromagnetic Random Access Memory), read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read Only Memory), erasable programmable Read Only Memory (E PROM, erasable Programmable Read Only Memory), charged erasable programmable Read Only Memory (EEPR OM, electrically Erasable Programmable Read Only Memory), flash Memory, magnetic surface Memory, optical Disk, or Compact Disk-Read Only Memory (CD-ROM), among others; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Mar kup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method of text processing, the method comprising:

2. The method of claim 1, wherein the encoding the first type word and the text to be processed to obtain a text word vector corresponding to the first type word and the text to be processed comprises:

extracting features of the first type word through an encoder to obtain a first word vector corresponding to the first type word;

performing word segmentation processing on the text to be processed to obtain at least two segmented words;

extracting features of each word segment through the encoder to obtain a second word vector corresponding to each word segment; all second word vectors corresponding to the at least two word segments form text vectors of the text to be processed, and the first word vectors and the text vectors form the text word vectors.

3. The method according to claim 2, wherein the performing a context decoding process on the text word vector to obtain a confidence level that each of the at least two segmented words has a context with the first type word includes:

performing classification mapping on the second vector corresponding to each word segmentation based on the first word vector through a linear layer in a decoder to obtain a classification result of the second word vector of each word segmentation;

And carrying out the decoding processing on the classification result to obtain the confidence that each segmentation word in the at least two segmentation words has the upper-lower relationship with the first type word.

4. The method of claim 3, wherein the step of,

when the first type word is a lower level entity word, the second type word is an upper level concept word;

when the first type word is an upper concept word, the second type word is a lower entity word;

the classification result is used for representing a part of the upper concept words of the word segment belonging to the first type word, or the classification result is used for representing a part of the lower entity words of the word segment belonging to the first type word.

5. The method of claim 1, wherein said determining a second type of word corresponding to each of said first type of words from said at least two segmentations based on said confidence level comprises at least one of:

determining the word segmentation with the confidence coefficient larger than a confidence coefficient threshold value as the second type word;

the method comprises the steps of,

and when the positions of the plurality of segmented words with the confidence degree larger than the confidence degree threshold value in the text to be processed are continuous, determining text fragments corresponding to the continuous plurality of segmented words as the second type words.

6. The method of claim 1, wherein the encoding the first type word and the text to be processed to obtain a text word vector corresponding to the first type word and the text to be processed comprises:

splicing the first type word and the text to be processed to form a spliced text;

the stitched text is entered into a pre-trained MRC model,

encoding the spliced text through an encoding module of the MRC model to obtain text word vectors corresponding to the first type words and the text to be processed;

the MRC model is obtained by training sample words generated in a sample word generation mode based on whole word masking.

7. The method of claim 6, wherein the MRC model comprises a first decoding module;

performing context decoding processing on the text word vector to obtain confidence degrees of each word segment in the at least two word segments and the first type word with context, wherein the method comprises the following steps:

and performing upper-lower relation decoding processing on the text word vector based on a first decoding parameter obtained through pre-training by the first decoding module to obtain the confidence degree of each word in the at least two words and the first type word with the upper-lower relation.

8. The method of claim 6, wherein the MRC model includes a second decoding module;

performing upper-lower relation decoding processing on the text word vector based on a second decoding parameter obtained through pre-training by the second decoding module to obtain a start confidence coefficient of each word of the at least two words serving as a start position of the second type word and an end confidence coefficient of each word of the at least two words serving as an end position of the second type word;

and determining a second type word corresponding to each first type word from the at least two segmented words according to the confidence, wherein the second type word comprises the following steps:

and determining the second type word from the at least two word segments according to the start confidence and the end confidence.

9. The method of claim 8, wherein said determining the second type of word from the at least two segmented words based on the start confidence and the end confidence comprises:

Determining the word segmentation with the start confidence coefficient larger than a first threshold value as a start word;

determining the word segmentation with the ending confidence coefficient larger than a second threshold value as an ending word;

determining target start words and target end words which are adjacent in position and located in front of the end words in all the start words and all the end words;

and determining a text segment between the target beginning word and the target ending word as the second type word in the text to be processed.

10. The method according to any one of claims 1 to 9, further comprising:

constructing an information base by adopting the upper and lower word pairs;

constructing a conceptual diagram of the target field according to the upper and lower word pairs in the information base;

and responding to the received information search request based on the conceptual diagram to obtain a response result.

11. A text processing apparatus, the apparatus comprising:

12. A text processing apparatus, comprising:

a memory for storing executable instructions; a processor for implementing the text processing method of any of claims 1 to 10 when executing executable instructions stored in said memory.

13. A computer readable storage medium, characterized in that executable instructions are stored for causing a processor to execute the executable instructions for implementing the text processing method of any one of claims 1 to 10.