CN113743115A

CN113743115A - Text processing method and device, electronic equipment and storage medium

Info

Publication number: CN113743115A
Application number: CN202111041964.7A
Authority: CN
Inventors: 李薛; 陈旭涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-12-03

Abstract

The invention discloses a text processing method, a text processing device, electronic equipment and a storage medium, and relates to the technical field of computers. One embodiment of the method comprises: acquiring a preset word bank to perform sequence labeling on each text in the knowledge base and calculate a root bank; screening sentences including word roots from a knowledge base, and segmenting each sentence to determine a segmentation combination corresponding to each sentence; splicing the participle combinations including the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result; and calculating the effective score of each splicing result, further determining a business entity from the splicing results to obtain a business entity library, and identifying the business entity in the received text to be processed. The implementation method can solve the problems that manual marking methods are difficult to unify standards, much time is needed, and the accuracy of service identification marking is reduced, so that the accuracy of service entity identification in the text is reduced.

Description

Text processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for text processing, an electronic device, and a storage medium.

Background

In the e-commerce field, the after-sale link plays a great role in user experience and user retention. At present, in an after-sale consultation scene of a user, a large number of processing flows need assistance of artificial intelligence and generally involve text processing, such as identifying business entities in texts. Since each service entity is usually related to a service scene and sometimes has no universality, in order to accurately identify the service entity in the text, in the prior art, it is necessary to manually segment the corpus of the related field, mark each service entity according to the service understanding and the definition of the service entity, and then extract the service entity in the text based on the marked service entity. However, the manual marking mode is difficult to unify standards, which not only needs more time, but also reduces the accuracy of service identification marking, thereby reducing the accuracy of service entity identification in the text.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for text processing, an electronic device, and a storage medium, which can solve the problems that a manual tagging manner is difficult to unify standards, and not only much time is required, but also accuracy of service identification tagging is reduced, and accuracy of service entity identification in a text is reduced.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of text processing.

The text processing method of the embodiment of the invention comprises the following steps: acquiring a preset word bank, carrying out sequence tagging on each text in the knowledge bank, inputting the tagged text into a preset recognition model, and calculating a root bank; screening sentences including the roots from the knowledge base for each root in the root base, segmenting each sentence, and combining segmentation results to determine a segmentation combination corresponding to each sentence, wherein the segmentation combination includes the roots; splicing the participle combinations including the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result; calling a preset calculation model, calculating effective scores of all splicing results, and determining a business entity from the splicing results based on the effective scores to obtain a business entity library; and identifying the business entities in the received text to be processed based on the business entity library.

In one embodiment, segmenting each sentence, and combining the segmentation results to determine a segmentation combination corresponding to each sentence, includes:

for each sentence, segmenting the sentence, taking the root of the word as a starting word, sequentially taking the segmented words sequentially positioned behind the root of the word in the sentence as an ending word, and determining a segmented word combination from the sentence based on the starting word and the ending word.

In another embodiment, the splicing, based on the service scenario to which the sentence corresponding to the word segmentation combination belongs and a preset service scenario association order, of the word segmentation combinations including the same root word to obtain a splicing result includes:

sequencing the word segmentation combination corresponding to each sentence based on the sequence of the end words in the word segmentation combination in the corresponding sentence to generate a word segmentation array corresponding to each sentence;

determining a business scene corresponding to the word segmentation array based on the business scene to which the sentence corresponding to the word segmentation array belongs;

and screening target word segmentation arrays corresponding to the same word root, and splicing word segmentation combinations at the same position in the target word segmentation arrays corresponding to each service scene based on the service scene association sequence to obtain a splicing result.

In another embodiment, the invoking a preset calculation model to calculate the validity score of each splicing result includes:

and calculating the characteristic vector of the splicing result for each splicing result, calling a preset calculation model to extract the characteristics of the splicing result based on the characteristic vector, further calculating a floating point numerical value corresponding to the characteristics, and determining the floating point numerical value as the effective score of the splicing result.

In yet another embodiment, determining business entities from the concatenation results for updating to a business entity repository based on the validity score comprises:

judging whether the effective score of the splicing result is greater than a preset score threshold value or not;

if yes, determining the word segmentation combination with the splicing sequence at the last in the splicing result as a business entity, and updating the business entity to a business entity library; and if not, not processing the splicing result.

In yet another embodiment, before the screening the sentence including the root word from the knowledge base, the method further comprises:

and acquiring historical search words of the knowledge base, and adding the historical keywords and the preset word base to the root word base.

In yet another embodiment, the pending text comprises advisory text sent by the user;

the identifying the business entity in the received text to be processed based on the business entity library comprises:

receiving consultation information sent by a user, and determining a consultation text corresponding to the consultation information;

matching the service entity library with the consultation text, identifying the service entities included in the consultation text, inquiring a corresponding response file based on the service entities included in the consultation text, and further replying the consultation information.

To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a text processing apparatus.

The text processing device of the embodiment of the invention comprises: the calculation unit is used for acquiring a preset word bank, carrying out sequence labeling on each text in the knowledge base, inputting the labeled text into a preset identification model, and calculating a root bank; a determining unit, configured to screen, for each root in the root bank, sentences including the root from the knowledge base, perform word segmentation on each sentence, and combine word segmentation results to determine a word segmentation combination corresponding to each sentence, where the word segmentation combination includes the root; the splicing unit is used for splicing the participle combinations comprising the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result; the calculation unit is further configured to invoke a preset calculation model, calculate an effective score of each splicing result, and determine a business entity from the splicing results based on the effective score so as to update the business entity to a business entity library; and the processing unit is used for identifying the business entities in the received text to be processed based on the business entity library.

In an embodiment, the computing unit is specifically configured to:

In another embodiment, the splicing unit is specifically configured to:

In another embodiment, the computing unit is specifically configured to:

In another embodiment, the processing unit is specifically configured to:

In yet another embodiment, the apparatus further comprises:

and the adding unit is used for acquiring the historical search words of the knowledge base and adding the historical keywords and the preset word base to the root word base.

In yet another embodiment, the pending text comprises advisory text sent by the user; the processing unit is specifically configured to:

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

An electronic device of an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the text processing method provided by the embodiment of the invention.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a computer-readable medium.

A computer-readable medium of an embodiment of the present invention stores thereon a computer program, which, when executed by a processor, implements a method of text processing provided by an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: in the embodiment of the invention, the text in the knowledge base can be subjected to sequence labeling based on a preset word base so as to calculate a root base through an identification model, wherein words in the root base represent the roots of business entities; the method comprises the steps that sentences comprising roots can be screened from a knowledge base on the basis of a root bank, and then word segmentation is carried out on each sentence and word segmentation results are combined to determine that each sentence correspondingly comprises a word segmentation combination of the roots; and then, based on the business scene to which the corresponding sentence of the word segmentation combination belongs and a preset business scene correlation sequence, splicing the word segmentation combinations comprising the same root word to obtain a splicing result, further calculating the score of each splicing result through a calculation model, and determining a business entity from the splicing result based on the effective score to obtain a business entity library so as to realize the identification of the business entity in the received text to be processed based on the business entity library. In the embodiment of the invention, after the root base is expanded based on the preset word base and the knowledge base, sentences comprising the roots in the root base can be screened out, word segmentation combinations corresponding to the sentences are determined, namely the roots are expanded based on texts in the knowledge base, the word segmentation combinations corresponding to scenes are obtained, further, based on the association sequence of various business scenes and the business scenes to which the sentences corresponding to the word segmentation combinations belong, the word segmentation combinations comprising the same roots can be spliced to obtain a splicing result, which is equivalent to the splicing of the association word segmentation combinations between combination scenes, so that the inference relation between the business scenes is embodied in the splicing result, and then the business entity is determined based on the effective score of the splicing result, so that the root base is expanded from the preset word base, then the word segmentation combinations of the business scenes are expanded based on the root base, and further the inference relation between the business scenes is determined, and obtaining a business entity library to identify the business entities in the received text to be processed, so that the efficiency of determining the business entities is improved, the accuracy and comprehensiveness of determining the business entities are improved, and the accuracy of identifying the business entities in the text is improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of one primary flow of a method of text processing according to an embodiment of the invention;

FIG. 2 is a flow diagram illustrating a process for computing a root repository, according to an embodiment of the invention;

FIG. 3 is a schematic diagram of yet another major flow of a method of text processing according to an embodiment of the invention;

FIG. 4 is a flow chart illustrating a process of determining a business entity according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the main elements of an apparatus for text processing according to an embodiment of the present invention;

FIG. 6 is a diagram of yet another exemplary system architecture to which embodiments of the present invention may be applied;

FIG. 7 is a schematic block diagram of a computer system suitable for use in implementing embodiments of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

An embodiment of the present invention provides a method for text processing, where the method may be performed by a text processing device, and as shown in fig. 1, the method includes:

s101: and acquiring a preset word bank, carrying out sequence tagging on each text in the knowledge bank, inputting the tagged text into a preset recognition model, and calculating a root bank.

Wherein, the word stock is preset. In the embodiment of the invention, the business entity is required to be determined, is not completely the same as a common word and generally corresponds to a specific business scene, so the initial word stock can be labeled manually, for example, basic words frequently used in each business scene are labeled manually by customer service personnel. For example, the words "AAPLUS member", "AA association", "AA to home APP" are commonly found in some services and are extended by AA, so that AA can be labeled, i.e. as a word in a preset lexicon. In the embodiment of the invention, the preset word bank can be realized by an expansion mode of a weak supervision algorithm. In order to avoid the common words in the word stock, the word stock can be subjected to duplication elimination based on the common word stock, then the duplicate-eliminated word stock is subjected to subsequent processing, and the common word stock can be preset or provided by a third party.

The knowledge base is pre-established, and can be a corpus used in each service scenario, and if a service entity in a customer service scenario needs to be determined in the embodiment of the present invention, the knowledge base can be a corpus used in the customer service scenario, and can include various texts and other files, each text can include one or more sentences, each text is associated with a keyword and a corresponding service scenario, that is, each sentence is also associated with a keyword and a corresponding service scenario. The knowledge base is the basis for determining the business entities.

Because the corresponding keywords are associated when the linguistic data such as the texts are input into the knowledge base, and the keywords are usually words associated with the service scene, the keyword base associated with the linguistic data such as the texts in the knowledge base can be added into the preset word base in the step, so that the comprehensiveness of the word base is improved.

Based on the word stock, sequence labeling can be performed on each corpus in the knowledge base, that is, sequence labels corresponding to the word stock are generated, for example, the sequence labels can be BIO labels. And then, inputting the sequence labels into a preset recognition model, and calculating a root library. The recognition model may include one or more of, for example, a supervised Albert model and an unsupervised combined entropy model (Unper). Based on the word stock calculated by the recognition model, in order to avoid including common words, duplication removal can be performed based on the common word stock, and then a root word stock is obtained.

Because the knowledge base can be used for searching the corpus, search words are generally required to be input during searching, and the search words are also generally words related to business, the historical search words of the knowledge base can be added into the root word base in order to improve the comprehensiveness of the root word. Meanwhile, in order to avoid losing words in the word stock in the calculation process, the preset word stock can be added into the root word stock.

Fig. 2 is a schematic flow chart illustrating a process of calculating a root base according to an embodiment of the present invention. As shown in fig. 2, step1 is performed first: removing the weight of the pre-labeled word stock CD through a common word stock, and adding a keyword stock kefu associated with each corpus in a knowledge base to obtain a word stock Kd-1; step2 is then executed: performing sequence tagging on a text K & A in a knowledge base by using Kd-1, calculating corresponding word banks, namely Sup and Unpper, respectively by using an Albert model and an Unper model, and performing duplication and superposition on the Sup and the Unpper respectively through common word banks to obtain Kd-2 ', wherein the pair Kd-2' can be corrected in a manual screening mode to obtain a word bank Kd-2; then step3 is executed: the historical retrieval word stock search of the knowledge base is corrected in a manual screening mode to obtain a Kd-afs word stock, a Kd-bfs word stock is obtained after Kd-1 and Kd-2 are overlapped, and a final root word stock Kd-3 is obtained after the Kd-afs and Kd-bfs word stocks are overlapped.

S102: and screening sentences comprising the roots from the knowledge base for each root in the root base, segmenting each sentence, and combining segmentation results to determine a segmentation combination corresponding to each sentence.

Wherein the word segmentation combination comprises the root word. The screened sentences comprise any root word in a root word library. After the root word bank is obtained, the root words can be expanded based on the text in the knowledge base, so that the sentences including the root words can be screened from the knowledge base for each root word. The word segmentation can be performed on each sentence, so that a word segmentation result of each sentence is obtained, and then the word segmentation results are combined with the lines to obtain a word segmentation combination comprising the root word.

It should be noted that, in order to determine the comprehensiveness of the business entity, in the embodiment of the present invention, each root word in the root word library may be sequentially processed to obtain a business entity corresponding to the root word, so that a business entity library may be formed, and the business entity may be identified for a subsequently received text to be processed based on the business entity library.

The method and the device can perform word segmentation for each sentence firstly, and can perform word segmentation combination based on the root of a word in the sentence because the embodiment of the invention obtains the business entity based on root expansion, namely the word segmentation combination needs to comprise the root of a word. Specifically, generating a combination of participles including a root word may be performed as: for each sentence, segmenting the sentence, taking the root of a word as a starting word, sequentially taking the segments sequentially positioned behind the root of the word in the sentence as an ending word, and determining a segmentation combination from the sentence based on the starting word and the ending word. For example, if the statement is "AA logistics forwarding detail", the word segmentation can be obtained as follows: AA. Logistics, switching and detail are carried out, the root of a word is used as a starting word, and the participles sequentially positioned behind the root of a word are used as ending words, so that the composition of the participles can be obtained as follows: AA logistics, AA logistics transfer and AA logistics transfer details.

S103: and splicing the participle combinations comprising the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result.

Since the inference relationship between scenes is usually reflected when the business entities of different scenes are involved in the sentence, the inference relationship between the business scenes can be considered when determining the business entities. In the embodiment of the invention, the association sequence between the service scenes can be set based on the inference relation between the service scenes to indicate that the service scene positioned at the back side in the association sequence is influenced by the service scene positioned at the front side. For example, the service scenario includes UAD (after sales order system), JDL (logistics), and IM (online chat robot), and the preset association sequence may be: and the UAD-JDL-IM is influenced by the UAD and the JDL, the JDL is influenced by the UAD, and the UAD is not influenced by other service scenes.

In the embodiment of the invention, for each root word, each business scene can correspond to a plurality of sentences comprising the root word, and then a plurality of participle combinations can be obtained.

In the embodiment of the invention, for each root word, one sentence can be selected from the sentences corresponding to each business scene in turn as a target sentence, then the participles corresponding to each target sentence are combined and spliced based on the business scene association sequence and the business scene to which each target sentence belongs, and if a certain business scene does not correspond to the sentence comprising the root word, the business scene is not included in the expansion of the root word, the business scene in the preset business association sequence can be deleted. For example, the service scenario includes UAD, JDL, and IM, and the preset association sequence is: and when the UAD-JDL-IM is used for processing the root of word AA, if the statement corresponding to the JDL does not include the statement of AA, deleting the JDL from the association sequence of the service scenes, so that the association sequence of the service scenes is updated as follows: and UAD-IM.

Specifically, the manner of obtaining the splicing result in this step may be implemented as: sequencing the word segmentation combination corresponding to each sentence based on the sequence of the end words in the word segmentation combination in the corresponding sentence to generate a word segmentation array corresponding to each sentence; determining a business scene corresponding to the word segmentation array based on the business scene to which the sentence corresponding to the word segmentation array belongs; and screening target word segmentation arrays comprising the same word root, combining and splicing the word segmentation groups at the same position in the target word segmentation data corresponding to each service scene based on the service scene association sequence, and obtaining a splicing result.

The arrangement of the word segmentation combinations of each sentence can generate a corresponding word segmentation array, and the arrangement sequence of the word segmentation combinations in the word segmentation array is the arrangement sequence of the end words in the sentence in the word segmentation combinations. For example, the combination of participles is: and the obtained word segmentation array can be (AA logistics, AA logistics switching and AA logistics switching) detailed, namely the sequence of each word segmentation combination in the word segmentation array is the arrangement sequence of the end words in the corresponding sentences in the word segmentation combination. And then, based on the service scene to which the sentence corresponding to the participle array belongs, determining the service scene corresponding to the participle data, so that target participle arrays corresponding to the same word root can be screened out from all the participle arrays, and based on the service scene correlation sequence, the participle combinations at the same position in the target participle arrays corresponding to each service scene are spliced to obtain a splicing result.

It should be noted that, a service scene positioned later in the service scene association sequence may be affected by a service scene positioned earlier, so that the participle array corresponding to the service scene positioned later in the service scene association sequence needs to be sequentially spliced with the participle array of each service scene positioned earlier based on the service scene association sequence. Therefore, in this step, based on the service scene association sequence, the obtained participle combinations at the same position in each participle array are spliced to obtain a splicing result, which can be executed as: and combining and splicing the word segmentation arrays corresponding to each service scene with the word segmentation arrays at the same positions in the word segmentation arrays corresponding to the service scenes, wherein the service scene association sequence of the word segmentation arrays is positioned before the service scene, based on the service scene association sequence, so as to obtain a splicing result.

For example, a business scenario includes: UAD, JDL and IM, the preset association sequence is: the root word of the UAD-JDL-IM is AA, and one statement corresponding to the UAD and comprising the AA is as follows: the AAPLUS member classical year card, JDL corresponds to a statement comprising AA: the AA logistics forwarding details are that one statement corresponding to the IM and comprising the AA is as follows: AA safety insurance, then, we can obtain that a participle array corresponding to UAD is (AAPLUS, AAPLUS member classical year card), a participle array corresponding to JDL is (AA logistics, AA logistics forwarding details), and a participle array corresponding to IM is (AA safety insurance). According to the association sequence of UAD-JDL-IM, IM is influenced by UAD and JDL, JDL is influenced by UAD, and UAD is not influenced by other service scenes, at the moment, UAD is not influenced by other service scenes, so that the segmentation combination in the segmentation array can be directly used as a splicing result; JDL is affected by UAD, so that the word segmentation combinations with the same position in the word segmentation arrays respectively corresponding to JDL and UAD are spliced to obtain a splicing result: transferring AAPLUS-AA logistics and AAPLUS member-AA logistics; IM is influenced by UAD and JDL, then the obtained splicing result is: AAPLUS-AA logistics-AA safety connection, AAPLUS member-AA logistics switching-AA safety connection insurance. Therefore, the final splicing result is: AAPLUS, AAPLUS member classical annual card, AAPLUS-AA logistics, AAPLUS member-AA logistics switching, AAPLUS-AA logistics-AA safety union, AAPLUS member-AA logistics switching-AA safety union insurance.

It should be noted that, in the splicing process, if the number of the participle combinations included in the participle array corresponding to the service scene is insufficient, it may be determined whether the service scene is located at the last position of the association sequence of the service scene, if so, the splicing of the participle combinations of the service scene may be stopped, otherwise, the last participle combination of the participle array may be used instead of performing subsequent splicing. For example, if the word segmentation array corresponding to JDL is (AA logistics), the AAPLUS member-AA logistics transfer-AA safety insurance in the splicing result should be: AAPLUS member-AA commodity circulation-AA safety insurance.

S104: and calling a preset calculation model, calculating the effective score of each splicing result, and determining a business entity from the splicing results based on the effective score to obtain a business entity library.

The calculation model is pre-trained, in the step, after the effective score of the splicing result is calculated, whether the effective score of the splicing result is greater than a preset score threshold value or not can be judged; if so, determining the word segmentation combination with the splicing sequence at the last in the splicing result as a business entity so as to update the business entity library; and if not, not processing the splicing result. For example, for an AAPLUS-AA logistics, if its validity score is greater than a score threshold, the AA logistics may be determined to be a business entity.

S105: and identifying the business entities in the received text to be processed based on the business entity library.

In order to determine the comprehensiveness of the business entity, in the embodiment of the present invention, the processing in steps S102 to S104 may be performed on each root word in the root word library in sequence to obtain a business entity corresponding to the root word, so that a business entity library may be formed, and the business entity may be identified for a subsequently received text to be processed based on the business entity library. Specifically, the received text to be processed may be a query text determined based on the user sending the query information, at this time, the business entity library may be matched with the query text to identify the business entity in the query text, and then, a corresponding response file may be queried from data such as a knowledge base based on the identified business entity to respond to the user query information, thereby improving the accuracy of the user query response.

It should be noted that after the business entity library is determined, the business entity can be edited manually, mainly the results of the business entity are corrected and edited manually; based on the service scene corresponding to the service entity, a tree structure can be established to effectively induce the reasoning relationship between the service entities; in addition, the embodiment of the invention can also update the business entity based on manual or real-time calculation; meanwhile, service entity query and new service scenes can be realized.

After the business entity is determined, the effective rate of the business entity can be evaluated and determined through the fitting calculation module. Because the service entities have different using methods in different scenes, the matching can be performed by combining corresponding service indexes, for example, when the service entity identification is applied to knowledge base searching, the searching effectiveness of the identified service entities can be matched with historical searching results or real-time searching results and the matching degree can be calculated, and if the matching degree is reduced, a correcting mode can be activated to correct the service entities. Correcting the business entity may include revising decisions and dynamic optimization. Because the effective score is calculated in the embodiment of the invention, the splicing result with higher effective score but not greater than the preset score threshold can be screened for manual re-screening, thereby directly carrying out the correction judgment of the business entity. And the dynamic optimization is to respectively take the result with high fitting degree and the result with low fitting degree as a positive sample and a negative sample to enter a calculation model for self-updating of the model according to the fitting result.

After expanding a root bank based on a preset word bank and a knowledge bank, selecting sentences comprising roots in the root bank and determining participle combinations corresponding to the sentences, namely expanding the roots based on texts in the knowledge bank to obtain the participle combinations corresponding to scenes, splicing the participle combinations comprising the same roots based on the association sequence of business scenes and the business scenes to which the sentences corresponding to the participle combinations belong to obtain a splicing result, splicing the participle combinations corresponding to the association between the combination scenes to ensure that the inference relation between the business scenes is embodied in the splicing result, determining the business entities based on the effective score of the splicing result, expanding the root bank to the preset word bank, expanding the participle combinations of the business scenes based on the root bank, determining the business entities by combining the inference relation between the business scenes to obtain a business entity bank, the method and the device have the advantages that the business entity in the received text to be processed is identified, so that the efficiency of determining the business entity is improved, the accuracy and comprehensiveness of determining the business entity are improved, and the accuracy of identifying the business entity in the text is improved.

It should be noted that, in the embodiment of the present invention, step S102, step S103, and step S104 may be executed simultaneously, for example, for a root in a root bank, a sentence including the root may be screened out and determined as a target sentence, and then the processing of step S102, step S103, and step S104 is executed on the target sentence. In the embodiment of the present invention, the preset association sequence is: the root word of the UAD-JDL-IM is AA, and a target statement of the UAD is as follows: an object statement of the AAPLUS member classical year card, JDL, is: the AA logistics forwarding details, and one target statement of the IM is as follows: as an example, AA safety insurance is used to describe a method for determining a business entity, as shown in fig. 3, the method includes:

s301: and (3) segmenting each target sentence, taking the target root as an initial word, and taking the adjacent segmented words behind the root as end words.

Fig. 4 is a logic diagram for determining a service scenario, and AA is a root word. Therefore, the end words of each target sentence in the step are PLUS, logistics and an alliance in sequence.

S302: and determining the word segmentation combination corresponding to each target sentence based on the starting word and the ending word.

Based on the processing result of step S301, it can be found that the word segmentation combinations are: AAPLUS, AA stream and AA safety.

S303: and splicing word segmentation combinations corresponding to the target sentences based on the business scene association in sequence.

As shown in fig. 4, Concat represents a splicing algorithm, and may splice the preceding and following word segmentations in a combined manner, so as to obtain a splicing result: AAPLUS, AAPLUS-AA stream and AAPLUS-AA stream-AA safety connection.

S304: calculating a feature vector of the splicing result, calling a preset calculation model to extract features of the splicing result based on the feature vector, further calculating a floating point numerical value corresponding to the features, and determining the floating point numerical value as a valid score of the splicing result.

As shown in fig. 4, Inverse Decoding denotes a vector calculation algorithm, Conv1d denotes a vector extraction and floating point number calculation algorithm, and the vector extraction algorithm may be specifically implemented by a convolutional layer.

As shown in FIG. 4, the calculation model can calculate the effective scores of AAPLUS, AAPLUS-AA logistics and AAPLUS-AA logistics-AA interconnection in sequence.

S305: if the effective score of the splicing result is larger than a preset score threshold value, determining the word segmentation combination with the last splicing sequence in the splicing result as a business entity, and updating the business entity to a business entity library; and if the effective score of the splicing result is not greater than the preset score threshold value, the splicing result is not processed.

For each stitching result, a comparison may be made based on the validity score and a score threshold to determine a business entity. The score threshold may be determined in the model training process, as shown in fig. 4, where the left side in the figure represents the score threshold corresponding to each layer of the stitching result.

S306: judging whether the word after the ending word in the target sentence of each business scene does not include the participle, if so, ending the process; if not, go to step S307.

S307: judging whether the service scene corresponding to the target sentence which does not include the word segmentation after the word is ended is the last in the service scene association sequence, if so, deleting the last service scene in the service scene association sequence to update the service scene association sequence, and executing the step; if not, go to step S308.

When the service scene corresponding to the target sentence which does not include the word segmentation after the end word is the last service scene in the service scene association sequence, the last service scene in the service scene association sequence can be deleted to update the service scene association sequence because the last service scene in the service scene association sequence is not influenced.

S308: and for the target sentences including the participles after the end words, updating the participles after the end words into the end words, determining new participle combinations corresponding to the target sentences based on the start words and the end words, further splicing the new participle combinations corresponding to the target sentences based on the updated service scene association in sequence to obtain a splicing result, and executing the step S304.

And updating the participles after the end word into the end word for the target sentence which comprises the participles after the end word so as to update the participle combination, and not updating the participle combination for the target sentence which does not comprise the participles after the end word.

It should be noted that, a target sentence usually includes only one business entity at most, so after a business entity is determined in a target sentence, in order to simplify the calculation process, the participle combination after the business entity may not be processed any more, so for the target sentence for which the business entity is determined, the end word may not be updated any more, so as to stop updating the participle combination.

After expanding a root bank based on a preset word bank and a knowledge bank, selecting sentences comprising roots in the root bank and determining participle combinations corresponding to the sentences, namely expanding the roots based on texts in the knowledge bank to obtain the participle combinations corresponding to scenes, splicing the participle combinations comprising the same roots based on the association sequence of business scenes and the business scenes to which the sentences corresponding to the participle combinations belong to obtain a splicing result, splicing the participle combinations corresponding to the association between the combination scenes to ensure that the inference relation between the business scenes is embodied in the splicing result, determining the business entities based on the effective score of the splicing result, expanding the root bank to the preset word bank, expanding the participle combinations of the business scenes based on the root bank, determining the business entities by combining the inference relation between the business scenes to obtain a business entity bank, the efficiency of determining the business entity is improved, the accuracy and comprehensiveness of determining the business entity are improved, and the accuracy of identifying the business entity in the text is improved.

In order to solve the problems in the prior art, an embodiment of the present invention provides an apparatus 500 for text processing, as shown in fig. 5, the apparatus 500 includes:

a calculating unit 501, configured to obtain a preset word library, label a sequence of each text in the knowledge base, input the labeled text into a preset recognition model, and calculate a root library;

a determining unit 502, configured to, for each root in the root bank, screen a sentence including the root from the knowledge base, perform word segmentation on each sentence, and combine word segmentation results to determine a word segmentation combination corresponding to each sentence, where the word segmentation combination includes the root;

a screening unit 503, configured to splice word segmentation combinations including the same root word based on the service scene to which the corresponding sentence of the word segmentation combination belongs and a preset service scene association sequence, so as to obtain a splicing result;

the calculating unit 501 is further configured to invoke a preset calculating model, calculate an effective score of each splicing result, and determine a business entity from the splicing results based on the effective score, so as to update the business entity to a business entity library;

a processing unit 504, configured to identify a business entity in the received text to be processed based on the business entity library.

It should be understood that the manner of implementing the embodiment of the present invention is the same as the manner of implementing the embodiment shown in fig. 1, and the description thereof is omitted.

In an implementation manner of the embodiment of the present invention, the calculating unit 501 is specifically configured to:

In an implementation manner of the embodiment of the present invention, the splicing unit 503 is specifically configured to:

In an implementation manner of the embodiment of the present invention, the processing unit 504 is specifically configured to:

In an implementation manner of the embodiment of the present invention, the apparatus 500 further includes:

In an implementation manner of the embodiment of the invention, the text to be processed comprises a consultation text sent by a user; the processing unit 504 is specifically configured to:

It should be understood that the embodiment of the present invention may be implemented in the same manner as the embodiment shown in fig. 1 or 3, and thus, will not be described herein again.

In the embodiment of the invention, after the root base is expanded based on the preset word base and the knowledge base, sentences comprising the roots in the root base can be screened out, word segmentation combinations corresponding to the sentences are determined, namely the roots are expanded based on texts in the knowledge base, the word segmentation combinations corresponding to scenes are obtained, further, based on the association sequence of various business scenes and the business scenes to which the sentences corresponding to the word segmentation combinations belong, the word segmentation combinations comprising the same roots can be spliced to obtain a splicing result, which is equivalent to the splicing of the association word segmentation combinations between combination scenes, so that the inference relation between the business scenes is embodied in the splicing result, and then the business entity is determined based on the effective score of the splicing result, so that the root base is expanded from the preset word base, then the word segmentation combinations of the business scenes are expanded based on the root base, and further the inference relation between the business scenes is determined, and obtaining a business entity library to identify the business entities in the received text to be processed, so that the efficiency of determining the business entities is improved, the accuracy and comprehensiveness of determining the business entities are improved, and the accuracy of identifying the business entities in the text is improved.

According to an embodiment of the present invention, an electronic device and a readable storage medium are also provided.

The electronic device of the embodiment of the invention comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the processor, and the instructions are executed by the at least one processor to cause the at least one processor to perform the method for text processing provided by the embodiment of the invention.

Fig. 6 shows an exemplary system architecture 600 of a text processing apparatus or method to which embodiments of the invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. Various client applications may be installed on the

terminal devices

601, 602, 603.

The

terminal devices

601, 602, 603 may be, but are not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server that provides various services, and the server may analyze and perform other processes on data such as a received product information query request, and feed back a processing result (for example, product information — just an example) to the terminal device.

It should be noted that the method for processing the text provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for processing the text is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, a block diagram of a computer system 700 suitable for use in implementing embodiments of the present invention is shown. The computer system illustrated in FIG. 7 is only an example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a computing unit, a screening unit, and a processing unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, a computing unit may also be described as a "unit of functionality of a computing unit".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the method for text processing provided by the present invention.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of text processing, comprising:

acquiring a preset word bank, carrying out sequence tagging on each text in the knowledge bank, inputting the tagged text into a preset recognition model, and calculating a root bank;

screening sentences including the roots from the knowledge base for each root in the root base, segmenting each sentence, and combining segmentation results to determine a segmentation combination corresponding to each sentence, wherein the segmentation combination includes the roots;

splicing the participle combinations including the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result;

calling a preset calculation model, calculating effective scores of all splicing results, and determining a business entity from the splicing results based on the effective scores to obtain a business entity library;

and identifying the business entities in the received text to be processed based on the business entity library.

2. The method of claim 1, wherein performing word segmentation on each sentence, and combining the word segmentation results to determine a corresponding word segmentation combination for each sentence comprises:

3. The method according to claim 2, wherein the splicing includes word segmentation combinations with the same root word based on the association sequence of the service scenario to which the corresponding sentence of the word segmentation combination belongs and a preset service scenario, and the obtaining of the splicing result includes:

4. The method of claim 1, wherein the invoking a predetermined calculation model to calculate the validity score of each stitching result comprises:

5. The method of claim 1, wherein determining business entities from the concatenation results for updating to a business entity library based on the validity score comprises:

6. The method of claim 1, further comprising, prior to screening the knowledge base for sentences that include the root word:

7. The method of claim 1, wherein the text to be processed comprises advisory text sent by a user;

8. An apparatus for text processing, comprising:

the calculation unit is used for acquiring a preset word bank, carrying out sequence labeling on each text in the knowledge base, inputting the labeled text into a preset identification model, and calculating a root bank;

a determining unit, configured to screen, for each root in the root bank, sentences including the root from the knowledge base, perform word segmentation on each sentence, and combine word segmentation results to determine a word segmentation combination corresponding to each sentence, where the word segmentation combination includes the root;

the splicing unit is used for splicing the participle combinations comprising the same root word based on the service scene to which the corresponding sentence of the participle combination belongs and a preset service scene correlation sequence to obtain a splicing result;

the calculation unit is further configured to invoke a preset calculation model, calculate an effective score of each splicing result, and determine a business entity from the splicing results based on the effective score so as to update the business entity to a business entity library;

and the processing unit is used for identifying the business entities in the received text to be processed based on the business entity library.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.