CN115906858A

CN115906858A - Text processing method and system and electronic equipment

Info

Publication number: CN115906858A
Application number: CN202110903758.6A
Authority: CN
Inventors: 曾骞; 张国慈; 张小洵; 王浩
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2023-04-04

Abstract

The embodiment of the application provides a text processing method, a text processing system and electronic equipment. Wherein the method comprises the following steps: determining semantic tags respectively corresponding to a plurality of sentences in a text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; and generating the simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words. According to the technical scheme, the text space is simplified, key and important words are not simply extracted through a means for simplifying the file space, the information of the relationship between the second type of words is also concerned about to be extracted, the simplified expression of the relationship between the words is reserved, and the semantic accuracy of the simplified expression can be ensured; the simplified expression of the text is accurate, intuitive and simple, and the time cost of reading and understanding the text by the user is reduced.

Description

Text processing method and system and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text processing method and system, and an electronic device.

Background

Due to a large number of texts such as policy files and news release files, and different content types, in the process of searching for a policy suitable for the text or knowing the current situation, enterprises, users and the like often need to read and understand each text in detail. If the policy document or the news release document refers to a good policy, enterprises, users and the like need to understand and calculate the good policy by themselves, which undoubtedly takes a lot of time. While spending a lot of time, there may be a deviation in understanding, so that enterprises, users and the like may enjoy or not know the corresponding interest policy even if the requirements are met.

Disclosure of Invention

In view of the above, the present application provides a text processing method, system and electronic device that solve the above problems, or at least partially solve the above problems.

In one embodiment of the present application, a method of text processing is provided. The method comprises the following steps:

determining semantic tags respectively corresponding to a plurality of sentences in a text;

extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag;

extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to second-class tags;

and generating the simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words.

In another embodiment of the present application, a text processing method is provided. The method comprises the following steps:

determining semantic tags corresponding to a plurality of sentences in a policy text respectively;

extracting at least one first category word from sentences of which the semantic tags belong to the first category tags;

extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags;

and generating a simplified expression of the text for displaying to the enterprise user according to the at least one first word, the at least two second words and the relationship information among the different second words.

In yet another embodiment of the present application, a text processing system is provided. The method comprises the following steps:

the data layer is used for interactively storing and acquiring data with the database; wherein, the database stores linguistic data;

the language processing layer is provided with at least one language processing model and is used for optimizing any language processing model according to the linguistic data acquired by the data layer; the system is further used for performing semantic analysis on the input text by utilizing at least part of the at least one language processing model to determine semantic tags respectively corresponding to a plurality of sentences in the text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; generating a simplified expression of the text according to the at least one first type word, the at least two second type words and the relationship information among different second type words;

and the output layer is used for outputting the simplified representation of the text.

In yet another embodiment of the present application, a text processing system is provided. The system comprises:

the database is used for storing the linguistic data;

the language processing engine is used for optimizing any language processing model in at least one language processing model by utilizing the linguistic data stored in the database;

the language processing engine is further configured to perform semantic analysis on the input text by using at least part of the at least one language processing model to determine semantic tags corresponding to a plurality of sentences in the text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to second-class tags; generating a simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words;

the database is also used for storing the text and the simplified expression of the text.

In one embodiment of the present application, an electronic device is provided. The electronic device includes a memory and a processor; the memory is used for storing one or more computer instructions, and the one or more computer instructions can realize the steps of the text processing method provided by the above embodiments when being executed by the processor.

The technical solutions provided by the embodiments of the present application are not only applicable to processing of the policy text mentioned in the background art, but also can be extended to processing of other texts besides the policy text, such as processing of news text, processing of documents inside an enterprise, and the like. Specifically, according to the technical scheme provided by each embodiment of the application, semantic tags corresponding to a plurality of sentences in a text are determined, and for the sentences with the semantic tags of the first category, at least one first category word can be extracted from the sentences; for the statement with the semantic tag of the second category, at least two second words and the relation information among different second words can be extracted from the statement; and finally, generating the simplified expression of the text based on the extracted at least one first-type word, the extracted at least two second-type words and the relationship information among different second-type words. Therefore, the technical scheme provided by each embodiment of the application simplifies the text space, and the means for simplifying the file space is not to simply extract key and important words, but also to extract the relationship information between the second type of words and keep the simplified expression of the relationship between the words, so that the semantic accuracy of the simplified expression can be ensured; the simplified expression of the text is accurate, intuitive and simple, and the time cost of reading and understanding the text by the user is reduced.

Furthermore, simplified expressions of texts obtained by the technical scheme provided by the embodiment of the application also provide data support for subsequently recommending texts for users and enterprises, and the method is beneficial to improving the recommendation efficiency and accuracy.

In the text processing system provided by the embodiment of the application, a hierarchical system architecture design concept is adopted, and a data layer can interact with a database to obtain corpora or store data and the like; the language processing layer can be flexibly deployed with at least one language processing model, and the suitable language processing model can be selected for different types of texts. For example, for documents such as policy texts, a language processing model (e.g., a classification model) can be optimized (e.g., trained) by using a dedicated document corpus, so that the optimized (or trained) language processing model suitable for the policy text can be used to process the policy text to obtain a simplified representation of the policy text. For other types of texts, such as news texts, an adaptive language processing model can be selected based on the characteristics of the news texts, so that the news texts are processed by utilizing the adaptive language processing model to obtain corresponding simplified expressions; and so on. The output layer is used for outputting the simplified expression of the text processed by the language processing layer so as to store the simplified expression into a database or output the simplified expression to a corresponding client and the like. In addition, in the technical solution provided by the embodiment of the present application, more than one semantic processing model may be used for semantic analysis of the text. Therefore, the text processing system adopting the hierarchical system architecture design concept provided by the embodiment of the application has higher flexibility, and can flexibly deploy and replace the language processing model based on actual requirements.

Further, the text processing system provided by the embodiment of the application may further include a recommendation layer, the recommendation layer is located between the language processing layer and the output layer, and the recommendation layer may also have at least one recommendation model, and the recommendation model may be selected according to the characteristics of an actual recommendation object, and the like, so that the flexibility is high, and the application scenarios are wider.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required to be utilized in the description of the embodiments or the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

Fig. 1 is a theoretical schematic diagram of a principle framework structure and a processing flow corresponding to each functional module for implementing a text processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;

fig. 3a is a schematic diagram of a technical architecture corresponding to a text processing system according to an embodiment of the present application;

FIG. 3b is a block diagram of a system for characterizing a text processing system from a hierarchical structure according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a text processing method according to another embodiment of the present application;

fig. 5 is a schematic flowchart of a text processing method according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of a text processing system according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Under the large background of rapid development of global informatization, users, enterprises or organizations and the like face a large amount of information, capture social development tendency, policy interest information and the like from the information, spend a large amount of time and possibly miss interest opportunities.

For example, in a specific application scenario, i.e., a political and enterprise service scenario, an important service content is to promote efficient use of policies promulgated by enterprises, organizations, users, etc. to organizations, service-type organizations, etc. To assist businesses, organizations, and the like, service-type organizations, and the like, often publish a wide variety of policy documents, often published in text (hereinafter collectively referred to as policy text), on public service platforms and associated websites each year. Because the number of the policy texts is large and the contents are different, when an enterprise, an organization or a user searches for a policy text suitable for the enterprise, the organization or the user needs to read and understand the contents of each policy text in detail so as to screen out the policy text suitable for the enterprise, the organization or the user. That is, the more policy text, the higher the cost of time required for an enterprise, organization, or user to read the policy text in detail and filter useful text from it. In addition, the policy text usually also relates to some beneficial contents, and enterprises, organizations or users need to understand and calculate themselves if they want to further understand the specific preferential policy.

The prior art may also implement adaptive policy texts for automatically recommending to a certain extent enterprises, users or organizations. However, in the prior art, the recommendation process is mainly based on historical behavior data (such as viewing records) of an enterprise, a user or an organization, and the historical behavior data is relatively sparse, so that the recommendation system is prone to have a cold start problem.

The cold start problem refers to the problem that it is difficult to optimize or train a recommendation model with a better recommendation effect under the condition of no more history or training data, how to realize good operation of a recommendation system and make the recommendation effect better and better, and the problem is the cold start problem. In addition, because the prior art mainly focuses on the recommendation, the prior art also lacks a detailed interpretation of the content of the policy text and an intelligent analysis and calculation of the benefit degree that the policy text can provide, besides the cold start problem.

In view of the above problems, the present application provides the following embodiments to provide a solution for intelligently analyzing and semantically understanding a text to obtain a simplified representation of the text, which is helpful for a user, an enterprise or an organization to understand, so as to find a text suitable for the user, the enterprise or the organization as soon as possible (e.g., a policy text, a news issue text, etc.), so as to reduce the probability that the enterprise, the user or the organization miss interest information. In addition, the following embodiments of the present application can provide a simplified representation of a text for a user, an enterprise, or an organization, and can also actively recommend a suitable text to the user, the enterprise, or the organization, thereby further reducing the process of searching for the user, the enterprise, or the organization by themselves. In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, and these operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit the types of "first" and "second". In addition, the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Before introducing the method embodiment provided by the present application, a principle framework corresponding to a text processing system based on the technical solution provided by the present application is explained.

Fig. 1 shows a theoretical schematic diagram of a theoretical framework structure and a processing flow corresponding to each functional module for implementing the file processing method provided by the present application. As shown in fig. 1, a device for implementing the document processing method provided by the present application may include the following large functional modules, which specifically include but are not limited to: a content interpretation module 10 and a relationship interpretation module 20. Wherein the content of the first and second substances,

the content interpretation module 10 is configured to determine semantic tags corresponding to a plurality of sentences in a text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag;

the relation interpretation module 20 is configured to extract, from the sentences whose semantic tags belong to the second category tags, at least two second-category words and relation information between different second-category words;

the expression module 30 is configured to generate a simplified expression of the text according to the at least one first-type word, the at least two second-type words, and relationship information between different second-type words.

Further, as shown in fig. 1, the present embodiment may further include a recommendation module. Specifically, the recommending module 40 is configured to obtain description information of the target object; determining a recommended text in the plurality of texts according to the description information and simplified expressions of the plurality of texts; and sending the simplified expression corresponding to the recommended text to a client corresponding to the target object.

In a specific implementation, when the recommending module 40 determines the recommended text from the plurality of texts according to the description information and the simplified expressions of the plurality of texts, the recommending module may specifically be: and analyzing the relevance between the target object and each text respectively according to the description information and at least one first word contained in the simplified expression of each text, and determining the text with the relevance meeting the requirement as the recommended text.

As for the content of "analyzing the relevance between the target object and each text according to the description information and at least one first category word included in the simplified representation of each text", corresponding parts will be described in detail below, and the following content can be referred to.

The text in this embodiment may be a policy text, and correspondingly, semantic tags corresponding to statements included in the policy text may include, but are not limited to: application conditions, offer entries, others, etc. For example, the semantic label "apply for condition" may be categorized as a first category label and the semantic label "offer entry" may be categorized as a second category label.

The first category label may also have a lower level label, such as the semantic label "apply for condition" may also include a lower level label, such as "registered place", "business scope", and the like. Here, it should be noted that: each of the above-mentioned tags is a feature of a text such as a policy text of an enterprise. In another category of text, such as policy documents for a single public, semantic tags corresponding to statements of the text may include, but are not limited to: application conditions, offer entries, others, etc.; but the application conditions include subordinate labels that may be "year of operation", "household", "industry of residence", etc.

Therefore, it should be added here that: the semantic tags are only examples given for specific types of texts, and correspond to policy texts oriented to different objects, or analysis articles published by news or websites instead of the policy texts, and the semantic tags are different, which is not exhaustive in this embodiment. In addition, what needs to be added is: in the scheme provided by the embodiment of the present application, only the extraction manner of statements of two types of labels is illustrated, and in the specific implementation, a third type label, a fourth type label, … …, and the like can be further divided. And aiming at different types of labels, different extraction strategies can be adopted for extracting words. The extraction policy is not particularly limited in this embodiment.

More specifically, in an embodiment, the content reading module 10 may analyze the sentences in the text according to the sentence granularity based on the classification model, so as to classify all the sentences contained in the text, and assign semantic tags corresponding to the respective classifications to the sentences. Then, the content reading module 10 may further perform word extraction on the sentences whose semantic tags belong to the first category tags by using the first extraction model to extract at least one first category word. The first extraction model may be understood as analyzing each sentence from words, phrases, etc. with a finer granularity than the sentence granularity.

The relation interpretation module 20 may utilize the second extraction model to extract words from the sentences whose semantic tags belong to the second category tags, so as to extract a plurality of second category words; and then, by using a relationship identification model and combining the sentences of which the semantic labels belong to the second class labels, identifying the relationship information between any two second class words in the plurality of second class words extracted by the second extraction model.

Here, it should be noted that: the first extraction model, the second extraction model, the relationship recognition model, and the like mentioned above are not particularly limited in this embodiment. Specific implementations of the models will be listed below by way of example.

In an implementation solution, a relationship information may be preset with a corresponding relationship expression template. Therefore, the expression module 30 can express any two second-type words as the simplified relational expression according to the relational expression template corresponding to the relational information between the two second-type words which is read out by the relational interpretation module 20. Of course, the second category words having relationships may be more than two, and may be three or four, so when the expression template 30 performs simplified expression, the three, four or more second category words may be expressed as simplified relational expressions according to the relational expression templates corresponding to the relational information of the three, four or more second category words. Further, the expression module 30 is further configured to generate a simplified expression of the text according to at least one first word of the text that is decoded and read by the content interpretation module 10 and at least one simplified relational expression of the text obtained by the above process.

The recommending module 40 may complete the calculation of the correlation between the text and the description information of the target object based on the synonym expansion model, the normalization model, the calculation model, and the like, and complete the recommendation of the simplified representation of the text to the target object based on a lightweight recommendation framework (e.g., a framework composed of a recall model, an ordering model, and the like). Of course, in addition to sending a simplified representation of the recommended text to the target object, the original text of the text may also be recommended to the target object together. The entities corresponding to the target object may be different for different types of texts. For example, in the face of policy class text of a business, organization, etc., the target object is the business, organization, etc. The description information of the enterprise and organization can include but is not limited to: a location of registration, a scope of operation, a date of registration, etc. For another example, in the case of policy-class texts of individual citizens and employees of enterprises and public institutions, the target objects are the working years of the individual citizens and employees, the industries in which the employees are located, and the like.

Among them, the above-mentioned synonym expansion model, normalization model, calculation model, and the like will be described in detail below.

Therefore, the technical scheme provided by the application is that the function provided by the content interpretation module is used for finishing the fine interpretation of the text sentence, the function provided by the relation interpretation module 20 is used for finishing the interpretation of the relation information between the second class words in the text, the expression module 30 is used for finishing the simplified expression of the text, and the function provided by the recommendation module 40 is used for finishing the purpose of recommending the matched text to the target object. The system architecture corresponding to each function of the content interpretation module 10, the relationship interpretation module 20, the expression module 30 and the recommendation module 30 is shown in fig. 3a and 3 b.

The technical scheme provided by the embodiment can be used for intelligently implementing the precise interpretation and the relational interpretation of the text content based on information technologies such as cloud computing, big data, artificial intelligence and the like so as to accurately determine the simplified expression of the text (such as the relational expressions related to declaration conditions supported by policies, benefits supported by policies and the like in the policy text) by comprehensively using the language processing model and the recommendation model, and displaying the simplified expression to target objects (such as enterprise main users, individual users, organizations and the like) in a visual and concise manner. Further, the technical solution provided in this embodiment may also recommend an adapted text for the target object based on the description information of the target object itself (e.g., enterprise introduction information disclosed by an enterprise) in combination with the simplified expression of the text (more specifically, at least one first word included in the simplified expression of the document), so as to effectively avoid a cold start problem caused by recommending by using sparse historical behavior data. The following describes technical solutions provided in embodiments of the present application in detail to specifically explain how to solve the above problems in the prior art.

The following description is directed to embodiments of the methods provided herein.

An execution main body of the text processing method provided in the following embodiments may be a server, or a virtual server deployed on a server cluster, and the present embodiment is not particularly limited to this. Specifically, the present invention may be a computer program configured on an execution body and having functions corresponding to the modules of the above embodiments. Of course, if the client device has stronger computing power, the execution subject of the following method embodiments may also be the client device. Specifically, the client device may be a desktop computer, a notebook computer, an intelligent wearable device, and the like.

Fig. 2 is a flowchart illustrating a text processing method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

101. determining semantic tags respectively corresponding to a plurality of sentences in a text;

102. extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag;

103. extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to second-class tags;

104. and generating a simplified expression of the text according to the at least one first-class word, the at least two second-class words and the relationship information among different second-class words.

Taking the text in this embodiment as a policy text as an example, the policy text may specifically indicate various preferential policy texts (also referred to as support policy texts) issued by a certain jurisdiction or region, such as benefit people and benefit enterprises. Generally, the benefit policy text will include key contents related to policy-supported declaration conditions, policy-supported benefits, etc., so as to achieve the purpose of guiding and supporting the development of enterprises, talent subjects, etc. through these key contents. It is therefore necessary to solve the above-mentioned problems of the prior art by determining key content in a policy document, such as a declaration condition supported by a policy, a benefit supported by a policy, and the like, so as to process the key content later. Based on this, when the text in the foregoing 101 is a preferential policy type text, the statements included in the text at least include: at least one sentence with semantic label of 'applying condition', at least one sentence with semantic label of 'preferential item', etc.

In an achievable technical solution, in the foregoing 101, semantic tags respectively corresponding to a plurality of sentences in a text can be obtained by analyzing text contents in a policy text according to sentence granularity. Specifically, the step 101 of "determining semantic tags corresponding to a plurality of sentences in a text" may specifically include:

1011. acquiring the text;

1012. performing sentence segmentation on the text to obtain a plurality of sentences;

1013. and utilizing a classification model to identify semantic labels respectively corresponding to the sentences.

Sometimes, the semantic labels corresponding to two sentences or more sentences are the same; in this regard, to simplify the present embodiment, the method may further include a step 1014 of summarizing the sentences with the same identified semantic tags.

In the above 1011, a web crawler (also called a web spider or a web robot) may be used to crawl from an open website or platform to obtain a text meeting the requirement. Specifically, the crawled entry may be set to the home page of a website or a web page for posting information, or the like. In addition, the crawled entry may also be a homepage of a third-party website on the internet, a webpage for publishing information in the third-party website, or the like, which is not limited herein. Further, some crawling rules may be set to crawl only text that meets the rules, such as crawling only policy text that contains a document number, where the document number is the number of a document issued by a particular department, and the document number is generated according to official regulations. In addition to obtaining the policy text by the web crawler, of course, the embodiment may also obtain the required text by other methods, such as a manual importing method, system docking with a website or a platform, and the like, which is not limited in this embodiment.

In the 1013, the classification model may be a pre-trained Machine learning model, and the Machine learning model may be constructed based on, for example, a BERT (Bidirectional Encoder responses from transforms) algorithm, a SVM (Support Vector Machine) algorithm, an LSTM (Long Short-Term Memory, long Short-Term Memory network, which is a time-cyclic neural network) algorithm, a GBDT (Gradient Boosting Decision Tree) algorithm, and the like, which is not limited herein. Accordingly, the training process of the classification model may include the steps of:

acquiring a training sample; the training sample comprises a plurality of sentences and sample labels to which the plurality of sentences belong;

taking the sentences as the input of the classification model, and executing the classification model to obtain an output result containing semantic labels corresponding to the sentences;

and optimizing the classification model based on the output result and the sample labels to which the multiple sentences belong respectively.

Alternatively, the classification model may be a model constructed based on a rule-based approach. The rule-based method is to convert knowledge into rules, and the rules are usually accurate but have poor expansibility.

The machine learning model is mainly classified into the following two categories:

one type is a machine learning model constructed based on a machine learning algorithm, such as an SVM (support vector machine) algorithm, a GBDT (guaranteed bit rate) algorithm and the like, and the machine learning model is trained by manually defining model characteristics and putting the characteristics into a classification algorithm.

The other type is a classification algorithm based on deep learning, such as CNN (Convolutional Neural Networks), BERT, LSTM, and the like, which can automatically learn the internal connection between data and categories, and can build a feature library in combination with an application scenario without spending a long time.

Still alternatively, the classification model is a model constructed based on a hybrid approach. For example, CNN, BERT, LSTM and other algorithms are mixed to construct the model. Or, the model built by mixing the rule method and the machine learning algorithm, for example, the rule method is used for pre-filtering to analyze a plurality of sentences in the text to obtain semantic tags corresponding to the sentences, then the semantic tags corresponding to the sentences and the text obtained by the rule method are used as the input of the model built by the machine learning algorithm, and the model built by the machine learning model is executed to obtain the semantic tags corresponding to the sentences. Of course, a post-rule method may also be added as a post-base, for example, the semantic tags corresponding to the sentences obtained from the model constructed by the machine learning model are used as the input of the post-rule method, and the post-rule method is executed to modify (or optimize) the semantic tags corresponding to the sentences in the text.

In 1012, the text may be segmented according to the sentence granularity, and each segmented sentence is input into the classification model, and the classification model performs recognition and analysis on each sentence to obtain a semantic tag corresponding to each sentence. The sentence segmentation is performed on the text, and the sentence segmentation can be realized by using related means in the prior art, which is not specifically limited in this embodiment. For example, for a policy text for an enterprise, an organization, or the like, semantic tags corresponding to statements contained in the policy text may include, but are not limited to: declare conditions, preferential items, others, etc.; and summarizing the sentences with the same semantic label based on the semantic label and the policy text corresponding to each sentence. Here, summarizing statements with the same semantic label in combination with policy text includes: merging, sorting and other operations, namely merging the sentences with the same semantic label, and then sorting the sentences with the same semantic label according to the position relation of the sentences in the policy text.

For example, after classifying text content in the policy text a by using a classification model, based on the policy text a and semantic tags corresponding to each sentence in the policy text a obtained after the classification, sentences with the same semantic tags are summarized to obtain a final result, where the final result is:

statements with semantic tags of "declaration conditions" include: 1. the inclusion of the support object in the bluish blue program requires the following basic conditions: first, after 6 months and 1 day in 2011, college teachers and scientific research institutes take the lead team to create a scientific enterprise which registers and taxes in city, has corporate legal qualification and has a sound financial management system and an accounting system. And (II) college teachers and scientific research institutes experts invest shares in forms of currency, intangible assets and the like, the share ratio is not lower than 20% within 1 year after enterprise registration, and technical achievements used by enterprises are independently researched and developed and have no intellectual property disputes. Simultaneously, the following requirements are met: 1. headquarters for manufacturing. The registered fund of the enterprise engaged in advanced manufacturing industry is not less than 5000 ten thousand yuan, and the income tax amount (local warehousing tax fund, the same below) of the last year is not less than 1000 ten thousand yuan. 2. Service headquarters. The method is characterized in that enterprises engaged in modern logistics, leisure travel, business retail, business exhibition, scientific and technological research and development, creative design, service outsourcing, electronic commerce, equity investment and the like are provided, registered funds are not less than 1000 ten thousand yuan, and taxes are not less than 500 ten thousand yuan in the last year.

The statement with the semantic label of 'offer entry' comprises: the patentees who converted patent technology at city are awarded no more than 10% of subsidy limits for the company implementing the patent technology.

In the policy text a, the corresponding semantic tags may be "other" except for the statement whose semantic tag is "political declaration condition" and the statement whose semantic tag is "offer entry", and a specific statement whose semantic tag is "other" is not specifically exemplified here.

In the above example, the illustrated sentences whose semantic labels are "application conditions" and the sentences whose semantic labels are "benefit items" may be used as the key contents in the policy text a.

It should be added that the classification model in the present embodiment may be a pre-written program code, an application program, a functional module, etc., and may be stored in a storage medium. When a sentence of text needs to be analyzed using a classification model, the classification model may be called from a corresponding storage medium. Specifically, referring to fig. 3a, a technical architecture corresponding to a text processing system is shown, and the technical architecture is a conceptual hierarchical architecture of the system. The technical architecture shown in fig. 3a provides technical support for implementing the steps of the text processing method provided by the present embodiment. As shown in fig. 3a, the system architecture includes:

the language processing layer is provided with at least one language processing model and is used for optimizing any language processing model according to the linguistic data acquired by the data layer; the system is further used for performing semantic analysis on the input text by utilizing at least part of the at least one language processing model to determine semantic tags respectively corresponding to a plurality of sentences in the text; extracting at least one first category word from the sentence of which the semantic tag belongs to the first category tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; generating a simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words;

The above-mentioned data layer interactive database may be multiple, and may include but is not limited to: basic thesaurus, industry thesaurus, stop words thesaurus, internet thesaurus, other thesaurus, etc. The language processing layer can perform semantic analysis on the text by using at least one language processing model, and also provides some training algorithms, and the execution subject can optimize (such as rule addition, deletion, modification and the like, or train) the at least one language processing model based on the training algorithms by using the corpus (i.e. words, sentences and the like stored in each database) provided by the database to obtain the required model, such as a classification model, a first extraction model, a second extraction model, a relation recognition model and the like. The optimized (or trained) model may be stored in a storage medium for recall when needed.

The named entity recognition function shown in fig. 3a refers to recognizing entities belonging to a specific type, such as time, amount, location, etc., from text, by recognizing the boundaries of the entities and determining the categories to which the entities belong based on the recognized boundaries of the entities. The named entity recognition function is similar to a classification model and is realized mainly based on the following three algorithms related to named entity recognition: rule-based methods, machine learning-based methods, and hybrid methods. The rule-based method is to construct a rule template, and select and use the methods with characteristics including statistical information, punctuation marks, keywords, indicator words, direction words, position words (such as tail words), central words and the like; and (3) combining a named entity library, carrying out weight assignment on each rule, and carrying out type judgment (such as semantic tags corresponding to the above-mentioned sentences or words) according to the condition that the entity conforms to the rule. The method based on machine learning refers to learning a labeling model based on a large amount of labeled corpora by utilizing a machine learning model, such as a hidden Markov model, a maximum entropy model, a support vector machine, a conditional random field and the like, so as to label the types of entities corresponding to all positions of a sentence. A hybrid approach, refers to a hybrid with rule and model based approaches, as described above with respect to hybrid approaches.

Among them, machine Reading Comprehension (MRC) shown in fig. 3a is a technology for enabling a computing mechanism to solve text semantics and answer related questions by using an algorithm. Carrying out induction and carding according to the technical thought of the mainstream model understood by machine reading and carrying out stripping combination on certain technical points, and classifying the model corresponding to the machine reading understanding into the following models: one-dimensional matching models, two-dimensional matching models, inference models, other models, and the like. The one-dimensional matching model and the two-dimensional matching model are basic models, and the inference model is a mechanism for mainly researching how to infer text contents on the basic models. More specifically, for specific implementation of the one-dimensional matching model, the two-dimensional matching model, and the inference model, reference may be made to the existing literature, which is not described herein in detail. Fig. 3a shows that the first extraction model may be implemented by using a named entity recognition function or a machine reading understanding function, and actually, the classification model may also be implemented by using a machine reading understanding function, which is not specifically limited in this embodiment.

As can be seen, in the text processing system provided by this embodiment, a hierarchical system architecture design concept (as shown in fig. 3 a) is adopted, and the data layer may interact with the database to obtain corpora or store data; the language processing layer can be flexibly deployed with at least one language processing model, and the suitable language processing model can be selected for different types of texts. Or, an appropriate language processing model is selected according to actual precision, processing efficiency requirements and the like. For example, for documents such as policy texts, a language processing model (e.g., a neural network model LSTM provided in a machine learning-based approach) can be optimized (e.g., trained) using a proprietary document corpus, so that the optimized (or trained) language processing model suitable for the policy text can be used to process the policy text to obtain a simplified representation of the policy text. For other types of texts, such as news texts, the adaptive language processing model can be selected based on the characteristics of the news texts, so that the news texts are processed by utilizing the adaptive language processing model to obtain corresponding simplified expressions; and so on. The output layer is used for outputting the simplified expression of the text processed by the language processing layer so as to store the simplified expression into a database or output the simplified expression to a corresponding client and the like. In addition, in the technical solution provided by the embodiment of the present application, more than one semantic processing model may be used for semantic analysis of the text. Therefore, the text processing system adopting the hierarchical system architecture design concept provided by the embodiment of the application has higher flexibility, and can flexibly deploy and replace the language processing model based on actual requirements.

In this embodiment, the step 102 "extracting at least one first category word from the sentence whose semantic tag belongs to the first category tag" can be implemented by using the named entity recognition function or the machine reading understanding function, as shown in fig. 3 a. In other words, the step 102 of the present embodiment can be implemented by using the first extraction model, which is constructed based on the named entity recognition function. Specifically, in step 102, "extracting at least one first-class word from a sentence whose semantic tag belongs to a first-class tag" in this embodiment includes:

1021. acquiring a first extraction model;

1022. and taking the statement of which the semantic label belongs to the first class label as the input of the first extraction model, and executing the first extraction model to obtain the at least one first class word.

For example, the first extraction model is constructed based on a BERT, LSTM algorithm. After the first extraction model extracts at least one first-class word, the first extraction model can also identify a word label corresponding to each first-class word. The word label may be a subordinate label of the first category of semantic labels described above. For example, the subordinate labels of the semantic label "application condition" may include: business scope, registry, etc.

It should be added that, the entity mentioned above may also be understood as a first word extracted from the sentence, and the word tag corresponding to the first word may be: extent of business, time, amount, location, etc. In the present embodiment, the "entity" and the "word" are only different designations used in different scenarios by the same concept. The term "entity" appearing hereinafter may also be understood as a word with a word label, more specifically a first word with a word label.

The first type of words extracted by the first extraction model can carry the position of the first type of words in the sentence where the first type of words are located. Further, in order to improve the accuracy of the first-class word extraction, the first-class word extracted based on the first extraction model may be optimized by using the established rule with better suitability and combining with the sentence where the first-class word is located, so as to obtain at least one optimized first-class word. The rules with better suitability can be constructed by a linguistic expert or a technician based on intensive study and understanding of texts (such as a certain type of texts, for example, policy texts). The optimized at least one first-type word obtained through the processing steps can be summarized based on the at least one first-type word and the position of the first-type word in the sentence, wherein the position of the first-type word is carried by each first-type word. Similar to the above-mentioned sentence summarization, the extracted first words with the same word labels are merged and sorted according to the respective sentence sequence and the sentence sequence.

Here, it should be noted that: the words in this embodiment, such as the first-type words, the second-type words, and the like, may be words formed by a single word or multiple words, or word segments formed by multiple words, and the like, which is not limited in this embodiment. Not any word segment made up of a plurality of words, or any word segment made up of a plurality of words, may be referred to as a word as defined herein, and the words and word segments referred to herein are commonly used in natural language. For example, "leisure travel" belongs to word segments, while "business research and development" does not belong to word segments.

Continuing with the policy text a shown in the above example as an example, assume that the word extracted from the sentence of the first category label by using the first extraction model is a plurality of first category words labeled as "business scope", including: "created by college teachers and scientific research institutes experts with lead team; engaging in advanced manufacturing; the word label is the first word of "registration place", such as "city", in modern logistics, leisure travel, trade and trade retail, business exhibition, scientific and technological research and development, creative design, outsourcing of services, electronic commerce, equity investment, etc. Further, through intensive research and understanding of policies, assume that a rule for the word label being "business scope" constructed in an end-to-end construction manner is: org _ start (i.e., an initial word in a statement) = [ 'engage',.. ] or org _ end (a tail word in a statement) = [ 'business', 'logistics', 'travel', 'retail', 'exhibitions', 'research and development', 'design', 'outsource', 'business', 'investment', "etc. ]. The first class word with the word label of the operation range extracted by the first extraction model is optimized by combining the rule with the corresponding sentence, so that the first class word with the word label of the operation range created by college teachers and scientific research institutes experts with a team can be determined to belong to the word label of the non-operation range. Thus, the first category of words that are optimized and ultimately labeled "business scope" includes: the system is engaged in advanced manufacturing industry, modern logistics, leisure travel, trade and retail, business exhibition, scientific and technological research and development, creative design, service outsourcing, electronic commerce, equity investment and the like. Similarly, corresponding rules may also be constructed for words corresponding to other word labels (e.g., "place of registration" and "fund of registration"), which are not described herein again.

Here, it should be noted that: the rules specifically shown in the above examples and constructed for the word "business scope" are only illustrative and do not represent the actually constructed rules, that is, the constructed rules may be other types of rules, and the present embodiment does not limit this.

In a specific embodiment, in the step 103 "extracting at least two second-type words and relationship information between different second-type words from the sentence whose semantic tag belongs to the second-type tag" in this embodiment may include the following steps:

1031. acquiring a second extraction model and a relation identification model;

1032. taking the statement of which the semantic label belongs to the second category label as the input of the second extraction model, and executing the second extraction model to obtain the at least two second category words;

1033. and taking the at least two second-class words and the sentence of which the semantic label belongs to the second-class label as the input of the relation recognition model, and executing the relation recognition model to obtain the relation information between any two second-class words in the at least two second-class words.

Extracting at least two second words from the sentences of which the semantic tags belong to the second class tags by using a second extraction model; and then, the statement of which the semantic label belongs to the second type label and at least two second type words are extracted as the input of a relation recognition model, and the relation recognition model is executed to obtain the relation information between any two second type words in the at least two second type words.

For example, taking the above-mentioned sentence whose semantic tag in the policy text a belongs to the second category tag (i.e. the sentence whose semantic tag in the policy text a is "offer item") as an example, the word extraction process is performed on the sentence a whose semantic tag belongs to the second category tag by using the second extraction model, so as to identify and analyze two second category words "subsidy limit" and "10%" in the sentence a, wherein "the reward is given to the patentee who converts patent technology in city by no more than 10% of the enterprise subsidy limit implementing the patent technology". The two words of the second category, "supplementary quota" and "10%" constitute the word pair corresponding to the sentence a. Inputting the sentence a and the word pair (the "supplementary limit" and the "10%") into the relationship recognition model, the relationship information capable of reflecting the relationship between the two second words "supplementary limit" and "10%", namely "multiply not to exceed", can be obtained. In specific implementation, the two second words and the associated information can be represented as triples { "supplementary amount", "multiply by no more than", "10%" }.

Here, it should be noted that: the relationship recognition model may be a machine learning model. Specifically, the relationship recognition model may be, but not limited to, constructed based on an algorithm such as R-BERT (which is a derivative model of BERT for relationship extraction), and the constructed machine model is trained and optimized by using a training sample, so that a relationship recognition model meeting requirements can be obtained. The relationship recognition model can be realized based on the following two types of correlation algorithms: rule-based methods, machine learning-based methods. The relation recognition model constructed based on the rule method is designed according to the requirement of an extraction task and comprises a plurality of patterns including vocabulary, syntax and semantic features, second words matched with the patterns are searched in the sentence analysis process, and relation information between the two second words is deduced. According to the dependence degree on the training corpus, the relationship recognition model constructed based on the machine learning method comprises the following three methods: an unsupervised relationship identification model, a supervised relationship identification model and a weak supervised relationship identification model; wherein the content of the first and second substances,

the unsupervised relation recognition model has the core idea that: two words have similar meanings if they are used similarly and appear in the same context. Likewise, if two entities have similar contexts, then the two entity pairs tend to be related by the same semantics. Unsupervised relationship recognition models can be used to discover new relationships, but the resulting relationships do not have semantic information.

The supervised relation identification model refers to the fact that a relation identification task is regarded as a multi-classification task, and relation extraction is conducted through extracting text feature vectors and utilizing a supervised classifier. Existing supervised relationship extraction methods can be classified into, but are not limited to, feature vector methods, kernel function-based methods, neural network-based methods, and the like.

The weak supervision relation recognition model can be divided into two frames, one is that the cost and the cost for improving the extraction effect are reduced by utilizing technologies such as semi-supervised learning and active learning, for example, unmarked data are found from a text by a small amount of marked data, and the data are marked manually; the other framework is based on the annotation returning idea, and the text where the automatic annotation returning entity is located is used as training data by utilizing the relation triple in the knowledge base, so that the cost of manual annotation is reduced.

For example, the second extraction model in this embodiment may be, but is not limited to, constructed and trained based on a machine learning method, and the relationship recognition model may be, but is not limited to, constructed and trained based on a deep learning supervised relationship recognition algorithm.

Referring back to the technical architecture diagram shown in fig. 3a, the language processing models provided in the language processing layer may include, but are not limited to, the classification model, the first extraction model, the second extraction model, the relationship recognition model, and the like mentioned above.

The simplified representation of the text can be generated by obtaining information about the relationship between at least one first type of word, the at least two second types of words, and different second types of words in

steps

102 and 103 to generate the simplified representation of the text. Specifically, step 104 may include the following steps:

1041. expressing at least two second-class words with the relationship information into a simplified relational expression according to a relationship expression template corresponding to the relationship information;

1042. and generating a simplified expression of the text according to the at least one first word and the at least one simplified relational expression.

One relationship information may correspond to one or two relationship expression templates. The triple { second type word 1, relationship information, second type word 2} extracted from the text sentence may be plural, and plural triples may have the same relationship information or different sets of relationship information. For each triple, a relationship expression template corresponding to the relationship information in the triple can be selected, and the triple is expressed as a simplified relationship. Such as: the simplified relational expression corresponding to the triples { "subsidy amount", "multiplication is not more than", "10%" } is as follows: less than or equal to the supplementary limit multiplied by 10 percent.

In specific implementation, an expression simplified model can be constructed, and the expression simplified model comprises a plurality of relation templates. A relational expression template refers to expression rules which are constructed for one relational word in a relational word library and are used for describing the relation between two words, and the expression rules reflect the two words and the relational word between the words. The relational word library is a database specially used for storing various relational words for describing the relationship between two elements, for example, the relational words stored in the relational word library can be multiplied by no more than, multiplied by equal to, increased by no more than, and the like, and is not limited herein. When the relational expression template corresponding to each relational word in the relational word library is constructed, a manual mode, an automatic mode or a manual and automatic mixed mode can be adopted. For example, the corresponding relation expression template of "multiply not by" may be "= a × B", the corresponding relation expression template of "multiply equal to" may be "= a × B", and the corresponding relation expression template of "increase not by" may be "< = a + B" may be constructed.

Therefore, when the relationship expression template in the simplified representation model is used, the relationship expression template corresponding to each triplet may be determined based on the relationship words (i.e., the relationship information) in at least one group of triplets, and then the simplified relationship corresponding to each triplet may be generated according to the relationship expression template corresponding to each triplet. For example, in connection with the above example, two second words extracted from the sentence a "the patentee who converts patent technology in the Hangzhou is given a reward of not more than 10% of the subsidy limit of the enterprise implementing the patent technology" in the policy text A are "subsidy limit" and "10%", respectively, and the relation word for describing the relation information between "subsidy limit" and "10%" is "multiplied by not more than". Based on the corresponding relation template of "multiply not more than" in the simplified representation model, for example, "< = a × B", the simplified relation formula reflecting the relation between the "supplementary limit" and "10%" can be generated accordingly as follows: "< = complement limit 10%". The simplified relational expression (i.e., "= subsidy limit 10%") that reflects the relationship between the "subsidy limit" and the "10%" that is generated is the simplified expression corresponding to the statement a "the patent right person who changes patent technology in the Hangzhou gives the reward of not more than 10% of the subsidy limit of the enterprise that implements the patent technology", and the simplified expression corresponding to the statement a can more intuitively reflect the preferential degree provided by the policy text A to a certain extent. Of course, the simplified expression corresponding to the sentence a may be in other forms, such as a specific numerical form, besides the formula form shown above. For example, when the specific value of the "supplementary limit" is known, the product value between the "supplementary limit" and "10%" can be further calculated, and the form of the simplified expression corresponding to the word a obtained at this time is a numerical value form, and specifically, if the value corresponding to the "supplementary limit" is set to 2 ten thousand, the simplified expression corresponding to the word a can be < =2000 yuan. The present embodiment is not intended to limit the specific form of the simplified expression.

Further, the step 1042 "generating the simplified representation of the text according to the at least one first word and the at least one simplified relation" may include:

10421. acquiring a post-processing algorithm;

10422. post-processing the at least one first type of word by using the post-processing algorithm to obtain a summarized expression;

10423. and generating a simplified expression of the text according to the summarized expression and the at least one simplified relational expression.

The post-processing algorithm is the above-mentioned summary of at least one first type of words of the same word type. For example, in the above examples, the first category of words labeled "business scope" includes: the method is characterized in that the method is engaged in advanced manufacturing industry, modern logistics, leisure travel, trade and retail, business exhibition, scientific and technological research and development, creative design, service outsourcing, electronic commerce and equity investment; when these first words are summarized, the following expression can be obtained in summary:

the operation range is as follows: the method is applied to advanced manufacturing industries, modern logistics, leisure tourism, trade and trade retail, business exhibition, scientific and technological research and development, creative design, service outsourcing, electronic commerce and stock right investment.

Suppose that the first category, which also includes the word label "place of registration", is city; then the following simplified representation can be generated based on the above-described procedure of 10423:

further, the method provided by this embodiment may further include the following steps:

105. acquiring description information of a target object;

106. determining a recommended text in the plurality of texts according to the description information and simplified expressions of the plurality of texts;

107. and sending the simplified expression corresponding to the recommended text to a client corresponding to the target object.

Before the 106 is executed, the description information of the target object may be preprocessed. Such pre-treatments may include, but are not limited to: and (5) expansion and normalization processing of the similar meaning words. The normalization process refers to representing a plurality of words or labels belonging to a synonym by using the same Chinese word or label.

Assume that the application scenario of this embodiment is a government-enterprise service scenario, that is, the target object is an enterprise, and the text is a policy text. The simplified expression of the text obtained through the steps comprises at least one first word and at least one simplified relational expression. The first category of words includes word labels and corresponding values, such as word label "registered place", and corresponding values "city". The description information of the target object may also be preprocessed into a representation including word labels and corresponding values, which is done to facilitate subsequent relevancy calculations.

In the step 105, the target correspondence may be an enterprise user of an enterprise type, and may also be other types of users, such as an individual user, which is not limited herein. The description information of the target object is preprocessed to obtain a plurality of third words for describing the target object. Similarly, the third category of words includes word labels and their corresponding values. Such as: the word label "business name" corresponds to a value of ". X company", the word label "registered place" corresponds to a value of ". X country", the word label "business scope" corresponds to a value of "business retail", and the like.

In practical applications, the same content can be expressed in various ways, for example, "beijing", etc. can refer to "beijing city". In order to recommend a text to a target object more accurately in the following, according to the technical scheme provided by this embodiment, after the description information of the target object is obtained, operations such as synonym expansion and normalization processing are performed on words extracted from the description information, and operations such as synonym expansion and normalization processing are also performed on a first word in a simplified expression of the text.

In specific implementation, near word expansion is performed on the words extracted from the description information and the first class of words in the simplified representation of the text by using a near word model, and a plurality of near word sequences with similar meaning expressions are obtained after the near word expansion processing. Here, it should be noted that: in this embodiment, whether the word is extracted from the description information or the first word in the simplified text expression includes the word label and the corresponding value. Therefore, when the similar meaning word expansion is carried out, the similar meaning word expansion can be carried out on the word label, and meanwhile, the similar meaning word expansion is also carried out on the value corresponding to the word label. For example, the word "business retail" corresponds to the value "business retail" in the descriptive information. The similar meaning words of the word label 'operation range' obtained after the expansion of the similar meaning words comprise: a main operation project, an operation project and a local industry … …; the synonyms of "commercial" include: trade, commercial, … …; the synonym for "retail" includes: retail, vending, selling, … ….

The goal of near word expansion is to enrich the expression of words in order to subsequently facilitate improved recommendation accuracy. And then, carrying out normalization processing on a plurality of near meaning words obtained after the near meaning words are expanded by using a normalization processing model so as to carry out normalization expression on the near meaning words with different expressions, namely representing the words with the same meaning but different expressions by using a unified standard word to realize the normalization of the near meaning words.

For example, a word vector model is constructed based on the word2vec algorithm, and then word vectors are calculated for data in a database (such as a basic database, an industry word stock, an internet word stock and the like) by using the word vector model; then calculating the distance between the two words based on the word vectors of the two words; defining two words with the distance smaller than a set threshold (the specific value is not specifically limited in this embodiment) as synonyms; finally, the two words defined as the similar meaning words are stored in a similar meaning word stock. And obtaining a near-meaning word expansion model based on the near-meaning word library, and respectively performing near-meaning word expansion on the words extracted from the description information of the target object and the first class words in the simplified expression of the text by using the near-meaning word expansion model.

And utilizing a normalization model to express all words expanded from the similar meaning words into a unified standard word. For example, the synonym: free selling, retail, etc., are uniformly expressed as "retail". The normalization model may be a machine learning model, such as a neural network model, among others. For example, taking word vectors of a plurality of words belonging to the similar words in the similar word bank as the input of the neural network model, and executing the neural network model to obtain an output result; and then optimizing the neural network model based on the output result and the sample normalization words corresponding to the plurality of words. The trained neural network model can be used as a normalization model in the embodiment.

Correspondingly, in this embodiment, the step 106 "determining a recommended text in the plurality of texts according to the description information and the simplified expressions of the plurality of texts" includes:

1061. and respectively preprocessing the description information and the simplified expressions of the texts.

Specifically, the preprocessing of the description information may include: extracting at least one word from the description information (which can also be realized by using the first extraction model mentioned above), performing near-meaning word expansion on the extracted word by using a near-meaning word expansion model, and then performing normalization processing on a plurality of near-meaning words expanded by the near-meaning words by using a normalization model. Through the preprocessing process, the standard words (or normalized expressions) corresponding to each word contained in the description information are obtained.

The preprocessing of the simplified representation of the text may specifically be a preprocessing of at least one first type word in the simplified representation of the text. The preprocessing process is the same as the above process to obtain the standard words (or normalized expressions) corresponding to the first words.

1062. And calculating the correlation degree of the description information and each text for at least one standard word corresponding to the preprocessed description information and at least one standard word corresponding to each of the plurality of texts.

The correlation may also be referred to as a matching degree, an adaptation degree, or the like. In a specific implementation, the distance may be calculated based on the word vector of the at least one standard word corresponding to the description information and the word vector of the at least one standard word corresponding to one text. The calculated distance can be directly used as the correlation degree of the description information and the text. Or, according to a preset conversion rule, converting the calculated distance into a degree parameter representing the degree of correlation. In implementation, the word vector of the at least one standard word corresponding to the description information and the word vector of the at least one standard word corresponding to one text may be embedded into the same vector space to calculate the distance.

In specific implementation, the correlation degree of the description information and the text can be calculated by utilizing a calculation model. The input of the calculation model can be a word vector of at least one standard word corresponding to the description information and a word vector of at least one standard word corresponding to a text, and the distance can be output by executing the calculation model. The calculation model may be a model constructed based on a machine learning algorithm, for example, the calculation model is a neural network model, the training sample set includes a plurality of samples, and one sample includes: the method comprises the steps of obtaining word vectors of at least one standard word corresponding to sample enterprise description information, word vectors of at least one standard word corresponding to a sample text and sample distances. During training, a sample can be used as the input of the neural network model, and the neural network model is executed to obtain an output result; optimizing the neural network model based on the output result and the sample distance; and after optimization, continuously acquiring a next sample from the training sample set, and continuing the process until the distance between the output result and the sample meets a preset condition, which indicates that the model is trained.

It should be noted that the above-mentioned calculation model is not limited to the neural network model listed above, and may be other types of models, which is not limited in this embodiment.

1063. And determining a recommended text in the plurality of texts according to the correlation degree between the description information and each text.

Wherein, the number of the determined recommended texts can be multiple.

One implementation may be: and recalling a plurality of texts with the relevancy meeting the preset conditions, then sequencing the plurality of texts, and recommending the sequenced texts to the target object.

The above process may be referred to as a content-based recommendation algorithm, i.e., a recommendation algorithm based on the distance of the description information from each text word vector. The algorithm does not have the cold start problem. In addition, the algorithm is suitable for a scene of recommending texts to enterprises and organizations, and because behavior information of the enterprises and organizations, such as history consulting, collecting, downloading text records and the like, is difficult to acquire in practical application, text recommendation is difficult to be performed in combination with the history behavior information.

For other scenarios, it is not difficult to obtain the historical behavior information of the user (such as a scenario in which a single user recommends a policy text) in the case that the historical behavior information of the target object can be obtained. Therefore, among a plurality of texts whose correlation satisfies the preset condition, a text similar to the history text included in the history behavior information may be used as the recommended text when the above step 1063 is performed. The historical text included in the historical behavior information may be a text referred by the target object at a certain historical time, or a collected text, or a downloaded text, or the like.

In a scenario where historical behavior information of a target object can be obtained, another recommendation algorithm that can be used is a content-based collaborative filtering recommendation algorithm. The collaborative filtering recommendation algorithm may further include: object-based collaborative filtering; collaborative filtering based on text; model-based collaborative filtering, etc.

1. Object-based collaborative filtering

The similarity between objects is calculated based on operation behaviors (such as collection, downloading, reference and the like) between the objects (such as users, enterprises and the like) and texts. For example, the user a collects the text a at the history time t1, views the text b at the history time t2, and downloads the text c at the history time t 3. User B downloads text a at history time t4, views text d at history time t5, and views text c at history time t 6. User C downloads text e at historical time t7, views text f at historical time t8, and views text g at historical time t 9. It can be derived that:

the user A and the user B both see, collect or download the text a and the text c;

the user A and the user C, and the user B and the user C have no intersection;

it can thus be determined that user a and user B belong to similar users; while user a and user C, and user B and user C, do not belong to similar users. Belonging to similar users, the text preference of the similar users has recommendation value, so that texts which are historically watched, collected or downloaded by the similar users can be used as recommendation texts

In this embodiment, in the step 1063, the content-based recommendation algorithm may be fused with the collaborative filtering algorithm to implement text recommendation. That is, the step 1063 may specifically be:

and determining a recommended text in the plurality of texts according to the correlation degree between the description information and each text and the similarity between the target object and other objects.

That is, a plurality of texts whose relevance satisfies a preset condition are recalled as recommended texts, and a text of a preference of an object whose similarity with the target object satisfies a similar condition is recalled as a recommended text. The text of the object preference may be determined based on the historical behavior information of the object, which is not limited in this document. For example, in the above example, the user a and the user B belong to similar users; if the user a is the target object in this embodiment, the text d viewed by the user B at the historical time t5 can be recommended to the user a as the recommended text.

The recalled recommended text contains a text with the relevance meeting a preset condition and a text with the preference of the similar object. In essence, it is also possible to take the intersection at the time of recall, for example, take a plurality of texts whose relevance satisfies a preset condition as a first set, take a plurality of texts whose similar objects are preferred as a second set, and take the intersection of the first set and the second set as a recommended text.

2. Text-based collaborative filtering

The simple understanding is that the texts consulted, collected or downloaded in the target object history period are clustered similarly to analyze the text preference of the target object so as to recommend the favorite texts of the target object when recommending.

Correspondingly, the step 1063 may specifically be:

and determining a recommended text in the plurality of texts according to the correlation degree of the description information and each text and the text preference of the target object.

3. Model-based collaborative filtering

The simple understanding is that a filtering model is constructed, and the filtering model can predict the future behavior of the target object by learning the historical behavior information of the target object. For example, the historical behavior information of the target object includes text information viewed, downloaded, or collected during a historical period. The filtering model can learn historical behavior information as a sample to predict the preference of a target object for a certain text.

Correspondingly, the step 1063 may specifically be:

determining at least one first candidate text in the plurality of texts according to the correlation degree between the description information and each text;

taking the historical behavior information of the target object and simplified expressions of a plurality of texts as the parameters of the filtering model, and executing the filtering model to obtain at least one second candidate text preferred by the target object;

and determining the recommended text according to the at least one first candidate text and the at least one second candidate text.

Further, the recommendation algorithm that can be used in the present embodiment may also be a knowledge-based recommendation algorithm. The knowledge-based recommendation algorithm is to provide a recommendation text according to a preset knowledge rule according to a specified requirement (for example, a text content requirement input by a target object, and a specific form of the requirement may be a keyword, a phrase, a sentence, and the like). If no corresponding recommended text exists, the recommended text can be found according to the preset knowledge rule by continuously perfecting and meeting the requirements through one or more times of information interaction with the target object. Specifically, the knowledge-based recommendation algorithm may include: a constraint-based recommendation algorithm, an instance-based recommendation algorithm. Wherein the content of the first and second substances,

1. the constraint-based recommendation algorithm is to recommend the content meeting the recommendation rule to the target object by constructing a recommendation rule set. For example, after the requirement of the target object is obtained, at least one adaptive recommendation rule is obtained in the recommendation rule set based on the requirement of the target object, and then a text meeting the at least one recommendation rule is recommended for the target object.

Correspondingly, the step 1063 may specifically be:

acquiring demand information input by the target object at least once, acquiring at least one adaptive recommendation rule in a recommendation rule set based on the demand information, and recommending at least one second candidate text meeting the at least one recommendation rule for the target object;

2. The example-based recommendation algorithm is to search out a text similar to the requirement of the target object from a plurality of texts as a recommendation text. The algorithm is similar to the search algorithm. The target object enters a requirement (e.g., a keyword) and then retrieves text that contains or approximates the requirement.

Correspondingly, the step 1063 may specifically be:

acquiring demand information input by the target object at least once, and retrieving at least one second candidate text containing content similar to the demand information from a plurality of texts based on the demand information;

Further, the recommendation algorithm that can be used in this embodiment may also be a combined recommendation algorithm. The combined recommendation algorithm is to use two or more recommendation algorithms simultaneously, such as: combining the collaborative filtering recommendation algorithm with the content-based recommendation algorithm, the section of introducing the collaborative filtering recommendation algorithm above respectively lists more specific implementation steps of step 1063, which reflect the concept of the combined recommendation algorithm. Similarly, the section of the knowledge-based recommendation algorithm introduced above also lists the more specific implementation steps of step 1063, which reflect the idea of combining the knowledge-based recommendation algorithm and the content-based recommendation algorithm.

It is to be added that the content-based recommendation algorithm, the knowledge-based recommendation algorithm, and the system filtering recommendation algorithm described above may be used alone to recommend text for a target object. In specific implementation, which algorithm and which algorithm combination are selected can be determined based on recommendation precision requirements, application scenarios and the like. For example, in a scenario of recommending a policy text to an enterprise, since it is difficult to obtain historical behavior information of the enterprise, a content-based recommendation algorithm is selected to perform recommendation processing.

Still further, the principle function module corresponding to the method provided by the embodiment can be designed into a low-dependency and decouplable recommendation function module. Such as a coarse recall module and a fine sort module. The functional modules jointly form a lightweight recommendation framework, and each functional module can flexibly select various algorithms. When the method is implemented, the specific form of the functional modules, such as the rough recall module, may be an executable file edited by a computer language, and the step of executing the executable file of the corresponding function may be completed. That is, after the plurality of recommended texts are recalled by the above method, the fine sorting module (i.e. the executable file for implementing the fine sorting function) may be called to perform reverse sorting on the plurality of recommended texts. And finally, pushing the N recommended files ranked at the top to the target object. The value of N may be a default value, such as the previous 5 or 8, or a value set by the target object, which is not specifically limited in this embodiment.

Here, it should be added that: in the case where there are a plurality of recommendation texts, the plurality of recommendation texts may be recommended to the target object in a list form. In specific implementation, the recommendation information sent to the target object may only include simplified expressions of the recommendation texts and links of the recommendation files. Alternatively, the recommendation information sent to the target object may include the full text of each recommended text, as well as a simplified representation of each recommended text.

See FIG. 3b for a schematic diagram depicting the text processing system provided herein from a hierarchical architecture. A recommendation layer is added on the basis of fig. 3 a. In the embodiment, the contents of the language processing layer, the data layer and the output layer are received

The language processing layer is also used for storing the text and the simplified expression association of the text in the database through the data layer.

The recommendation layer is positioned between the language processing layer and the output layer, is provided with at least one recommendation model and is used for acquiring the description information of the target object; and the recommendation module is further configured to perform relevance analysis on the description information and simplified expressions of a plurality of texts stored in the database by using at least part of the at least one recommendation model, and determine at least one recommendation text recommended to the target object.

The output layer is used for outputting the simplified expression of the at least one recommended text recommended to the target object.

Specifically, the recommendation module may include, but is not limited to: a model built based on a content recommendation algorithm, a model built based on a collaborative filtering recommendation algorithm, a model built based on a knowledge recommendation algorithm, a model built by a combined recommendation algorithm, and the like.

The above-mentioned content is expressed by using a more theoretical hierarchy, and a hardware-biased expression manner is adopted, for example, the text processing system provided by the following embodiment includes: database, language processing engine. Wherein the content of the first and second substances,

the database is used for storing the linguistic data;

Further, the text processing system provided in this embodiment further includes:

the recommendation engine is used for acquiring the description information of the target object; performing relevance analysis on the description information and simplified expressions of a plurality of texts stored in the database by using at least part of at least one recommendation model, and determining at least one recommendation text recommended to the target object;

an output module for outputting a simplified representation of the at least one recommended text recommended to the target object.

Fig. 4 is a flowchart illustrating another text processing method according to an embodiment of the present application. The execution main body of the method may be a server, and the server may be a common server, a cloud, a virtual server, or the like, which is not specifically limited in this embodiment. The embodiment is suitable for a scene of recommending the policy text to the enterprise user. As shown in fig. 4, the method includes the steps of:

201. determining semantic tags respectively corresponding to a plurality of sentences in a policy text;

202. extracting at least one first category word from sentences of which the semantic tags belong to the first category tags;

203. extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags;

204. and generating a simplified expression of the text for displaying to the enterprise user according to the at least one first word, the at least two second words and the relation information among the different second words.

205. acquiring description information of an enterprise user;

206. determining a recommended policy text in the plurality of policy texts according to the description information and simplified expressions of the plurality of policy texts;

207. and sending the simplified expression corresponding to the recommended policy text to a client corresponding to the enterprise user.

For more details of the above 201-207, reference may be made to the corresponding description above, and details are not repeated here.

Fig. 5 also shows a flow diagram of a text processing method. The embodiment shown in FIG. 5 is written from the perspective of client-side interaction with the server-side. Specifically, as shown in fig. 5, the method includes:

301. receiving request information sent by a target object through a client;

302. and acquiring the description information of the target object according to the request information.

Wherein the description information may be carried in request information; or the request information may carry an identifier of the target object, and the execution main body of the method of this embodiment obtains corresponding description information according to the identifier.

303. Determining a recommended text in the plurality of texts according to the description information and simplified expressions of the plurality of texts; 304. and sending the simplified expression of the recommended text to the client so as to be displayed on the client.

The simplified expression of the text is generated according to at least one first word type, the at least two second word types and the relationship information among different second word types of the text; the at least one first type word is extracted from sentences of which the semantic labels in the text belong to first type labels, and the at least two second type words and the relation information between different second type words are extracted from sentences of which the semantic labels in the text belong to second type labels.

Here, it should be noted that: for more detailed contents of each step in this embodiment, reference may be made to the corresponding description above, and further description is omitted here. In addition, the present embodiment may include other steps mentioned above in addition to the above steps.

Fig. 6 illustrates a text processing system provided in an embodiment of the present application. The system provided by the present embodiment is written from a hardware perspective. As shown in fig. 6, the text processing system includes: a server 401 and a client 402; wherein the content of the first and second substances,

the server 401 is configured to determine semantic tags corresponding to a plurality of sentences in a text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; generating a simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words; and storing the simplified expression of the text into a text simplified expression library.

A client 402, configured to send request information to a server 401;

the server 401 is further configured to receive request information sent by a target object through the client 402; acquiring the description information of the target object according to the request information; storing simplified expressions of a plurality of texts according to the description information and the text simplified expression library, and determining a recommended text in the plurality of texts; sending the simplified expression of the recommended text to the client;

the client 402 is further configured to display a simplified representation of the recommended text.

Referring to fig. 6, in the technical solution provided in this embodiment, the server 401 may be a single server, a virtual server deployed on a server or a server cluster, or a computer set (i.e., a cloud end) based on cloud computing. The Cloud Computing is one of distributed Computing, and comprises a super virtual computer consisting of a group of loosely coupled computer sets; the client 402 may be, for example, a desktop computer, a tablet computer, a smart phone, a smart wearable device (e.g., a smart watch, glasses, etc.), and the like.

Here, it should be noted that: the content of each step in the information recommendation system provided in this embodiment, which is not described in detail in the foregoing embodiments, may refer to the corresponding content in the foregoing embodiments, and is not described in detail herein. In addition, the information recommendation system provided in this embodiment may further include, in addition to the above steps, other parts or all of the steps in the above embodiments, which may be specifically referred to corresponding contents in the above embodiments, and details are not described here.

In summary, the technical solutions provided by the embodiments of the present application have at least some following advantages:

for the text, the words can be positioned and extracted from the text by utilizing the language processing capability (such as text classification and entity identification) and combining the text characteristics, and the inter-word relation information can also be extracted, so that the text can be accurately, not missed and succinctly expressed, and the reading time of the target object is saved.

When the text is recommended for the target object, the text is recommended for the target object by using a content-based recommendation algorithm based on the description information of the target object and the simplified expression of the text, so that the problem of cold start in the recommendation process can be avoided while the recommendation accuracy is high.

The technical scheme provided by the embodiments of the application is applied to certain specific scenes, for example, when a policy text is recommended to an enterprise or a user, the effect is more prominent, and the technical blank of the conventional policy text recommendation can be filled.

Fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the client device includes: a memory 81 and a processor 82. The memory 81 may be configured to store other various data to support operations on the sensor. Examples of such data include instructions for any application or method operating on the sensor. The memory 81 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The memory 81 for storing one or more computer instructions;

the processor 82 is coupled to the memory 81, and configured to execute one or more computer instructions stored in the memory 81 to implement the steps in the text processing method provided in the foregoing embodiments.

Further, as shown in fig. 7, the electronic device further includes: communication components 83, power components 85, and a display 86. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.

Accordingly, embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program can implement the steps or functions of the text processing method provided in the foregoing embodiments when executed by a computer.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps or functions of the text processing method provided in the foregoing embodiments.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the description of the above embodiments, those skilled in the art can clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of text processing, comprising:

and generating a simplified expression of the text according to the at least one first-class word, the at least two second-class words and the relationship information among different second-class words.

2. The method of claim 1, wherein generating the simplified representation of the text according to the relationship information between the at least one first type of word, the at least two second types of words, and the different second types of words comprises:

expressing at least two second-class words with the relationship information into a simplified relational expression according to a relationship expression template corresponding to the relationship information;

and generating a simplified expression of the text according to the at least one first word and the at least one simplified relational expression.

3. The method of claim 2, wherein generating the simplified representation of the text based on the at least one first type of word and the at least one simplified relationship comprises:

acquiring a post-processing algorithm;

post-processing the at least one first word class by using the post-processing algorithm to obtain a summarized expression;

and generating a simplified expression of the text according to the summarized expression and the at least one simplified relational expression.

4. The method according to any one of claims 1 to 3, wherein determining semantic tags corresponding to a plurality of sentences in the text respectively comprises:

performing sentence segmentation on the text to obtain a plurality of sentences;

utilizing a classification model to identify semantic labels respectively corresponding to the sentences;

and summarizing the sentences with the same identified semantic label.

5. The method of claim 4, wherein extracting at least one first category word from the sentence whose semantic tag belongs to the first category tag comprises:

acquiring a first extraction model;

and taking the statement of which the semantic label belongs to the first category label as the input of the first extraction model, and executing the first extraction model to obtain the at least one first category word.

6. The method according to any one of claims 1 to 3, wherein extracting at least two second-class words and relationship information between different second-class words from the sentence whose semantic tag belongs to the second-class tag comprises:

acquiring a second extraction model and a relation identification model;

taking the statement of which the semantic label belongs to the second category label as the input of the second extraction model, and executing the second extraction model to obtain the at least two second category words;

and taking the at least two second-class words and the sentence of which the semantic label belongs to the second-class label as the input of the relation recognition model, and executing the relation recognition model to obtain the relation information between any two second-class words in the at least two second-class words.

7. The method of any of claims 1 to 3, further comprising:

acquiring description information of a target object;

determining a recommended text in the plurality of texts according to the description information and simplified expressions of the plurality of texts;

and sending the simplified expression corresponding to the recommended text to a client corresponding to the target object.

8. A method of text processing, comprising:

determining semantic tags respectively corresponding to a plurality of sentences in a policy text;

and generating a simplified expression of the text for displaying to the enterprise user according to the at least one first word, the at least two second words and the relation information among the different second words.

9. The method of claim 8, further comprising:

acquiring description information of an enterprise user;

determining a recommended policy text in the plurality of policy texts according to the description information and simplified expressions of the plurality of policy texts;

and sending the simplified expression corresponding to the recommendation policy text to a client corresponding to the enterprise user.

10. A text processing system, comprising:

the language processing layer is provided with at least one language processing model and is used for optimizing any language processing model according to the linguistic data acquired by the data layer; the language processing module is further used for performing semantic analysis on the input text by using at least part of the at least one language processing model to determine semantic tags corresponding to a plurality of sentences in the text; extracting at least one first category word from the sentence of which the semantic tag belongs to the first category tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; generating a simplified expression of the text according to the at least one first word, the at least two second words and the relationship information among the different second words;

11. The system of claim 10, further comprising:

the language processing layer is further used for storing the text and the simplified expression association of the text in the database through the data layer;

the recommendation layer is positioned between the language processing layer and the output layer, is provided with at least one recommendation model and is used for acquiring the description information of the target object; the recommendation system is further used for performing relevance analysis on the description information and simplified expressions of a plurality of texts stored in the database by using at least part of the at least one recommendation model to determine at least one recommendation text recommended to the target object;

12. A text processing system, comprising:

the database is used for storing the linguistic data;

the language processing engine is further configured to perform semantic analysis on the input text by using at least a part of the at least one language processing model to determine semantic tags corresponding to a plurality of sentences in the text; extracting at least one first class word from the statement of which the semantic tag belongs to the first class tag; extracting at least two second-class words and relationship information among different second-class words from sentences of which the semantic tags belong to the second-class tags; generating a simplified expression of the text according to the at least one first type word, the at least two second type words and the relationship information among different second type words;

13. The system of claim 12, further comprising:

14. An electronic device comprising a memory and a processor; the memory is configured to store one or more computer instructions which, when executed by the processor, are capable of implementing the steps of the text processing method of any one of claims 1 to 7 or the text processing method of claim 8 or 9.