CN113283218A

CN113283218A - Semantic text compression method and computer equipment

Info

Publication number: CN113283218A
Application number: CN202110705874.7A
Authority: CN
Inventors: 马建
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-08-20

Abstract

The application relates to the technical field of artificial intelligence, and provides a semantic text compression method, a semantic text compression device, computer equipment and a computer readable storage medium. The semantic text compression method includes the steps of utilizing a pre-trained noise recognition model to conduct noise recognition on a semantic text input by a user, conducting noise reduction processing on the semantic text based on a noise recognition result to obtain a text to be compressed, calling a sentence compression tool to conduct sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed to obtain a primary compressed text, avoiding noise from interfering the content of the primary compressed text due to the fact that the semantic text is subjected to noise reduction processing and then sentence compression, conducting text restoration on the primary compressed text according to a preset restoration strategy to obtain a target compressed text, and improving the content accuracy degree of the target compressed text while compressing the semantic text input by the user.

Description

Semantic text compression method and computer equipment

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a semantic text compression method, a semantic text compression device, computer equipment and a computer readable storage medium.

Background

Voice interaction is one of the mature application directions in the technical field of artificial intelligence at present. With the breakthrough of deep learning research and the accumulation of a large amount of voice data, the voice recognition technology has been developed dramatically.

However, in the existing semantic text compression scheme, redundant content of a semantic text is removed by removing a stop word, and then the residual text is compressed based on rule compression, but considering that the semantic text compression scheme needs to be applied to semantic text compression in various fields or various scenes, there is generalization, so that the accuracy of the text content obtained by compression is relatively low, sentences in a target compressed text obtained by compression are not simplified enough, and the completeness and the smoothness of the sentences cannot be ensured. Therefore, the existing semantic text compression scheme has the problem of low accuracy of a target compressed text obtained by compressing the semantic text.

Disclosure of Invention

In view of this, embodiments of the present application provide a semantic text compression method, a semantic text compression apparatus, a computer device, and a computer readable storage medium, so as to solve the problem that the accuracy of a target compressed text obtained by compressing a semantic text is low in an existing semantic text compression scheme.

A first aspect of an embodiment of the present application provides a semantic text compression method, including:

carrying out noise recognition on semantic texts input by a user by utilizing a pre-trained noise recognition model to obtain a noise recognition result; the noise identification result is used for indicating whether each participle in the semantic text is noise or not;

denoising the semantic text based on the noise identification result to obtain a text to be compressed;

based on the part of speech of each word to be compressed in the text to be compressed, calling a sentence compression tool to perform sentence compression on the text to be compressed to obtain a preliminary compressed text;

and performing text restoration on the preliminary compressed text according to a preset restoration strategy to obtain a target compressed text.

A second aspect of the embodiments of the present application provides a semantic text compression apparatus, including:

the noise identification unit is used for carrying out noise identification on semantic texts input by a user by utilizing a pre-trained noise identification model to obtain a noise identification result; the noise identification result is used for indicating whether each participle in the semantic text is noise or not;

the noise reduction unit is used for carrying out noise reduction processing on the semantic text based on the noise identification result to obtain a text to be compressed;

the compression unit is used for calling a sentence compression tool to perform sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed to obtain a primary compressed text;

and the repairing unit is used for performing text repairing on the preliminary compressed text according to a preset repairing strategy to obtain a target compressed text.

A third aspect of embodiments of the present application provides a computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect.

The implementation of the semantic text compression method, the semantic text compression device, the computer equipment and the computer readable storage medium provided by the embodiment of the application has the following beneficial effects:

according to the embodiment of the application, noise recognition is carried out on the semantic text input by a user by utilizing the pre-trained noise recognition model, noise reduction processing is carried out on the semantic text based on a noise recognition result to obtain the text to be compressed, statement compression is carried out on the text to be compressed by calling a statement compression tool based on the part of speech of each word to be compressed in the text to be compressed to obtain the primary compressed text, interference of noise on the content of the primary compressed text can be avoided due to the fact that the semantic text is subjected to noise reduction processing and then is subjected to statement compression, finally, text restoration is carried out on the primary compressed text according to a preset restoration strategy to obtain the target compressed text, and the content accuracy degree of the target compressed text is improved while the semantic text input by the user is compressed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an implementation of a semantic text compression method provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating an implementation of a semantic text compression method according to another embodiment of the present disclosure;

fig. 3 is a block diagram illustrating a semantic text compression apparatus according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the semantic text compression method provided by this embodiment, an execution subject is a server, and specifically, the execution subject may be a server configured with the function of the method, or any server in a server cluster. Here, the server cluster may be a server cluster composed of a plurality of servers, and a distributed system is constructed based on the server cluster so that data sharing or data synchronization may be achieved among the plurality of servers in the server cluster. On this basis, an object script file is configured to any server in the server cluster, and the object script file describes the semantic text compression method provided by this embodiment, so that the server configured with the object script file can execute each step in the semantic text compression method by executing the object script file.

In implementation, before the server compresses the semantic text, the semantic text may be obtained by text conversion of the voice information by the terminal after the voice information is input by the user through the terminal, and the terminal sends the semantic text to the server, or the semantic text may be obtained by text conversion of the voice information by the server after the voice information is input by the user through the terminal and the terminal sends the voice information to the server, and the server performs text conversion on the voice information, which is not limited herein. The server carries out noise identification on the semantic text by utilizing a pre-trained noise identification model, carries out noise reduction processing on the semantic text based on a noise identification result to obtain a text to be compressed, carries out sentence compression on the text to be compressed by calling a sentence compression tool based on the part of speech of each word to be compressed in the text to be compressed to obtain a primary compressed text, can avoid noise from interfering the content of the primary compressed text because the semantic text is subjected to noise reduction processing and then sentence compression, and finally carries out text restoration on the primary compressed text according to a preset restoration strategy to obtain a target compressed text, thereby improving the content accuracy of the target compressed text while compressing the semantic text input by a user.

For example, taking the server as an application having a communication function as an example, a user enables the application through a terminal, inputs voice information on the application, and the terminal performs text conversion on the voice information to obtain a corresponding semantic text, and then sends the semantic text to the server through a communication network. Or, the user inputs the voice information on the application program, the terminal sends the voice information to the server through the communication network, and the server performs text conversion on the voice information to obtain the semantic text. After the server acquires the semantic text, the server performs noise recognition on the semantic text by using a pre-trained noise recognition model, performs noise reduction processing on the semantic text based on a noise recognition result to obtain a text to be compressed, invokes a sentence compression tool to perform sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed to obtain a primary compressed text, because the semantic text is subjected to noise reduction processing and then sentence compression, the interference of noise on the content of the primary compressed text can be avoided, and finally the text of the primary compressed text is repaired according to a preset repairing strategy to obtain a target compressed text, the target compressed text can be displayed in the communication application program, so that the semantic text input by a user is compressed, and the content accuracy of the target compressed text is improved.

The following describes a semantic text compression method provided in this embodiment in detail by a specific implementation manner.

Fig. 1 shows a flowchart of an implementation of a semantic text compression method provided in an embodiment of the present application, which is detailed as follows:

s11: carrying out noise recognition on semantic texts input by a user by utilizing a pre-trained noise recognition model to obtain a noise recognition result; and the noise identification result is used for indicating whether each participle in the semantic text is noise or not.

In step S11, the semantic text is a text for characterizing the content of the speech information obtained by text-converting the speech information input by the user. Here, in actual use, the semantic text may be obtained by text conversion by the terminal according to the voice information of the user, and the semantic text is sent to the server by the terminal. Or the terminal sends the voice information to the server, and the server performs text conversion on the voice information to obtain the voice information.

In this embodiment, the pre-trained noise recognition model is used to perform noise recognition on the semantic text, that is, the pre-trained noise recognition model is used to recognize the content belonging to the noise word from the semantic text, and then the noise word recognition result for the semantic text is used as the noise recognition result. Here, the noise word recognition result may be a probability value that each participle in the semantic text belongs to a noise word, and may also be a noise flag of each participle in the semantic text, the noise flag indicating whether each participle in the text is a noise word.

When the noise recognition method is implemented, when a noise word is recognized from a semantic text by using a pre-trained noise recognition model, the semantic text is used as the input of the pre-trained noise recognition model, the pre-trained noise recognition model is used for carrying out noise word judgment on each word, word and word combination in the semantic text, for example, whether the connection relation between the word and the front and rear words or words is reasonable or whether the word is a misrecognized non-speech content or the like is judged, so that the probability value that each participle in the semantic text is a noise word is obtained, and the probability value that each participle in the semantic text is a noise word is used as a noise recognition result. Or configuring a corresponding noise identifier for each participle according to the probability value of each participle in the semantic text as a noise word, and taking the noise identifier corresponding to each participle in the semantic text as a noise identification result.

As an example, the pre-trained noise recognition model may calculate a probability value of the word itself or a word composed of the word as a noise word by recognizing a context or a collocation relationship between words in a semantic text, and output a probability value of each participle as a noise word as a model to obtain a noise recognition result.

It should be noted that, in order to further improve the compression degree of the semantic text, in some embodiments, whether each word or each word in the semantic text is a tone noise included in the semantic information may be identified, so that the tone noise is deleted in the subsequent denoising process, and the reduction degree of the compressed semantic text may be further improved.

As another example, the pre-trained noise recognition model recognizes a context or a collocation relationship between words in the semantic text, and also determines a probability value that each word or each word in the semantic text is a noise word by recognizing whether each word or each word in the semantic text is a semantic word.

It should be understood that, in the implementation process, the pre-trained noise recognition model may be obtained by training an existing text type recognition model by using a pre-configured training sample. Here, the training sample describes probability values of words and/or words, respectively, in the sample sentence, which are noise words.

It can be understood that, a person skilled in the art can learn the pre-trained noise recognition model mentioned in this embodiment, configure a corresponding training sample according to an actual requirement, train the pre-constructed noise recognition model by using the training sample, and describe the convergence degree of the noise recognition model by configuring a corresponding loss function, thereby obtaining the pre-trained noise recognition model, so how to construct the pre-trained noise recognition model and how to train the pre-trained noise recognition model are not described here again.

As an example, in the foregoing solution, the step S11 specifically includes:

obtaining a binary classification value of each participle based on each participle in the semantic text by utilizing a pre-trained noise recognition model;

performing noise labeling on each word segmentation according to the binary classification value of each word segmentation to obtain a noise identifier of each word segmentation;

and taking the noise identifications of all the participles as a noise identification result.

In this embodiment, since the semantic text is a text obtained by text conversion performed by the terminal or the server according to the voice information of the user, a plurality of segmented words can be obtained by performing word segmentation processing on the semantic text. And taking the plurality of participles as input information of a pre-trained noise recognition model, preliminarily evaluating the binary value of each participle through the pre-trained noise recognition model, and further outputting a preliminary predicted value of noise of each participle. And then, noise labeling is carried out on each participle based on the preliminary estimated value, so that whether the noise label belongs to the noise word or not is configured for each participle, and the noise labels of all the participles are used as noise identification results.

As one example, the pre-trained noise recognition model may include: the method comprises an ELECTRA model framework and a CRF model framework, wherein an input layer of the ELECTRA model framework is used as an input layer of the whole model and used for inputting semantic texts, an output layer of the ELECTRA model framework is connected with the CRF model framework, the ELECTRA model framework carries out word segmentation according to the input semantic texts and finally obtains a binary classification value of each word segmentation, then the binary classification value of each word segmentation is transmitted to the CRF model framework, noise labeling is carried out on each word segmentation through the CRF model framework based on the binary classification value of each word segmentation, and in all the word segmentation, which word is noise and which word is not noise is labeled.

In implementation, semantic samples can be input into a noise recognition model in advance for training, and the noise recognition model comprises an eletctra model framework and a CRF model framework, so that in the training process, the semantic samples are mainly used for training a Generator and a judger Discriminator in the eletctra model framework. Specifically, the method comprises the following steps:

the generator is mainly a small text hiding module, and the generator specifically functions in a classic Bert MLM mode, namely, randomly hiding some participles of an original sequence, and then enabling the model to predict the hidden participles to obtain predicted participles.

The judger takes the prediction result of the generator as input, distinguishes each input participle, and judges whether the participle is the original participle in the semantic sample or replaces the original participle to obtain the participle.

According to the two input results, training is carried out in a mode of minimizing comprehensive loss, and the corresponding expression is as follows:

wherein, theta_GTo generator parameters; theta_DIs a judger parameter;

dividing words for the current time; x is all input participles; l is_MLMRepresenting a generator loss function; l is_DisIs a judger loss function.

Wherein, e (x)_t) Is x_tThe vector of (a);

representing a preset occlusion function; m represents the number of the word segments being hidden.

It can be understood that although the ELECTRA model uses context information in the training process, the ELECTRA output result is independent and does not use context information, so that the CRF model is accessed in the last layer of the ELECTRA model, and the output of the ELECTRA model is input into the CRF to obtain the final labeling result.

As shown in fig. 1, after step S11 is performed, that is, after a noise recognition result is obtained, steps S12 to S14 are continuously performed.

S12: and denoising the semantic text based on the noise identification result to obtain a text to be compressed.

In step S12, the noise recognition result is a probability value of each participle in the semantic text belonging to a noise word, or a noise flag of each participle in the semantic text, the noise flag indicating whether each participle in the text is a noise word. Here, when the noise recognition result is a noise flag of each participle in the semantic text, the noise flag may be configured based on a probability value that each participle in the semantic text belongs to a noise word.

In this embodiment, the semantic text is denoised based on the noise recognition result, which is equivalent to removing the words or the segments recognized as noise in the semantic text, for example, removing the segments with probability value of the segment belonging to the noise word being greater than a preset threshold, or removing the segments with noise identification indicating that the segment is noise.

As an example, after determining a probability value that each participle in the semantic text belongs to a noise word, configuring a corresponding noise identifier for each participle according to the probability value of each participle, and when the probability value of the participle is greater than a first preset threshold, configuring the identifier of the participle as the determined noise identifier.

It should be understood that, in this example, when the probability value of a segment is equal to the first preset threshold, the identifier of the segment may also be configured as a pending noise identifier, and when the probability value of the segment is smaller than the first preset threshold, the identifier of the segment may also be configured as a non-noise identifier, and a segment configured with a pending noise identifier or a non-noise identifier is not removed from the semantic text as a noise word.

In this embodiment, the text to be compressed is the text after the semantic text is rejected by the noise word with a higher probability value, or the text after the semantic text is rejected and marked with the noise identifier. And performing noise reduction processing on the semantic text based on the noise identification result, namely performing content elimination on the semantic text according to the probability value that each participle in the semantic text belongs to a noise word or the noise identifier of each participle, namely eliminating the participle which is recognized as text noise by a machine in the semantic text, and further obtaining the text to be compressed.

For example, when the noise recognition result is the probability value that each participle in the semantic text belongs to the noise word, the semantic text is subjected to noise reduction processing based on the noise recognition result, specifically, whether the probability value that each participle belongs to the noise word is equal to or greater than a second preset threshold value is judged, and when the probability value that each participle belongs to the noise word is equal to or greater than the second preset threshold value, the participle is removed from the semantic text as the noise word.

For another example, when the noise identification result is the noise identifier of each participle in the semantic text, the semantic text is denoised based on the noise identification result, specifically, a removing operation is performed according to the noise identifier corresponding to each participle, that is, if the noise identifier of the participle is the determined noise identifier, the participle is removed from the semantic text as a noise word.

It is understood that, in practical applications, when the word segmentation mode or the noise word recognition mode in the semantic text is different, the obtained noise recognition result is also different. The skilled person knows that, in order to deal with different types of noise recognition results, a corresponding noise processing scheme should be matched based on an existing scheme, and then noise words are removed from the semantic text, so that the types of the noise recognition results are not enumerated here, and detailed details of how to perform noise reduction processing on the semantic text are not repeated.

As an example, in the above solution, the step S12 includes:

taking the participles carrying the determined noise identifications in the semantic text as noise participles;

and removing the noise word segmentation from the semantic text to obtain a text to be compressed.

In this embodiment, the noise identifiers of all the participles in the semantic text are noise identification results, and the noise identification results are used for describing the corresponding relationship between each participle in the semantic text and the noise identifier, so that the noise participles can be removed from the semantic text according to the noise identification results, and the text to be compressed is obtained.

It should be noted that the participles carrying the determined noise identifiers are noise participles, the participles carrying the undetermined noise identifiers or non-noise identifiers are not removed as noise, that is, when the semantic text is subjected to noise removal based on the noise recognition result, the participles carrying the determined noise identifiers are removed as noise splits, and other scores are retained, that is, the participles carrying the undetermined noise identifiers or non-noise identifiers are retained after the noise participles are removed to form the text to be compressed.

In connection with the above example, since the pre-trained noise recognition model may include: the method comprises an ELECTRA model framework and a CRF model framework, wherein an input layer of the ELECTRA model framework is used as an input layer of the whole model and used for inputting semantic texts, an output layer of the ELECTRA model framework is connected with the CRF model framework, the ELECTRA model framework carries out word segmentation according to the input semantic texts and finally obtains a binary classification value of each word segmentation, then the binary classification value of each word segmentation is transmitted to the CRF model framework, noise labeling is carried out on each word segmentation through the CRF model framework based on the binary classification value of each word segmentation, and in all the word segmentation, which word is noise and which word is not noise is labeled.

Further, the CRF model framework carries out noise labeling on each participle based on the two classification values of each participle, namely, each participle can be determined to carry a determined noise identifier, an undetermined noise identifier or a non-noise identifier, so that the participle carrying the determined noise identifier in the semantic text can be determined, and then the participle is taken as a noise participle to be removed from the semantic text.

In this embodiment, after performing noise processing on the semantic text based on the noise recognition result, that is, removing the participles recognized as noise in the semantic text, the step S13 is performed after obtaining the text to be compressed.

S13: and calling a sentence compression tool to perform sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed to obtain a preliminary compressed text.

In step S13, the part-of-speech expression describes the type of each participle to be compressed in the text to be compressed, for example, describes the participle to be compressed as an adjective, a word assistant, a preposition, and so on.

In this embodiment, the text to be compressed is a text obtained by removing noise and word segmentation from the semantic text, and because the noise in the semantic text is removed, the phenomenon that the sentence compression is disordered due to the content of the noise can be avoided when the sentence compression is performed on the text to be compressed. Because the positions or meanings of different participles to be compressed in the text to be compressed are different, when the sentence compression machine is called, different sentence compression tools are called to compress the sentences according to the part of speech to which each participle belongs.

When implemented, the sentence compression tool can be an existing machine learning-based build text compression tool, e.g., Hanlp, Stanford, pyltp, etc. Because the compression rules and the compression methods are the same among different compression tools, in order to meet different compression requirements, different text compression tools can be called by configuring different interfaces during implementation, and statements of texts to be compressed are sequentially compressed according to a certain sequence. For example, a corresponding relationship between the interface and the text compression tool is called through the configuration tool, and the statement compression tool is called through the calling tool calling interface based on the corresponding relationship to perform statement compression on the text to be compressed, wherein the corresponding relationship between the tool calling interface and the text compression tool may be one-to-one.

In practical application, in order to adapt to different compression requirements, the strategy for calling the compression tool can be adjusted according to a specific applicable scenario.

As an example, a first sentence compression tool for removing the speech-qi assisting words is called to perform sentence compression on a text to be compressed to obtain a text to be refined, wherein the speech-qi assisting words comprise words representing spoken language contents such as "o", "j", "hum", and the like; and then, calling a second sentence compression tool for removing modifiers to perform sentence compression on the text to be perfected to obtain a second text, wherein the modifiers can comprise adjectives, such as cheap, cost-effective, most and obvious, and the like, and calling a third sentence compression tool for removing prepositional phrases to perform sentence compression on the second text to obtain a preliminary compressed text.

It should be understood that, in the specific implementation, a new sentence compression tool may also be constructed according to the characteristics of the semantic text content, for example, a sentence compression tool for removing english words or english phrases, or an existing sentence compression tool is adjusted to obtain a new sentence compression tool, and then the new sentence compression tool is called through the configuration tool calling interface, so as to implement the sentence compression operation on the text to be compressed.

As an example, in the foregoing solution, the step S13 specifically includes:

performing word segmentation operation on the text to be compressed to obtain a word segmentation set of the text to be compressed;

performing part-of-speech configuration on each to-be-compressed participle in the participle set based on a pre-configured part-of-speech database, and enabling each to-be-compressed participle to correspond to at least one part-of-speech tag;

and calling a sentence compression tool to perform sentence compression on the text to be compressed according to the part-of-speech tag corresponding to each participle to be compressed to obtain a preliminary compressed text.

In this embodiment, the word segmentation is performed on the text to be compressed, that is, the text unit of the text to be compressed is minimized. For example, if the text to be compressed is a paragraph, performing word segmentation on the text to be compressed is to decompose the text to be compressed into a plurality of sentences, and performing word and word decomposition on the plurality of sentences to finally obtain a minimized text set of the text to be compressed, that is, a word set consisting of word segments and word segments. Here, since all the participles in the participle set are obtained by performing a participle operation on the text to be compressed, the content in the participle set can be restored to the text to be compressed by piecing together the participles in a certain order.

It should be noted that, in this embodiment, after the word segmentation set of the text to be compressed is obtained, when performing part-of-speech configuration on each word to be compressed according to the pre-configured part-of-speech database, a part-of-speech tag corresponding to the word to be compressed may be matched from the pre-configured part-of-speech database, so as to perform part-of-speech configuration on each word to be compressed in the word segmentation set.

As an example, in other embodiments, after the word segmentation set of the text to be compressed is obtained, the structure of each sentence in the text to be compressed and the characteristics of each internal component of each sentence may be examined by analyzing the structural mode of the text to be compressed, the grammatical structure and the grammatical relation of each sentence in the text to be compressed are analyzed, and further, part-of-speech tag configuration is performed on each word to be compressed obtained by the word segmentation operation. And then, calling a statement compression tool according to the result of the part-of-speech tag configuration to compress the statements in the text to be compressed according to the inherent rules or strategies for compressing the text content of the statement compression tool.

It should be understood that, according to the part-of-speech tag corresponding to each participle to be compressed, the sentence compression tool is invoked to perform sentence compression on the text to be compressed, the sentence compression tool may be invoked to perform sentence compression on the text to be compressed in sequence according to the part-of-speech tag corresponding to each participle to be compressed, or the sentence compression tool may be invoked to perform sentence compression on the text to be compressed simultaneously according to the part-of-speech tag corresponding to each participle to be compressed.

As a possible implementation manner of this embodiment, the foregoing steps: according to the part-of-speech tag corresponding to each word to be compressed, calling a sentence compression tool to perform sentence compression on the text to be compressed to obtain a preliminary compressed text, wherein the method comprises the following steps:

when the number of part-of-speech tags corresponding to a single word to be compressed is two or more, determining a target part-of-speech tag from the two or more part-of-speech tags; the target part-of-speech tag is the same tag of the single word to be compressed and the adjacent word to be compressed;

and calling a statement compression tool according to the target part-of-speech tag to perform statement compression on the text to be compressed to obtain a primary compressed text.

In this possible implementation manner, it is considered that two or more part-of-speech tags may be obtained after part-of-speech tag configuration is performed on a single participle, for example, when an adjective is present before a moose auxiliary word, the part-of-speech tag configured for the moose auxiliary word may be two, that is, the moose auxiliary word tag and the adjective tag.

In order to reduce the invocation of the sentence compression tool as much as possible, in the implementation manner, a target part-of-speech tag needs to be determined from two or more part-of-speech tags, and then the sentence compression tool is invoked according to the target part-of-speech tag to perform sentence compression on a text to be compressed, so as to obtain a preliminary compressed text. Here, the target part-of-speech tag is a tag in which the single word to be compressed is identical to the word to be compressed adjacent to the single word to be compressed, that is, when two or more part-of-speech tags exist in the word to be compressed, the target part-of-speech tag indicates that part of the part-of-speech tags of the word to be compressed are affected by the adjacent word to the target part-of-speech tag, so that by determining whether part-of-speech tags of other words to be compressed adjacent to the word to be compressed are identical to the part-of-speech tags, the sentence compression operation can be performed by only calling the sentence compression tool corresponding to the part-of-speech tags identical to the adjacent words to be compressed.

For example, the term "to be compressed" and the term "to be compressed" in the sentence "… …" may be used as separate word segments to be compressed, and the term "to be compressed" is an assistant word of language, but it is considered that the term "to be compressed" may also be used as a word segment to be compressed, so the term "to be compressed" may also be considered as an adjective, and therefore, for the word segment "to be compressed," there are both an assistant word tag of language and an adjective tag, and here, the target part word tag of the term "to be compressed" is determined to be the same part word tag as the term "to be compressed," i.e., the adjective tag. When the sentence "… …" is subjected to sentence compression, a sentence compression tool can be called according to the part-of-speech tag of the adjective, so that the word to be compressed "cost" can be eliminated.

It can be understood that, when the sentence compression is performed on the text to be compressed, considering that the components of the parentheses or the middle parentheses in the text to be compressed generally represent explanation, the adjective phrase for modifying the limited nouns is made into a fixed phrase in the sentence and generally belongs to the minor component, the fixed phrase for prepositioning is generally not long enough, the meaning of the center of the sentence is less influenced after the fixed phrase is deleted, the fixed phrase for the quantity word generally represents the number of nouns and does not relate to the center meaning of the sentence, and the fixed phrase for the "end" and the ground "end are phrases for modifying the noun phrase and the verb phrase, that is, the fixed phrase and the verb phrase in the sentence and belong to the minor component can be deleted. Therefore, it can be known to those skilled in the art that the configuration of the statement compression tool or the invocation of the statement compression tool is implemented based on the above rules, and therefore, details of how to configure the statement compression tool are not described herein.

S14: and performing text restoration on the preliminary compressed text according to a preset restoration strategy to obtain a target compressed text.

In step S14, the preset repairing policy is used to describe an implementation means or method for text repairing on the preliminary compressed text, where in consideration of the problem that when the text to be compressed is compressed, the content of the preliminary compressed text may be removed to cause the content to be unsmooth or the readability may be reduced, the text repairing is performed on the preliminary compressed text according to the preset repairing policy, so as to adjust the content of the preliminary compressed text to make it have better readability.

As an example, the text repairing is performed on the preliminary compressed text according to a preset repairing strategy, which may be repairing of a number word or a unit quantifier, and since the number word is compressed and only the unit quantifier may be retained, the compressed number word needs to be supplemented. For example, the text to be compressed is "premium paid for 30 years, and a child can get the text", where "30 years" is a combination of a number word and a unit amount word, and the corresponding preliminary compressed text is "premium paid for years and a child gets the text", at this time, the number word "30" in "30 years" is compressed, and the reserved unit amount word "year" causes the preliminary compressed text to be unsmooth, so that the text is restored to the preliminary compressed text, and the obtained target compressed text should be "premium paid for 30 years and a child gets the text".

As another example, performing text restoration on the preliminary compressed text according to a preset restoration policy may further include performing restoration according to a relationship between punctuations or the number of text words before and after the punctuations. For example, the text to be compressed is a word "can be retrieved in the next year", where the word "has only one" between two commas, and the corresponding preliminary compressed text is a word "can be retrieved in the next year", and where commas appear before and after the word "here", so that the word "delete", that is, the text of the preliminary compressed text is repaired, and the obtained target compressed text should be "can be retrieved in the next year".

As an embodiment of the present application, step S14 includes:

performing digit repairing on the preliminary compressed text to obtain a text to be completed;

and if a preset punctuation error event is identified in the text to be perfected, punctuation correction is carried out on the text to be perfected based on a preset punctuation repairing strategy to obtain a target compressed text.

In this embodiment, when a sentence compression tool is called to perform sentence compression on a text to be compressed, a sharp quantifier is generated in the content of the preliminary compressed text due to the removal of the quantifier, so performing the digit repairing on the content of the preliminary compressed text means that the removed quantifier is repaired, that is, the removed quantifier is rewritten into the preliminary compressed text during the compression operation, so as to obtain the text to be completed. Punctuation relation restoration is used for describing a strategy for restoring unreasonable punctuation relations existing in a to-be-perfected text based on the position relation between two punctuations, or is used for describing a strategy for restoring a shorter sentence.

It should be noted that, in practical applications, the words and the unit quantity words in the text to be compressed are paired and are easily recognized as short sentences, that is, the text to be compressed may include short sentences formed by combining a plurality of words and unit quantity words. For example, one, two, three, … …, first payment … …, third payment … …, etc. Therefore, after the preliminary compression text is obtained by preliminary compression of the text to be compressed, in order to avoid identifying the part of the content as a meaningless short sentence, the punctuation relation is repaired on the text to be perfected based on a preset punctuation repairing strategy after the preliminary compression text is subjected to digital repairing to obtain the text to be perfected, so that the unit quantifier and the non-digital word are prevented from being mistakenly deleted when the unit quantifier is used as the meaningless short sentence in the preliminary compression text, and the rationalization degree and the accuracy of the text compression process are improved.

In combination with the above example, as another example, the preliminary compressed text is first subjected to digit repairing to obtain a text to be completed, and then the text to be completed is subjected to punctuation relation repairing based on a preset punctuation repairing strategy to obtain a target compressed text. For example, when N is greater than or equal to 2, it is assumed that a text to be compressed is "premium paid for 30 years," a child can get a word, "a" in 30 years "is a combination of a number word and a unit amount word," a "is a word help word, and a corresponding preliminary compressed text is" premium paid for 30 years, "a child gets," in 30 years, "a 30" number word is compressed, and a unit amount word "year" is reserved to cause the preliminary compressed text to be unsmooth, so that a text to be completed, which is obtained by performing number word restoration on the preliminary compressed text, is "premium paid for 30 years," a child gets. In the text to be improved, ", i" there is only one "character between two commas", so when the punctuation relation of the text to be improved is repaired based on the preset punctuation repair strategy, because the preset punctuation repair strategy limits the minimum value N of the number of characters between two punctuation symbols, and N is greater than or equal to 2, the punctuation relation of the text to be improved is repaired based on the preset punctuation repair strategy, specifically, the word "delete" is that the obtained target compressed text is "premium paid for 30 years, and is received by children".

It can be understood that, in practical application, when no number word exists in the semantic text, the number word is not removed, so that the number word repairing of the preliminary compressed text is not needed, and at the moment, punctuation correction operation can be performed on the preliminary compressed text as the text to be perfected.

In the scheme, noise identification is carried out on a semantic text input by a user by utilizing a pre-trained noise identification model, noise reduction processing is carried out on the semantic text based on a noise identification result to obtain a text to be compressed, sentence compression is carried out on the text to be compressed by calling a sentence compression tool based on the part of speech of each word to be compressed in the text to be compressed to obtain a primary compressed text, interference of noise on the content of the primary compressed text can be avoided due to the fact that the semantic text is subjected to noise reduction processing and then sentence compression, and finally text restoration is carried out on the primary compressed text according to a preset restoration strategy to obtain a target compressed text.

Fig. 2 shows a flowchart of an implementation of a semantic text compression method according to another embodiment of the present application. Based on any of the above embodiments, in this embodiment, after the step S14, a step S21 is further included. The details are as follows:

s21: and generating a file to be loaded according to the target compressed text, wherein the file to be loaded is used for being loaded by a terminal, and the target compressed text is displayed in an additional text box.

In step S21, the target compressed text is a thumbnail text compressed by the server according to the semantic text, and the target compressed text can be used to represent the content backbone of the semantic text. The file to be loaded is obtained by the server through script construction according to the target compressed text, namely the file to be loaded is generated by the server, sent to the terminal by the server and loaded by the terminal.

In this embodiment, when the server generates the file to be loaded according to the target compressed text, specifically, according to a preconfigured file generation method, the target compressed text is embedded into a preconfigured script template, and is further encapsulated to obtain the file to be loaded, and the file to be loaded is sent to the terminal for loading. Here, after the terminal loads the file to be loaded, an additional text box is constructed on the terminal, and the target compressed text is displayed in the additional text box. The additional text box may be constructed at a position corresponding to the voice information input by the user, for example, a text box is constructed below the information bar of the voice information as the additional text box.

In addition, a file to be loaded is generated according to the target compressed text, the file to be loaded is used for being loaded by the terminal, and the target compressed text is displayed in the additional text box, so that the target compressed text can be displayed on the terminal, a user can conveniently use the target compressed text, and the availability of a semantic text compression result is improved.

Referring to fig. 3, fig. 3 is a block diagram illustrating a semantic text compression apparatus according to an embodiment of the present disclosure. The mobile terminal in this embodiment includes units for executing the steps in the embodiments corresponding to fig. 1 and fig. 2. Please refer to fig. 1 and fig. 2, and fig. 1 and fig. 2 for the corresponding embodiments. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 3, the semantic text compression apparatus 30 includes: a noise identification unit 31, a noise reduction unit 32, a compression unit 33, and a repair unit 34. Specifically, the method comprises the following steps:

the noise identification unit 31 is configured to perform noise identification on a semantic text input by a user by using a pre-trained noise identification model to obtain a noise identification result; the noise identification result is used for indicating whether each participle in the semantic text is noise or not;

the denoising unit 32 is configured to perform denoising processing on the semantic text based on the noise identification result to obtain a text to be compressed;

the compression unit 33 is configured to invoke a sentence compression tool to perform sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed, so as to obtain a preliminary compressed text;

and the repairing unit 34 is configured to perform text repairing on the preliminary compressed text according to a preset repairing strategy to obtain a target compressed text.

As an embodiment, the semantic text compression apparatus 30 further includes: a configuration unit 35. Specifically, the method comprises the following steps:

and the configuration unit 35 is configured to generate a file to be loaded according to the target compressed text, where the file to be loaded is used for being loaded by a terminal, and the target compressed text is displayed in an additional text box.

It should be understood that, in the structural block diagram of the semantic text compression apparatus shown in fig. 3, each unit is used to execute each step in the embodiment corresponding to fig. 1 and 2, while each step in the embodiment corresponding to fig. 1 and 2 has been explained in detail in the above embodiment, please refer to the relevant description in the embodiments corresponding to fig. 1 and 2 and fig. 1 and 2 specifically, and will not be repeated here.

Fig. 4 is a block diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 4, the computer device 40 of this embodiment includes: a processor 41, a memory 42 and a computer program 43, such as a program for a semantic text compression method, stored in said memory 42 and executable on said processor 41. The steps in the embodiments of the semantic text compression methods described above are implemented when the processor 41 executes the computer program 43, for example, S11 to S14 shown in fig. 1, and further, for example, S11 to S21 shown in fig. 2, and the functions of the units in the embodiment corresponding to fig. 3, for example, the functions of the units 31 to 35 shown in fig. 3, are implemented when the processor 41 executes the computer program 43, for specific reference is made to the relevant description in the embodiment corresponding to fig. 3, which is not repeated herein.

Illustratively, the computer program 43 may be divided into one or more units, which are stored in the memory 42 and executed by the processor 41 to accomplish the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 43 in the computer device 40. For example, the computer program 43 may be divided into a noise identification unit, a noise reduction unit, a compression unit, a repair unit, and a configuration unit as described above.

The turntable device may include, but is not limited to, a processor 41, a memory 42. Those skilled in the art will appreciate that fig. 4 is merely an example of a computer device 40 and does not constitute a limitation of computer device 40 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the turntable device may also include input output devices, network access devices, buses, etc.

The Processor 41 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 42 may be an internal storage unit of the computer device 40, such as a hard disk or a memory of the computer device 40. The memory 42 may also be an external storage device of the computer device 40, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device 40. Further, the memory 42 may also include both internal storage units and external storage devices of the computer device 40. The memory 42 is used for storing the computer program and other programs and data required by the turntable device. The memory 42 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of semantic text compression, comprising:

2. The semantic text compression method according to claim 1, wherein the noise recognition of the semantic text input by the user by using the pre-trained noise recognition model to obtain a noise recognition result comprises:

3. The semantic text compression method according to claim 2, wherein the denoising the semantic text based on the noise recognition result to obtain a text to be compressed comprises:

4. The semantic text compression method according to claim 1, wherein the step of calling a sentence compression tool to perform sentence compression on the text to be compressed based on the part of speech of each word to be compressed in the text to be compressed to obtain a preliminary compressed text comprises:

5. The semantic text compression method according to claim 4, wherein the step of calling a sentence compression tool to perform sentence compression on the text to be compressed according to the part-of-speech tag corresponding to each participle to be compressed to obtain a preliminary compressed text comprises the steps of:

6. The semantic text compression method according to claim 1, wherein the text repairing the preliminary compressed text according to a preset repairing strategy to obtain a target compressed text comprises:

7. The semantic text compression method according to any one of claims 1 to 6, wherein after the step of performing text repair on the preliminary compressed text according to a preset repair strategy to obtain a target compressed text, the method further comprises:

and generating a file to be loaded according to the target compressed text, wherein the file to be loaded is used for being loaded by a terminal, and the target compressed text is displayed in an additional text box.

8. A semantic text compression apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.