CN112949261A

CN112949261A - Text restoration method and device and electronic equipment

Info

Publication number: CN112949261A
Application number: CN202110158872.0A
Authority: CN
Inventors: 佟禹
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-06-11
Also published as: WO2022166808A1

Abstract

The application discloses a text reduction method and device and electronic equipment, belongs to the technical field of language identification, and can solve the problem that the text reduction of the existing electronic equipment is inaccurate. The method comprises the following steps: acquiring a first candidate word and a second candidate word according to the first character group; determining a first confusion degree and a second confusion degree, wherein the first confusion degree is the confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence by a first candidate word, and the second confusion degree is the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by a second candidate word; under the condition that the first confusion degree is smaller than the second confusion degree, obtaining a reduced target text according to the first candidate word; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word. The method is applied to a scene of the text restoration.

Description

Text restoration method and device and electronic equipment

Technical Field

The application belongs to the technical field of language identification, and particularly relates to a text restoration method and device and electronic equipment.

Background

In the process of editing the text, if the western character group (for example, english character group) at the end of a certain line of the text cannot be displayed on the line completely, the western character group may be disconnected from the position of the automatic line change, and a separator, for example, mark 1, mark 2, mark 3, mark 4, mark 5, mark 6 in fig. 1, may be added at the position of the line disconnection.

Currently, if the text is copied to another file, the character set can be automatically restored from the separators. Specifically, the separator located at the end of the text line may be directly removed, so that the character groups before and after the separator form a character group, and the character group is displayed in the text obtained by copying, for example, the text shown in fig. 2 is the text obtained after copying the text shown in fig. 1.

However, in the above process, since some character sets are compound words, that is, the character set itself includes separators, the character sets in the restored text may be mistaken by directly removing the separators, for example, the character sets marked at mark 3, mark 5, and mark 6 in fig. 2. Therefore, how to accurately restore the text becomes a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application aims to provide a text reduction method, a text reduction device and electronic equipment, and the problem that the text reduction of the existing electronic equipment is inaccurate can be solved.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a text reduction method, where the method includes: according to a first character group, a first candidate word and a second candidate word are obtained, wherein the first character group is a character group which is positioned at the end of the line of the Nth line in the target text to be restored and ends with a separator, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and a second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the separator from the first character group; determining a first confusion degree and a second confusion degree, wherein the first confusion degree is the confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence by a first candidate word, and the second confusion degree is the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by a second candidate word; under the condition that the first confusion degree is smaller than the second confusion degree, obtaining a reduced target text according to the first candidate word; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

In a second aspect, an embodiment of the present application provides a text restoring apparatus, which includes an obtaining module, a determining module, and a restoring module. The acquisition module is used for acquiring a first candidate word and a second candidate word according to the first character group, wherein the first character group is a character group which is positioned at the end of the line of the Nth line in the target text to be restored and ends with a separator, the first candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the separator from the first character group; the determining module is used for determining a first confusion degree and a second confusion degree, wherein the first confusion degree is the confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence by a first candidate word, and the second confusion degree is the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by a second candidate word; the restoring module is used for obtaining a restored target text according to the first candidate word under the condition that the first confusion degree is smaller than the second confusion degree; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when the program or instructions are executed by the processor, the steps of the text restoration method in the first aspect are implemented.

In a fourth aspect, the present application provides a readable storage medium, on which a program or instructions are stored, and when the program or instructions are executed by a processor, the steps of the text restoration method in the first aspect are implemented.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the steps of the text restoration method as in the first aspect.

In this embodiment of the present application, a first candidate word and a second candidate word may be obtained according to a first character group, where the first character group is a character group that is located at the end of the nth line in a target text to be restored and ends with a delimiter, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and a second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the delimiter from the first character group; determining a first confusion degree and a second confusion degree, wherein the first confusion degree is the confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence by a first candidate word, and the second confusion degree is the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by a second candidate word; under the condition that the first confusion degree is smaller than the second confusion degree, obtaining a reduced target text according to the first candidate word; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word. According to the scheme, the lower the confusion degree corresponding to the sentence is, the more fluent the sentence is, namely the lower the confusion degree corresponding to the sentence is, the more accurate the sentence is, therefore, by comparing the confusion degree corresponding to the first sentence obtained according to the first candidate word with the confusion degree corresponding to the second sentence obtained according to the second candidate word, which of the first candidate word and the second candidate word is correct can be determined, namely, the correct word composed of the first character group and the second character group in the target text can be determined, and the text can be accurately restored.

Drawings

Fig. 1 is a schematic diagram of a text to be restored according to an embodiment of the present application;

fig. 2 is a schematic diagram of a restored text according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a text reduction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text reduction device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 6 is a hardware schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The text restoration method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

As shown in fig. 3, an embodiment of the present application provides a text reduction method, which includes steps 201 to 204, or steps 201 to 203 and 205, described below.

It should be noted that the execution main body of the text reduction method provided in the embodiment of the present application may be a text reduction device, or a control module used for executing the text reduction method in the text reduction device, and may also be an electronic device. The text reduction method provided by the embodiment of the present application will be exemplarily described below by taking a text reduction device as an example.

Optionally, in this embodiment of the present application, when an execution main body of the text restoring method provided in this embodiment of the present application is an electronic device, the electronic device may include the text restoring apparatus provided in this embodiment of the present application, or be externally connected to the text restoring apparatus. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

Step 201, the electronic device obtains a first candidate word and a second candidate word according to the first character group.

The first character group may be a character group that is located at the end of the line of the nth line in the target text to be restored and ends with a delimiter, the first candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, the third character group is a character group obtained by removing the delimiter from the first character group, and N is a positive integer.

In this embodiment of the application, after the electronic device obtains the target text to be restored, the electronic device may obtain the first candidate word and the second candidate word according to the first character group, so that the target text may be restored according to a correct word in the first candidate word and the second candidate word.

Optionally, the text reduction method provided by the embodiment of the present application may be applied to the following two possible scenarios:

scene one: the electronic device copies the target text from one location to another, such as from one document to another.

Scene two: the target text is text in a target image, which is recognized by the electronic device through Optical Character Recognition (OCR) technology.

Optionally, in the second scenario, the text in the target image may be laid out horizontally or laid out vertically. When the text in the target image is vertically laid out, the first character group may be a character group ending at the M-th column and ending with a separator in the target text to be restored, the second character group may be a first character group of the M + 1-th column in the target text to be restored, and M is a positive integer.

Of course, in actual implementation, the text reduction method provided in the embodiment of the present application may also be applied to any other possible scenarios, which may be determined according to actual use requirements, and the embodiment of the present application is not limited.

Optionally, the character group related to the embodiment of the present application may be a western character group, for example, an english character group, a french character group, a german character group, a russian character group, or a portuguese character group, and the like, and may be determined specifically according to actual use requirements, and the embodiment of the present application is not limited. In the embodiment of the present application, an english character set is taken as an example for exemplary explanation.

In this embodiment of the application, after the electronic device obtains the target text to be restored, the electronic device may detect, line by line, whether a line end of each line of text in the target text ends with a separator or a specific separator (e.g., "-"), and if so, the electronic device may use a character group including the separator as the first character group. If not, the electronic device may continue to detect the next line of text.

Optionally, in this embodiment of the application, a manner of acquiring, by the electronic device, the first candidate word and the second candidate word may be:

step 1, the electronic device combines the character group before the last line delimiter of the current line (i.e. the third character group), the delimiter (e.g. "-"), and the first character group of the next line of the current line into a unit (hereinafter referred to as a processing candidate set).

For example, taking the text shown in fig. 1 as an example, the electronic device may obtain the candidate set to be processed { representa, -, station } from the first line, obtain the candidate set to be processed { repre, -, station } from the fourth line, obtain the candidate set to be processed { pre, -, train } from the sixth line, obtain the candidate set to be processed { re, -, list } from the ninth line, obtain the candidate set to be processed { fine, -, tuned } from the tenth line, and obtain the candidate set to be processed { task, -, specfic } from the fourteenth line.

And 2, combining character groups before and after the separator for all the candidate items in each candidate set to be processed to obtain candidate words, such as { representation }, { representation }, { pretrain }, { result }, { qualified } and { taskspecification }, so as to obtain the first candidate word.

And step 3, generating a compound word candidate with preserved segmenters for all the candidate items in each candidate set to be processed, such as { representa-tion }, { representa-presentation }, { pre-train }, { re-Sult }, { fine-tuned } and { task-specific }, so as to obtain the second candidate word.

It can be understood that, in the embodiment of the present application, the second candidate word is a compound word, so that the electronic device may respectively perform word validity detection and sentence fluency detection by using a word obtained by combining the characters before and after the delimiter and a compound word formed by the delimiter, thereby ensuring accuracy of the restored target text.

In step 202, the electronic device determines a first confusion level and a second confusion level.

The first confusion degree may be a confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in the target sentence with the first candidate word, and the second confusion degree may be a confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word.

In this embodiment of the application, after the electronic device obtains the first candidate word and the second candidate word, the electronic device may determine the first confusion degree and the second confusion degree, so that a correct word may be determined from the first candidate word and the second candidate word to restore the target text.

Optionally, in this embodiment of the application, for the step 202, the electronic device may perform the following steps 202a and 202b on the first candidate word and the second candidate word, respectively, so as to determine the first confusion degree and the second confusion degree.

It is understood that the following steps 202a and 202b are exemplified by one of the first candidate word and the second candidate word (e.g., the target candidate word in the embodiment of the present application).

In step 202a, the electronic device determines a target parameter based on a probability of occurrence of each character in the target candidate word in the target text.

The target candidate word may be the first candidate word or the second candidate word.

Step 202b, the electronic device determines a confusion degree corresponding to the target candidate word according to the target parameter.

Wherein, the target parameters may include: the legality value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence. The target phrase may include the target candidate word, a fourth character group, and a fifth character group, where the fourth character group may be a character group located before the first character group in the target text, and the fifth character group is a character group located after the second character group in the target text.

In this embodiment of the application, the electronic device may determine, based on a probability that each character in a target candidate word (a first candidate word or a second candidate word) appears in a target text, a legality value of the target candidate word, a fluency value of a target phrase, and a fluency value of a target sentence, so as to obtain the target parameter, and then the electronic device may determine, according to the target parameter, a confusion degree (e.g., the first confusion degree or the second confusion degree) corresponding to the target candidate word.

In this embodiment of the application, the electronic device may input the target candidate word (the first candidate word or the second candidate word) and the target text into the language model, and then the language model may calculate a legitimacy value of the target candidate word, a fluency value of the target phrase, and a fluency value of the target sentence, so as to obtain the target parameter.

In this embodiment of the application, the legality value of the target candidate word may be a probability (denoted as Score) of occurrence of the target candidate word in the target text_1)。

Optionally, in this embodiment of the present application, the legality value of the target candidate word may be a product between probabilities of each character in the target candidate word appearing in the target text.

Wherein, the probability that the Kth character in the target candidate word appears in the target text refers to: and the probability of the occurrence of the Kth character under the condition that a sixth character group occurs in the target text, wherein the sixth character group consists of the 1 st character to the (K-1) th character in the target candidate word, and K is an integer greater than 1.

It should be noted that "in the case where a certain character group or character (denoted as a) appears, another character (denoted as B)" referred to in the embodiments of the present application means: in the text, B is located after a, and there is no separator between a and B.

Specifically, the target candidate word may have a legality value expressed as:

P(W)＝p(C₁)×p(C₂|C₁)×…×p(C_K|C₁,C₂,…C_K-1)；

wherein P (W) represents the legality value of the target candidate word, p (C)₁) Indicates the probability of the first character in the target candidate word appearing in the target text, p (C)_K|C₁,C₂,…C_K-1) A product representing a probability of occurrence of a K-th character in the case where a sixth character group consisting of 1 st character to (K-1) th character in the target candidate word occurs in the target text.

Illustratively, the judgment is made by a language model, which is shown in the following formula (1), wherein W represents a candidate word, C₁Representing the first character in the candidate word, C_kRepresenting the last character in the candidate word by C₁To C_kThe probability of the candidate word W is formed by the characters of (a), and whether W is a legal word is judged. Wherein the probability formula for calculating the word is shown in the following formula (2), wherein p (C)₁) Represents a character C₁The calculation formula of the probability of occurrence in the target text is shown in the following formula (3). Exemplarily, if C₁Representing the character r, the total number of characters in the target text is 100, the character r appears 10 times, and then the probability of r appearing is 10/100 ═ 0.1, namely p (C)₁)＝0.1。

In the formula (4), p (C)₂|C₁) Is represented by C₂Occurrence of (2) and (C)₁Is correlated, i.e. in the presence of C₁Under the condition of C₂The probability of (c). Exemplarily, C₁If the character "w", C₂Representing the character "e", then the probability of the occurrence of the character "e" under the condition of the occurrence of the character "w" is: p (e | w) ═ P (we)/P (w).

W＝C₁,C₂,C₃,…CK (1)

P(W)＝P(C₁,C₂,C₃,…C_K)＝p(C₁)×p(C₂|C₁)×…×p(C_K|C₁,C₂,…C_K-1) (2)

p(C_k) Number of occurrences of ═ character k/total number of characters in the document (3)

P(C₂|C₁)＝P(C₁C₂)/P(C₁) (4)

In this embodiment of the application, the fluency value of the target phrase may be a probability (marked as Score) of appearance of a phrase composed of the target candidate word, the fourth character group, and the fifth character group in the target text_2)。

In the embodiment of the present application, the fluency value of the target phrase may be calculated according to the following formula (5).

Wherein S represents a sentence or phrase consisting of the word W₁…W_NAnd (4) forming. Generally, the less confusing, the more fluent the sentence or phrase.

Exemplarily, as shown by reference numeral 1 in fig. 1, assuming that the target candidate word is "presentation", the word before "presentation-" is "language", and the word after "stop" is "model", the target candidate word is obtained by formula (5)

In this embodiment of the application, the fluency value of the target sentence may be a probability (denoted as Score) of occurrence of the target sentence in the target text_3)。

In this embodiment, the fluency value of the target sentence may be calculated according to the formula (5).

Exemplarily, as shown by reference numeral 1 in fig. 1, assuming that the target candidate word is "representation" and the statement where "representation-" is located is "We interior a new language representation model called BERT", then it can be obtained by equation (5):

alternatively, in this embodiment of the application, the step 202b may be specifically implemented by the step 202b1 described below.

In step 202b1, the electronic device obtains the confusion degree corresponding to the target candidate word according to the sum of the product of the legality value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient.

Wherein the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.

In this embodiment of the present application, after the electronic device determines the legality value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence, the electronic device may calculate a sum of a product of the legality value of the target candidate word and a first coefficient (denoted as α), a product of the fluency value of the target phrase and a second coefficient (denoted as β), and a product of the fluency value of the target sentence and a third coefficient (denoted as γ), so as to obtain a confusion degree (denoted as Score) corresponding to the target candidate word.

That is, Score is α × Score _1+ β × Score _2+ γ × Score _ 3.

Optionally, in this embodiment of the application, values of the first coefficient, the second coefficient, and the third coefficient may be any possible positive numbers, and a sum of the first coefficient, the second coefficient, and the third coefficient is equal to 1.

In step 203, the electronic device determines whether the first confusion is less than the second confusion.

In an embodiment of the application, after the electronic device determines the first and second puzzles, the electronic device may compare the first and second puzzles. Thereby determining which of the first candidate word and the second candidate word is correct.

In this embodiment of the application, if the first confusion is smaller than the second confusion, the electronic device may obtain the restored target text according to the first candidate word, that is, if the first confusion is smaller than the second confusion, the electronic device may execute step 204 described below. If the second confusion is smaller than the first confusion, the electronic device may obtain the restored target text according to the second candidate word, that is, if the second confusion is smaller than the first confusion, the electronic device may perform step 205 described below.

It is understood that, in the embodiment of the present application, the following step 204 and step 205 are alternatively performed.

And step 204, the electronic equipment obtains the reduced target text according to the first candidate word.

In the embodiment of the application, under the condition that the first confusion degree is smaller than the second confusion degree, the electronic device may restore the target text according to the first candidate word, so that the restored target text may be obtained.

According to a possible implementation manner, the electronic device may directly replace the first character group and the second character group in the target text with the first candidate word, so that the restored target text may be obtained.

In another possible implementation manner, the electronic device may replace the target sentence in the target text with the first sentence (where the first sentence includes the first candidate word), so as to obtain the restored target text.

And step 205, the electronic equipment obtains the reduced target text according to the second candidate word.

In this embodiment of the application, under the condition that the second confusion degree is smaller than the first confusion degree, the electronic device may restore the target text according to the second candidate word, so that the restored target text may be obtained.

According to a possible implementation manner, the electronic device may directly replace the first character group and the second character group in the target text with the second candidate word, so that the restored target text may be obtained.

In another possible implementation manner, the electronic device may replace the target sentence in the target text with the second sentence (where the second sentence includes the second candidate word), so as to obtain the restored target text.

According to the text reduction method provided by the embodiment of the application, the smaller the confusion degree corresponding to the sentence is, the more fluent the sentence is, that is, the smaller the confusion degree corresponding to the sentence is, the more accurate the sentence is, by comparing the confusion degree corresponding to the first sentence obtained according to the first candidate word with the confusion degree corresponding to the second sentence obtained according to the second candidate word, which of the first candidate word and the second candidate word is correct can be determined, that is, the correct word composed of the first character group and the second character group in the target text can be determined, so that the text can be accurately reduced.

Optionally, in this embodiment of the application, after the electronic device obtains the restored target text, the text restoration method provided in this embodiment of the application may further include step 206 described below.

And step 206, the electronic equipment acquires the keywords of the restored target text based on the keyword recognition model.

The content type of the keyword may be the same as a content type preset in the keyword recognition model.

In the embodiment of the application, after the electronic equipment obtains the reduced target text, the electronic equipment can input the reduced target text into the keyword recognition model, so that the keywords in the reduced target text can be obtained based on the keyword recognition model, accurate keywords can be obtained, and the accuracy of keyword recognition can be improved.

Optionally, in this embodiment of the application, after the keyword recognition model recognizes the keywords in the restored target text, the keyword recognition model may output a keyword list to the electronic device. Wherein, the keyword list may include all keywords in the restored target text.

For example, assuming that the content type of the preset keyword in the keyword recognition model is "place name", after the electronic device inputs the restored target text into the keyword recognition model, the keyword recognition model may extract and output all words related to the "place name" from the restored target text, thereby obtaining the keyword.

In the embodiment of the application, after the electronic equipment inputs the reduced target text into the keyword recognition model, the keyword recognition model can perform keyword recognition on the reduced target text, so that keywords in the reduced target text can be obtained, and a list of the keywords is output to the electronic equipment, so that the keywords in the target text can be accurately obtained.

The following describes a text reduction device provided in the embodiment of the present application, with an example of a text reduction method executed by the text reduction device in the embodiment of the present application.

As shown in fig. 4, the embodiment of the present application provides a text restoring apparatus 300, where the text restoring apparatus 300 includes an obtaining module 301, a determining module 302, and a restoring module 303. An obtaining module 301, configured to obtain a first candidate word and a second candidate word according to a first character group, where the first character group is a character group that is located at the end of the nth line in the target text to be restored and ends with a delimiter, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and a second character group, the second character group is a first character group of an (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the delimiter from the first character group; a determining module 302, configured to determine a first confusion degree and a second confusion degree, where the first confusion degree is a confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence with a first candidate word, and the second confusion degree is a confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence with a second candidate word; the restoring module 303 is configured to obtain a restored target text according to the first candidate word under the condition that the first confusion degree is smaller than the second confusion degree; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

Optionally, the determining module is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively: determining a target parameter based on the probability of each character in a target candidate word appearing in a target text, wherein the target candidate word is a first candidate word or a second candidate word; determining the confusion degree corresponding to the target candidate words according to the target parameters; wherein the target parameters include: the legality value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence; the target phrase comprises a target candidate word, a fourth character group and a fifth character group, wherein the fourth character group is a character group positioned in the target text before the first character group, and the fifth character group is a character group positioned in the target text after the second character group.

Optionally, the determining module is specifically configured to obtain a confusion degree corresponding to the target candidate word according to a sum of a product of the legitimacy value of the target candidate word and the first coefficient, a product of the fluency value of the target phrase and the second coefficient, and a product of the fluency value of the target sentence and the third coefficient; wherein the sum of the first coefficient, the second coefficient, and the third coefficient is equal to 1.

Optionally, the legality value of the target candidate word is the probability of the target candidate word appearing in the target text; the fluency value of the target word group is the probability of the appearance of the word group consisting of the target candidate word, the fourth character group and the fifth character group in the target text; the fluency value of the target sentence is the probability of the target sentence appearing in the target text.

Optionally, the legality value of the target candidate word is a product of probabilities of each character in the target candidate word appearing in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text refers to: and the probability of the occurrence of the Kth character under the condition that a sixth character group occurs in the target text, wherein the sixth character group consists of the 1 st character to the (K-1) th character in the target candidate word, and K is an integer greater than 1.

Optionally, the determining module is further configured to obtain a keyword of the restored target text based on the keyword recognition model, where a content type of the keyword is the same as a content type preset in the keyword recognition model.

The embodiment of the present application provides a text reduction apparatus, wherein the smaller the confusion degree corresponding to a sentence is, the smoother the sentence is, that is, the smaller the confusion degree corresponding to the sentence is, the more accurate the sentence is, so that by comparing the confusion degree corresponding to a first sentence obtained according to a first candidate word with the confusion degree corresponding to a second sentence obtained according to a second candidate word, which of the first candidate word and the second candidate word is correct can be determined, that is, a correct word composed of a first character group and a second character group in a target text can be determined, and thus the text can be accurately reduced.

The text recovery apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a Personal Computer (PC), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The text restoration device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The text reduction device provided by the embodiment of the application can implement each process implemented by the method embodiment, and is not repeated here to avoid repetition.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the foregoing text reduction method embodiment, and can achieve the same technical effect, and no further description is provided here to avoid repetition.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic device and the non-mobile electronic device described above.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The processor 110 may be configured to obtain a first candidate word and a second candidate word according to a first character group, where the first character group is a character group that is located at the end of the nth line in the target text to be restored and ends with a delimiter, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and a second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the delimiter from the first character group; determining a first confusion degree and a second confusion degree, wherein the first confusion degree is the confusion degree corresponding to a first sentence obtained by replacing a first character group and a second character group in a target sentence by a first candidate word, and the second confusion degree is the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by a second candidate word; under the condition that the first confusion degree is smaller than the second confusion degree, obtaining a reduced target text according to the first candidate word; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

Optionally, the processor 110 is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively: determining a target parameter based on the probability of each character in a target candidate word appearing in a target text, wherein the target candidate word is a first candidate word or a second candidate word; determining the confusion degree corresponding to the target candidate words according to the target parameters; wherein the target parameters include: the legality value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence; the target phrase comprises a target candidate word, a fourth character group and a fifth character group, wherein the fourth character group is a character group positioned in the target text before the first character group, and the fifth character group is a character group positioned in the target text after the second character group.

Optionally, the processor 110 is specifically configured to obtain a confusion degree corresponding to the target candidate word according to a sum of a product of the legality value of the target candidate word and the first coefficient, a product of the fluency value of the target phrase and the second coefficient, and a product of the fluency value of the target sentence and the third coefficient; wherein the sum of the first coefficient, the second coefficient, and the third coefficient is equal to 1.

Optionally, the processor 110 is further configured to obtain a keyword of the restored target text based on the keyword recognition model, where a content type of the keyword is the same as a content type preset in the keyword recognition model.

The embodiment of the present application provides an electronic device, where the smaller the confusion degree corresponding to a sentence is, the more fluent the sentence is, that is, the smaller the confusion degree corresponding to the sentence is, the more accurate the sentence is, so by comparing the confusion degree corresponding to a first sentence obtained according to a first candidate word with the confusion degree corresponding to a second sentence obtained according to a second candidate word, which of the first candidate word and the second candidate word is correct can be determined, that is, a correct word composed of a first character group and a second character group in a target text can be determined, and thus the text can be accurately restored.

It should be noted that, in the embodiment of the present application, the obtaining module, the determining module, the restoring module, and the input module in the text restoring apparatus may all be implemented by the processor 110.

It should be understood that in the embodiment of the present application, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. The electronic device provides wireless broadband internet access to the user via the network module 102, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media. The audio output unit 103 may include a speaker, a buzzer, a receiver, and the like. The input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics processor 1041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 110 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the foregoing text reduction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is a processor in the electronic device in the above embodiment. The readable storage medium may include a computer-readable storage medium, such as a computer Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, and so forth.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the text reduction method embodiment, and the same technical effect can be achieved.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling an electronic device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for text reduction, the method comprising:

acquiring a first candidate word and a second candidate word according to a first character group, wherein the first character group is a character group which is positioned at the end of the line of the Nth line in a target text to be restored and ends with a separator, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and the second character group, the second character group is a first character group of the (N + 1) th line in the target text to be restored, and the third character group is a character group obtained by removing the separator from the first character group;

determining a first confusion degree and a second confusion degree, wherein the first confusion degree is a confusion degree corresponding to a first sentence obtained by replacing the first character group and the second character group in the target sentence by the first candidate word, and the second confusion degree is a confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence by the second candidate word;

under the condition that the first confusion degree is smaller than the second confusion degree, obtaining the reduced target text according to the first candidate word; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

2. The method of claim 1, wherein determining the first and second degrees of confusion comprises:

performing the following steps on the first candidate word and the second candidate word respectively:

determining a target parameter based on a probability of occurrence of each character in a target candidate word in the target text, wherein the target candidate word is the first candidate word or the second candidate word;

determining the confusion degree corresponding to the target candidate words according to the target parameters;

wherein the target parameters include: the legality value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence; the target phrase comprises the target candidate word, a fourth character group and a fifth character group, wherein the fourth character group is a character group positioned in the target text before the first character group, and the fifth character group is a character group positioned in the target text after the second character group.

3. The method of claim 2, wherein determining the confusion degree corresponding to the target candidate word according to the target parameter comprises:

obtaining the confusion degree corresponding to the target candidate word according to the sum of the product of the legality value of the target candidate word and a first coefficient, the product of the fluency value of the target phrase and a second coefficient, and the product of the fluency value of the target sentence and a third coefficient;

wherein the sum of the first coefficient, the second coefficient, and the third coefficient is equal to 1.

4. The method according to claim 2 or 3, wherein the legality value of the target candidate word is a probability that the target candidate word appears in the target text;

the fluency value of the target word group is the probability of the occurrence of the word group consisting of the target candidate word, the fourth character group and the fifth character group in the target text;

and the fluency value of the target sentence is the probability of the target sentence appearing in the target text.

5. The method of claim 4, wherein the legality value of the target candidate word is a product between probabilities of each character in the target candidate word appearing in the target text;

wherein, the probability that the kth character in the target candidate word appears in the target text is: a probability of occurrence of a kth character if a sixth character group occurs in the target text, the sixth character group consisting of 1 st to (K-1) th characters in the target candidate word, K being an integer greater than 1.

6. The text reduction device is characterized by comprising an acquisition module, a determination module and a reduction module;

an obtaining module, configured to obtain a first candidate word and a second candidate word according to a first character group, where the first character group is a character group that is located at the end of an nth row in a target text to be restored and ends with a delimiter, the first candidate word is a word obtained by combining the first character group and a second character group, the second candidate word is a word obtained by combining a third character group and a second character group, the second character group is a first character group of an N +1 th row in the target text to be restored, and the third character group is a character group obtained by removing the delimiter from the first character group;

a determining module, configured to determine a first confusion degree and a second confusion degree, where the first confusion degree is a confusion degree corresponding to a first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second confusion degree is a confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word;

the restoring module is used for obtaining the restored target text according to the first candidate word under the condition that the first confusion degree is smaller than the second confusion degree; or under the condition that the second confusion degree is smaller than the first confusion degree, obtaining the reduced target text according to the second candidate word.

7. The apparatus according to claim 6, wherein the determining module is specifically configured to perform the following steps for the first candidate word and the second candidate word, respectively:

8. The apparatus according to claim 7, wherein the determining module is specifically configured to obtain the confusion degree corresponding to the target candidate word according to a sum of a product of the legality value of the target candidate word and a first coefficient, a product of the fluency value of the target phrase and a second coefficient, and a product of the fluency value of the target sentence and a third coefficient;

9. The apparatus according to claim 7 or 8, wherein the legality value of the target candidate word is a probability of the target candidate word appearing in the target text;

10. The apparatus of claim 9, wherein the legality value of the target candidate word is a product between probabilities of each character in the target candidate word appearing in the target text;

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text reduction method according to any one of claims 1-5.

12. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the text reduction method according to any one of claims 1-5.