CN105320641B - Text verification method and user terminal - Google Patents

Text verification method and user terminal Download PDF

Info

Publication number
CN105320641B
CN105320641B CN201410370686.3A CN201410370686A CN105320641B CN 105320641 B CN105320641 B CN 105320641B CN 201410370686 A CN201410370686 A CN 201410370686A CN 105320641 B CN105320641 B CN 105320641B
Authority
CN
China
Prior art keywords
text
read
character string
segment
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410370686.3A
Other languages
Chinese (zh)
Other versions
CN105320641A (en
Inventor
芦世先
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410370686.3A priority Critical patent/CN105320641B/en
Publication of CN105320641A publication Critical patent/CN105320641A/en
Application granted granted Critical
Publication of CN105320641B publication Critical patent/CN105320641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a text verification method and a user terminal, wherein the method comprises the following steps: acquiring a text abstract of a standard text segment with the same title as a text segment to be read in a text signing site; matching the text segments to be read by adopting the text abstract; and when the matching result is larger than a preset threshold value, outputting the text segment to be read. The accuracy of the text fragment can be further improved, and the text reading quality is guaranteed.

Description

Text verification method and user terminal
Technical Field
The invention relates to the technical field of internet, in particular to a text verification method and a user terminal.
Background
With the continuous development and improvement of internet technology, networks have become an indispensable part of people's life, and users can connect with networks through user terminals such as mobile phones and computers to perform file transmission, browse network texts, play games and the like.
In the existing process of reading web texts, a feature vector is calculated through a text segment, and whether the text segment is correct or not is determined according to the feature vector, for example: judging whether the novel content belongs to the novel correctly, and the like, because the judgment of the segment content is only carried out according to the currently read network text, the accuracy of the text segment cannot be well ensured, and the quality of text reading is influenced
Disclosure of Invention
The embodiment of the invention provides a text verification method and a user terminal, which can further improve the accuracy of text fragments and ensure the text reading quality.
In order to solve the foregoing technical problem, a first aspect of an embodiment of the present invention provides a text verification method, which may include:
acquiring a text abstract of a standard text segment with the same title as a text segment to be read in a text signing site;
matching the text segments to be read by adopting the text abstract;
and when the matching result is larger than a preset threshold value, outputting the text segment to be read.
A second aspect of an embodiment of the present invention provides a user terminal, which may include:
the abstract acquiring unit is used for acquiring a text abstract of a standard text segment with the same title as that of a text segment to be read in a text signing site;
the segment matching unit is used for matching the text segment to be read by adopting the text abstract;
and the segment output unit is used for outputting the text segment to be read when the matching result is greater than a preset threshold value.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. The text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is further improved, and the quality of text reading is further ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text verification method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another text verification method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a user terminal according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a segment matching unit according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another segment matching unit provided in the embodiment of the present invention;
fig. 6 is a schematic structural diagram of another ue according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of another user terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The text verification method provided by the embodiment of the invention can be applied to scenes of reading network novels, such as: when reading the text of the network novel, acquiring a text abstract of a standard text segment with the same title as that of a text segment to be read (such as a chapter and the like) in a text signing site of a novel author; matching the text segments to be read by adopting the text abstract; and when the matching result is larger than a preset threshold value, outputting the scene of the text segment to be read and the like. The text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is further improved, and the quality of text reading is further ensured.
The user terminal related to the embodiment of the invention can comprise: terminal devices such as a computer, a tablet computer, a smart phone, a notebook computer, a palm computer and a Mobile Internet Device (MID); the text signing site is a site signed by a text author, the copyright of the text is owned by the text signing site, and the texts in the text signing site are all standard texts, namely accurate and error-free texts; the standard text segment is a part of content belonging to the standard text, for example: chapter, subtitle, and content attributed to the subtitle, and the like.
The text verification method provided by the embodiment of the invention will be described in detail below with reference to fig. 1 and 2.
Referring to fig. 1, a flow chart of a text verification method according to an embodiment of the present invention is schematically shown. As shown in fig. 1, the method of the embodiment of the present invention includes the following steps S101 to S103.
S101, acquiring a text abstract of a standard text segment with the same title as a text segment to be read in a text signing site;
specifically, when a user opens a text to be read through a user terminal, the user terminal obtains a text abstract of a standard text segment with the same title as that of the text segment to be read in the text to be read through a text signing site, it can be understood that the text to be read opened by the user terminal comes from some text aggregation sites, the text aggregation sites provide a site for free text reading for the user by extracting a text of a third-party site, and the text segment to be read is matched by using the standard text abstract as the standard text segment in the text signing site needs to be charged but the text abstract in the standard text segment can be obtained, so that the accuracy of the text segment to be read can be improved.
It should be noted that, before the user terminal obtains the text abstract of the standard text segment with the same title as the text segment to be read in the text signing site, when receiving a browsing request of a tag carrying a text to be read, the user terminal may further obtain directory information of a standard text associated with the tag in the text signing site, where the tag is preferably a text name of the text to be read, and the user terminal may match the directory information of the text to be read by using the directory information of the standard text, and further, the user terminal matches a title in the directory of the standard text with a title in the directory of the text to be read, and the title may specifically be a title of each chapter in the directory. The method comprises the steps that the directory information is matched for the first time, a matching basis can be provided for the text segment matching process, and after the directory information is matched, namely the directory information matched with the standard text is consistent with the directory information of the text to be read, the user terminal obtains the text abstract of the standard text segment with the same title as the text segment to be read in the text signing site.
S102, matching the text segments to be read by adopting the text abstract;
specifically, the user terminal performs segmentation processing on the text abstract and the text segment to be read respectively according to a preset format, and performs matching processing according to the character string in the text abstract and the character string in the text segment to be read after the segmentation processing.
S103, outputting the text segment to be read when the matching result is larger than a preset threshold value;
specifically, when the matching result after the matching process is greater than a preset threshold, it indicates that the similarity between the text segment to be read and the text abstract is high, and the user terminal may output the text segment to be read.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. The text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is further improved, and the quality of text reading is further ensured.
Referring to fig. 2, a schematic flow chart of another text verification method according to an embodiment of the present invention is provided. As shown in fig. 2, the method of the embodiment of the present invention includes the following steps S201 to S208.
S201, when a browsing request of a label carrying a text to be read is received, acquiring directory information of a standard text associated with the label in a text signing site;
s202, matching the directory information of the text to be read by adopting the directory information of the standard text;
specifically, when a browsing request carrying a tag of a text to be read is received, a user terminal may obtain, in a text subscription site, directory information of a standard text associated with the tag, where the tag is preferably a text name of the text to be read, and the user terminal may match the directory information of the text to be read by using the directory information of the standard text, and further, a title in the directory of the standard text and a title in the directory of the text to be read are matched by the user terminal, where the title may specifically be a title of each chapter in the directory. By carrying out primary matching on the directory information, a matching basis can be provided for the text segment matching process.
It should be noted that the text to be read opened by the user terminal comes from a few novel aggregation sites, and the novel aggregation sites provide free text reading sites for the user by extracting the text of the third-party sites.
S203, after the matching is passed, acquiring a text abstract of a standard text segment with the same title as the text segment to be read in the text signing site;
specifically, after the matching of the directory information is passed, that is, when the directory information of the matched standard text is consistent with the directory information of the text to be read, the user terminal obtains the text abstract of the standard text segment with the same title as the text segment to be read in the text to be read through the text signing site, it can be understood that the text to be read opened by the user terminal comes from some text aggregation sites, the text aggregation sites provide a site for free text reading for the user by extracting the text of a third-party site, and the standard text segment in the text signing site needs to be charged but can be obtained as the text abstract in the standard text segment, so that the text segment to be read is matched by using the standard text abstract, and the accuracy of the text segment to be read can be improved.
S204, segmenting the text abstract and the text segment to be read respectively according to a preset format, and matching according to the character strings in the text abstract and the character strings in the text segment to be read after segmentation;
specifically, the user terminal may first perform segmentation processing on the text summary and the text segment to be read according to a preset format, for example: segmenting the text abstract and the text segment to be read into a plurality of character strings by taking preset word number as a segmentation boundary; or segmenting the text abstract and the text segment to be read into a plurality of character strings and the like according to the part of speech. The user terminal can perform matching processing according to the character strings in the text abstract after segmentation processing and the character strings in the text segment to be read.
Preferably, the matching processing procedure may be that the user terminal performs segmentation processing on the text abstract and the text segment to be read respectively according to a first preset format, and obtains a first character string in the text abstract and a second character string in the text segment to be read after the segmentation processing, it is understood that the first character string may include at least one character string, and the second character string may also include at least one character string, the user terminal obtains character string information of the first character string and the second character string, the character string information includes a sum of character string lengths and a lewistein distance between character strings, the lewistein distance represents an editing number (including insertion, deletion, replacement, and the like) of at least a single character required for changing one character string into another character string, and the user terminal performs matching processing on the sum of character string lengths and the lewistein distance between character strings, and acquiring the Lavenstein ratio of the first character string and the second character string, and determining the Lavenstein ratio as a matching result of the text segment to be read and the text abstract for matching processing. Taking the first preset format of the part of speech segmentation as an example, assuming that the acquired first character string is "sun", the second character string is "sunflower", the Sum of the string lengths is 5 and the levens distance seen by the first string and the second string is 2 (including deletion and replacement), then the resulting string length is determined according to the formula for the levens ratio (Sum-Idist)/Sum, where Sum represents the Sum of the lengths of the strings, Idist represents the Laves distance between strings, the ratio of the levens for the first character string and the second character string is 0.6, the user terminal calculates the ratio of the levens for all the character strings in the text segment to be read and the text abstract in this way, and determining the average value of the final Levensstein ratio as a matching result of the matching processing of the text segment to be read and the text abstract. Preferably, when the levenson ratios of the first character string and the second character string are calculated, the first character string and the second character string can be converted into pinyin respectively, the levenson ratio between the pinyin of the first character string and the pinyin of the second character string is calculated in the above manner, the user terminal obtains the levenson ratios of the first character string and the second character string and the average value of the levenson ratios between the pinyin of the first character string and the pinyin of the second character string, the average value is used as the final levenson ratio between the first character string and the second character string, and the matching is further performed in a pinyin manner, so that the interference of homophones on the matching process can be avoided, and the accuracy of the matching process can be further improved.
It is understood that the manner of using the levenstan ratio is applicable to the case where the lengths of the character strings are not very different, for example, for a text aggregation site, some advertising expressions are usually added in front of a text, and for a text abstract, these advertising expressions are not present, so that in the matching process, these advertising expressions also need to be matched, which results in a long time consumption, and for this case, the embodiment of the present invention further provides another matching process that the user terminal performs segmentation processing on the text abstract and the text segment to be read respectively according to a second preset format, and obtains at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after segmentation processing, obtains the number of fourth character strings in the text segment to be read that is the same as the at least one third character string, and determining the ratio of the number of the fourth character strings to the number of the at least one third character string as a matching result of the matching processing of the text segment to be read and the text abstract. Taking a second preset format of the preset word number segmentation as an example, assuming that a sentence in the text abstract is "where the language is finished after the graduation is returned to the country", the sentence in the text segment to be read is 'the land after returning from the country after finishing the page is recognized', the preset word number is three, then a sentence in the text abstract can be divided into six third character strings of 'graduation return, industry return, after country, after land and so on', and a sentence in the text segment to be read can be divided into eight fourth character strings of' page back after finishing, page back, land after returning, and so on, so as to be recognized and opened, then 4 third character strings in one sentence in the text abstract are the same as 4 character strings in one sentence in the text segment to be read, so that the matching result for the two sentences is 4/6-0.67. The user terminal calculates the number ratio of all the character strings in the text segment to be read and the text abstract by adopting the mode, and determines the average value of the final number ratio as the matching result of the matching processing of the text segment to be read and the text abstract.
It is to be understood that the first preset format and the second preset format may be the same preset format, and the naming is only used for distinguishing different matching processes, and similarly, the first character string, the second character string, the third character string, and the fourth character string may also be the same character string.
S205, when the matching result is larger than a preset threshold value, outputting the text segment to be read;
specifically, when the matching result after the matching process is greater than a preset threshold, it indicates that the similarity between the text segment to be read and the text abstract is high, and the user terminal may output the text segment to be read.
S206, when the matching result is smaller than or equal to a preset threshold value, at least one third-party text fragment with the same title as the text fragment to be read is obtained from at least one third-party site;
specifically, when the matching result is less than or equal to the preset threshold, it indicates that the similarity between the text segment to be read and the text abstract is not high, and the user terminal may obtain, from at least one third-party site, at least one third-party text segment having the same title as the text segment to be read.
S207, respectively calculating the similarity of each third-party text fragment in the at least one third-party text fragment by using the text abstract;
specifically, the user terminal may obtain third-party text snippets of a preset number of third-party sites, for example: and acquiring the third-party text fragments of 10 third-party sites. The user terminal calculates the similarity of each of the preset number of third-party text segments by using the text abstract, and the specific calculation process may refer to the matching process described above, which is not described herein again.
S208, acquiring and outputting a third-party text segment with the similarity greater than the preset threshold and the maximum similarity;
specifically, the user terminal may obtain and output the third-party text segments with the similarity greater than the preset threshold and the maximum similarity from among the preset number of third-party text segments.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. Through the directory information of the matched text, a matching basis is provided for the matching of subsequent text segments; the text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is improved; matching the text segments by adopting a Levensstein ratio mode and a number ratio mode, and further improving the accuracy of the text segments; and judging the matching result by adopting a preset threshold value, and outputting the most accurate text segment according to the judgment result, thereby ensuring the text reading quality.
The user terminal provided by the embodiment of the present invention will be described in detail below with reference to fig. 3 to 6. It should be noted that, the user terminals shown in fig. 3 to fig. 6 are used for executing the method according to the embodiment of the present invention shown in fig. 1 and fig. 2, for convenience of description, only the parts related to the embodiment of the present invention are shown, and details of the specific technology are not disclosed, please refer to the embodiment of the present invention shown in fig. 1 and fig. 2.
Referring to fig. 3, a schematic structural diagram of a user terminal is provided in an embodiment of the present invention. As shown in fig. 5, the user terminal 1 according to the embodiment of the present invention may include: a digest acquisition unit 11, a segment matching unit 12, and a segment output unit 13.
The abstract acquiring unit 11 is used for acquiring a text abstract of a standard text segment with the same title as that of a text segment to be read in a text signing site;
in the specific implementation, when a user opens a text to be read through the user terminal 1, the abstract acquiring unit 11 acquires a text abstract of a standard text segment with the same title as that of the text segment to be read in the text to be read through a text signing site, it can be understood that the text to be read opened by the user terminal comes from some text aggregation sites, and the text aggregation sites provide a site for free text reading for the user by extracting a text of a third-party site.
It should be noted that, before the summary obtaining unit 11 obtains the text summary of the standard text segment having the same title as the text segment to be read in the text subscription site, when receiving a browsing request of a tag carrying a text to be read, the user terminal 1 may further obtain directory information of a standard text associated with the tag in the text subscription site, where the tag is preferably a text name of the text to be read, and the user terminal 1 may match the directory information of the text to be read by using the directory information of the standard text, and further, the user terminal 1 matches a title in the directory of the standard text with a title in the directory of the text to be read, where the title may specifically be a title of each chapter in the directory. The initial matching of the directory information can provide a matching basis for the text segment matching process, and after the directory information matching is passed, that is, when the directory information matching the standard text is consistent with the directory information of the text to be read, the abstract acquiring unit 11 executes a step of acquiring the text abstract of the standard text segment with the same title as the text segment to be read in the text signing site.
The segment matching unit 12 is configured to perform matching processing on the text segment to be read by using the text abstract;
in a specific implementation, the segment matching unit 12 performs segmentation processing on the text abstract and the text segment to be read respectively according to a preset format, and performs matching processing according to a character string in the text abstract and a character string in the text segment to be read after the segmentation processing.
Specifically, please refer to fig. 4, which provides a schematic structural diagram of a segment matching unit according to an embodiment of the present invention. As shown in fig. 4, the segment matching unit 12 may include:
the first obtaining subunit 121 is configured to respectively perform segmentation processing on the text abstract and the text fragment to be read according to a first preset format, and obtain a first character string in the text abstract and a second character string in the text fragment to be read after the segmentation processing;
an information obtaining subunit 122, configured to obtain character string information of the first character string and the second character string, where the character string information includes a sum of character string lengths and a levenstein distance between character strings;
a first result determining subunit 123, configured to obtain a levenstein ratio of the first character string and the second character string according to the sum of the lengths of the character strings and the levenstein distance between the character strings, and determine the levenstein ratio as a matching result of matching the text segment to be read and the text abstract;
in a specific implementation, the segment matching unit 12 may first perform segmentation processing on the text abstract and the text segment to be read according to a preset format, for example: segmenting the text abstract and the text segment to be read into a plurality of character strings by taking preset word number as a segmentation boundary; or segmenting the text abstract and the text segment to be read into a plurality of character strings and the like according to the part of speech. The segment matching unit 12 may perform matching processing according to the character strings in the text abstract after the segmentation processing and the character strings in the text segment to be read.
Preferably, the matching processing procedure may be that the first obtaining subunit 121 separately performs segmentation processing on the text abstract and the text segment to be read according to a first preset format, and obtains a first character string in the text abstract and a second character string in the text segment to be read after the segmentation processing, it is understood that the first character string may include at least one character string, and the second character string may also include at least one character string, the information obtaining subunit 122 obtains character string information of the first character string and the second character string, the character string information includes a sum of character string lengths and a lewistan distance between character strings, the lewistan distance represents an edit number (including insertion, deletion, replacement, and the like) of a minimum single character required for one character string to become another character string, the first result determining subunit 123 obtains the levenstein ratio of the first character string and the second character string according to the sum of the lengths of the character strings and the levenstein distance between the character strings, and determines the levenstein ratio as a matching result of matching the text segment to be read and the text abstract. Taking the first preset format of the part of speech segmentation as an example, assuming that the acquired first character string is "sun", the second character string is "sunflower", the Sum of the string lengths is 5 and the levens distance seen by the first string and the second string is 2 (including deletion and replacement), then the resulting string length is determined according to the formula for the levens ratio (Sum-Idist)/Sum, where Sum represents the Sum of the lengths of the strings, Idist represents the Laves distance between strings, the ratio of the levens for the first character string and the second character string is 0.6, the segment matching unit 12 calculates the ratio of the levens for all the character strings in the text segment to be read and the text abstract in this way, and determining the average value of the final Levensstein ratio as a matching result of the matching processing of the text segment to be read and the text abstract. Preferably, when the levenson ratios of the first character string and the second character string are calculated, the first character string and the second character string may be converted into pinyin respectively, the levenson ratio between the pinyin of the first character string and the pinyin of the second character string is calculated in the above manner, the segment matching unit 12 obtains the levenson ratio of the first character string and the second character string and an average value of the levenson ratios between the pinyin of the first character string and the pinyin of the second character string, the average value is used as a final levenson ratio between the first character string and the second character string, and matching is further performed in a pinyin manner, so that interference of homophones on a matching process can be avoided, and meanwhile, accuracy of matching processing can be further improved.
It is understood that the manner of using the levenstan ratio is applicable to the case where the lengths of the character strings are not very different, for example, for a text aggregation site, some advertising expressions are usually added in front of the text, and for a text abstract, these advertising expressions are not present, so that in the matching process, these advertising expressions also need to be matched, which results in a long time consumption, and therefore for this case, the embodiment of the present invention further provides another structural schematic diagram of a segment matching unit, as shown in fig. 5, the segment matching unit 12 may include:
a second obtaining subunit 124, configured to respectively perform segmentation processing on the text abstract and the text segment to be read according to a second preset format, and obtain at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after the segmentation processing;
a number obtaining subunit 125, configured to obtain the number of fourth character strings in the text segment to be read, where the fourth character strings are the same as the at least one third character string;
a second result determining subunit 126, configured to determine, as a matching result of the matching processing between the text segment to be read and the text abstract, a ratio of the number of the fourth character string to the number of the at least one third character string;
in a specific implementation, in the matching process, the second obtaining subunit 124 may further perform segmentation processing on the text abstract and the text segment to be read according to a second preset format, and obtain at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after the segmentation processing, the number obtaining subunit 125 obtains the number of fourth character strings in the text segment to be read, which is the same as the at least one third character string, and the second result determining subunit 126 determines a ratio of the number of the fourth character strings to the number of the at least one third character string as a matching result of the text segment to be read and the text abstract for matching processing. Taking a second preset format of the preset word number segmentation as an example, assuming that a sentence in the text abstract is "where the language is finished after the graduation is returned to the country", the sentence in the text segment to be read is 'the land after returning from the country after finishing the page is recognized', the preset word number is three, then a sentence in the text abstract can be divided into six third character strings of 'graduation return, industry return, after country, after land and so on', and a sentence in the text segment to be read can be divided into eight fourth character strings of' page back after finishing, page back, land after returning, and so on, so as to be recognized and opened, then 4 third character strings in one sentence in the text abstract are the same as 4 character strings in one sentence in the text segment to be read, so that the matching result for the two sentences is 4/6-0.67. The segment matching unit 12 calculates the number ratio of all the character strings in the text segment to be read and the text abstract in this way, and determines the average value of the final number ratio as the matching result of the matching processing of the text segment to be read and the text abstract.
It is to be understood that the first preset format and the second preset format may be the same preset format, and the naming is only used for distinguishing different matching processes, and similarly, the first character string, the second character string, the third character string, and the fourth character string may also be the same character string. Meanwhile, the segment matching unit 12 may include a first obtaining subunit 121, an information obtaining subunit 122, a first result determining subunit 123, a second obtaining subunit 124, a number obtaining subunit 125, and a second result determining subunit 126 at the same time, so as to solve the matching processing procedure in different situations.
The segment output unit 13 is configured to output the text segment to be read when the matching result is greater than a preset threshold;
specifically, when the matching result after the matching processing is greater than the preset threshold, it indicates that the similarity between the text segment to be read and the text abstract is high, and the segment output unit 13 may output the text segment to be read.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. The text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is improved; the text segments are matched in a Levensstein ratio mode and a number ratio mode, so that the accuracy of the text segments is further improved, and the text reading quality is further ensured.
Referring to fig. 6, a schematic structural diagram of another ue is provided in the embodiment of the present invention. As shown in fig. 6, the user terminal 1 according to the embodiment of the present invention may include: a digest acquisition unit 11, a fragment matching unit 12, a fragment output unit 13, an information acquisition unit 14, a notification unit 15, a fragment acquisition unit 16, and a calculation unit 17; the specific structures of the digest obtaining unit 11, the segment matching unit 12, and a part of the structure of the segment output unit 13 may refer to the specific description of the embodiment shown in fig. 3, which is not described herein again.
The information acquiring unit 14 is configured to acquire, when a browsing request of a tag carrying a text to be read is received, directory information of a standard text associated with the tag in a text signing site;
the notifying unit 15 is configured to match the directory information of the text to be read with the directory information of the standard text, and notify the abstract acquiring unit 11 to acquire a text abstract of a standard text segment having the same title as the text segment to be read in a text signing site after the matching is passed;
in a specific implementation, when the user terminal 1 receives a browsing request of a tag carrying a text to be read, the information obtaining unit 14 may obtain, in a text signing site, directory information of a standard text associated with the tag, where the tag is preferably a text name of the text to be read, the notifying unit 15 may match the directory information of the text to be read by using the directory information of the standard text, and further, the notifying unit 15 matches a title in the directory of the standard text with a title in the directory of the text to be read, where the title may be a title of each chapter in the directory. By carrying out primary matching on the directory information, a matching basis can be provided for the text segment matching process.
It should be noted that the text to be read opened by the user terminal 1 comes from some novel aggregation sites, the novel aggregation sites provide free text reading sites for the user by extracting the text of the third-party site, and after the matching of the directory information is passed, that is, the directory information of the matched standard text is consistent with the directory information of the text to be read, the notification unit 15 notifies the summary acquisition unit 11 to execute the text summary of the standard text segment with the same title as the text segment to be read in the text signing site.
The segment obtaining unit 16 is configured to obtain, in at least one third-party site, at least one third-party text segment having the same title as the text segment to be read when the matching result is smaller than or equal to a preset threshold;
in a specific implementation, when the matching result is less than or equal to the preset threshold, it indicates that the similarity between the text segment to be read and the text abstract is not high, and the segment obtaining unit 16 may obtain, in at least one third-party site, at least one third-party text segment having the same title as the text segment to be read, where it is understood that the third-party site may be a text aggregation site other than a currently used text aggregation site.
The calculating unit 17 is configured to calculate a similarity of each third-party text segment in the at least one third-party text segment by using the text abstract;
in a specific implementation, the calculating unit 17 may obtain third-party text segments of a preset number of third-party sites, for example: and acquiring the third-party text fragments of 10 third-party sites. The calculating unit 17 calculates the similarity of each of the preset number of third-party text segments by using the text summary, and the specific calculating process may refer to the matching processing process in the embodiment shown in fig. 3, which is not described herein again.
The segment output unit 12 is further configured to acquire and output a third-party text segment with a similarity greater than the preset threshold and a maximum similarity;
in a specific implementation, the segment output unit 12 may obtain and output third-party text segments with a similarity greater than the preset threshold and a maximum similarity from among the preset number of third-party text segments.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. Through the directory information of the matched text, a matching basis is provided for the matching of subsequent text segments; the text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is improved; matching the text segments by adopting a Levensstein ratio mode and a number ratio mode, and further improving the accuracy of the text segments; and judging the matching result by adopting a preset threshold value, and outputting the most accurate text segment according to the judgment result, thereby ensuring the text reading quality.
Referring to fig. 7, a schematic structural diagram of another ue is provided in an embodiment of the present invention. As shown in fig. 7, the user terminal 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 7, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a text verification application program.
In the user terminal 1000 shown in fig. 7, the network interface 1004 is mainly used for connecting a text signing site and a third party site, and performing data communication with the user terminal; the processor 1001 may be configured to call the text verification application stored in the memory 1005, and specifically perform the following steps:
acquiring a text abstract of a standard text segment with the same title as a text segment to be read in a text signing site;
matching the text segments to be read by adopting the text abstract;
and when the matching result is larger than a preset threshold value, outputting the text segment to be read.
In one embodiment, before the processor 1001 obtains the text abstract of the standard text segment having the same title as the text segment to be read in the text signing site, the following steps are further performed:
when a browsing request of a label carrying a text to be read is received, acquiring directory information of a standard text associated with the label in a text signing site;
and matching the directory information of the text to be read by adopting the directory information of the standard text, and acquiring the text abstract of the standard text segment with the same title as the text segment to be read in the text signing site after the matching is passed.
In an embodiment, when the processor 1001 performs the matching process on the text segment to be read by using the text abstract, the following steps are specifically performed:
and respectively carrying out segmentation processing on the text abstract and the text segment to be read according to a preset format, and carrying out matching processing according to the character strings in the text abstract and the character strings in the text segment to be read after the segmentation processing.
In an embodiment, when the processor 1001 performs segmentation processing on the text abstract and the text fragment to be read respectively according to a preset format, and performs matching processing according to a character string in the text abstract and a character string in the text fragment to be read after the segmentation processing, the following steps are specifically performed:
respectively carrying out segmentation processing on the text abstract and the text segment to be read according to a first preset format, and acquiring a first character string in the text abstract and a second character string in the text segment to be read after the segmentation processing;
acquiring character string information of the first character string and the second character string, wherein the character string information comprises the sum of the lengths of the character strings and the Levensstein distance between the character strings;
and acquiring the Lavenstein ratio of the first character string and the second character string according to the sum of the lengths of the character strings and the Lavenstein distance between the character strings, and determining the Lavenstein ratio as a matching result of the matching processing of the text segment to be read and the text abstract.
In an embodiment, when the processor 1001 performs segmentation processing on the text abstract and the text fragment to be read respectively according to a preset format, and performs matching processing according to a character string in the text abstract and a character string in the text fragment to be read after the segmentation processing, the following steps are specifically performed:
segmenting the text abstract and the text segment to be read according to a second preset format, and acquiring at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after segmentation;
acquiring the number of fourth character strings in the text segment to be read, wherein the number of the fourth character strings is the same as that of the at least one third character string;
and determining the ratio of the number of the fourth character strings to the number of the at least one third character string as a matching result of the matching processing of the text segment to be read and the text abstract.
In one embodiment, the processor 1001 further performs the steps of:
when the matching result is smaller than or equal to a preset threshold value, at least one third-party text segment with the same title as the text segment to be read is obtained from at least one third-party site;
respectively calculating the similarity of each third-party text fragment in the at least one third-party text fragment by adopting the text abstract;
and acquiring and outputting the third-party text segment with the similarity greater than the preset threshold and the maximum similarity.
In the embodiment of the invention, the text abstract of the standard text segment with the same title as the text segment to be read is acquired from the text signing site, the text abstract is adopted to carry out matching processing on the text segment to be read, and the text segment to be read is output when the matching result is greater than the preset threshold value. Through the directory information of the matched text, a matching basis is provided for the matching of subsequent text segments; the text abstract of the standard text segment acquired by the text signing site is adopted to match the text segment to be read, so that the accuracy of the text segment is improved; matching the text segments by adopting a Levensstein ratio mode and a number ratio mode, and further improving the accuracy of the text segments; and judging the matching result by adopting a preset threshold value, and outputting the most accurate text segment according to the judgment result, thereby ensuring the text reading quality.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (10)

1. A text verification method, comprising:
when a browsing request of a label carrying a text to be read is received, acquiring directory information of a standard text associated with the label in a text signing site;
matching the directory information of the text to be read by adopting the directory information of the standard text;
after the matching is passed, namely when the directory information of the matched standard text is consistent with the directory information of the text to be read, acquiring a text abstract of the standard text segment with the same title as the text segment to be read;
matching the text segment to be read by adopting the text abstract, segmenting the text abstract and the text segment to be read respectively according to a preset format, converting character strings in the text abstract and character strings in the text segment to be read which are segmented into pinyin respectively, and matching according to the pinyin of the character strings in the text abstract and the pinyin of the character strings in the text segment to be read;
and when the matching result is larger than a preset threshold value, outputting the text segment to be read.
2. The method as claimed in claim 1, wherein the segmenting the text abstract and the text segment to be read according to a preset format, converting the character string in the text abstract and the character string in the text segment to be read after the segmenting process into pinyin respectively, and performing matching processing according to the pinyin of the character string in the text abstract and the pinyin of the character string in the text segment to be read comprises:
respectively carrying out segmentation processing on the text abstract and the text segment to be read according to a first preset format, and acquiring a first character string in the text abstract and a second character string in the text segment to be read after the segmentation processing;
respectively converting the first character string and the second character string into pinyin, and acquiring the sum of the lengths of the pinyin of the first character string and the pinyin of the second character string and the Levenson distance between the pinyin of the first character string and the pinyin of the second character string;
and acquiring a Levenson ratio between the Pinyin of the first character string and the Pinyin of the second character string according to the sum of the lengths of the Pinyin of the first character string and the Pinyin of the second character string and the Levenson distance between the Pinyin of the first character string and the Pinyin of the second character string, and determining the Levenson ratio as a matching result of matching the text segment to be read and the text abstract.
3. The method according to claim 1, wherein the matching the text segment to be read with the text abstract further comprises:
segmenting the text abstract and the text segment to be read according to a second preset format, and acquiring at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after segmentation;
acquiring the number of fourth character strings in the text segment to be read, wherein the number of the fourth character strings is the same as that of the at least one third character string;
and determining the ratio of the number of the fourth character strings to the number of the at least one third character string as a matching result of the matching processing of the text segment to be read and the text abstract.
4. The method of claim 1, further comprising:
when the matching result is smaller than or equal to a preset threshold value, at least one third-party text segment with the same title as the text segment to be read is obtained from at least one third-party site;
respectively calculating the similarity of each third-party text fragment in the at least one third-party text fragment by adopting the text abstract;
and acquiring and outputting the third-party text segment with the similarity greater than the preset threshold and the maximum similarity.
5. A user terminal, comprising:
the system comprises an information acquisition unit, a text signing site and a text reading unit, wherein the information acquisition unit is used for acquiring the directory information of a standard text associated with a label when receiving a browsing request of the label carrying a text to be read;
the notification unit is used for matching the directory information of the text to be read by adopting the directory information of the standard text and notifying the abstract acquisition unit after the matching is passed, namely the directory information of the matched standard text is consistent with the directory information of the text to be read;
the abstract acquiring unit is used for acquiring a text abstract of a standard text segment with the same title as that of a text segment to be read in a text signing site;
the segment matching unit is used for respectively carrying out segmentation processing on the text abstract and the text segment to be read according to a preset format, respectively converting character strings in the text abstract and the text segment to be read after the segmentation processing into pinyin, and carrying out matching processing according to the pinyin of the character strings in the text abstract and the pinyin of the character strings in the text segment to be read;
and the segment output unit is used for outputting the text segment to be read when the matching result is greater than a preset threshold value.
6. The terminal of claim 5, wherein the segment matching unit comprises:
the first obtaining subunit is configured to respectively perform segmentation processing on the text abstract and the text segment to be read according to a first preset format, and obtain a first character string in the text abstract and a second character string in the text segment to be read after the segmentation processing;
an information obtaining subunit, configured to convert the first character string and the second character string into pinyin respectively, and obtain a sum of lengths of the pinyin of the first character string and the pinyin of the second character string and a levenstein distance between the pinyin of the first character string and the pinyin of the second character string;
and acquiring a Levenson ratio between the Pinyin of the first character string and the Pinyin of the second character string according to the sum of the lengths of the Pinyin of the first character string and the Pinyin of the second character string and the Levenson distance between the Pinyin of the first character string and the Pinyin of the second character string, and determining the Levenson ratio as a matching result of matching the text segment to be read and the text abstract.
7. The terminal of claim 5, wherein the segment matching unit comprises:
the second obtaining subunit is configured to respectively perform segmentation processing on the text abstract and the text segment to be read according to a second preset format, and obtain at least one third character string in the text abstract and at least one fourth character string in the text segment to be read after the segmentation processing;
the number obtaining subunit is configured to obtain the number of fourth character strings, which are the same as the at least one third character string, in the text segment to be read;
and the second result determining subunit is configured to determine, as a matching result of the matching processing between the text segment to be read and the text abstract, a ratio of the number of the fourth character string to the number of the at least one third character string.
8. The terminal of claim 5, further comprising:
the segment obtaining unit is used for obtaining at least one third-party text segment with the same title as the text segment to be read from at least one third-party site when the matching result is smaller than or equal to a preset threshold value;
the calculating unit is used for calculating the similarity of each third-party text fragment in the at least one third-party text fragment by adopting the text abstract;
the segment output unit is further configured to acquire and output a third-party text segment with a similarity greater than the preset threshold and a maximum similarity.
9. A user terminal comprising a processor and a memory;
the memory is to store a text verification application, the text verification application including program instructions;
the processor is configured to invoke the text proofing application to perform the text proofing method of any of claims 1-4.
10. A storage medium, characterized in that the storage medium stores a computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the text verification method of any of claims 1-4.
CN201410370686.3A 2014-07-30 2014-07-30 Text verification method and user terminal Active CN105320641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410370686.3A CN105320641B (en) 2014-07-30 2014-07-30 Text verification method and user terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410370686.3A CN105320641B (en) 2014-07-30 2014-07-30 Text verification method and user terminal

Publications (2)

Publication Number Publication Date
CN105320641A CN105320641A (en) 2016-02-10
CN105320641B true CN105320641B (en) 2020-04-03

Family

ID=55248046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410370686.3A Active CN105320641B (en) 2014-07-30 2014-07-30 Text verification method and user terminal

Country Status (1)

Country Link
CN (1) CN105320641B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319978B (en) * 2018-02-01 2021-01-22 北京捷通华声科技股份有限公司 Semantic similarity calculation method and device
CN110517050A (en) * 2019-08-12 2019-11-29 太平洋医疗健康管理有限公司 A kind of medical insurance, which instead cheats to exchange, encodes digging system and method
CN110532112B (en) * 2019-08-29 2022-10-04 维沃移动通信有限公司 Object extraction method and mobile terminal
CN111930890A (en) * 2020-07-28 2020-11-13 深圳市梦网科技发展有限公司 Information sending method and device, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1403959A (en) * 2001-09-07 2003-03-19 联想(北京)有限公司 Content filter based on text content characteristic similarity and theme correlation degree comparison
CN101826099A (en) * 2010-02-04 2010-09-08 蓝盾信息安全技术股份有限公司 Method and system for identifying similar documents and determining document diffusance
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1403959A (en) * 2001-09-07 2003-03-19 联想(北京)有限公司 Content filter based on text content characteristic similarity and theme correlation degree comparison
CN101826099A (en) * 2010-02-04 2010-09-08 蓝盾信息安全技术股份有限公司 Method and system for identifying similar documents and determining document diffusance
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
编辑距离算法在科研基金名称数据分析中的应用;赵胜钢 等;《数字图书馆论坛》;20140525;第55页 *

Also Published As

Publication number Publication date
CN105320641A (en) 2016-02-10

Similar Documents

Publication Publication Date Title
WO2022142014A1 (en) Multi-modal information fusion-based text classification method, and related device thereof
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN110069608B (en) Voice interaction method, device, equipment and computer storage medium
CN110659366A (en) Semantic analysis method and device, electronic equipment and storage medium
US20230072352A1 (en) Speech Recognition Method and Apparatus, Terminal, and Storage Medium
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
US9456175B2 (en) Caption searching method, electronic device, and storage medium
CN105320641B (en) Text verification method and user terminal
JP2020004382A (en) Method and device for voice interaction
CN105096934A (en) Method for constructing speech feature library as well as speech synthesis method, device and equipment
CN111145732A (en) Processing method and system after multi-task voice recognition
CN104462051A (en) Word segmentation method and device
CN111178056A (en) Deep learning based file generation method and device and electronic equipment
CN111312233A (en) Voice data identification method, device and system
CN110704608A (en) Text theme generation method and device and computer equipment
CN113658594A (en) Lyric recognition method, device, equipment, storage medium and product
US20240169972A1 (en) Synchronization method and apparatus for audio and text, device, and medium
CN112395391A (en) Concept graph construction method and device, computer equipment and storage medium
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
CN111783433A (en) Text retrieval error correction method and device
WO2022105120A1 (en) Text detection method and apparatus from image, computer device and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN107728806A (en) Input method candidate word methods of exhibiting, device, computer installation and storage medium
CN115965018B (en) Training method of information generation model, information generation method and device
CN111783447B (en) Sensitive word detection method, device and equipment based on ngram distance and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant