CN108268438B

CN108268438B - Page content extraction method and device and client

Info

Publication number: CN108268438B
Application number: CN201611260567.8A
Authority: CN
Inventors: 李洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2021-10-22
Anticipated expiration: 2036-12-30
Also published as: CN108268438A

Abstract

The invention provides a page content extraction method, a device and a client, wherein the method comprises the following steps: acquiring a selected area in the page; identifying the characters in the selected middle area one by one, acquiring a meta sentence containing the characters, and splitting the meta sentence to obtain alternative words; sorting the alternative words by using at least one attribute of the alternative words to obtain a sorting result; and selecting the candidate word with the highest ranking in the ranking result as a target candidate word, and extracting the target candidate word. According to the method and the device, the page content selected by the client can be quickly and effectively extracted through the splitting of the meta sentence and the attribute sequencing by utilizing the alternative words, the extracted content is more accurate, the situation that the user needs to manually adjust after selecting is avoided, the time is saved, and the user experience is improved.

Description

Page content extraction method and device and client

Technical Field

The invention relates to the technical field of internet, in particular to a page content extraction method, a page content extraction device and a client.

Background

With the rapid development of the mobile internet, people's daily life is closely connected with the internet, so that the internet generates massive data information, which becomes a main source of information acquisition, and the internet has widely permeated into various fields of networks.

Increasingly, there is a demand for information analysis and information processing, wherein users often need to copy text characters for other operations, such as searching or pasting the text characters to a dialog box for further editing, when reading web page texts by using client devices; as people have higher requirements on accuracy and timeliness of information analysis, users want to be able to efficiently and accurately complete text replication.

In the prior art, when a user selects and copies a text, the marking speed is slow in some cases, so that the operation completion time is long; some contents to be copied are not in the default selection range, the contents to be copied cannot be correctly selected, and the user experience is poor; some situations that the selection of the flash cursor needs to be adjusted for multiple times, and even after the adjustment for multiple times, the word desired by the user cannot be copied correctly occur, so that the operation efficiency is low.

Disclosure of Invention

In order to solve the technical problem, the invention provides a page content extraction method, a page content extraction device and a client.

In a first aspect, a method for extracting page content is provided, where the method includes: acquiring a selected area in the page; identifying the characters in the selected middle area one by one, acquiring a meta sentence containing the characters, and splitting the meta sentence to obtain alternative words; sorting the alternative words by using at least one attribute of the alternative words to obtain a sorting result; and selecting target alternative words according to the sorting result, and extracting the target alternative words. .

In a second aspect, an apparatus for extracting page content is provided, the method including: the area acquisition module is used for acquiring the selected area in the page; the alternative word generation module is used for identifying the characters in the selected middle area one by one, acquiring a meta sentence containing the characters, and splitting the meta sentence into alternative words; the attribute sorting module is used for sorting the alternative words according to the attributes of the alternative words to obtain a sorting result; and the page content extraction module is used for selecting target alternative words according to the sequencing result and extracting the target alternative words.

In a third aspect, a client is provided, where the client includes the foregoing page content extracting apparatus, and is installed in a user terminal, and is used to extract page content according to an input of a user.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: the content of the user selected area can be extracted quickly and accurately based on the separation of the meta sentence into the alternative words and the sequencing of the alternative words by using at least one attribute of the alternative words, so that the user can conveniently copy, search and the like, and the user experience is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention.

FIG. 2 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for extracting page content according to an embodiment of the present invention;

fig. 9 is a schematic block diagram of a device of a page content extracting device according to an embodiment of the present invention;

fig. 10 is a schematic block diagram of a device of a page content extracting device according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a schematic structural diagram of an implementation environment related to a page content extraction method provided by an embodiment of the present invention is shown. The implementation environment comprises user equipment 101 to be evaluated, the user equipment 101 displays a page to be extracted, and a user selects the page content. The user equipment displays the selected content according to the selection of the user.

In an embodiment of the present invention, a method for extracting page content is provided, as shown in fig. 2, the method includes:

s210, acquiring the selected area in the page.

Specifically, the client obtains a selected area of the user operation in the page through the man-machine interface. For example, the selected area may be an area selected on the touch interface by a user pressing with a finger. For example, the selected region may also be a region selected by swiping or clicking in the interface using an input tool such as a stylus.

S220, identifying the characters in the selected middle area one by one, obtaining a meta sentence containing the characters, and splitting the meta sentence into alternative words.

Specifically, the client identifies the characters contained in the selected middle region, and the characters may be complete characters contained in the selected middle region or incomplete characters partially contained in the selected middle region. The selected area refers to a selected area formed on a user interface through pressing, touching, sliding and the like by a user, if the character is completely contained in the selected area, the character is complete relative to the selected area, if the character is just located at the boundary of the selected area, part of the character is located in the selected area, and part of the character is located outside the selected area, the character is incomplete relative to the selected area, and both the complete character and the incomplete character can be recognized as the character in the selected area, and in the recognition process, the complete character and the incomplete character are distinguished by using the identification bits.

In one example, flag bit 1 is used to indicate for a complete character that is all contained in the selected region, and flag bit 0 is used to indicate for an incomplete character.

In another example, the integrity of the character may be represented by a quantized marker bit value, wherein the marker bit value is 1 for all the complete characters contained in the selected region, and the marker bit value is X for the incomplete characters, wherein X is a value between 0 and 1, and the value represents the area of the incomplete character containing the corresponding complete character.

In one example, a meta sentence containing characters is obtained by retrieval at a location corresponding to the content of the page. The meta sentence is a character string where the character is located and is segmented by adjacent punctuations, such as page contents' AAAAAA, BBBBBBBBB, CCCCCCCCCCDDDDDDDD, EEE, FF; g, HHHHHHH; IIIIIIIIII'. The phrases contained therein are "AAAAAA", "BBBBBBB", "CCCCCCCCCCC", "DDDDDDDDD", "EEE", "FF", "G", "HHHHHHHHH", "IIIIIIIIII", respectively. Where A, B, C, D, E, F, G, H, I denotes the characters in each sentence, the characters may be the same or different.

Specifically, the client splits the meta sentence into alternative words, and adopts different word segmentation technologies, which may be a word segmentation technology in the prior art, or an improved word segmentation technology in this embodiment, for example. A word is the smallest, independently active, meaningful language component; the English words are marked by spaces as natural delimiters, while Chinese is a writing unit based on characters, the Chinese words are not marked by obvious distinction, and Chinese word analysis is the basis and key of Chinese information processing technology. Therefore, a mature word segmentation technology is required to be selected when the Chinese information is processed. The client in this embodiment splits each word in the target sentence by using a mature word splitting technology, and splits each word into one alternative word phrase, where each alternative word phrase includes multiple alternative words.

In one example, splitting alternatives includes: and setting the maximum granularity of the split alternative words, wherein the granularity is the number of characters contained in the split alternative words. Reading continuous character strings in the meta sentence; matching the continuous character strings with a preset word list according to the sequence from left to right; when the character string with the first length in the continuous character string is matched with a preset word list, judging whether the character string with the first length plus 1 length is matched with the preset word list; if not, taking the character string with the first length as an alternative word, cutting the character string with the first length from the continuous character string, and continuously matching by using the cut continuous character string; and if so, updating the first length plus 1 as the first length, and continuously judging whether the character string of the first length plus 1 is matched with a preset word list.

In one example, splitting alternatives includes: and setting the maximum granularity of the split alternative words, wherein the granularity is the number of characters contained in the split alternative words. Reading continuous character strings in the meta sentence; matching the continuous character strings with a preset word list according to the sequence from right to left; when the character string with the first length in the continuous character string is matched with a preset word list, judging whether the character string with the first length plus 1 length is matched with the preset word list; if not, taking the character string with the first length as an alternative word, cutting the character string with the first length from the continuous character string, and continuously matching by using the cut continuous character string; and if so, updating the first length plus 1 as the first length, and continuously judging whether the character string of the first length plus 1 is matched with a preset word list.

In another example, the splitting process in the first two examples is repeated, and the output splitting result is selected according to the principle of maximum granularity and the principle of minimum number of split words. For example, for the meta-sentence "we play in a wild zoo", the alternative words split by matching from left to right are "we/in a wild/live/animal/zoo/play", the alternative words split in the order from right to left are "we/in/wild zoo/play", and "we/in/wild zoo/play" is selected as an output result according to the maximum granularity principle.

Therefore, the alternative word splitting method of the embodiment can improve the accuracy of alternative word selection, so that the accuracy of page content extraction is improved.

S230, sorting the alternative words according to the attributes of the alternative words to obtain a sorting result.

The alternative words comprise various attributes, such as the use heat of the alternative words, the part of speech of the alternative words, the number of characters contained in the alternative words and the like, and the attributes of the words can be used for distinguishing and ordering the content selected by the user according to the importance, so that the content of the page selected by the user can be more easily identified and extracted.

In this embodiment, the split candidate words are sorted by using the completeness of the characters in the candidate words, the heat of the candidate words, and the magnetism of the candidate words, which are hereinafter referred to as a first attribute, a second attribute, and a third attribute.

In the process of sequencing the alternative words by using the alternative word attributes, firstly, the integrity attribute values in the alternative words are utilized for carrying out first sequencing, and the alternative word integrity attribute reflects the selection position of a user when the user selects the page content and is the most important index for extracting the page content. As mentioned above, the quantized flag value can be used to indicate the integrity of the character, the flag value 1 is used to indicate the integrity of all the complete characters contained in the selected region, and the flag value X is used to indicate the non-complete character, where X is a value between 0 and 1, and the value indicates the area of the non-complete character containing the corresponding complete character. Then the integrity attribute value for a word is the average of the sums of the integrity of the characters in the word. For example, if the integrity of each character in the alternative word "wildlife zoo" is X1, X2, X3, X4, and X5, respectively, then the integrity of the word is (X1+ X2+ X3+ X4+ X5) divided by 5. The completeness formula is summarized as follows:

wherein, I represents the character serial number in the alternative words, n represents the number of the alternative words, and XI represents the integrity of the I-th character.

In the above example, if the values of X1-X5 are 0.6, 1, 1, 1, 0.8, respectively, then the completeness formula is:

after the alternative words are ranked, the method further comprises the following steps: and judging whether the integrity of the alternative words is greater than a first preset threshold value, if the integrity of the alternative words is too low, indicating that the words deviate from the center of the user selection area, and screening words which are not selected by the user through the threshold value. In an example, if the first preset threshold is set to be 50% of the area of the selection region, if there are candidate words with a completeness degree greater than 50% of the area of the selection region in the multiple candidate words, the client stores such candidate words in a first candidate phrase, where the candidate words in the first candidate phrase are objects to be ranked for the second time.

In one example, the client obtains a completeness ranking result of the candidate words, and takes the candidate word with the highest ranking as the target candidate word.

In one example, the client uses the heat and the part of speech of the candidate word to reorder the first candidate word group to obtain the ordering result. The candidate words in the sequencing result have priorities, the word with the highest priority is the target word, and the heat of the candidate words is the number of times that the candidate words are searched in the hot word service; the part of speech of the alternative words is the characteristic of words used for dividing the part of speech, wherein the hot word service is the service related to the hot words such as a search engine or an input method.

S240, selecting target alternative words according to the sorting result, and extracting the target alternative words.

Specifically, each alternative word in the sorting result must have an order, and one or several alternative words with higher coverage and meeting the heat and the part of speech can be selected as target alternative words according to the order of each alternative word.

In one example, one target alternative word is selected according to the sorting result, the target alternative word is highlighted in the page, and the target alternative word is copied, so that the user can perform target word correlation operation on the target word copied by the client, such as pasting to a chat dialog box for editing, or performing correlation retrieval on the copied target word.

In one example, the target alternative words selected by the sorting result are multiple, the multiple target alternative words are highlighted in the page, and the user is waited for selection operation; the client copies the target candidate word according to the user selection, and the user can perform target word related operation on the target word copied by the client, such as pasting the target word to a chat dialog box for editing, or performing related retrieval on the copied target word.

The client end marks the alternative words on the text in a highlighting way, the marked words are the target words of the user, and the client end further copies the target words

In summary, the embodiment provides a method for extracting page content selected by a client quickly and effectively by splitting a meta sentence and sorting by using alternative word attributes, the extracted content is more accurate, the user is prevented from manually adjusting after selecting, time is saved, and user experience is improved.

Referring to fig. 3, the present embodiment provides a method for extracting page content, which includes the following steps:

and S310, acquiring the selected area in the page.

For example, if the user operates the mobile phone client, and the user needs to copy the text in the process of browsing the web page, the user operates on the touch screen of the mobile phone client, and the finger surface of the user contacts with the touch screen to obtain an annular selection area, as shown in fig. 3, the annular area in fig. 3 is the selection area in the text.

S320, identifying the characters in the selected area, acquiring all sentences corresponding to the characters, and deleting repeated sentences in all the sentences to obtain the target sentences.

Step S320 includes the following substeps:

s3201, identifying characters in the selected area. Referring to fig. 5, the steps include:

s32011, identifying the complete character in the selected middle area, and adding a complete character identification bit for the complete character.

S32012, identifying the incomplete character in the selected middle area, and adding an incomplete character identification bit for the incomplete character.

In step S320, the client identifies all the characters in the selected area through a character acquisition technique, please refer to fig. 4, where the characters belonging to the selected area include:

[ two, ten, country, college, collection, go, color ]

Wherein, the ' ten ' is the complete character in the selected area, and the ' two, country, set, go ' and color ' are the incomplete characters in the selected area. And respectively adding character identification bits for the characters to indicate whether the characters are complete characters or the completeness of the characters.

S3202, all the meta sentences corresponding to the characters are acquired. Referring to fig. 6, the steps include:

s32021, retrieving the characters in the page content to obtain multiple meta-sentences corresponding to each character in the selected area.

S32022, querying the multiple meta-sentences to determine whether there is a duplicate meta-sentence in the multiple meta-sentences.

S32023, if yes, deleting the repeated clause.

Specifically, the meta-sentence to which the character belongs is judged by the client side, and the meta-sentences corresponding to all the characters in the selected area are sequentially identified by taking punctuation marks between sentences as boundaries. For the repeated sentences in all the sentences, the client deletes the repeated sentences through the deduplication technology. For example. Still referring to fig. 4, a meta sentence corresponding to "two" is that "the sattarian peak of the leaders of the twenty nations group will be opened successfully," a meta sentence corresponding to "ten" is also that "the sattarian peak of the leaders of the twenty nations group will be opened successfully," two "and" ten "are identical, only one sentence is finally reserved for the repeated sentences, and the rest of identical sentences are deleted; and identifying and removing the duplicate according to the above steps to finally obtain the sentence corresponding to the character, namely the target sentence:

the Antariya peak of the twenty-national group leaders was successfully developed.

Again thanks to the outstanding work and positive results obtained by turkish in the last-year chairman. "C (B)

S3203, the meta sentence is split to obtain alternative words. Referring to fig. 7, this step includes the following sub-steps:

s32031, reading the continuous character strings in the meta sentence;

s32032, matching the continuous character strings with a preset word list according to the sequence from left to right;

s32033, when the character string with the first length in the continuous character string is matched with the preset word list, judging whether the character string with the first length plus 1 length is matched with the preset word list;

s32034, if not, taking the character string with the first length as an alternative word, cutting the character string with the first length from the continuous character string, and continuously matching by using the cut continuous character string;

s32035, if yes, adding 1 to the first length to serve as the first length, and continuing to judge whether the character string with the length of the first length added by 1 is matched with a preset word list.

Specifically, the sentence selected in fig. 4 is split:

the Antariya peak of the twenty-country group leaders can be successfully developed. "the resolution results are as follows:

[ icosaka, group, leaders, antalia, Peak meeting, Kai, De, very successful ]

Thanks again to the outstanding work and positive results of turkish in the last-year chairman. "the resolution results are as follows:

third, thank you, last year, chairman, country, turkey, excel, work, and, get, and positive, result. "C (B)

S330, sequencing the alternative words by using at least one attribute of the alternative words to obtain a sequencing result. The alternative words comprise various attributes, such as the use heat of the alternative words, the part of speech of the alternative words, the number of characters contained in the alternative words and the like, and the attributes of the words can be used for distinguishing and ordering the content selected by the user according to the importance, so that the content of the page selected by the user can be more easily identified and extracted.

In one example, the alternative words are ranked by using one attribute of the alternative words, and a ranking result of the alternative words is obtained. For example, the characters in the candidate word may be sorted by their integrity attributes, because the integrity attributes of the characters in the candidate word are obtained during the process of obtaining the characters, according to the formula:

wherein, I represents the character serial number in the alternative words, n represents the number of the alternative words, and XI represents the integrity of the I-th character. The integrity numerical values of the characters in the alternative words can be obtained, and the alternative words can be sequenced according to the integrity numerical values.

In one example, the alternative words are ranked by using one attribute of the alternative words, and a ranking result of the alternative words is obtained. For example, the candidate words may be ranked according to the heat attribute of the characters in the candidate words, the heat of the candidate words may be queried according to the heat labels of the hot words in the word stock, and the labels of the heat in the word stock are obtained by collecting the internet search engine or the instant messaging tool from big data. For example, the heat values of roast ducks, parks and motor homes are 370 ten thousand search values, 150 ten thousand search values and 80 search values respectively, and the heat ranks of the three are 'roast ducks-parks-motor homes' in sequence.

In one example, the two or three attributes of the alternative words are used for sorting, and the process comprises firstly sorting by using the first attribute and then correcting the sorting result by using the second attribute and/or the third attribute. Specifically, step S330 may include the following sub-steps:

s3301, according to the first attribute value of the candidate word, priority ordering is conducted on the multiple candidate words, and a first ordering result is obtained.

S3302, judging whether a first attribute value of the alternative word is larger than a first preset threshold value, if so, storing the alternative word in a first alternative phrase;

and S3303, re-ordering the candidate words in the first candidate phrase according to the second attribute value or the third attribute value of the candidate words, so as to obtain an ordering result.

And when the first attribute is the integrity of the alternative word, the second attribute is the heat of the alternative word, and the third attribute is the part of speech of the alternative word, the first attribute is the integrity of the alternative word. Firstly, sorting for the first time according to the integrity of the alternative words, then comparing the integrity with a preset threshold value to obtain the alternative words with the integrity higher than the threshold value, taking the alternative words as a first alternative phrase, and then sorting the first alternative phrase according to the heat of the alternative words. However, there is a case where the unique candidate word cannot be determined after the candidate words are sorted according to the degree of heat, and then the candidate words are sorted according to the part of speech.

Certainly, the three attributes mentioned above are not limited, the character length and the like included in the candidate words may also be used to participate in the sorting, and the attribute sequence of the candidate words may also be arranged and combined, for example, the first attribute may be selected as the heat of the candidate words, and the sorting is performed through the heat, which is beneficial to directly selecting the network hot words and improving the efficiency and accuracy of extracting the content.

In one example, referring to fig. 8, step S3303 may further include the following sub-steps:

s33031, obtaining a second attribute value of the candidate word in the first candidate phrase, and comparing the second attribute value of the candidate word with a second preset threshold value;

s33032, if there is a candidate word whose second attribute value is greater than the second preset threshold, re-ordering the candidate words in the first candidate word group according to the second attribute value of the candidate word;

s33033, if there is no candidate word whose second attribute value is greater than the second preset threshold, re-ordering the candidate words in the first candidate word group according to the third attribute of the candidate word.

Specifically, if all the candidate words in the first sorting result are not hot words or are not suitable for being used as a sorting basis after being judged one by one, the candidate words in the first candidate word group are sorted again according to the third attribute of the candidate words to obtain a sorting result.

The third attribute includes a part of speech of the alternative word, specifically, for the part of speech of the alternative word: through statistics of mass user copying behaviors, the user has higher copying probability of nouns, adjectives and verbs, wherein the nouns are the highest; therefore, the order of sorting the alternative word phrases is as follows:

noun > adjective > verb > other words

The other words include numbers, quantifiers, pronouns and the like, and because the possibility that the words of other parts of speech are taken as the default copied content of the user is very small, the other words can not be distinguished.

For example, in fig. 3, if "twenty country" is recognized to be searched 1 ten thousand times, the "twenty country" is a hotword, and the corresponding hotness value is 1 ten thousand; if the 'excellent' is input, the 'excellent' is found to be a hot word by calling a hot word library, and the 'excellent' is searched 5000 times, wherein the heat degree is 5000; at this time, ranking both according to the hotness value results in a ranking of "twenty nation" higher than "excellent". However, if the preset heat threshold is higher than 1 ten thousand, the heat value at this time is not used as a reference value for sorting, but a part of speech is used as a judgment condition for judging sorting.

S340, selecting target alternative words according to the sorting result, and extracting the target alternative words.

And the client takes the alternative word with the highest rank in the alternative words as a target alternative word according to the ranking result, and extracts the alternative word. Specifically, the extraction may include operations of preparing the candidate word and copying the candidate word in the memory in advance.

Specifically, the client highlights and marks the words with the first priority in the second ranking result on the text, the highlighted and marked words are the target candidate words of the user, and the client further copies the target candidate words. The manner of highlighting may be highlight highlighting, color highlighting, or shape highlighting, among others. Highlighting means changing the background color of the target candidate word so as to display the area where the word is located in a highlighted mode; color highlighting means changing the color of the word to highlight in other words; shape highlighting refers to changing the font of the alternative word or the shape of the area where the alternative word is located.

In summary, the page content extraction method provided by this embodiment can greatly improve the efficiency and accuracy of extracting content by adopting multi-attribute sorting and screening. For example, after the candidate words are sorted by the completeness of the candidate words, the candidate words in the first sorting result are further identified and judged, and the popularity of the candidate words or the part of speech of the candidate words is selected for re-sorting, so that the operation target of the user is copied more efficiently.

Referring to fig. 9, the present embodiment provides a page content extracting apparatus, including:

the area acquisition module executes the step S210 to acquire the selected area in the page;

the alternative word generation module executes step S220, and is configured to recognize characters in the selected middle region one by one, obtain a meta sentence including the characters, and split the meta sentence into alternative words;

the attribute sorting module executes the step S230, and is configured to sort the candidate words according to the multiple attributes of the candidate words to obtain a sorting result;

and the page content extraction module executes the step S240, and is configured to use the candidate word with the highest ranking result as a target candidate word and extract the target candidate word.

Referring to fig. 10, the present embodiment provides a page content extracting apparatus, including:

and the area acquisition module executes the step S310 to acquire the selected area in the page.

And the alternative word generation module executes step S320, and is configured to identify the characters in the selected area, obtain all sentences corresponding to the characters, and delete repeated sentences in all sentences to obtain the target sentence.

The alternative word generation module comprises the following sub-modules:

and the character recognition submodule executes the step S3201 and is used for recognizing the characters in the selected area.

The character recognition submodule includes:

and the complete character recognition sub-module executes step S32011 to recognize the complete character in the selected middle region and add a complete character identification bit to the complete character.

And the incomplete character recognition sub-module executes step S32012 to recognize the incomplete character in the selected middle region and add an incomplete character identification bit to the incomplete character.

And the meta sentence acquisition submodule executes the step S3202 and is configured to acquire all meta sentences corresponding to the characters.

The meta sentence acquisition submodule comprises the following submodules:

and the meta sentence retrieval sub-module executes step S32021 to retrieve the characters in the page content to obtain a plurality of meta sentences corresponding to each character in the selected area.

The query submodule performs step S32022 to query the multiple meta-sentences to determine whether there are repeated meta-sentences in the multiple meta-sentences.

And a deduplication sub-module executing step S32021, configured to delete the repeated clause when there is the repeated clause.

The meta sentence splitting sub-module performs step S3203 to split the meta sentence to obtain the alternative word. The method comprises the following substeps:

a character string reading submodule for executing step S32031 to read a continuous character string in the meta sentence;

the matching submodule executes the step S32032, and matches the continuous character strings with a preset word list according to the sequence from left to right;

a first matching judgment sub-module, executing step S32033, when the character string of the first length in the continuous character string matches the preset word list, judging whether the character string of the first length plus 1 length matches the preset word list;

a first logic judgment sub-module executing step S32034, configured to, when the judgment result of the first matching judgment sub-module is negative, take the character string of the first length as an alternative word, cut the character string of the first length from the continuous character string, and continue matching using the cut continuous character string;

and the second logic judgment sub-module executes step S32035, and is configured to, when the judgment result of the first matching judgment sub-module is yes, add 1 to the first length to update the first length, and continue to judge whether the character string with the length of the first length added by 1 matches the preset vocabulary.

And the attribute sorting module executes the step S330, and is used for sorting the alternative words through at least one attribute of the alternative words to obtain a sorting result. The alternative words comprise various attributes, such as the use heat of the alternative words, the part of speech of the alternative words, the number of characters contained in the alternative words and the like, and the attributes of the words can be used for distinguishing and ordering the content selected by the user according to the importance, so that the content of the page selected by the user can be more easily identified and extracted.

In one example, the attribute ordering module may now include the following sub-modules:

and the first attribute sorting submodule executes the step S3301, and carries out priority sorting on the multiple candidate words according to the first attribute values of the candidate words to obtain a first sorting result.

The first attribute judgment sub-module executes the step S3302, judges whether the first attribute value of the alternative word is greater than a first preset threshold value, and stores the alternative word in a first alternative phrase if the first attribute value of the alternative word is greater than the first preset threshold value;

and the secondary sorting sub-module executes the step S3303, and sorts the candidate words in the first candidate word group again according to the second attribute value or the third attribute value of the candidate words to obtain the sorting result.

In one example, the secondary sorting module may further include the following sub-modules:

a second attribute threshold comparison sub-module, configured to execute step S33031, obtain a second attribute value of the candidate word in the first candidate word group, and compare the second attribute value of the candidate word with a second preset threshold;

the first logic ordering submodule executes the step S33032, and when there is a candidate word whose second attribute value is greater than the second preset threshold value, reorders the candidate words in the first candidate word group according to the second attribute value of the candidate word;

and the first logic ordering submodule reorders the alternative words in the first alternative word group according to the third attribute of the alternative words when the alternative words with the second attribute value larger than the second preset threshold do not exist.

And the page content extraction module executes the step S340 to select the candidate word with the highest rank in the ranking result as the target candidate word and extract the target candidate word.

Referring to fig. 11, the present embodiment provides a terminal, and the terminal may be configured to implement the page content extracting method provided in the foregoing embodiment. Specifically, the method comprises the following steps:

the terminal 700 may include RF (Radio Frequency) circuitry 110, memory 120 including one or more computer-readable storage media, an input unit 130, a display unit 140, a sensor 150, audio circuitry 160, a WiFi (wireless fidelity) module 170, a processor 180 including one or more processing cores, and a power supply 190. Those skilled in the art will appreciate that the terminal structures shown in the figures are not intended to be limiting of the terminal, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 110 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information from a base station and then sends the received downlink information to the one or more processors 180 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 110 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low Noise Amplifier), a duplexer, and the like. In addition, the RF circuitry 110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), email, SMS (Short Messaging Service), and the like.

The memory 120 may be used to store software programs and modules, and the processor 180 executes various functional applications and data processing by operating the software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required by functions (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal 700, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 120 may further include a memory controller to provide the processor 180 and the input unit 130 with access to the memory 120.

The input unit 130 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, the input unit 130 may include a touch-sensitive surface 131 as well as other input devices 132. The touch-sensitive surface 131, also referred to as a touch display screen or a touch pad, may collect touch operations by a user on or near the touch-sensitive surface 131 (e.g., operations by a user on or near the touch-sensitive surface 131 using a finger, a stylus, or any other suitable object or attachment), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface 131 may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 180, and can receive and execute commands sent by the processor 180. In addition, the touch-sensitive surface 131 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave; in addition to the touch-sensitive surface 131, the input unit 130 may also include other input devices 132. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 140 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal 700, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 140 may include a Display panel 141, and optionally, the Display panel 141 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like. Further, the touch-sensitive surface 131 may cover the display panel 141, and when a touch operation is detected on or near the touch-sensitive surface 131, the touch operation is transmitted to the processor 180 to determine the type of the touch event, and then the processor 180 provides a corresponding visual output on the display panel 141 according to the type of the touch event. Although in FIG. 11, touch-sensitive surface 131 and display panel 141 are shown as two separate components to implement input and output functions, in some embodiments, touch-sensitive surface 131 may be integrated with display panel 141 to implement input and output functions.

The terminal 700 can also include at least one sensor 150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 141 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 141 and/or a backlight when the terminal 700 is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the terminal is stationary, and can be used for applications of recognizing terminal gestures (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal 700, detailed descriptions thereof are omitted.

Audio circuitry 160, speaker 161, and microphone 162 may provide an audio interface between a user and terminal 700. The audio circuit 160 may transmit the electrical signal converted from the received audio data to the speaker 161, and convert the electrical signal into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 160, and then outputs the audio data to the processor 180 for processing, and then to the RF circuit 110 to be transmitted to, for example, another terminal, or outputs the audio data to the memory 120 for further processing. The audio circuit 160 may also include an earbud jack to provide communication of a peripheral headset with the terminal 700.

WiFi belongs to a short-distance wireless transmission technology, and the terminal 700 can help a user send and receive e-mails, browse web pages, access streaming media, and the like through the WiFi module 170, and provides wireless broadband internet access for the user. Although fig. 8 shows the WiFi module 170, it is understood that it does not belong to the essential constitution of the terminal 700 and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 180 is a control center of the terminal 700, connects various parts of the entire terminal using various interfaces and lines, performs various functions of the terminal 700 and processes data by operating or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120, thereby performing overall monitoring of the terminal. Optionally, processor 180 may include one or more processing cores; preferably, the processor 180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 180.

The terminal 700 also includes a power supply 190 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 180 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 190 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown, the terminal 700 may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the display unit of the terminal is a touch screen display, the terminal further includes a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for:

acquiring a selected area in the text;

identifying characters in the selected area, and acquiring sentences corresponding to the characters;

splitting the sentence into a plurality of alternative words;

performing priority ordering on the multiple alternative words according to the alternative word attributes to obtain an ordering result;

and marking target words according to the sequencing result, and copying the target words.

Further, the processor of the terminal is also configured to execute the instructions of: identifying all sentences corresponding to the characters in the selected area; and deleting the repeated sentences in all the sentences to obtain the sentences corresponding to the characters.

Further, the processor of the terminal is also configured to execute the instructions of: and splitting the sentence by adopting a forward maximum matching algorithm to obtain a plurality of alternative words.

Further, the processor of the terminal is also configured to execute the instructions of: performing priority ordering on the multiple candidate words according to the first attribute of the candidate words to obtain a first ordering result; judging whether the first attribute of the alternative word is larger than a first preset threshold value or not, if so, storing the alternative word in a first alternative phrase; and re-ordering the alternative words in the first alternative phrase according to the second attribute or the third attribute of the alternative words to obtain the ordering result.

Specifically, the first attribute includes the integrity of the candidate word, and the integrity of the candidate word is an area occupied by the candidate word in the selection area.

Further, the processor of the terminal is also configured to execute the instructions of: acquiring a second attribute of the alternative word in the first alternative word group, and comparing the second attribute of the alternative word with a second preset threshold value; if the candidate words with the second attribute larger than the second preset threshold exist, the candidate words in the first candidate word group are ranked again according to the second attribute of the candidate words; and if the candidate words with the second attribute larger than the second preset threshold value do not exist, sequencing the candidate words in the first candidate word group again according to the third attribute of the candidate words.

Further, the second attribute of the alternative word comprises the heat of the alternative word, and the third attribute comprises the part of speech of the alternative word.

In summary, the terminal provided in this embodiment further splits the sentences corresponding to the characters by acquiring partial complete and incomplete characters in the selected region, and sorts the split candidate words for multiple times, so that the target content that the user wants to copy can be correctly marked, and the number of user operations is reduced; by combining surrounding components, the experience of the user in copying the text is further optimized.

The technical solution in this embodiment may be essentially or partially contributed to by the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable one or more terminal devices to execute all or part of the steps of the method according to each embodiment of the present invention.

The division of the modules/units described in this embodiment is only a logical function division, and other division manners may be available in actual implementation, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed. Some or all of the modules/units can be selected according to actual needs to achieve the purpose of implementing the scheme of the invention.

In addition, each module/unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for extracting page content is characterized by comprising the following steps:

acquiring a selected area in the page;

identifying the characters in the selected middle area one by one, acquiring a meta sentence containing the characters, and splitting the meta sentence to obtain alternative words;

sorting the alternative words by using at least one attribute of the alternative words to obtain a sorting result;

selecting target alternative words according to the sorting result, and extracting the target alternative words;

wherein said identifying one by one characters in said selected region comprises: identifying complete characters in the selected region; identifying incomplete characters in the selected region; and adding identification bits for the complete characters and the incomplete characters, wherein the identification bits are used for identifying the integrity of the complete characters and the incomplete characters.

2. The method of claim 1, wherein obtaining the meta sentence containing the character comprises:

retrieving the characters in the page content to obtain a plurality of meta sentences corresponding to each character in the selected area;

querying the multiple meta-sentences to judge whether repeated meta-sentences exist in the multiple meta-sentences;

and if so, deleting the repeated meta sentence.

3. The method of claim 1, wherein splitting the meta sentence to obtain the candidate word comprises:

reading continuous character strings in the meta sentence;

matching the continuous character strings with a preset word list according to the sequence from left to right;

when the character string with the first length in the continuous character string is matched with a preset word list, judging whether the character string with the first length plus 1 length is matched with the preset word list;

if not, taking the character string with the first length as an alternative word, cutting the character string with the first length from the continuous character string, and continuously matching by using the cut continuous character string;

and if so, updating the first length plus 1 as the first length, and continuously judging whether the character string of the first length plus 1 is matched with a preset word list.

4. The method of claim 1, wherein the ranking the alternative words using the at least one attribute of the alternative words to obtain a ranking result comprises:

carrying out priority sequencing on a plurality of alternative words according to the first attribute values of the alternative words to obtain a first sequencing result;

judging whether the first attribute value of the alternative word is larger than a first preset threshold value or not, if so, storing the alternative word in a first alternative phrase;

and re-ordering the alternative words in the first alternative phrase according to the second attribute or the third attribute of the alternative words to obtain the ordering result.

5. The method of claim 4, wherein the first attribute value comprises a completeness of the alternative word, and wherein the completeness of the alternative word is calculated by the following formula:

wherein, X represents the integrity of the alternative word, I represents the character serial number in the alternative word, n represents the number of the characters in the alternative word, and XI represents the integrity of the I-th character.

6. The method of claim 4, wherein the reordering the candidate words in the first candidate word group according to the second attribute or the third attribute of the candidate words comprises:

acquiring a second attribute value of the alternative word in the first alternative word group, and comparing the second attribute value of the alternative word with a second preset threshold value;

if the candidate words with the second attribute values larger than the second preset threshold exist, the candidate words in the first candidate word group are ranked again according to the second attribute values of the candidate words;

and if the candidate words with the second attribute values larger than the second preset threshold value do not exist, sequencing the candidate words in the first candidate word group again according to the third attribute of the candidate words.

7. The method of claim 6, wherein the second attribute value of the alternative word comprises a heat value of the alternative word, and wherein the third attribute comprises a part-of-speech of the alternative word.

8. The method of claim 1, wherein the target alternate word is highlighted and/or the target alternate word is copied.

9. A page content extraction apparatus, characterized in that the apparatus comprises the following modules:

the area acquisition module is used for acquiring the selected area in the page;

the alternative word generation module is used for identifying the characters in the selected middle area one by one, acquiring a meta sentence containing the characters, and splitting the meta sentence into alternative words;

the attribute sorting module is used for sorting the alternative words according to the attributes of the alternative words to obtain a sorting result;

the page content extraction module is used for selecting target alternative words according to the sequencing result and extracting the target alternative words; the alternative word generation module comprises a character recognition sub-module, and the character recognition module is used for: identifying complete characters in the selected region; identifying incomplete characters in the selected region; and adding identification bits for the complete characters and the incomplete characters, wherein the identification bits are used for identifying the integrity of the complete characters and the incomplete characters.

10. The apparatus according to claim 9, wherein the alternative word generating module includes a meta-sentence obtaining sub-module, and the meta-sentence obtaining sub-module is configured to: retrieving the characters in the page content to obtain a plurality of meta sentences corresponding to each character in the selected area; querying the multiple meta-sentences to judge whether repeated meta-sentences exist in the multiple meta-sentences; and if so, deleting the repeated meta sentence.

11. The apparatus according to claim 9, wherein the alternative word generating module includes a word segmentation submodule configured to read a continuous character string in the meta sentence; matching the continuous character strings with a preset word list according to the sequence from left to right; when the character string with the first length in the continuous character string is matched with a preset word list, judging whether the character string with the first length plus 1 length is matched with the preset word list; if not, taking the character string with the first length as an alternative word, cutting the character string with the first length from the continuous character string, and continuously matching by using the cut continuous character string; and if so, updating the first length plus 1 as the first length, and continuously judging whether the character string of the first length plus 1 is matched with a preset word list.

12. The apparatus of claim 9, wherein the attribute ordering module comprises:

the first attribute sorting submodule is used for carrying out priority sorting on a plurality of candidate words according to the first attribute values of the candidate words to obtain a first sorting result;

the first attribute threshold value judging submodule is used for judging whether a first attribute value of the alternative word is larger than a first preset threshold value or not, and if so, the alternative word is stored in a first alternative word group;

and the secondary sorting submodule is used for re-sorting the candidate words in the first candidate word group according to the second attribute or the third attribute of the candidate words to obtain the sorting result.

13. The apparatus of claim 12, wherein the first attribute value comprises a completeness of the alternative word, and wherein the completeness of the alternative word is calculated by the following formula:

14. The apparatus of claim 12, wherein the secondary ranking sub-module comprises:

a second attribute value obtaining sub-module, configured to obtain a second attribute value of the candidate word in the first candidate word group;

a second attribute threshold judgment submodule for comparing the second attribute value of the candidate word with a second preset threshold; if the candidate words with the second attribute values larger than the second preset threshold exist, the candidate words in the first candidate word group are ranked again according to the second attribute values of the candidate words; and if the candidate words with the second attribute values larger than the second preset threshold value do not exist, sequencing the candidate words in the first candidate word group again according to the third attribute of the candidate words.

15. The apparatus of claim 14, wherein the second attribute value of the alternative word comprises a heat value of the alternative word, and wherein the third attribute comprises a part-of-speech of the alternative word.

16. The apparatus of claim 9, wherein the page content extraction module comprises:

the highlight display module is used for highlighting the target alternative words;

and the replication submodule is used for replicating the target alternative words.

17. A client comprising the apparatus of any one of claims 9-16.

18. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction or at least one program, which is loaded and executed by a processor to implement the page content extraction method according to any one of claims 1 to 8.