CN112288548A - Method, device, medium and electronic equipment for extracting key information of target object - Google Patents

Method, device, medium and electronic equipment for extracting key information of target object Download PDF

Info

Publication number
CN112288548A
CN112288548A CN202011272208.0A CN202011272208A CN112288548A CN 112288548 A CN112288548 A CN 112288548A CN 202011272208 A CN202011272208 A CN 202011272208A CN 112288548 A CN112288548 A CN 112288548A
Authority
CN
China
Prior art keywords
sentence
scoring
statement
target
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011272208.0A
Other languages
Chinese (zh)
Inventor
李浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202011272208.0A priority Critical patent/CN112288548A/en
Publication of CN112288548A publication Critical patent/CN112288548A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • G06Q30/0627Directed, with specific intent or strategy using item specifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method, a device, a medium and electronic equipment for extracting key information of a target object, wherein the extraction method comprises the following steps: step S1, acquiring a first statement set; step S2, scoring each statement in the first statement set according to a set scoring strategy; step S3, selecting a target statement with the highest score in the first statement set according to the score corresponding to each statement, and adding the target statement into a second statement set; step S4, repeating the loop of steps S2 and S3 until the number of words contained in the target sentence selected in step S3 is greater than or equal to the set first threshold; step S5, using the target sentence in the second sentence set obtained in step S4 as the key information of the target object. The technical scheme of the embodiment of the invention can obtain more accurate key information of the target object.

Description

Method, device, medium and electronic equipment for extracting key information of target object
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for extracting key information of a target object, a computer-readable storage medium and electronic equipment.
Background
In the scenes of commodity marketing and the like, key information of commodities needs to be extracted, for example, key texts are extracted from detailed text descriptions of the commodities. Specifically, the detailed text of the product may be from a detail page picture of the product, and after the text in the detail page picture of the product is identified, key texts are extracted from the large batch of texts to represent the characteristics of the product. These key text messages may also be used for downstream tasks such as automatic generation of merchandise marketing texts and e-commerce customer service questions and answers and conversations.
The key information extraction task of the commodity is actually an extraction type automatic summarization task, namely some texts are extracted from input texts according to certain requirements and serve as output summary texts.
In a downstream task aiming at the automatic generation of a commodity marketing text, the pertinence of the abstract which needs to be extracted is strong, so that commodities are more attractive, but the traditional extraction type automatic abstract technology cannot meet the task requirement of the automatic generation of the commodity marketing text.
How to extract the key information of the target object more accurately is a technical problem which needs to be solved urgently at present.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method and an apparatus for extracting key information of a target object, a computer-readable storage medium, and an electronic device, so as to extract the key information of the target object more accurately at least to a certain extent.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the embodiments of the present invention, there is provided a method for extracting key information of a target object, the method including the steps of: step S1, acquiring a first sentence set, wherein the first sentence set is composed of sentences obtained by identifying the image of the target object; step S2, scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score; step S3, selecting a target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set and adding the target statement into a second statement set; step S4, taking the step S2 and the step S3 executed in sequence as a loop, and repeatedly executing the loop until the number of words included in the target sentence selected in the step S3 is greater than or equal to a set first threshold; step S5, using the target sentences in the second sentence set obtained in step S4 except the target sentence obtained in the last execution of step S3 as the key information of the target object.
In some embodiments, before the step S1, the extracting method further includes: and performing text recognition on the image of the target object through an OCR technology to obtain the first sentence set.
In some embodiments, the set scoring policy comprises: scoring the sentences in the first sentence set according to at least one of the following attribute scoring rules of the sentences: the importance scoring rule comprises the steps of scoring the importance of the current statement according to the proportion of the font size of the current statement to the font size of the statement with the largest font in the image of the target object; the appointed scoring rule comprises the steps of carrying out appointed scoring on the current sentence according to the proportion of the number of appointed keywords in the current sentence to the number of all words in the sentence, and determining the appointed keywords according to an appointed keyword dictionary in the process of carrying out the appointed scoring; the readability scoring rule comprises scoring the readability of the current sentence according to the confidence degree of the current sentence given by the OCR recognition model in the process of performing OCR recognition on the image of the target object; a compliance scoring rule, including scoring the current sentence for compliance according to the ratio of the number of compliant words in the current sentence to the number of all words in the sentence, and determining the compliant words according to a non-compliance dictionary during the scoring for compliance; a redundancy scoring rule comprising performing redundancy scoring on the current sentence according to the similarity of the current sentence and the sentences in the second sentence set;
in some embodiments, the set scoring policy comprises: and taking the weighted sum of the scoring results of the importance score, the assigned score, the readability score, the compliance score and the redundancy score of the current sentence as the scoring score.
In some embodiments, the specifiable scoring rule comprises a marketability scoring rule that marketability scores the current sentence according to a ratio of a number of marketing keywords in the current sentence to a number of all words in the sentence.
In some embodiments, before the step S2, the extracting method further includes: establishing an appointed key dictionary; and/or, establishing an out-of-compliance dictionary.
In some embodiments, before the step S2, the extracting method further includes: calculating the similarity between the current statement and each statement in the second statement set; and taking the maximum similarity in the similarities as the similarity between the current sentence and the sentences in the second sentence set.
According to a second aspect of the embodiments of the present invention, there is provided an extraction apparatus of key information of a target object, the extraction apparatus including: an obtaining unit, configured to obtain a first sentence set, where the first sentence set is composed of sentences obtained by identifying an image of the target object; the scoring unit is used for scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score; the selecting unit is used for selecting a target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set and adding the target statement into a second statement set; the cycle execution unit is used for repeatedly executing the cycle of the grading and selecting operations sequentially performed by the grading unit and the selecting unit until the number of words contained in the target sentence selected by the selecting unit is greater than or equal to a set first threshold; and the output unit is used for taking target sentences in the second sentence set obtained after the execution of the circular execution unit is finished, except the target sentences obtained by the last execution of the selection operation, as the key information of the target object.
According to a third aspect of embodiments of the present invention, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for extracting key information of a target object as described in the first aspect of the embodiments above.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: one or several processors; a storage device, configured to store one or several programs, which when executed by the one or several processors, cause the one or several processors to implement the method for extracting key information of a target object as described in the first aspect of the above embodiments.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
in the technical scheme provided by some embodiments of the present invention, sentences are obtained by identifying images of a target object, and the sentences are scored and selected according to a set scoring policy, so as to obtain key information of the target image meeting the requirements of the scoring policy, thereby accurately extracting the key information of the target object.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:
fig. 1 schematically shows a flowchart of a method of extracting key information of a target object according to an embodiment of the present invention;
FIG. 2 schematically shows a flow chart of a method of extracting key information of a target object according to another embodiment of the invention;
fig. 3 schematically shows a block diagram of an apparatus for extracting key information of a target object according to an embodiment of the present invention;
FIG. 4 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or several hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 is a flowchart illustrating a method for extracting key information of a target object according to an embodiment of the present invention. The method provided by the embodiment of the invention can be executed by any electronic equipment with computer processing capability, such as a terminal device and/or a server. As shown in fig. 1, the method for extracting key information of the target object includes:
step S1, a first sentence set is obtained, where the first sentence set is composed of sentences obtained by recognizing an image of a target object.
And step S2, scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score.
And step S3, selecting the target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set, and adding the target statement into the second statement set.
In step S4, the loop is repeated with step S2 and step S3 executed in sequence as a loop until the number of words included in the target sentence selected in step S3 is equal to or greater than the set first threshold.
In step S5, the target sentences in the second sentence set obtained in step S4, excluding the target sentence obtained in step S3 last time, are used as key information of the target object.
In the technical scheme of the embodiment of the invention, sentences are obtained by identifying images, the sentences are scored based on a set scoring strategy, after the sentences with the highest scoring scores are extracted, scoring and extracting operations are continued, and a plurality of excellent sentences which meet word number limitation are finally obtained as the key information of the target object.
Before step S1, the extraction method further includes: and performing text Recognition on the image of the target object by using an OCR (Optical Character Recognition) technology to obtain a first sentence set.
Specifically, the target object in the embodiment of the present invention may be a commodity, and the first sentence set is a text data set. The texts of the commodities come from the detail page pictures of the commodities and are obtained by recognizing the detail page pictures of the commodities through an OCR technology.
In step S2, the scoring policy set includes: the sentences in the first sentence set are scored according to the attribute scoring rule of at least one of the sentences. These attribute scoring rules include: importance scoring rules, specificity scoring rules, readability scoring rules, compliance scoring rules, and redundancy scoring rules.
In the embodiment of the invention, modeling can be performed respectively from five aspects of importance, specificity, readability, compliance and redundancy, and the like so as to score the sentences in the first sentence set.
Furthermore, the sentences in the first sentence set can be scored in combination with the angle of the word number limit. Therefore, the extraction of the key information of the target object can be limited from six aspects of importance, specificity, readability, compliance, redundancy, word number limitation and the like, so that the key information of the target object which is important, strong in marketing, smooth, compliant, non-redundant in information and meets the length limitation can be obtained.
From the viewpoint of the word number limitation, the first threshold is the length limitation of the sentence, i.e. the maximum word number of the sentence. For example, the first threshold may be N, where N is a natural number, and in this case, the length of the sentence with the highest score of the key information as the target object may only be smaller than N. In one embodiment, N may be 6, but is not limited to this in practical applications.
In the embodiment of the present invention, the importance scoring rule may score the importance of the current sentence according to a ratio of the font size of the current sentence to the font size of the sentence with the largest font in the image of the target object.
Specifically, the sentence with the largest font size in each target object image is used as the subject sentence, i.e., the heading sentence, of the image. The score of the subject sentence of the image is 1, and the scoring formula of the importance score ai of other sentences except the subject sentence is as follows: ai is hi/hmax, where hi is the font size of the current sentence, hmax is the font size of the subject sentence, and i is a natural number indicating the sequence number of the current sentence in the first sentence set.
For example, if the font size of a subject sentence in an image of a target object is 16 points and the font size of a sentence of the target object is 12 points, the importance score of the sentence is 0.75.
In the embodiment of the present invention, the specifiability scoring rule may perform specifiability scoring on the current sentence according to a ratio of the number of specified keywords in the current sentence to the number of all words in the sentence.
In one embodiment, the specifiability scoring rule may be a marketability scoring rule in which a current sentence is marketable scored according to a ratio of a number of marketing keywords in the current sentence to a number of all words in the sentence.
In the process of carrying out the appointed grading, determining appointed keywords according to the appointed key dictionary; and in the marketing scoring process, determining marketing keywords according to the marketing key dictionary. This requires the establishment of a designated key dictionary or a marketing key dictionary in advance.
For example, in the marketing scoring process, the marketability of a sentence is determined by the number of marketing keywords, i.e., selling point words, contained in the sentence. Whether a word is a point word is determined by a point dictionary. The selling point dictionary is a marketing key dictionary. The marketing key dictionary and the selling point dictionary can be marked by experts.
For example, the marketing score bi of a sentence of a target object is given by the formula: bi ═ # (Mi)/# (Mi), where, # (Mi) is the number of words of the current sentence that are included in the marketing key dictionary and # (Mi) is the number of total words of the current sentence.
The marketability of the sentence represents the proportion of marketing keywords in the sentence. When a sentence has 8 words in which the number of marketing keywords is 4, the marketability score of the sentence is 0.5.
In the embodiment of the present invention, the readability scoring rule may score readability of the current sentence according to a confidence of the current sentence given by the OCR recognition model in the OCR recognition process of the image of the target object.
The scoring formula for readability score ci is: ci represents the probability that OCR recognition is accurate, and is given by the OCR recognition model when recognizing a sentence in the image of the target object.
For example, in one embodiment, an OCR confidence of a sentence is 0.8, the readability score of the sentence is 0.8.
In the embodiment of the present invention, the compliance scoring rule may score the compliance of the current sentence according to the ratio of the number of compliant words in the current sentence to the number of all words in the sentence, and the compliance of the sentence represents the ratio of compliant words in the sentence.
In the process of performing the compliance scoring, a compliant word needs to be determined according to the non-compliant dictionary, which requires the pre-establishment of the non-compliant dictionary.
For example, in performing a compliance score, the compliance of a sentence is determined by the number of compliance words contained in the sentence. Whether a word is a compliant word is determined by the non-compliant dictionary. The non-compliance dictionary here may be defined by the content review team. Words in the non-compliance dictionary include words for yellow gambling or illegal advertising. Searching a word in the non-compliance dictionary, and if the word is not searched, determining that the word is in compliance; if it can be found, the word is not compliant.
For example, the formula for the marketability score di of a target object is: di ═ n (Ni)/# (Ni), where # (Ni) is the number of words that the current sentence is not included in the non-compliant dictionary, # (Ni) is the total number of words of the current sentence, and i is a natural number representing the sequence number of the current sentence in the first sentence set.
The marketability of the sentence represents the proportion of compliant words in the sentence. When a sentence has 8 words, where the number of words not included in the non-compliance dictionary is 8, the marketability score of the sentence is 0.5. The compliance score of the sentence was 1.
In the embodiment of the present invention, the redundancy scoring rule may perform redundancy scoring on the current sentence according to the similarity between the current sentence and the sentences in the second sentence set.
The similarity between the current sentence and the sentence which has been extracted into the second sentence set can be measured by the Jaccard similarity index, and the calculation formula of the similarity ei between the current sentence and the sentence in the second sentence set can be: arg maxkJ (ti, tk), wherein J (ti, tk) is Jaccard similarity index of two sentences ti and tk, i is a natural number representing the sequence number of the current sentence in the first sentence set, and k is a natural number representing the sequence number of the sentence in the second sentence set.
When calculating the similarity between the current sentence and the sentences in the second sentence set, the similarity between the current sentence and each sentence in the second sentence set may be calculated, and the largest similarity among the similarities may be taken as the similarity between the current sentence and the sentences in the second sentence set. I.e. the similarity between the current sentence and the sentence with the largest Jaccard similarity in the extracted second sentence set determines the redundancy of the current sentence. The greater the similarity of the current sentence to the sentences in the second sentence set, the greater the redundancy of the current sentence.
In one embodiment, when the similarity between the current sentence and each sentence in the second sentence set is 0.8, 0.7, 0.6, and 0.65, the similarity score ei between the current sentence and each sentence in the second sentence set is 0.8.
Further, in the embodiment of the present invention, the set scoring policy may be: and taking the weighted sum of the scoring results obtained by the importance scoring, the assignment scoring, the readability scoring, the compliance scoring and the redundancy scoring of the current sentence as the scoring score.
For example, when the weight of each scoring result is 1, the calculation formula of the scoring score si may be: si ═ ai + bi + ci + di + ei. If the importance score of the current sentence is 0.75, the marketing score is 0.5, the readability score is 0.8, the compliance score of the sentence is 1, and the similarity score ei is 0.8, the score si of the current sentence has a value of 3.85.
By flexibly setting the weight of each parameter of the scoring formula, the influence of different sentence attributes on the scoring result of the sentence can be adjusted, so that the key information of the target object can be selected more flexibly.
In step S2, it is necessary to score each sentence in the first sentence set according to the same set scoring policy, so as to obtain a scoring score of each sentence. In step S3, the sentence with the highest score in each sentence in the first sentence set is selected and added to the second sentence set.
If the number of words included in the sentence selected in step S3 is less than the set first threshold, steps S2 and S3 are repeatedly executed until the number of words included in the sentence selected in step S3 is greater than or equal to the set first threshold.
It should be noted that, when step S3 is executed for the first time, the second sentence set needs to be newly created, and then, before each time step S2 is executed, the similarity between the current sentence and each sentence in the second sentence set, including the sentence just added to the second sentence set, needs to be calculated.
Therefore, when steps S2 and S3 are executed in a loop, the redundancy score of the sentences in the first sentence set needs to be recalculated each time.
In the embodiment of the present invention, when extracting sentences from the first sentence set, according to the calculation formula si ═ ai + bi + ci + di + ei for the score of each sentence, a greedy extraction manner is adopted, that is, each time one sentence which has not been extracted and has the highest score is extracted until the number of words of the extracted sentences reaches the length limit N. At this time, one sentence extracted finally is excluded, and the set of other extracted sentences is the key information of the commodity.
The technical scheme provided by the embodiment of the invention is a commodity key information extraction method based on extraction type automatic summarization, aiming at scene characteristics in scenes such as e-commerce and the like, the method can respectively carry out modeling from six angles such as importance, marketing property, readability, compliance, redundancy, word number limitation and the like, and the extraction of the commodity key information is realized by utilizing the extraction type automatic summarization technology to mine the commodity key information.
As shown in fig. 2, in one embodiment of the present invention, the first threshold is 6. The method for extracting the key information of the target object in the embodiment comprises the following steps:
in step S201, a first sentence set composed of sentences obtained by recognizing an image of a target object is acquired.
Step S202, scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score.
Step S203, selecting the target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set, and adding the target statement into the second statement set.
Step S204, judging whether the number of the words of the target sentence selected in the step S203 is more than or equal to 6, if so, executing the step S205; if not, step S202 is executed.
In step S205, the target sentences in the second sentence set except the target sentence obtained in the last execution of step S203 are used as key information of the target object.
In the method for extracting the key information of the target object provided by the embodiment of the invention, the sentences are obtained by identifying the image of the target object, and the sentences are scored and selected according to the set scoring strategy, so that the key information of the target image meeting the requirements of the scoring strategy is obtained, and the extraction of the key information of the target object is accurately realized.
The following describes an embodiment of the apparatus of the present invention, which can be used to perform the above-mentioned method for extracting key information of the target object of the present invention. Referring to fig. 3, an apparatus 300 for extracting key information based on a target object according to an embodiment of the present invention includes:
an obtaining unit 302 is configured to obtain a first sentence set, where the first sentence set is composed of sentences obtained by recognizing an image of a target object.
And the scoring unit 304 is configured to score each statement in the first statement set according to a set scoring policy to obtain a scoring score.
The selecting unit 306 is configured to select, according to the score corresponding to each sentence, a target sentence with the highest score in the first sentence set, delete the target sentence from the first sentence set, and add the target sentence to the second sentence set.
And the loop execution unit 308 is configured to repeatedly execute a loop of scoring and selecting operations sequentially performed by the scoring unit and the selecting unit until the number of words included in the target sentence selected by the selecting unit is greater than or equal to a set first threshold.
And the output unit 310 is configured to use, as the key information of the target object, the target sentences in the second sentence set obtained after the loop execution unit finishes executing the loop except the target sentence obtained by executing the selection operation last time.
For details that are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the above-described embodiments of the method for extracting key information of a target object of the present invention for details that are not disclosed in the embodiments of the apparatus of the present invention, since each functional module of the apparatus for extracting key information of a target object of the present invention corresponds to the steps of the above-described embodiments of the method for extracting key information of a target object.
In the extraction device based on the key information of the target object provided by the embodiment of the invention, the sentences are obtained by identifying the image of the target object, and the sentences are scored and selected according to the set scoring strategy, so that the key information of the target image meeting the requirements of the scoring strategy is obtained, and the extraction of the key information of the target object is accurately realized.
Referring now to FIG. 4, a block diagram of a computer system 400 suitable for use with the electronic device implementing an embodiment of the invention is shown. The computer system 400 of the electronic device shown in fig. 4 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 404 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for system operation are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable storage medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 401.
It should be noted that the computer readable storage medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer-readable storage medium carries one or more programs, which, when executed by an electronic device, enable the electronic device to implement the method for extracting key information of a target object as described in the above embodiments.
For example, the electronic device may implement the steps shown in fig. 1 and 2.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by several modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A method for extracting key information of a target object is characterized by comprising the following steps:
step S1, acquiring a first sentence set, wherein the first sentence set is composed of sentences obtained by identifying the image of the target object;
step S2, scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score;
step S3, selecting a target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set and adding the target statement into a second statement set;
step S4, taking the step S2 and the step S3 executed in sequence as a loop, and repeatedly executing the loop until the number of words included in the target sentence selected in the step S3 is greater than or equal to a set first threshold;
step S5, using the target sentences in the second sentence set obtained in step S4 except the target sentence obtained in the last execution of step S3 as the key information of the target object.
2. The extraction method according to claim 1, wherein, before the step S1, the extraction method further comprises:
and performing text recognition on the image of the target object through an OCR technology to obtain the first sentence set.
3. The extraction method according to claim 2, wherein the set scoring strategy comprises: scoring the sentences in the first sentence set according to at least one of the following attribute scoring rules of the sentences:
the importance scoring rule comprises the steps of scoring the importance of the current statement according to the proportion of the font size of the current statement to the font size of the statement with the largest font in the image of the target object;
the appointed scoring rule comprises the steps of carrying out appointed scoring on the current sentence according to the proportion of the number of appointed keywords in the current sentence to the number of all words in the sentence, and determining the appointed keywords according to an appointed keyword dictionary in the process of carrying out the appointed scoring;
the readability scoring rule comprises scoring the readability of the current sentence according to the confidence degree of the current sentence given by the OCR recognition model in the process of performing OCR recognition on the image of the target object;
a compliance scoring rule, including scoring the current sentence for compliance according to the ratio of the number of compliant words in the current sentence to the number of all words in the sentence, and determining the compliant words according to a non-compliance dictionary during the scoring for compliance;
and the redundancy scoring rule comprises the step of performing redundancy scoring on the current statement according to the similarity between the current statement and the statements in the second statement set.
4. The extraction method according to claim 3, wherein the set scoring strategy comprises:
and taking the weighted sum of the scoring results of the importance score, the assigned score, the readability score, the compliance score and the redundancy score of the current sentence as the scoring score.
5. The extraction method according to claim 4, wherein the specified scoring rules include marketability scoring rules that marketability score the current sentence according to a ratio of the number of marketing keywords in the current sentence to the number of all words in the sentence.
6. The extraction method according to claim 4, wherein, before the step S2, the extraction method further comprises:
establishing an appointed key dictionary; and/or, establishing an out-of-compliance dictionary.
7. The extraction method according to claim 4, wherein, before the step S2, the extraction method further comprises:
calculating the similarity between the current statement and each statement in the second statement set;
and taking the maximum similarity in the similarities as the similarity between the current sentence and the sentences in the second sentence set.
8. An extraction apparatus of key information of a target object, the extraction apparatus comprising:
an obtaining unit, configured to obtain a first sentence set, where the first sentence set is composed of sentences obtained by identifying an image of the target object;
the scoring unit is used for scoring each statement in the first statement set according to a set scoring strategy to obtain a scoring score;
the selecting unit is used for selecting a target statement with the highest score in the first statement set according to the score corresponding to each statement, deleting the target statement from the first statement set and adding the target statement into a second statement set;
the cycle execution unit is used for repeatedly executing the cycle of the grading and selecting operations sequentially performed by the grading unit and the selecting unit until the number of words contained in the target sentence selected by the selecting unit is greater than or equal to a set first threshold;
and the output unit is used for taking target sentences in the second sentence set obtained after the execution of the circular execution unit is finished, except the target sentences obtained by the last execution of the selection operation, as the key information of the target object.
9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing a method of extracting key information of a target object according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or several processors;
storage means for storing one or several programs which, when executed by the one or several processors, cause the one or several processors to implement the method of extracting key information of a target object according to any one of claims 1 to 7.
CN202011272208.0A 2020-11-13 2020-11-13 Method, device, medium and electronic equipment for extracting key information of target object Pending CN112288548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011272208.0A CN112288548A (en) 2020-11-13 2020-11-13 Method, device, medium and electronic equipment for extracting key information of target object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011272208.0A CN112288548A (en) 2020-11-13 2020-11-13 Method, device, medium and electronic equipment for extracting key information of target object

Publications (1)

Publication Number Publication Date
CN112288548A true CN112288548A (en) 2021-01-29

Family

ID=74398868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011272208.0A Pending CN112288548A (en) 2020-11-13 2020-11-13 Method, device, medium and electronic equipment for extracting key information of target object

Country Status (1)

Country Link
CN (1) CN112288548A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH087076A (en) * 1994-06-20 1996-01-12 Ricoh Co Ltd Picture processor
US20070133874A1 (en) * 2005-12-12 2007-06-14 Xerox Corporation Personal information retrieval using knowledge bases for optical character recognition correction
US20100076991A1 (en) * 2008-09-09 2010-03-25 Kabushiki Kaisha Toshiba Apparatus and method product for presenting recommended information
WO2011159460A2 (en) * 2010-06-18 2011-12-22 Google Inc. Identifying establishments in images
JP2018067095A (en) * 2016-10-18 2018-04-26 株式会社東芝 Business card information management system, and search result display method and search result display program in business card information management system
CN109740510A (en) * 2018-12-29 2019-05-10 三星电子(中国)研发中心 Method and apparatus for output information
CN110597978A (en) * 2018-06-12 2019-12-20 北京京东尚科信息技术有限公司 Article abstract generation method and system, electronic equipment and readable storage medium
CN110781669A (en) * 2019-10-24 2020-02-11 泰康保险集团股份有限公司 Text key information extraction method and device, electronic equipment and storage medium
CN111475613A (en) * 2020-03-06 2020-07-31 深圳壹账通智能科技有限公司 Case classification method and device, computer equipment and storage medium
CN111597775A (en) * 2020-01-15 2020-08-28 南方电网调峰调频发电有限公司信息通信分公司 HTML-based information intelligent extraction technology method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH087076A (en) * 1994-06-20 1996-01-12 Ricoh Co Ltd Picture processor
US20070133874A1 (en) * 2005-12-12 2007-06-14 Xerox Corporation Personal information retrieval using knowledge bases for optical character recognition correction
US20100076991A1 (en) * 2008-09-09 2010-03-25 Kabushiki Kaisha Toshiba Apparatus and method product for presenting recommended information
WO2011159460A2 (en) * 2010-06-18 2011-12-22 Google Inc. Identifying establishments in images
JP2018067095A (en) * 2016-10-18 2018-04-26 株式会社東芝 Business card information management system, and search result display method and search result display program in business card information management system
CN110597978A (en) * 2018-06-12 2019-12-20 北京京东尚科信息技术有限公司 Article abstract generation method and system, electronic equipment and readable storage medium
CN109740510A (en) * 2018-12-29 2019-05-10 三星电子(中国)研发中心 Method and apparatus for output information
CN110781669A (en) * 2019-10-24 2020-02-11 泰康保险集团股份有限公司 Text key information extraction method and device, electronic equipment and storage medium
CN111597775A (en) * 2020-01-15 2020-08-28 南方电网调峰调频发电有限公司信息通信分公司 HTML-based information intelligent extraction technology method
CN111475613A (en) * 2020-03-06 2020-07-31 深圳壹账通智能科技有限公司 Case classification method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112749344B (en) Information recommendation method, device, electronic equipment, storage medium and program product
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
KR20200094627A (en) Method, apparatus, device and medium for determining text relevance
US20220318275A1 (en) Search method, electronic device and storage medium
CN108256044B (en) Live broadcast room recommendation method and device and electronic equipment
CN113590796B (en) Training method and device for ranking model and electronic equipment
CN112699645B (en) Corpus labeling method, apparatus and device
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN112560461A (en) News clue generation method and device, electronic equipment and storage medium
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN113392218A (en) Training method of text quality evaluation model and method for determining text quality
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN115099239A (en) Resource identification method, device, equipment and storage medium
CN117194730B (en) Intention recognition and question answering method and device, electronic equipment and storage medium
CN115048523B (en) Text classification method, device, equipment and storage medium
CN115186163B (en) Training of search result ranking model and search result ranking method and device
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN114036397B (en) Data recommendation method, device, electronic equipment and medium
CN114880498A (en) Event information display method and device, equipment and medium
CN115238676A (en) Method and device for identifying hot spots of bidding demands, storage medium and electronic equipment
CN112288548A (en) Method, device, medium and electronic equipment for extracting key information of target object
CN114329206A (en) Title generation method and device, electronic equipment and computer readable medium
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination