CN103678407A - Data processing method and data processing device - Google Patents

Data processing method and data processing device Download PDF

Info

Publication number
CN103678407A
CN103678407A CN201210358626.0A CN201210358626A CN103678407A CN 103678407 A CN103678407 A CN 103678407A CN 201210358626 A CN201210358626 A CN 201210358626A CN 103678407 A CN103678407 A CN 103678407A
Authority
CN
China
Prior art keywords
statement
paragraph
data processing
matrix
topic relativity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210358626.0A
Other languages
Chinese (zh)
Inventor
孙健
夏迎炬
杨宇航
张明明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210358626.0A priority Critical patent/CN103678407A/en
Publication of CN103678407A publication Critical patent/CN103678407A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and a data processing device. The data processing method comprises the steps of picture identification, initialization, theme correlation determination, theme paragraph division and theme paragraph selection. The picture identification step is used for identifying pictures to obtain multiple identification result words, and one or more search words are generated from the multiple identification result words according to specific combination modes. The initialization step is used for carrying out searching through the searching words to initialize an obtained webpage and obtain multiple statements. The theme correlation determination step is used for determining theme correlation between the obtained statements. The theme paragraph division step is used for dividing the multiple statements into multiple paragraphs according to the determined theme correlation and determining thematic values of the paragraphs. The theme paragraph selection step is used for selecting theme paragraphs meeting preset conditions from the multiple paragraphs based on the determined thematic values of the paragraphs. By means of the data processing method, theme paragraphs of webpages related to pictures can be efficiently and accurately obtained, themes of the pictures can be determined, and information searching, integration and sharing are facilitated.

Description

Data processing method and data processing equipment
Technical field
The present invention relates to a kind of data processing method and data processing equipment, more specifically, relate to a kind of data processing method and data processing equipment that extracts the paragraph higher with the degree of correlation of this picture theme for the webpage from relevant to picture.
Background technology
The content that text message in picture is understood this picture for user has important effect.Owing to adopting optical character identification (OCR) technology for example can not lock exactly the keyword of representative picture theme, therefore by means of the large amount of text information of internet, verify and extract the text in picture, can more accurately extract all sidedly the text message of picture, thereby help user to obtain quickly and accurately required information.
By utilizing OCR result to retrieve and by means of data mining means such as text cluster and couplings, can obtain the webpage with the Topic relative of picture in search engine.Yet, because the quantity of information that the webpage obtaining by Web Mining comprises is excessive, may produce that theme disperses (for example, a webpage comprises a plurality of themes) or topic drift (for example, from the theme of this picture, transfer to other theme) etc. problem, thereby be unfavorable for determining of picture theme.
Summary of the invention
Provided hereinafter about brief overview of the present invention, to the basic comprehension about some aspect of the present invention is provided.But, should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is used for determining key part of the present invention or pith, neither be intended to for limiting scope of the present invention.Its object is only that the form of simplifying provides about some concept of the present invention, usings this as the preorder in greater detail providing after a while.
Therefore, in view of said circumstances, the object of this invention is to provide a kind of data processing method and data processing equipment, it is by utilizing the OCR result of picture to retrieve to obtain a plurality of related web pages, topic relativity between statement based in webpage is determined the theme paragraph division of webpage, and divide and select the theme paragraph higher with this picture degree of correlation and export based on this theme paragraph, thereby make user can obtain efficiently and accurately the info web relevant to this picture, to contribute to the understanding to the content of text in picture.
To achieve these goals, according to an aspect of the present invention, a kind of data processing method is provided, the method comprises: picture recognition step, for picture is identified, to obtain a plurality of recognition result words, and from a plurality of recognition result words, generate one or more terms according to particular combination form; Initialization step, carries out initialization for the webpage obtaining utilizing term to retrieve, to obtain a plurality of statements; Topic relativity determining step, for determining the topic relativity between each statement obtaining; Theme paragraph partiting step, for being divided into a plurality of statements based on determined topic relativity the thematic value of a plurality of paragraphs definite each paragraph; And the unsuccessful step of selecting of subject matter segments, for the thematic value of each paragraph based on determining, from a plurality of paragraphs, select to meet the theme paragraph of predetermined condition.
According to embodiments of the invention, topic relativity determining step may further include: statement similarity calculates sub-step, for calculating the similarity between each statement; Matching degree is calculated sub-step, for calculating the matching degree between each statement and picture recognition result; And correlativity determines sub-step, for the similarity based on calculating and matching degree, determine the topic relativity between each statement.
According to another embodiment of the present invention, in correlativity, determine in sub-step, can generate topic relativity matrix based on following manner, topic relativity between any two statements of each element representation in this topic relativity matrix: for the element on the principal diagonal of matrix, determine the value of this element based on corresponding matching degree; And for the element in the lower triangle battle array of matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and this topic relativity matrix is symmetric matrix.
According to still another embodiment of the invention, in paragraph partiting step, can, based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of this matrix, and divide paragraph according to determined optimum minor structure.
According to an embodiment more of the present invention, select in step subject matter segments is unsuccessful, can divide a plurality of paragraphs be sorted the thematic value based on determining, and select output according to predetermined condition.
According to a further aspect in the invention, a kind of data processing equipment is also provided, and this equipment comprises: picture recognition unit, is configured to picture to identify, to obtain a plurality of recognition result words, and from a plurality of recognition result words, generate one or more terms according to particular combination form; Initialization unit, the webpage that is configured to obtain utilizing term to retrieve carries out initialization, to obtain a plurality of statements; Topic relativity determining unit, is configured to determine the topic relativity between each obtained statement; Theme paragraph division unit, is configured to the thematic value that based on determined topic relativity, a plurality of statements is divided into a plurality of paragraphs and determines each paragraph; And the subject matter segments selected cell that falls, be configured to the thematic value of each paragraph based on determining, from a plurality of paragraphs, select to meet the theme paragraph of predetermined condition.
According to another aspect of the invention, a kind of storage medium is also provided, this storage medium comprises machine-readable program code, and when executive routine code on messaging device, this program code is carried out according to data processing method of the present invention messaging device.
In addition, again on the one hand, also provide a kind of program product according to an embodiment of the invention, this program product comprises the executable instruction of machine, when carrying out instruction on messaging device, this instruction is carried out according to data processing method of the present invention messaging device.
Therefore, according to embodiments of the invention, can improve data-handling efficiency, help user obtains rapidly and accurately the info web higher with picture degree of correlation and understands the theme of the text message of picture, thereby is conducive to information retrieval, integrated and shared.
In instructions part below, provide other aspects of the embodiment of the present invention, wherein, describe in detail for disclosing fully the preferred embodiment of the embodiment of the present invention, and it is not applied to restriction.
Accompanying drawing explanation
The present invention can, by reference to given detailed description and being better understood by reference to the accompanying drawings hereinafter, wherein use same or analogous Reference numeral to represent identical or similar parts in institute's drawings attached.Described accompanying drawing comprises in this manual and forms a part for instructions together with detailed description below, be used for further illustrating the preferred embodiments of the present invention and explain principle and advantage of the present invention.Wherein:
Fig. 1 illustrates the process flow diagram of the example of data processing method according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the detailed processing in the topic relativity determining step illustrating in the data processing method shown in Fig. 1;
Fig. 3 is the schematic diagram that the example of topic relativity matrix according to an embodiment of the invention and division thereof is shown;
Fig. 4 illustrates the block diagram of the functional configuration of data processing equipment according to an embodiment of the invention;
Fig. 5 is the block diagram that the detailed functions configuration of the topic relativity determining unit shown in Fig. 4 is shown; And
Fig. 6 is the block diagram illustrating as the exemplary configurations of the personal computer of the messaging device adopting in embodiments of the invention.
Embodiment
In connection with accompanying drawing, one exemplary embodiment of the present invention is described hereinafter.All features of actual embodiment are not described for clarity and conciseness, in instructions.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, to realize developer's objectives, for example, meet those restrictive conditions with system and traffic aided, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition,, although will also be appreciated that development is likely very complicated and time-consuming, concerning having benefited from those skilled in the art of present disclosure, this development is only routine task.
At this, also it should be noted is that, for fear of the details because of unnecessary fuzzy the present invention, only show in the accompanying drawings with according to the closely-related device structure of the solution of the present invention and/or treatment step, and omitted other details little with relation of the present invention.
Hereinafter with reference to Fig. 1 to Fig. 6, data processing method and data processing equipment are according to an embodiment of the invention described.
First, with reference to Fig. 1, data processing method is according to an embodiment of the invention described.As shown in Figure 1, data processing method can comprise the unsuccessful step S105 that selects of picture recognition step S101, initialization step S102, topic relativity determining step S103, theme paragraph partiting step S104 and subject matter segments.
Particularly, in picture recognition step S101, can identify the picture of input, to obtain a plurality of recognition result words, and from a plurality of recognition result words, generate one or more terms according to particular combination form.Preferably, as example and unrestricted, in picture recognition step S101, can adopt optical character identification (OCR) technology.Picture can be to need arbitrarily pictures to be processed, for example, and advertising pictures, the picture intercepting from video or arbitrarily other pictures.
In addition,, for the words such as title, time and place of the named entity in recognition result word, because it has stronger identification, be therefore more suitable for combining and be selected as term with appropriate format.Yet it will be understood by those skilled in the art that also can be by the recognition result word obtaining by optical character identification directly as term and without any processing.
Next, in initialization step S102, the webpage that can obtain utilizing the term obtaining to retrieve carries out initialization, to obtain a plurality of statements.Particularly, utilize the term obtain in step S101 to retrieve in search engine, thereby return to a plurality of relevant webpages, and can based on punctuation mark (such as, ", ", ".", "? ", "! " etc.) webpage returning is carried out to statement division to obtain a plurality of statements.Preferably, in initialization step S102, the order of statement sequence that maintenance obtains from webpage is identical for subsequent treatment with the original statement of this webpage order, and this is owing to supposing that the theme paragraph of webpage is present in continuous text fragments.
In topic relativity determining step S103, can determine the topic relativity between each obtained statement.
Preferably, as shown in Figure 2, topic relativity determining step S103 may further include statement similarity calculating sub-step S201, matching degree calculates sub-step S202 and correlativity is determined sub-step S203.Next with reference to Fig. 2, describe the detailed processing in topic relativity determining step S103 in detail.
At statement similarity, calculate in sub-step S201, calculate the similarity between each statement obtaining.Preferably, as example, word frequency that can be based on each statement, utilize cosine formula to calculate the similarity between any two statements, this computation process can represent by following formula (1):
Sim ( s i , s j ) = Σ k a k × b k Σ k a k 2 × b k 2 - - - ( 1 )
Wherein, S iand S jrepresent respectively statement i and statement j, a kand b krepresent the word frequency in statement i and statement j, this word frequency can obtain by statistical method, or also can be TF-IDF(word frequency-anti-document frequency) etc., the present invention does not limit this.
Although should be understood that this sentences cosine similarity is the calculating that example illustrates two similarities between statement, those skilled in the art also can utilize other proper methods arbitrarily to carry out the similarity between computing statement.
Next, in matching degree, calculate in sub-step S202, can calculate the matching degree between each statement and picture recognition result.
Preferably, in matching degree, calculate in sub-step S202, can carry out the matching degree between computing statement and picture recognition result based on editing distance algorithm.As example, the matching degree of maximum matching degree that can be between statement and each recognition result word between this statement and picture recognition result.Certainly, those skilled in the art also can expect mean value rather than the matching degree of maximal value between this statement and picture recognition result of the matching degree between statement and each recognition result word, and the present invention does not limit this.
Next, in correlativity, determine in sub-step S203 that the similarity based on calculating respectively and matching degree are determined the topic relativity between each statement in step S201 and S202.Can find out, due to the similarity based between statement not only also the matching degree based between statement and picture recognition result determine topic relativity, therefore guaranteed subsequently the paragraph of describing to be divided and picture theme between correlativity.
By take, based on topic relativity matrix, determine that topic relativity between statement describes correlativity in detail as example and determine the processing in sub-step S203 below.
Preferably, in correlativity, determine in sub-step S203, can generate in the following manner topic relativity matrix, the topic relativity between two statements of each element representation in this matrix wherein: for the element on the principal diagonal of this matrix, can determine based on corresponding matching degree the value of this element; And for the element in the lower triangle battle array of this matrix, the similarity between element that can be based on adjacent to this element and two statements relevant with this element is determined the value of this element, and this matrix is symmetric matrix.
As shown in Figure 3, it shows the example of topic relativity matrix according to an embodiment of the invention, and wherein the row and column of this matrix is all to m according to the tactic statement sequence 1 of the original statement of text.The sample calculation of each element in this matrix will be provided below.
Particularly, for example, for the element on principal diagonal, that is, if i=j, A[i] [j]=match[i]+1; For the element in lower triangle battle array, if i.e. i>j, A[i] [j]=(A[i-1] [j]+A[i] [j+1])/2-(1-Sim[i] [j]), and A[i] [j]=A[j] [i].Wherein, A[i] [j] represent the element of the capable j row of i in topic relativity matrix, match[i] represent the matching degree between statement i and picture recognition result, Sim[i] similarity between [j] expression statement i and statement j, 1≤i≤m, 1≤j≤m.
According to above calculating formula, be appreciated that for the element on principal diagonal, because the similarity between statement itself is specific constant, so the element on principal diagonal only depends on the matching degree between this statement and picture recognition result.In addition, be understandable that, because theme paragraph certainly exists among continuous statement, and this entry of a matrix element is arranged according to statement sequence, so for this matrix, the closer to cornerwise element, its topic relativity value should be larger, and for getting over away from cornerwise element, its topic relativity value should be less.
Although more than provided definite example of topic relativity matrix, it should be understood that, this determine method be only example and unrestricted, those skilled in the art can modify to above computing method according to instructed principle.
Next, referring back to Fig. 1, data processing method according to an embodiment of the invention will be continued to describe.
Because although the text of picture is crucial, but only from text information, cannot determine the theme that picture will be expressed, therefore, need to assist the theme of determining picture by the theme paragraph the theme paragraph definite and that picture degree of correlation is higher that excavate in the webpage relevant to picture.Based on existing continuous statement to describe the hypothesis of the theme of this webpage in web page text, the statement of webpage is carried out to paragraph division.The processing of below the theme paragraph of describing webpage in detail being divided.
After having determined topic relativity, in theme paragraph partiting step S104, can a plurality of statements be divided into based on determined topic relativity to the thematic value of a plurality of paragraphs definite each paragraph.
Preferably, for the division of theme paragraph, based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of this matrix, and divide paragraph according to determined optimum minor structure.
From the above, in topic relativity matrix, element the closer to principal diagonal, its topic relativity is higher, therefore, in actual partition process, along diagonal, carry out increase, minimizing and the division of element, its basis is topic relativity value sum (i.e. the thematic value of this paragraph) maximum that makes place paragraph.
Preferably, can determine that the optimum paragraph of a plurality of statements is divided and the thematic value of each paragraph based on following formula (2).
S [ i ] = max B [ i ] [ i ] + S [ i - 1 ] B [ i - 1 ] [ i ] + S [ i - 2 ] . . . B [ 2 ] [ i ] + S [ 1 ] B [ 1 ] [ i ] - - - ( 2 )
Wherein, S[i] represent before the thematic value of optimal dividing of i statement, B[i] [j] represent element that the element based on the capable i row of i in topic relativity matrix is listed as to the capable j of j and definite value.In following example, B[i] [j] for example can represent lower triangle element (the comprising diagonal entry) sum of the submatrix that i statement forms to j statement, those skilled in the art can certainly select upper triangle element (comprising diagonal entry) sum or all elements sum of this submatrix.
To describe based on dynamic programming algorithm, utilize above-mentioned expression formula to divide paragraph and calculate the concrete example of the thematic value of paragraph below.
For example, for a submatrix in topic relativity matrix, take statement s1 and statement s2 and be example, as follows:
5 2 2 3
Following triangulo operation is example, due to 5+3+2>5+3,, the thematic value of the paragraph that statement s1 and s2 merge is greater than the respectively thematic value of the paragraph of the section of doing for oneself of statement s1 and s2, therefore statement s1 and s2 are divided into a paragraph (s1, s2), and the thematic value that records the first two statement be 10.
Next, determine the optimum minor structure of the submatrix that statement s1, s2 and s3 form, as follows:
5 2 - 3 2 3 3 - 3 3 6
Owing to being that (paragraph is divided at above-mentioned definite statement s1, s2, due to (s1, s2) thematic value is greater than (s1) thematic value (s2)), therefore need comparison (s1, s2) (s3), (s1) (s2, s3) and the thematic value of (s1, s2, s3) three kinds of paragraph dividing mode.In this example, (s1) (s2, s3) the thematic value of this dividing mode (, 5+(3+6+3)=17) maximum, it is separately a paragraph that thereby the optimum paragraph of determining front 3 statements is divided into statement s1, and statement s2 and s3 merge into a paragraph, i.e. (s1) (s2, s3) thematic value is maximum, and by this value record, is the thematic value of front 3 statements.Fig. 3 schematically shows this paragraph division result, and the element that wherein belongs to same optimum minor structure marks with identical diagram.
Note, when the above optimum paragraph of determining statement s1, s2 and s3 is divided, owing to previously having determined the division of statement s1, s2, the definite result before therefore now only need being recorded in and without recalculating.
Similarly, in the above described manner, based on dynamic programming algorithm, determine successively the paragraph division of whole statements, wherein the thematic value of each paragraph equals in topic relativity matrix, the lower triangle element (comprising diagonal entry) of the submatrix consisting of the statement that forms this paragraph, upper triangle element (comprising diagonal entry) or whole element sum.
Next, after the optimum paragraph of having determined statement is divided, select in step S105 subject matter segments is unsuccessful, the thematic value of each paragraph based on determining in step S104 selects to meet the theme paragraph of predetermined condition from divided a plurality of paragraphs.
Preferably, select in step S105 subject matter segments is unsuccessful, can to divided paragraph, sort according to the thematic value of determining, and select output according to predetermined condition.For example, can select to sort paragraph the output of forward predetermined quantity.As an alternative, can certainly paragraph not sorted, but select its thematic value to be greater than paragraph the output of predetermined threshold.
By carrying out the processing in above step S101 to S105, can be from the Web Mining subject information relevant to picture, thereby solved the shortcoming that in conventional art, info web is numerous and diverse, specific aim is poor, contribute to user to determine picture theme, and be conducive to further information retrieval, integrated and shared.
Although describe the data processing method according to the embodiment of the present invention in detail in conjunction with Fig. 1 to Fig. 3 above, but those skilled in the art is understood that, process flow diagram shown in the drawings is only exemplary, and can, according to the difference of practical application and specific requirement, said method flow process be revised accordingly.For example, as required, can adjust the execution sequence of some step in said method, or can save or add some treatment step.In addition, should be understood that above example is not construed as limiting the invention, the principle that those skilled in the art can be based on instructed, carries out suitable modification and is applied to other application scenario said process.
Corresponding with the data processing method according to the embodiment of the present invention, embodiments of the invention also provide a kind of data processing equipment.Hereinafter with reference to Fig. 4, describe in detail according to the functional configuration example of data processing equipment of the present invention.
Particularly, as shown in Figure 4, data processing equipment can comprise picture recognition unit 401, initialization unit 402, topic relativity determining unit 403, theme paragraph division unit 404 and the subject matter segments selected cell 405 that falls.Hereinafter with reference to Fig. 4, describe the functional configuration of unit in detail.
Picture recognition unit 401 can be configured to picture to identify, and to obtain a plurality of recognition result words, and from a plurality of recognition result words, generates one or more terms according to particular combination form.Preferably, picture recognition unit 401 can adopt optical character recognition, and the picture of input can for advertising pictures, the picture intercepting from video or other need picture to be processed arbitrarily.In addition, neither be necessary to recognition result contamination, also can be by the word identifying directly as term.
The webpage that initialization unit 402 can be configured to obtain utilizing the term obtaining to retrieve carries out initialization, to obtain a plurality of statements.As example, can to the webpage returning, carry out statement division by the punctuation mark based in webpage, and preferably, keep the constant sequence that obtains a plurality of statements of original statement order in webpage for subsequent treatment, this is that theme paragraph based on webpage is present in the hypothesis in continuous text fragments.
Topic relativity determining unit 403 can be configured to determine the topic relativity between each obtained statement.
Hereinafter with reference to Fig. 5, describe the functional configuration example of topic relativity determining unit 403 in detail.
As shown in Figure 5, topic relativity determining unit 403 may further include statement similarity computing module 501, matching degree computing module 502 and correlativity determination module 503.The functional configuration of modules will be described in detail below.
Statement similarity computing module 501 can be configured to calculate the similarity between each statement obtaining.As example, word frequency that can be based on each statement, utilizes cosine formula to carry out the similarity between computing statement.Certainly, the invention is not restricted to this, but can adopt other method arbitrarily to carry out the similarity between computing statement.
Matching degree computing module 502 can be configured to calculate the matching degree between each statement and picture recognition result.As example, can carry out the matching degree between computing statement and picture recognition result based on editing distance algorithm.Preferably, the matching degree of maximum matching degree that can be between statement and each recognition result word between this statement and picture recognition result.As an alternative, also can adopt the mean value of the matching degree between statement and each recognition result word as the matching degree between this statement and picture recognition result.
Correlativity determination module 503 can be configured to similarity and the matching degree based on calculating, and determines the topic relativity between each statement.Preferably, correlativity determination module 403 can be determined the topic relativity between statement by generating topic relativity matrix based on following mode, topic relativity between any two statements of each element representation in this matrix: for the element on principal diagonal, determine the value of this element based on corresponding matching degree; And for the element in the lower triangle battle array of matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and this matrix is symmetric matrix.
Below with reference to data processing method according to an embodiment of the invention, described the concrete mode that generates topic relativity matrix, at this, be no longer repeated in this description.
Next, referring back to Fig. 4, data processing equipment according to an embodiment of the invention will be continued to describe.
After having determined topic relativity, paragraph division unit 404 can be configured to the thematic value that based on determined topic relativity, a plurality of statements is divided into a plurality of paragraphs and determines each paragraph.Preferably, paragraph division unit 404 can utilize dynamic programming algorithm to determine the optimum minor structure of the division of topic relativity matrix, and divides paragraph according to determined optimum minor structure.
About utilizing dynamic programming algorithm to carry out the concrete processing that the thematic value of each paragraph was divided and correspondingly determined to optimum paragraph, can, referring to above about the description of paragraph partiting step S104, at this, no longer be repeated in this description.In addition, should understand, because above topic relativity determining unit 403 is considered similarity between statement and the matching degree between statement and picture when generating topic relativity matrix simultaneously, therefore not only guaranteed that the theme paragraph of dividing has reflected the theme of webpage, but also guaranteed the theme paragraph divided and the correlativity between picture theme.
The subject matter segments selected cell 405 that falls can be configured to the thematic value based on determined each paragraph, selects to meet the theme paragraph of predetermined condition from divided a plurality of paragraphs.Preferably, the subject matter segments selected cell 405 that falls can sort to divided paragraph according to the thematic value of determining, and selects output according to predetermined condition.For example, can select to sort paragraph the output of forward predetermined quantity.As an alternative, also can not sort and choosing a topic value be greater than predetermined threshold paragraph and output.
It should be noted that, the data processing equipment described in the embodiment of the present invention is corresponding with preceding method embodiment, and therefore, the part not describing in detail in apparatus embodiments, refers to the introduction of relevant position in embodiment of the method, repeats no more here.
In addition, should also be noted that above-mentioned series of processes and equipment also can realize by software and/or firmware.In the situation that realizing by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, example general purpose personal computer 600 is as shown in Figure 6 installed the program that forms this software, and this computing machine, when various program is installed, can be carried out various functions etc.
In Fig. 6, CPU (central processing unit) (CPU) 601 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random-access memory (ram) 603.In RAM 603, also store as required data required when CPU 601 carries out various processing etc.
CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 is also connected to bus 604.
Following parts are connected to input/output interface 605: importation 606, comprises keyboard, mouse etc.; Output 607, comprises display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 608, comprises hard disk etc.; With communications portion 609, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 609 via network such as the Internet executive communication is processed.
As required, driver 610 is also connected to input/output interface 605.Detachable media 611, such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 610 as required, is installed in storage area 608 computer program of therefrom reading as required.
In the situation that realizing above-mentioned series of processes by software, from network such as the Internet or storage medium are such as detachable media 611 is installed the program that forms softwares.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 6, distributes separately to user, to provide the detachable media 611 of program with equipment.The example of detachable media 611 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM 602, storage area 608 etc., computer program stored wherein, and be distributed to user together with the equipment that comprises them.
The step that also it is pointed out that the above-mentioned series of processes of execution can be carried out according to the order of explanation naturally in chronological order, but does not need necessarily according to time sequencing, to carry out.Some step can walk abreast or carry out independently of one another.
Although described the present invention and advantage thereof in detail, be to be understood that in the situation that do not depart from the spirit and scope of the present invention that limited by appended claim and can carry out various changes, alternative and conversion.And, the term of the embodiment of the present invention " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
According to embodiments of the invention, following remarks is also disclosed:
1. 1 kinds of data processing methods of remarks, comprising:
Picture recognition step, for picture is identified, to obtain a plurality of recognition result words, and generates one or more terms according to particular combination form from described a plurality of recognition result words;
Initialization step, carries out initialization for the webpage obtaining utilizing described term to retrieve, to obtain a plurality of statements;
Topic relativity determining step, for determining the topic relativity between each statement obtaining;
Theme paragraph partiting step, for being divided into described a plurality of statements based on determined topic relativity the thematic value of a plurality of paragraphs definite each paragraph; And
The unsuccessful step of selecting of subject matter segments for the thematic value of each paragraph based on determining, selects to meet the theme paragraph of predetermined condition from described a plurality of paragraphs.
Remarks 2. is according to the data processing method described in remarks 1, and wherein, described topic relativity determining step further comprises:
Statement similarity calculates sub-step, for calculating the similarity between each statement;
Matching degree is calculated sub-step, for calculating the matching degree between each statement and picture recognition result; And
Correlativity is determined sub-step, for the similarity based on calculating and matching degree, determines the topic relativity between each statement.
Remarks 3. is according to the data processing method described in remarks 2, wherein, in described matching degree, calculate in sub-step, calculate the matching degree between each word in each statement and described a plurality of recognition result word, and the matching degree using the maximum matching degree of calculating as this statement and between picture recognition result.
Remarks 4. is according to the data processing method described in remarks 2, wherein, in described correlativity, determine in sub-step, based on following manner, generate topic relativity matrix, topic relativity between any two statements of each element representation in described topic relativity matrix: for the element on the principal diagonal of described matrix, determine the value of this element based on corresponding matching degree value; And for the element in the lower triangle battle array of described matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and described topic relativity matrix is symmetric matrix.
Remarks 5. is according to the data processing method described in remarks 4, and wherein, described topic relativity matrix is determined with following expression:
If i=j, A[i] [j]=match[i]+1;
If i > is j, A[i] [j]=(A[i-1] [j]+A[i] [j+1])/2-(1-Sim[i] [j]);
And A[i] [j]=A[j] [i],
Wherein, A[i] [j] represent the element of the capable j row of i in described topic relativity matrix, match[i] represent the matching degree between statement i and picture recognition result, Sim[i] similarity between [j] expression statement i and statement j.
Remarks 6. is according to the data processing method described in remarks 4, wherein, in described paragraph partiting step, based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of matrix, and divide paragraph according to determined optimum minor structure.
Remarks 7., according to the data processing method described in remarks 6, wherein, in described paragraph partiting step, is determined the thematic value of optimum minor structure based on following formula:
S [ i ] = max B [ i ] [ i ] + S [ i - 1 ] B [ i - 1 ] [ i ] + S [ i - 2 ] . . . B [ 2 ] [ i ] + S [ 1 ] B [ 1 ] [ i ]
Wherein, S[i] represent before the thematic value of optimum minor structure of i statement, B[i] [j] represent element that the element based on the capable i row of i in described topic relativity matrix is listed as to the capable j of j and definite value.
Remarks 8., according to the data processing method described in remarks 2, wherein, calculates in sub-step at described statement similarity, and the word frequency based on each statement, utilizes cosine formula to calculate the similarity between any two statements.
Remarks 9., according to the data processing method described in remarks 2, wherein, calculates in sub-step in described matching degree, based on editing distance algorithm, calculates the matching degree between each statement and picture recognition result.
Remarks 10. is according to the data processing method described in remarks 1, wherein, selects in step described subject matter segments is unsuccessful, and the thematic value based on determining sorts to divided a plurality of paragraphs, and selects output according to predetermined condition.
Remarks 11., according to the data processing method described in remarks 1, wherein, in described initialization step, carries out statement division based on punctuation mark to described webpage, and keeps the original statement order of described webpage constant.
Remarks 12., according to the data processing method described in remarks 1, wherein, adopts optical character identification OCR technology in described picture recognition step.
13. 1 kinds of data processing equipments of remarks, comprising:
Picture recognition unit, is configured to picture to identify, and to obtain a plurality of recognition result words, and from described a plurality of recognition result words, generates one or more terms according to particular combination form;
Initialization unit, the webpage that is configured to obtain utilizing described term to retrieve carries out initialization, to obtain a plurality of statements;
Topic relativity determining unit, is configured to determine the topic relativity between each obtained statement;
Theme paragraph division unit, is configured to the thematic value that based on determined topic relativity, described a plurality of statements is divided into a plurality of paragraphs and determines each paragraph; And
The subject matter segments selected cell that falls, is configured to the thematic value of each paragraph based on determining, selects to meet the theme paragraph of predetermined condition from described a plurality of paragraphs.
Remarks 14. is according to the data processing equipment described in remarks 13, and wherein, described topic relativity determining unit further comprises:
Statement similarity computing module, is configured to calculate the similarity between each statement;
Matching degree computing module, is configured to calculate the matching degree between each statement and picture recognition result; And
Correlativity determination module, the similarity and the matching degree that are configured to based on calculating are determined the topic relativity between each statement.
Remarks 15. is according to the data processing equipment described in remarks 14, wherein, described matching degree computing module is further configured to calculate the matching degree between each word in each statement and described a plurality of recognition result word, and the matching degree using the maximum matching degree of calculating as this statement and between picture recognition result.
Remarks 16. is according to the data processing equipment described in remarks 14, wherein, described correlativity determination module is further configured to generate topic relativity matrix based on following manner, topic relativity between any two statements of each element representation in described topic relativity matrix: for the element on the principal diagonal of described matrix, determine the value of this element based on corresponding matching degree; And for the element in the lower triangle battle array of described matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and described topic relativity matrix is symmetric matrix.
Remarks 17. is according to the data processing equipment described in remarks 16, and wherein, described topic relativity matrix is determined with following expression:
If i=j, A[i] [j]=match[i]+1;
If i > is j, A[i] [j]=(A[i-1] [j]+A[i] [j+1])/2-(1-Sim[i] [j]);
And A[i] [j]=A[j] [i],
Wherein, A[i] [j] represent the element of the capable j row of i in described topic relativity matrix, match[i] represent the matching degree between statement i and picture recognition result, Sim[i] similarity between [j] expression statement i and statement j.
Remarks 18. is according to the data processing equipment described in remarks 16, wherein, described paragraph division unit is further configured to based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of described matrix, and divide paragraph according to determined optimum minor structure.
Remarks 19. is according to the data processing equipment described in remarks 18, and wherein, described paragraph division unit is further configured to determine based on following formula the thematic value of optimum minor structure:
S [ i ] = max B [ i ] [ i ] + S [ i - 1 ] B [ i - 1 ] [ i ] + S [ i - 2 ] . . . B [ 2 ] [ i ] + S [ 1 ] B [ 1 ] [ i ]
Wherein, S[i] represent before the thematic value of optimum minor structure of i statement, B[i] [j] represent element that the element based on the capable i row of i in described topic relativity matrix is listed as to the capable j of j and definite value.
Remarks 20. is according to the data processing equipment described in remarks 14, and wherein, described statement similarity computing module is further configured to the word frequency based on each statement, utilizes cosine formula to calculate the similarity between any two statements.
Remarks 21. is according to the data processing equipment described in remarks 14, and wherein, described matching degree computing module is further configured to calculate the matching degree between each statement and picture recognition result based on editing distance algorithm.
Remarks 22. is according to the data processing equipment described in remarks 13, and wherein, the described subject matter segments thematic value that selected cell is further configured to based on determining that falls sorts to divided a plurality of paragraphs, and selects output according to predetermined condition.
Remarks 23. is according to the data processing equipment described in remarks 13, and wherein, described initialization unit is further configured to, based on punctuation mark, described webpage is carried out to statement division, and keeps the original statement order of described webpage constant.
Remarks 24. is according to the data processing equipment described in remarks 13, and wherein, described picture recognition unit is configured to adopt optical character identification OCR technology.

Claims (10)

1. a data processing method, comprising:
Picture recognition step, for picture is identified, to obtain a plurality of recognition result words, and generates one or more terms according to particular combination form from described a plurality of recognition result words;
Initialization step, carries out initialization for the webpage obtaining utilizing described term to retrieve, to obtain a plurality of statements;
Topic relativity determining step, for determining the topic relativity between each statement obtaining;
Theme paragraph partiting step, for being divided into described a plurality of statements based on determined topic relativity the thematic value of a plurality of paragraphs definite each paragraph; And
The unsuccessful step of selecting of subject matter segments for the thematic value of each paragraph based on determining, selects to meet the theme paragraph of predetermined condition from described a plurality of paragraphs.
2. data processing method according to claim 1, wherein, described topic relativity determining step further comprises:
Statement similarity calculates sub-step, for calculating the similarity between each statement;
Matching degree is calculated sub-step, for calculating the matching degree between each statement and picture recognition result; And
Correlativity is determined sub-step, for the similarity based on calculating and matching degree, determines the topic relativity between each statement.
3. data processing method according to claim 2, wherein, in described correlativity, determine in sub-step, based on following manner, generate topic relativity matrix, topic relativity between any two statements of each element representation in described topic relativity matrix: for the element on the principal diagonal of described matrix, determine the value of this element based on corresponding matching degree; And for the element in the lower triangle battle array of described matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and described topic relativity matrix is symmetric matrix.
4. data processing method according to claim 3, wherein, in described paragraph partiting step, based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of described matrix, and divide paragraph according to determined optimum minor structure.
5. data processing method according to claim 1, wherein, selects in step described subject matter segments is unsuccessful, and the thematic value based on determining sorts to divided a plurality of paragraphs, and selects output according to predetermined condition.
6. a data processing equipment, comprising:
Picture recognition unit, is configured to picture to identify, and to obtain a plurality of recognition result words, and from described a plurality of recognition result words, generates one or more terms according to particular combination form;
Initialization unit, the webpage that is configured to obtain utilizing described term to retrieve carries out initialization, to obtain a plurality of statements;
Topic relativity determining unit, is configured to determine the topic relativity between each obtained statement;
Theme paragraph division unit, is configured to the thematic value that based on determined topic relativity, described a plurality of statements is divided into a plurality of paragraphs and determines each paragraph; And
The subject matter segments selected cell that falls, is configured to the thematic value of each paragraph based on determining, selects to meet the theme paragraph of predetermined condition from described a plurality of paragraphs.
7. data processing equipment according to claim 6, wherein, described topic relativity determining unit further comprises:
Statement similarity computing module, is configured to calculate the similarity between each statement;
Matching degree computing module, is configured to calculate the matching degree between each statement and picture recognition result; And
Correlativity determination module, the similarity and the matching degree that are configured to based on calculating are determined the topic relativity between each statement.
8. data processing equipment according to claim 7, wherein, described correlativity determination module is further configured to generate topic relativity matrix based on following manner, topic relativity between any two statements of each element representation in described topic relativity matrix: for the element on the principal diagonal of described matrix, determine the value of this element based on corresponding matching degree; And for the element in the lower triangle battle array of described matrix, the similarity between the element based on adjacent to this element and two statements relevant with this element is determined the value of this element, and described topic relativity matrix is symmetric matrix.
9. data processing equipment according to claim 8, wherein, described paragraph division unit is further configured to based on determined topic relativity matrix, utilize dynamic programming algorithm to determine the optimum minor structure of the division of described matrix, and divide paragraph according to determined optimum minor structure.
10. data processing equipment according to claim 6, wherein, the described subject matter segments thematic value that selected cell is further configured to based on determining that falls sorts to divided a plurality of paragraphs, and selects output according to predetermined condition.
CN201210358626.0A 2012-09-24 2012-09-24 Data processing method and data processing device Pending CN103678407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210358626.0A CN103678407A (en) 2012-09-24 2012-09-24 Data processing method and data processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210358626.0A CN103678407A (en) 2012-09-24 2012-09-24 Data processing method and data processing device

Publications (1)

Publication Number Publication Date
CN103678407A true CN103678407A (en) 2014-03-26

Family

ID=50315988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210358626.0A Pending CN103678407A (en) 2012-09-24 2012-09-24 Data processing method and data processing device

Country Status (1)

Country Link
CN (1) CN103678407A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107190A1 (en) * 2014-12-30 2016-07-07 百度在线网络技术(北京)有限公司 Searching method and apparatus
CN109710939A (en) * 2018-12-28 2019-05-03 北京百度网讯科技有限公司 Method and apparatus for determining theme
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN101751403A (en) * 2008-12-11 2010-06-23 易搜比控股公司 Method for transforming hypertext tag language file to text file
US20100161652A1 (en) * 2008-12-24 2010-06-24 Yahoo! Inc. Rapid iterative development of classifiers
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning
CN102214189A (en) * 2010-04-09 2011-10-12 腾讯科技(深圳)有限公司 Data mining-based word usage knowledge acquisition system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751403A (en) * 2008-12-11 2010-06-23 易搜比控股公司 Method for transforming hypertext tag language file to text file
US20100161652A1 (en) * 2008-12-24 2010-06-24 Yahoo! Inc. Rapid iterative development of classifiers
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN102214189A (en) * 2010-04-09 2011-10-12 腾讯科技(深圳)有限公司 Data mining-based word usage knowledge acquisition system and method
CN101944109A (en) * 2010-09-06 2011-01-12 华南理工大学 System and method for extracting picture abstract based on page partitioning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107190A1 (en) * 2014-12-30 2016-07-07 百度在线网络技术(北京)有限公司 Searching method and apparatus
US10296541B2 (en) 2014-12-30 2019-05-21 Baidu Online Network Technology (Beijing) Co., Ltd. Searching method and apparatus
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium
CN109710939A (en) * 2018-12-28 2019-05-03 北京百度网讯科技有限公司 Method and apparatus for determining theme
CN109710939B (en) * 2018-12-28 2023-06-09 北京百度网讯科技有限公司 Method and device for determining theme

Similar Documents

Publication Publication Date Title
US11216504B2 (en) Document recommendation method and device based on semantic tag
US10255272B2 (en) Adjustment of document relationship graphs
Qu et al. The bag-of-opinions method for review rating prediction from sparse text patterns
CN104239300B (en) The method and apparatus that semantic key words are excavated from text
US9251469B2 (en) Dynamic load balancing based on question difficulty
Nasr et al. Automated extraction of product comparison matrices from informal product descriptions
US20150026178A1 (en) Subject-matter analysis of tabular data
CN103678418A (en) Information processing method and equipment
Mao et al. Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts
Bales et al. Bibliometric visualization and analysis software: State of the art, workflows, and best practices
CN103544186A (en) Method and equipment for discovering theme key words in picture
US8484148B2 (en) Predicting whether strings identify a same subject
CN104142990A (en) Search method and device
Plu et al. A hybrid approach for entity recognition and linking
US20090327877A1 (en) System and method for disambiguating text labeling content objects
Lim et al. Bibliographic analysis on research publications using authors, categorical labels and the citation network
Yang et al. HDD: a hypercube division-based algorithm for discretisation
US20140181097A1 (en) Providing organized content
US9110973B2 (en) Method and apparatus for processing a query
CN103678407A (en) Data processing method and data processing device
CN110347841A (en) A kind of method, apparatus, storage medium and the electronic equipment of document content classification
Almasoud et al. Automated multidocument biomedical text summarization using deep learning model
WO2021055868A1 (en) Associating user-provided content items to interest nodes
US10671644B1 (en) Adaptive column set composition
Marcacini et al. On the use of consensus clustering for incremental learning of topic hierarchies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140326