CN110489570A - Candidate the whole network bibliography real-time update platform and system - Google Patents

Candidate the whole network bibliography real-time update platform and system Download PDF

Info

Publication number
CN110489570A
CN110489570A CN201910722763.XA CN201910722763A CN110489570A CN 110489570 A CN110489570 A CN 110489570A CN 201910722763 A CN201910722763 A CN 201910722763A CN 110489570 A CN110489570 A CN 110489570A
Authority
CN
China
Prior art keywords
character string
image
recognition unit
candidate
bibliography
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910722763.XA
Other languages
Chinese (zh)
Inventor
欧峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Super Intellectual Property Consultant (Beijing) Co.,Ltd.
Original Assignee
Beijing Ruyou Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruyou Education Technology Co Ltd filed Critical Beijing Ruyou Education Technology Co Ltd
Priority to CN201910722763.XA priority Critical patent/CN110489570A/en
Publication of CN110489570A publication Critical patent/CN110489570A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/382Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using citations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention relates to a kind of candidate the whole network bibliography real-time update plateform system, the system comprises: webpage grabs screen equipment, and the Web-page screen edited to user carries out grabbing screen operation, grabs screen image to obtain webpage;Text box detection device identifies that webpage grabs each image-region at each text box difference place in screen image based on text box imaging features;OCR identifies equipment, carries out OCR identification respectively to each image-region to obtain corresponding multiple character strings;Multiple character strings of each image-region are uniformly carried out the sequence of frequency of occurrence order by character string sorting equipment, using each character string of the most preset quantity of frequency of occurrence as latest keywords;More new equipment is searched for, the search of candidate the whole network bibliography is reset based on each latest keywords.By means of the invention it is possible to realize the real-time update of the keyword of search according to text editing situation.

Description

Candidate the whole network bibliography real-time update platform and system
Technical field
The present invention relates to paper editor field more particularly to a kind of candidate the whole network bibliography real-time update platform and it is System.
Background technique
Data is to constitute the basis of thesis writing.Determine the selected topic, be designed and it is necessary observation with experiment after, The collection and processing work for carrying out data, are the further preparations done by thesis writing.
Thesis writing data can be divided into the firsthand information and two class of secondary data.The former is also referred to as primary data or straight Data is connect, refers to that author participates in investigation, research or the thing observed and learnt in person, if recorded done in experiment or observation etc., Belong to this kind of data;The latter is also referred to as secondary data or secondary source, refers to related profession or document feature information, mainly By study accumulation usually.On the basis of obtaining enough data, to be also processed, be allowed to systematization and methodization, Convenient for application.For thesis writing, these two types of data be all it is essential, they are properly applied into paper and write In work, pay attention to distinguishing primary and secondary, suitably to be quoted on the basis of abundant digest and assimilate especially for documents and materials, it should not noisy guest Take master by force.The utilization of the firsthand information to also be accomplished true, accurate, errorless.
In the epoch of current information explosion, only manually its efficiency of mode is very low for the collection of the data of thesis writing Under, the general search that the whole network document is carried out in such a way that user inputs keyword, however, the keyword of this user subjectivity The mode of determination necessarily to have an inborn precision insufficient, can not reflect the true of the paper that active user edits accurately, comprehensively Content.
Summary of the invention
It to solve the above-mentioned problems, can be right the present invention provides a kind of candidate the whole network bibliography real-time update platform The ranking results of the frequency of occurrence of each character string are chosen automatically in the Web-page screen that user is being edited waits for searching for The keyword of the whole network bibliography is selected, is extracted to improve the intelligent of search keyword, more it is essential that wherein going back root The position occurred according to each character string, which is determined, carries out the different weights of frequency of occurrence statistics to it, thus to appear in table or Character string in formula gives the weight inclination of determining keyword.
According to an aspect of the present invention, a kind of candidate the whole network bibliography real-time update platform, the platform packet are provided It includes:
Webpage grabs screen equipment, is arranged in the terminal of operation webpage, the Web-page screen for being edited to user It carries out grabbing screen operation, grabs screen image to obtain webpage;
Text box detection device is grabbed screen equipment with the webpage and is connect, and grabs screen image for receiving the webpage, and be based on Text box imaging features identify that webpage grabs each image-region where each text box difference in screen image;
OCR identify equipment, connect with the text box detection device, for each image-region received respectively into Row OCR is identified to obtain corresponding multiple character strings;
Character string sorting equipment is connect with OCR identification equipment, for multiple character strings of each image-region to be united One carries out the sequence of frequency of occurrence order, using each character string of the most preset quantity of frequency of occurrence as latest keywords;
More new equipment is searched for, is connect with the character string sorting equipment, for based on each latest keywords received The search of candidate the whole network bibliography is reset, it is multiple with reference to the corresponding multiple texts of periodical needed for editing paper to obtain Shelves;
Wherein, OCR identification equipment further includes OCR recognition unit, Table recognition unit and formulas solutions unit, described OCR recognition unit is used to carry out each image-region received respectively OCR identification to obtain corresponding multiple character strings;
Wherein, the Table recognition unit is connect with the OCR recognition unit, for determining that the OCR recognition unit obtains Where whether each character string obtained is located at it within the scope of table of image-region, and the character is authorized based on definitive result The different multiples of the frequency of occurrence of string;
Wherein, the formulas solutions unit is connect with the OCR recognition unit, for determining that the OCR recognition unit obtains Where whether each character string obtained is located at it within the scope of formula of image-region, and the character is authorized based on definitive result The different multiples of the frequency of occurrence of string.
According to another aspect of the present invention, a kind of candidate the whole network bibliography real-time update system, feature are additionally provided It is, obtained system includes: memory and processor, and the processor is connect with the memory;The memory, for depositing Store up the executable instruction of the processor;The processor is made for calling the executable instruction in the memory with realizing It is candidate complete for searching for be realized according to text editing situation with candidate the whole network bibliography real-time update platform as described above The method of the real-time update of the keyword of net bibliography.
Bibliography cannonical format can the type based on bibliography be simply summarized as follows:
The type of bibliography (i.e. quotation source) is identified in a manner of single-letter, specifically:
M --- monograph;C --- collection of thesis;N --- newspaper article;J --- journal of writings;D --- academic dissertation;R—— Report;For being not belonging to above-mentioned document type, using alphabetical " Z " mark.
For English reference, it should also be noted that following two points:
1. author's name use " surname is in preceding name rear " principle, specific format is: surname, the initial of name such as: Malcolm Richard Cowley is answered are as follows: and Cowley, M.R., if there is two authors, first author's mode is constant, it The initial of second author name is placed on front afterwards, and surname is put behind, and such as: Frank Norris and Irving Gordon is answered Are as follows: Norris, F.&I.Gordon.;
2. title, newpapers and periodicals name use italics, such as: Mastering English Literature, English Weekly。
The present invention at least has inventive point crucial at following two:
(1) each image-region where webpage grabs each text box in screen image respectively is identified, to each image Region carries out text string extracting respectively, and the ranking results of the frequency of occurrence of each character string based on each image-region are automatic It chooses for searching for the keyword of candidate the whole network bibliography, is extracted to improve the intelligent of search keyword;
(2) position occurred according to each character string determines the different weights that frequency of occurrence statistics is carried out to it, thus The weight inclination of determining keyword is given to the character string appeared in table or formula.
Detailed description of the invention
Embodiment of the present invention is described below with reference to attached drawing, in which:
Fig. 1 is the structure box according to candidate the whole network bibliography real-time update platform shown in embodiment of the present invention Figure.
Fig. 2 is according to candidate the whole network bibliography real-time update platform candidate obtained shown in embodiment of the present invention The interface schematic diagram of the search result of the whole network bibliography.
Specific embodiment
Below with reference to accompanying drawings to the reality of candidate the whole network bibliography real-time update platform and corresponding system of the invention The scheme of applying is described in detail.
In the prior art, when user's Paper Writing, it usually needs scanned for the whole network document to obtain multiple references Document is referred to and is used, wherein the general search that the whole network document is carried out in such a way that user inputs keyword, however, The mode of the determination of the keyword of this user's subjectivity necessarily has inborn precision deficiency, can not reflect accurately, comprehensively current The true content for the paper that user edits.
In order to overcome above-mentioned deficiency, the present invention has built a kind of candidate the whole network bibliography real-time update platform and corresponding System can effectively solve the problem that corresponding technical problem.
Fig. 1 is the structure box according to candidate the whole network bibliography real-time update platform shown in embodiment of the present invention Figure, the platform include:
Webpage grabs screen equipment, is arranged in the terminal of operation webpage, the Web-page screen for being edited to user It carries out grabbing screen operation, grabs screen image to obtain webpage;
Text box detection device is grabbed screen equipment with the webpage and is connect, and grabs screen image for receiving the webpage, and be based on Text box imaging features identify that webpage grabs each image-region where each text box difference in screen image;
OCR identify equipment, connect with the text box detection device, for each image-region received respectively into Row OCR is identified to obtain corresponding multiple character strings;
Character string sorting equipment is connect with OCR identification equipment, for multiple character strings of each image-region to be united One carries out the sequence of frequency of occurrence order, using each character string of the most preset quantity of frequency of occurrence as latest keywords;
More new equipment is searched for, is connect with the character string sorting equipment, for based on each latest keywords received The search of candidate the whole network bibliography is reset, it is multiple with reference to the corresponding multiple texts of periodical needed for editing paper to obtain Shelves;
Wherein, OCR identification equipment further includes OCR recognition unit, Table recognition unit and formulas solutions unit, described OCR recognition unit is used to carry out each image-region received respectively OCR identification to obtain corresponding multiple character strings;
Wherein, the Table recognition unit is connect with the OCR recognition unit, for determining that the OCR recognition unit obtains Where whether each character string obtained is located at it within the scope of table of image-region, and the character is authorized based on definitive result The different multiples of the frequency of occurrence of string;
Wherein, the formulas solutions unit is connect with the OCR recognition unit, for determining that the OCR recognition unit obtains Where whether each character string obtained is located at it within the scope of formula of image-region, and the character is authorized based on definitive result The different multiples of the frequency of occurrence of string.
Then, continue to carry out the specific structure of candidate the whole network bibliography real-time update platform of the invention further Explanation.
In candidate's the whole network bibliography real-time update platform:
In the Table recognition unit, whether each character string for determining that the OCR recognition unit obtains is located at it Within the scope of the table of place image-region, and authorize based on definitive result the different multiples packet of the frequency of occurrence of the character string It includes: when the character string for determining the OCR recognition unit acquisition is located at where it within the scope of table of image-region, by the character The frequency of occurrence of string increases n times, and wherein N is natural number and is greater than 1.
In candidate's the whole network bibliography real-time update platform:
In the formulas solutions unit, whether each character string for determining that the OCR recognition unit obtains is located at it Within the scope of the formula of place image-region, and authorize based on definitive result the different multiples packet of the frequency of occurrence of the character string It includes: when the character string for determining the OCR recognition unit acquisition is located at where it within the scope of formula of image-region, by the character The frequency of occurrence of string increases M times, and wherein M is natural number and is greater than 1.
In candidate's the whole network bibliography real-time update platform:
In the Table recognition unit, whether each character string for determining that the OCR recognition unit obtains is located at it Within the scope of the table of place image-region, and authorize based on definitive result the different multiples packet of the frequency of occurrence of the character string It includes: when the character string for determining the OCR recognition unit acquisition is not located at where it within the scope of table of image-region, by the word The frequency of occurrence of symbol string increases by 1 time.
In candidate's the whole network bibliography real-time update platform:
In the formulas solutions unit, whether each character string for determining that the OCR recognition unit obtains is located at it Within the scope of the formula of place image-region, and authorize based on definitive result the different multiples packet of the frequency of occurrence of the character string It includes: when the character string for determining the OCR recognition unit acquisition is not located at where it within the scope of formula of image-region, by the word The frequency of occurrence of symbol string increases by 1 time.
In candidate's the whole network bibliography real-time update platform:
In OCR identification equipment, M is greater than N.
In candidate's the whole network bibliography real-time update platform:
In OCR identification equipment, M value is that 4, N value is 2.
In candidate's the whole network bibliography real-time update platform:
In OCR identification equipment, the OCR recognition unit, the Table recognition unit and the formulas solutions list The asic chip of different model is respectively adopted to realize in member.
Can also include: in candidate's the whole network bibliography real-time update platform
Instant playback equipment is connect with the character string sorting equipment, for the multiple of each image-region of instant playback Character string uniformly carries out the ranking results of frequency of occurrence order.
Meanwhile in order to overcome above-mentioned deficiency, the present invention has also built a kind of candidate the whole network bibliography real-time update system, Obtained system includes: memory and processor, and the processor is connect with the memory;
Wherein, the memory, for storing the executable instruction of the processor;
Wherein, the processor, for calling the executable instruction in the memory, to realize using as described above Candidate the whole network bibliography real-time update platform according to text editing situation to realize for searching for candidate the whole network bibliography The method of the real-time update of keyword.
Fig. 2 is according to candidate the whole network bibliography real-time update platform candidate obtained shown in embodiment of the present invention The interface schematic diagram of the search result of the whole network bibliography.
As shown in Fig. 2, resetting candidate the whole network ginseng based on each latest keywords received in described search more new equipment After the search for examining document, the corresponding multiple documents of multiple reference periodicals needed for editing paper are obtained, it is the multiple The relevant information of document is shown on the interface of Fig. 2;
Wherein, Fig. 2 is obtained after the search for being reset candidate the whole network bibliography based on each latest keywords received It has arrived and has amounted to 1329298 documents, the corresponding result information of 1329298 documents has been shown, due to the limitation of length, In Paging list display has been carried out in Fig. 2;
Wherein, first page has two periodicals, and entitled " the analysis computer network security " of first periodical, author is Liu Intelligence is strong, and source is " Heilungkiang scientific and technological information " the 31st phase in 2011, and entitled " the computer network security spy of second periodical Analysis ", author Wu Hailiang, source are " Sci-tech Pioneering monthly magazine " the 13rd phase in 2011.
In addition, OCR (Optical Character Recognition, optical character identification) refer to electronic equipment (such as Scanner or digital camera) check the character printed on paper, its shape is determined by the mode for detecting dark, bright, then uses character Shape is translated into the process of computword by recognition methods;That is, it is directed to printed character, it is using optical mode that papery is literary Text conversion in shelves becomes the image file of black and white lattice, and passes through identification software for the text conversion in image into text lattice Formula, the technology further edited and processed for word processor.How except mistake or using auxiliary information raising recognition correct rate, it is Therefore the most important project of OCR, the noun of ICR (Intelligent Character Recognition) also generate.It measures One OCR system performance quality refers mainly to indicate: reject rate, misclassification rate, recognition speed, the friendly of user interface, product Stability, ease for use and feasibility etc..
The concept of OCR is to be put forward at first in nineteen twenty-nine by Germany scientist Tausheck, later American scientist Handel also proposed the idea identified using technology to text.And it is to what printed Chinese character identification was studied earliest The Casey and Nagy of IBM Corporation, they have delivered first article about Chinese Character Recognition within 1966, use template matching Method identifies 1000 printed Chinese characters.
Early in the 60, seventies, countries in the world begin to the research of OCR, and the initial stage studied, mostly with the identification side of text Based on method research, and the text identified is only 0 to 9 number.For equally possessing the Japan of ideographic language, or so nineteen sixty Begin one's study OCR basic identification it is theoretical, initial stage is using number as object, until beginning between 1965 to 1970 years some simple Product identify the postcode on mail such as the postcode identifying system of printing word, help post office to make region point letter Operation;Also therefore so far postcode is always the address ways of writing that various countries are advocated.
The Chinese Character Recognition in the early 1970s, scholar of Japan begins one's study, and done a large amount of work.China is in OCR The research work of technical aspect is started late, and just starts to study the identification of number, English alphabet and symbol in the seventies, The research for starting progress Chinese Character Recognition the end of the seventies, by 1986, China proposed " 863 " high and new technology project, and Chinese character is known Other research enters a substantive stage, and the Ding Xiaoqing professor of Tsinghua University and the Chinese Academy of Sciences distinguish developmental research, push away in succession Chinese OCR product is gone out, has been now the most leading Chinese character OCR technique of China.The OCR software of early stage, due to discrimination and commercialization etc. Various factors fail to reach actual requirement.Simultaneously as hardware device is at high cost, the speed of service is slow, does not also reach real Degree.Only individual departments, such as information departments, journalism unit use OCR software.Into the 1990s with Afterwards, with the extensive use of falt bed scanner, and universal, the promotion significantly of Chinese information automation and office automation The further development of OCR technique, makes the recognition correct rate of OCR, recognition speed meet the requirement of users.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above State the embodiment of the present invention, it is to be understood that above-described embodiment is exemplary, and is not considered as limiting the invention, Those skilled in the art can make changes, modifications, alterations, and variations to the above described embodiments within the scope of the invention.

Claims (10)

1. a kind of candidate's the whole network bibliography real-time update platform characterized by comprising
Webpage grabs screen equipment, is arranged in the terminal of operation webpage, and the Web-page screen for being edited to user carries out Screen operation is grabbed, grabs screen image to obtain webpage;
Text box detection device is grabbed screen equipment with the webpage and is connect, and grabs screen image for receiving the webpage, and be based on text Frame imaging features identify that webpage grabs each image-region where each text box difference in screen image;
OCR identifies equipment, connect with the text box detection device, for carrying out respectively to each image-region received OCR is identified to obtain corresponding multiple character strings;
Character string sorting equipment connect with OCR identification equipment, for by multiple character strings of each image-region uniformly into The sequence of row frequency of occurrence order, using each character string of the most preset quantity of frequency of occurrence as latest keywords;
More new equipment is searched for, is connect with the character string sorting equipment, for based on each latest keywords resetting received The search of candidate the whole network bibliography, it is multiple with reference to the corresponding multiple documents of periodical needed for editing paper to obtain;
Wherein, the OCR identification equipment further includes OCR recognition unit, Table recognition unit and formulas solutions unit, the OCR Recognition unit is used to carry out each image-region received respectively OCR identification to obtain corresponding multiple character strings;
Wherein, the Table recognition unit is connect with the OCR recognition unit, for determining what the OCR recognition unit obtained Where whether each character string is located at it within the scope of table of image-region, and the character string is authorized based on definitive result The different multiples of frequency of occurrence;
Wherein, the formulas solutions unit is connect with the OCR recognition unit, for determining what the OCR recognition unit obtained Where whether each character string is located at it within the scope of formula of image-region, and the character string is authorized based on definitive result The different multiples of frequency of occurrence.
2. candidate's the whole network bibliography real-time update platform as described in claim 1, it is characterised in that:
In the Table recognition unit, determine whether each character string that the OCR recognition unit obtains is located at where it It within the scope of the table of image-region, and include: to work as based on the different multiples that definitive result authorizes the frequency of occurrence of the character string Determine that the character string that the OCR recognition unit obtains is located within the scope of the table of its place image-region, by the character string Frequency of occurrence increases n times, and wherein N is natural number and is greater than 1.
3. candidate's the whole network bibliography real-time update platform as claimed in claim 2, it is characterised in that:
In the formulas solutions unit, determine whether each character string that the OCR recognition unit obtains is located at where it It within the scope of the formula of image-region, and include: to work as based on the different multiples that definitive result authorizes the frequency of occurrence of the character string Determine that the character string that the OCR recognition unit obtains is located within the scope of the formula of its place image-region, by the character string Frequency of occurrence increases M times, and wherein M is natural number and is greater than 1.
4. candidate's the whole network bibliography real-time update platform as claimed in claim 3, it is characterised in that:
In the Table recognition unit, determine whether each character string that the OCR recognition unit obtains is located at where it It within the scope of the table of image-region, and include: to work as based on the different multiples that definitive result authorizes the frequency of occurrence of the character string Determine that the character string that the OCR recognition unit obtains is not located within the scope of the table of its place image-region, by the character string Frequency of occurrence increase by 1 time.
5. candidate's the whole network bibliography real-time update platform as claimed in claim 4, it is characterised in that:
In the formulas solutions unit, determine whether each character string that the OCR recognition unit obtains is located at where it It within the scope of the formula of image-region, and include: to work as based on the different multiples that definitive result authorizes the frequency of occurrence of the character string Determine that the character string that the OCR recognition unit obtains is not located within the scope of the formula of its place image-region, by the character string Frequency of occurrence increase by 1 time.
6. candidate's the whole network bibliography real-time update platform as claimed in claim 5, it is characterised in that:
In OCR identification equipment, M is greater than N.
7. candidate's the whole network bibliography real-time update platform as claimed in claim 6, it is characterised in that:
In OCR identification equipment, M value is that 4, N value is 2.
8. candidate's the whole network bibliography real-time update platform as claimed in claim 7, it is characterised in that:
In OCR identification equipment, the OCR recognition unit, the Table recognition unit and the formulas solutions unit point It is not realized using the asic chip of different model.
9. candidate's the whole network bibliography real-time update platform as claimed in claim 8, which is characterized in that the platform also wraps It includes:
Instant playback equipment is connect with the character string sorting equipment, multiple characters for each image-region of instant playback The unified ranking results for carrying out frequency of occurrence order of string.
10. a kind of candidate's the whole network bibliography real-time update system, which is characterized in that obtained system includes: memory and processing Device, the processor are connect with the memory;
The memory, for storing the executable instruction of the processor;
The processor, for calling the executable instruction in the memory, to realize using such as any institute of claim 1-9 The candidate the whole network bibliography real-time update platform stated according to text editing situation to realize for searching for candidate the whole network with reference to text The method of the real-time update for the keyword offered.
CN201910722763.XA 2019-08-06 2019-08-06 Candidate the whole network bibliography real-time update platform and system Pending CN110489570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910722763.XA CN110489570A (en) 2019-08-06 2019-08-06 Candidate the whole network bibliography real-time update platform and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910722763.XA CN110489570A (en) 2019-08-06 2019-08-06 Candidate the whole network bibliography real-time update platform and system

Publications (1)

Publication Number Publication Date
CN110489570A true CN110489570A (en) 2019-11-22

Family

ID=68549576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910722763.XA Pending CN110489570A (en) 2019-08-06 2019-08-06 Candidate the whole network bibliography real-time update platform and system

Country Status (1)

Country Link
CN (1) CN110489570A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111880697A (en) * 2020-08-07 2020-11-03 北京搜狗科技发展有限公司 Encyclopedic data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542273A (en) * 2011-12-02 2012-07-04 方正国际软件有限公司 Detection method and system for complex formula areas in document image
CN102591475A (en) * 2011-12-29 2012-07-18 北京百度网讯科技有限公司 Content input method and system for online editor
CN103559310A (en) * 2013-11-18 2014-02-05 广东利为网络科技有限公司 Method for extracting key word from article
CN104615640A (en) * 2014-11-28 2015-05-13 百度在线网络技术(北京)有限公司 Method and device for providing searching keywords and carrying out searching
CN105264486A (en) * 2012-12-18 2016-01-20 汤姆森路透社全球资源公司 Mobile-enabled systems and processes for intelligent research platform
CN109144954A (en) * 2018-09-18 2019-01-04 天津字节跳动科技有限公司 Edit resource recommendation method, device and the electronic equipment of document

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542273A (en) * 2011-12-02 2012-07-04 方正国际软件有限公司 Detection method and system for complex formula areas in document image
CN102591475A (en) * 2011-12-29 2012-07-18 北京百度网讯科技有限公司 Content input method and system for online editor
CN105264486A (en) * 2012-12-18 2016-01-20 汤姆森路透社全球资源公司 Mobile-enabled systems and processes for intelligent research platform
CN103559310A (en) * 2013-11-18 2014-02-05 广东利为网络科技有限公司 Method for extracting key word from article
CN104615640A (en) * 2014-11-28 2015-05-13 百度在线网络技术(北京)有限公司 Method and device for providing searching keywords and carrying out searching
CN109144954A (en) * 2018-09-18 2019-01-04 天津字节跳动科技有限公司 Edit resource recommendation method, device and the electronic equipment of document

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
第03期: ""复杂版面文档图像中公式与文本的提取及分析"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111880697A (en) * 2020-08-07 2020-11-03 北京搜狗科技发展有限公司 Encyclopedic data processing method and device

Similar Documents

Publication Publication Date Title
CN102053991B (en) Method and system for multi-language document retrieval
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
JP2012529108A (en) Lighting system and language detection
CN109344914A (en) A kind of method and system of the Text region of random length end to end
US8208726B2 (en) Method and system for optical character recognition using image clustering
CN109062792A (en) A kind of Open Source Code detection method based on String matching and characteristic matching
CN102591475A (en) Content input method and system for online editor
Isheawy et al. Optical character recognition (ocr) system
CN107562843B (en) News hot phrase extraction method based on title high-frequency segmentation
Valy et al. A new khmer palm leaf manuscript dataset for document analysis and recognition: Sleukrith set
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
US10970489B2 (en) System for real-time expression of semantic mind map, and operation method therefor
CN110209759B (en) Method and device for automatically identifying page
CN109074355B (en) Method and medium for ideographic character analysis
Fischer et al. Handwritten historical document analysis, recognition, and retrieval-state of the art and future trends
Shapira et al. Massive multi-document summarization of product reviews with weak supervision
CN110489570A (en) Candidate the whole network bibliography real-time update platform and system
CN112464907A (en) Document processing system and method
Ohta et al. CRF-based bibliography extraction from reference strings focusing on various token granularities
CN100444194C (en) Automatic extraction device, method and program of essay title and correlation information
CN107562932A (en) The academic reference of books data in literature acquisition method of Chinese
Karambelkar et al. Automated Text Extraction from Images using Optical Character Recognition.
CN113722421A (en) Contract auditing method and system and computer readable storage medium
JP2010092108A (en) Similar sentence extraction program, method, and apparatus
CN105335416A (en) Content extraction method, content extraction apparatus and content extraction system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211130

Address after: 1501-1, floor 15, No. 19, Chegongzhuang West Road, Haidian District, Beijing 100048

Applicant after: Super Intellectual Property Consultant (Beijing) Co.,Ltd.

Address before: 12a-3-110, block D, 12 / F, No. 28, information road, Haidian District, Beijing 100085

Applicant before: Beijing Ruyou Education Technology Co.,Ltd.

TA01 Transfer of patent application right
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191122

WD01 Invention patent application deemed withdrawn after publication