CN111913930A

CN111913930A - Species data analysis method, system and computer program product

Info

Publication number: CN111913930A
Application number: CN201910389715.3A
Authority: CN
Inventors: 田金山; 吕威廷; 刘任哲
Original assignee: Shanghai Microtek Technology Co Ltd
Current assignee: Shanghai Microtek Technology Co Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-10

Abstract

The invention relates to a species data analysis method, a species data analysis system and a computer program product. The method comprises the following steps: an analysis unit identifies a specimen record image to generate a corresponding text message, and divides the text message into a plurality of character strings; the operation unit determines a plurality of weighted values of a plurality of word strings corresponding to a plurality of title fields when the plurality of word strings do not accord with the plurality of title fields of a species data table; and the operation unit calculates the most relevant word string according to the weight values and writes the most relevant word string into a species data field adjacent to each title field in the species data table.

Description

Species data analysis method, system and computer program product

Technical Field

The present invention relates to a species data analysis method, system and computer program product, and more particularly, to a species data analysis method, system and computer program product suitable for various specimens.

Background

At present, in the digital management of various types of specimens, for example: for animals, plants, minerals, fossils or collections, the user needs to manually read each specimen record, input and file the related data for subsequent statistical analysis, and therefore, a great deal of manual work is required in the above steps.

However, the format of each specimen record may be different from manufacturer to manufacturer, that is, the specimens from different locations around the world may lack a uniform record format, and the specimens may be manually read one by one, which not only easily causes reduction of the work efficiency due to eye fatigue, but also may cause unnecessary data misreading due to insufficient experience of the operator.

Disclosure of Invention

Some embodiments of the present invention provide a method, system and computer program product for analyzing species data.

The species data analysis method of an embodiment of the present invention includes: an analysis unit identifies a specimen record image to generate a corresponding text message, and divides the text message into a plurality of character strings; the operation unit determines a plurality of weighted values of a plurality of word strings corresponding to a plurality of title fields when the plurality of word strings do not accord with the plurality of title fields of a species data table; and the operation unit calculates the most relevant word string according to the weight values and writes the most relevant word string into a species data field adjacent to each title field in the species data table.

In another embodiment of the present invention, the computer program product comprises a set of instructions capable of performing the species data analysis method according to any embodiment of the present invention when the set of instructions is loaded into and executed by a computer.

The species data analysis system of another embodiment of the present invention includes an analysis unit, an operation unit and a storage unit. The analysis unit identifies a specimen record image to generate a corresponding text message, and divides the text message into a plurality of character strings. The operation unit is electrically connected to the analysis unit. The operation unit determines a plurality of weight values of the plurality of word strings corresponding to the plurality of title columns when the plurality of word strings do not accord with the plurality of title columns of the species data table, calculates the most relevant word string according to the plurality of weight values, and writes the most relevant word string into a species data column adjacent to each title column in the species data table. The storage unit is electrically connected to the operation unit. The storage unit stores a species data table.

The purpose, technical content, features and effects of the present invention will be more readily understood by the following detailed description of the embodiments with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic step diagram of a species data analysis method according to an embodiment of the invention.

Fig. 2 is a schematic diagram of an electronic device architecture of an embodiment of a species data analysis system implementing the species data analysis method of fig. 1.

Fig. 3 is a schematic step diagram of a species data analysis method according to an embodiment of the invention.

Description of the symbols

S0-S4 steps

1 electronic device

10 analysis unit

20 arithmetic unit

30 storage unit

40 image capturing unit

50 communication unit

Detailed Description

The following detailed description of the various embodiments of the invention, taken in conjunction with the accompanying drawings, is provided by way of illustration. In the description of the specification, numerous specific details are set forth in order to provide a more thorough understanding of the invention; however, the present invention may be practiced without some or all of these specific details. The same or similar elements in the drawings will be denoted by the same or similar symbols. It is particularly noted that the drawings are merely schematic and do not represent actual sizes or quantities of elements, and that some of the details may not be fully drawn for clarity of the drawings.

Referring to fig. 1 and 2 together, it is shown that the species data analysis method according to any embodiment of the present invention can be implemented by a computer program, so that the species data analysis method according to any embodiment can be completed when a computer (i.e., any electronic device 1 having the analysis unit 10, the operation unit 20, and the storage unit 30, such as a server, a tablet computer, or a smart phone) loads the program and executes the program.

The user provides the specimen record image corresponding to the specimen to the computer, and then proceeds the automatic analysis and filing process for the digital management and application. In some embodiments, the term "species specimen" or "specimen" refers to a collected sample of an animal, plant, mineral, fossil or collection, but is not limited thereto. Generally, the specimen is usually accompanied by a corresponding specimen record for describing the related information, such as: data records of specimen collection date, location, bio/mineral classification, collector, etc., such as: the lower table shows the multi-title field format data records of the first table, the single-title field format data records of the second table, or the no-title field format data records of the third table.

Table-multiple title field format data record

Table two single heading field format data records

Table three no-heading field format data record

In the embodiment, the storage unit 30 is electrically connected to the operation unit 20, and the storage unit 30 stores a species data table as a predetermined standard format, as shown in the following table four, meaningful species data can be obtained through the analysis process described later and filled in the species data field, so as to complete the species database for automatic filing and classification.

TABLE four species data sheet

First, the analysis unit 10 can automatically recognize the text message that may exist in the specimen record image through, for example but not limited to, an Optical Character Recognition (OCR) mechanism, however, the text message may be mixed with numbers and texts (hereinafter, referred to as characters) representing different data meanings, such as: the date and longitude are mixed information of numbers, places and biological classifications, which are recorded data without being sorted, the recording significance of the data can be correctly understood through manual interpretation in the past, and the electronic device 1 cannot directly identify the correct data significance. For this, the parsing unit 10 divides a whole text message into a plurality of text blocks for subsequent recognition, for example, according to at least one space or symbol (e.g., slash, comma, pause or semicolon) between each word or number, divides the text message into a plurality of strings, and transmits the strings to the computing unit 20 for data parsing. In brief, in step S1, the parsing unit 10 identifies a sample recorded image to generate a corresponding text message, and divides the text message into a plurality of strings.

Additionally, the above-mentioned image-to-text mechanism is simply utilized to easily generate a garbled symbol due to machine failure or misidentification, which in turn causes inconvenience and misinterpretation of subsequent data parsing, so that in an embodiment, the computing unit 20 filters out characters without grammatical significance in the text message, for example, the computing unit 20 may search word strings to filter out characters or symbols without grammatical significance, spelling or grammatical errors through the local database stored in the storage unit 30 or connecting to the cloud database, for example: the @ # $% meaningless symbols or characters not conforming to the spelling and grammar of common languages (Chinese, English, Latin), but not limited thereto. The present embodiment is helpful to improve the accuracy of the subsequent data analysis through the filtering step.

Because the tables I to III are specimen records with different formats, the existing image processing technology cannot simultaneously solve and process data records with multiple formats. In the embodiment, the operation unit 20 determines in advance through ways such as string comparison, data lookup, and the like whether the plurality of strings of the text message include/conform to a plurality of title fields of a preset species data table, as shown in table four, so as to distinguish the type of the data to be processed as a multi-title field format, a single-title field format, or a no-title field format; if the determination result is that the text message does not match or cannot correspond to the heading field format, indicating that the text message may belong to a single heading field format or a no-heading field format, the computing unit 20 does not directly and accurately analyze the species data corresponding to each heading field, and therefore, the species data corresponding to each heading field needs to be screened out through an algorithm.

In the present embodiment, in step S2, when the computing unit 20 determines that the word strings do not match the title fields of the predetermined species data table, an algorithm is used to determine a plurality of weight values of the word strings corresponding to the title fields, so as to facilitate subsequent determination of the degree of association between each title field and each word string. For example, the species data corresponding to the title field often has a common regular format, wherein the regular format of "date" may be 10 digits such as West Yuan year, month, and day, together with a specific separation symbol, for example: 1997/12/31, 1997-12-31, etc., the regular format of "place" can be address text with specific keywords, such as: xx street in xx district of xx city, and the regular format of longitude and latitude can be English letters representing specific longitude and latitude, matched with numbers and labels, for example: xxxExx 'xx ", xxxWxx' xx", xxxNxx 'xx ", xxxSxx' xx", while the regular format of "classification" may be text with specific biometric keywords, such as: xx in xx class xx of xx phylum xx, xx family xx, but not limited thereto. Therefore, the computing unit 20 determines what title field and corresponding weight value each string data may correspond to through the regular format with specific format rules or key features, for example: a string of numbers and symbols is determined to have a date weight of 90% and a location (latitude and longitude) weight of 80%.

Next, in step S3, the operation unit 20 calculates a string of the strings most associated with the specific title field according to the weight values of the strings, and writes the string into the species data field adjacent to the title field in the species data table. For example, the computing unit 20 determines that the weighting value of the date of the string 1997/12/31 is 100% and the weighting value of the date of the string 22N58 '00 "is 40%, thereby recognizing that the string most associated with the title field" date "among the plurality of strings is 1997/12/31 instead of 22N 58' 00", and writes the string 1997/12/31 into the species data field corresponding to and adjacent to the date title field in the species data table, as shown in table five below. By analogy, for example, but not limited to, a regular format with a specific format rule or a specific key feature, the operation unit 20 assigns a weight value corresponding to each title field to each word string, and calculates an associated word string most suitable for each title field, so as to automatically obtain a plurality of species data fields corresponding to the title fields, such as a place, a collector, a category, and the like, which will not be described in detail herein. That is, the computing unit 20 determines the weight value of each string corresponding to each title field according to a matching degree of each string compared with a normal format preset in each species data field.

Watch five

According to the above structure, the present disclosure converts the image into the disordered words, symbols, and digital information through the artificial intelligence calculation process, preliminarily finds out the word string data with grammatical significance, further analyzes a plurality of word strings by using the specific format rules or key features preset by each species data field, and finally rearranges and stores the most relevant word strings into the species data table and the species database to complete the digital automatic integration operation of automatic filing and classification, so that the user can complete the sample record image collection, the sample record digitalization, and the final statistical analysis rapidly at one time. Therefore, all the character data do not need to be classified and integrated through modes of manual identification, input, filing and the like, so that the low efficiency and the negligence of manual misreading are effectively prevented, and the practicability and the convenience can be greatly improved.

After a large number of samples are obtained by the above analysis method and stored in the species data table with related data, a huge species database is generated. A computer can be used for automatically carrying out various statistical analyses on a large number of databases to obtain various statistical analysis characters and charts for students, governments and other units to make reference bases for various researches and policies. The statistical analysis items are, for example but not limited to: combining the collection place of the specimen with a map to present the geographic distribution information of the species or article; a statistical relationship graph of quantity and time (such as years or month intervals) and showing the relationship between the frequency of occurrence and the time; presenting a graphical analysis in classification (e.g., species classification) ratios; counting according to the relation between the altitude and the time, and presenting the distribution conditions of the species at different time points; presenting the altitude division conditions of different areas by combining the altitude and the geographical related information; meanwhile, the multidimensional characteristics such as species, place, time … … and the like are combined to carry out big data analysis and data exploration, and the dependence relationship among the characteristics is analyzed.

The species data analysis method of the partial derivative example of the present invention is described below. Referring to fig. 2 and fig. 3 together, in the present embodiment, through step S0, the electronic device 1 obtains a specimen record image. For example, the electronic device 1 optionally includes an image capturing unit 40, and the image capturing unit 40 is electrically connected to the computing unit 20. The image capturing unit 40 captures the specimen record image and transmits the image to the analyzing unit 10, for example: the electronic device 1 captures images of the physical specimen and the specimen record thereof for performing an image digitizing process, and as a result, a specimen image (such as a pineapple photograph) including the specimen record (such as the contents of table one) is generated, for example: the electronic device 1 can be an image scanner, a camera device or a mobile device, but not limited thereto; alternatively, the electronic device 1 optionally includes a communication unit 50, and the communication unit 50 is electrically connected to the operation unit 20. For example, the communication unit 50 may be a wireless communication interface, which establishes a connection with a remote device through a wireless communication protocol. The communication unit 50 receives the specimen record images and transmits the specimen record images to the analysis unit 10, that is, the electronic device 1 can receive one or more specimen record images transmitted from an external electronic device through wired and wireless network communication, so that users in various places can capture a plurality of specimen record images through different electronic devices and upload the specimen record images to the electronic device 1 of the server or the cloud system for subsequent image analysis processing, so as to accelerate the speed and efficiency of the digital operation, for example: the electronic device 1 may be a server or a desktop computer, but not limited thereto.

As mentioned above, since each of the specimen records has no unified format, when the operation unit 20 determines, through ways such as string comparison and data lookup, that the plurality of title fields in the plurality of strings in the text message include/conform to the preset species data table, indicating that the text message belongs to the multi-field format, if the related strings in the text message can correspond to the plurality of title fields as shown in table four, even though the sequence of the fields is different, the operation unit 20 can obtain the title string corresponding to each title field in the species data table according to the string determination result, and in the plurality of strings in the text message content, regarding the first appearing string after the title string as the species data string, and writing the species data string adjacent to the title field in the species data table, that is, through step S4, another string (for example, species data string) in the text message adjacent to the string corresponding to the title (for example, title string), and writing the species data field adjacent to the header field in the species data table.

In one embodiment, the electronic device 1 determines a plurality of weight values of a plurality of word strings corresponding to a plurality of title fields through step S2, establishes a candidate list including a plurality of candidate fields corresponding to a plurality of title fields of the species data table by the operation unit 20, and then assigns a plurality of word strings to a plurality of candidate fields in the candidate list for candidate, and assigns the weight values corresponding to the word strings. In at least one embodiment, the computing unit 20 assigns a weight value corresponding to each string from high to low according to the sequence of the plurality of strings appearing in the same candidate field, for example: string 1997/12/31 appears in the first row of the date candidate field with a weight value of 100%, while string 22N 58' 00 "appears in the second row of the date candidate field with a weight value of 60%, as shown in Table six below, but not limited to: if the weight value of the regular expression is considered, it is also possible to modify the string appearing in the subsequent order to have a higher weight value, such that the weight value of the string three corresponding to the collector column in table six is higher than that of the string tai nan city. In other words, the variables for determining the weighting values include, but are not limited to, the regular format and the string arrangement order.

Table six candidate form

In one embodiment, the electronic device 1 calculates the most relevant string through step S3, and the calculating unit 20 calculates the most relevant string from a plurality of strings of the same candidate field, for example: in the same candidate species data field corresponding to the date title field, the most relevant word string is determined according to the weight value, for example: the operation unit 20 selects the string 1997/12/31 with the highest weight value of 100%, and then writes the string in the corresponding species data field of the species data table. The calculation logic and operation principle of the data fields of the other candidate species can be analogized, and will not be described in detail herein.

In another embodiment, in the electronic device 1, the step S3 is executed to calculate the most relevant string, and the computing unit 20 selects the candidate field with the highest weight value from the plurality of weight values corresponding to different candidate fields of the same string, for example: the word string 120E19 ' 00 "has a weight value corresponding to a 60% date title field in the candidate list, but the word string 120E19 ' 00" has a weight value corresponding to a 100% place title field in the candidate list, so that the word string most associated with the place title field in the candidate list is identified as 120E19 ' 00 ", which is then written to the corresponding species data field in the species data table. The calculation logic and operation principle of the data fields of the other candidate species can be analogized, and will not be described in detail herein.

Referring to fig. 2, a schematic diagram of an electronic device architecture of a species data analysis system according to an embodiment of the invention is shown. The species data analysis system may be any electronic device 1 comprising an analysis unit 10, an operation unit 20 and a storage unit 30, as described above. The parsing unit 10 may be an optical character recognition processor. The analyzing unit 10 identifies the specimen record image to generate a corresponding text message and distinguish a plurality of character strings, the related technical contents and effects are as described above.

The operation unit 20 is electrically connected to the analysis unit 10. In one embodiment, the computing unit 20 may be implemented by one or more processing elements such as a microprocessor, microcontroller, digital signal processor, microcomputer, central processing unit, field programmable gate array, programmable logic device, state machine, logic circuitry, analog circuitry, digital circuitry, and/or any other type of processing element that manipulates signals (analog and/or digital) based on operational instructions. When the calculating unit 20 determines that the plurality of strings do not match the plurality of title fields of the species data table, it determines a plurality of weight values of the plurality of strings corresponding to the plurality of title fields, and the related determining mechanism, each weight value determining mechanism and the derivative embodiments thereof are as described above. The operation unit 20 calculates the most relevant word string according to the weight values, and writes a species data field adjacent to each title field in the species data table, wherein the relevant calculation mechanism, the most relevant word string determination mechanism, the species data string determination mechanism, and the derivative embodiments thereof are as described above.

The storage unit 30 is electrically connected to the operation unit 20. In one embodiment, the storage unit 30 may be implemented by one or more memories. The storage unit 30 stores a species data table, a species database in which one or more data are calculated and written, and optionally a candidate list, which can be managed, queried, maintained by a user, and used by a computer for various statistical analyses of a large number of databases.

In some embodiments, the computer program product for analyzing species data is composed of a set of instructions, and the species data analyzing method of any of the above embodiments can be completed when the set of instructions is loaded into and executed by a computer.

In summary, some embodiments of the present invention provide a method, a system and a computer program product for analyzing species data, which mainly utilize an analyzing unit to capture a plurality of strings with specimen record significance from a digitized image, and determine the weight value of each string according to preset rules and characteristics of a species data table through an arithmetic unit to perform artificial intelligence calculation, automatically screen out the most relevant species data strings and write the most relevant species data strings into a species database, so as to complete the digital automatic integration operation of automatic filing and classification, and enable a user to quickly complete the operations from the collection of the specimen record image, the digitization of the specimen record to the final statistical analysis at one time. Therefore, all the character data are classified and integrated without the modes of manual identification, input, filing and the like, so that the low efficiency and the negligence of manual misreading are effectively prevented, the practicability and the convenience can be greatly improved, and the advantages and the effects are as above.

The above-mentioned embodiments are merely illustrative of the technical spirit and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and to implement the same, so that the scope of the present invention should not be limited by the above-mentioned embodiments, and all equivalent changes and modifications made in the spirit of the present invention should be covered by the scope of the present invention.

Claims

1. A species data parsing method, comprising:

an analysis unit identifies a specimen record image to generate a corresponding text message, and divides the text message into a plurality of character strings;

an operation unit determines a plurality of weighted values of the plurality of word strings corresponding to a plurality of title fields when the plurality of word strings are judged not to accord with the plurality of title fields of a species data table; and

the operation unit calculates the most relevant word string according to the weight values and writes a species data field adjacent to each title field in the species data table.

2. The method of claim 1, wherein the step of distinguishing the plurality of strings from the text message comprises: the arithmetic unit filters a character without grammar meaning in the text message.

3. The method of claim 1, wherein the step of assigning the weight values of the strings further comprises: and determining the weight value of each word string corresponding to each title field according to a coincidence degree obtained by comparing each word string with a normal format preset by the species data field.

4. The method of claim 1, wherein the step of determining the weight values of the title fields corresponding to the word strings comprises: the operation unit establishes a candidate list comprising a plurality of candidate fields corresponding to the plurality of title fields, distributes the plurality of strings to the plurality of candidate fields and assigns the plurality of weight values.

5. The method of claim 4, wherein the step of determining the weight values of the title fields corresponding to the word strings further comprises: and assigning the weighted values from high to low according to the sequence of the different character strings appearing in the same candidate column.

6. The method of claim 4, wherein the step of calculating the most relevant string according to the weight values comprises: calculating the most relevant string from the plurality of strings of the same candidate field.

7. The method of claim 4, wherein the step of calculating the most relevant string according to the weight values comprises: and screening the candidate column with the highest weight value from the plurality of weight values of the plurality of candidate columns corresponding to the same string.

8. The method of claim 1, further comprising, prior to the step of identifying the specimen record image:

a communication unit or an image capturing unit obtains and transmits the specimen record image to the analysis unit.

9. The method of claim 1, further comprising, after the step of distinguishing the plurality of strings:

when the operation unit judges that the word strings are matched with the title fields of the species data table, writing another word string adjacent to the word string corresponding to each title field in the text message into the species data field adjacent to each title field in the species data table.

10. A computer program product comprising a set of instructions capable, when loaded and executed by a computer, of performing a method of species data analysis according to any one of claims 1 to 9.

11. A species data parsing system, comprising:

an analysis unit for identifying a specimen record image to generate a corresponding text message and dividing the text message into a plurality of character strings;

an operation unit, electrically connected to the parsing unit, for determining a plurality of weight values of the plurality of word strings corresponding to a plurality of title columns when the plurality of word strings do not conform to the plurality of title columns of a species data table, calculating the most relevant word string according to the plurality of weight values, and writing the most relevant word string into a species data column adjacent to each title column in the species data table; and

and the storage unit is electrically connected with the operation unit and used for storing the species data table.

12. The system of claim 11, wherein the computing unit filters a word in the text message that is not grammatically meaningful.

13. The system of claim 11, wherein the computing unit determines the weight value of each string corresponding to each title field according to a matching degree of each string compared with a normal format preset in the species data field.

14. The species data analysis system of claim 11, wherein the computing unit creates a candidate list comprising a plurality of candidate fields corresponding to the plurality of heading fields, assigns the plurality of strings to the plurality of candidate fields, and assigns the plurality of weight values.

15. The species data analysis system of claim 14, wherein the computing unit assigns the weighted values from high to low according to an order of appearance of the strings in the candidate field.

16. The system of claim 14, wherein the computing unit computes the most relevant string from the plurality of strings in the same candidate field.

17. The species data analysis system of claim 14, wherein the computing unit filters the candidate field with the highest weight value from the plurality of weight values corresponding to the candidate fields in the same string.

18. The species data parsing system of claim 11 further comprising:

and the communication unit is electrically connected with the operation unit and used for receiving the specimen record image and transmitting the specimen record image to the analysis unit.

19. The species data parsing system of claim 11 further comprising:

and the image acquisition unit is electrically connected with the operation unit and is used for acquiring the specimen record image and transmitting the specimen record image to the analysis unit.

20. The system of claim 11, wherein the computing unit writes another string in the text message adjacent to the string corresponding to each heading field in the species data table adjacent to the species data field in the species data table when the computing unit determines that the strings match the heading fields of the species data table.