CN116185209A - Processing, data splitting and merging and coding and decoding processing method for handwriting input characters - Google Patents

Processing, data splitting and merging and coding and decoding processing method for handwriting input characters Download PDF

Info

Publication number
CN116185209A
CN116185209A CN202310088220.3A CN202310088220A CN116185209A CN 116185209 A CN116185209 A CN 116185209A CN 202310088220 A CN202310088220 A CN 202310088220A CN 116185209 A CN116185209 A CN 116185209A
Authority
CN
China
Prior art keywords
data
character
metadata
code
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310088220.3A
Other languages
Chinese (zh)
Inventor
张锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CN116185209A publication Critical patent/CN116185209A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]

Abstract

A method for processing, splitting and merging data of hand-written input characters and coding and decoding is provided. An object-based open codec solution, which can codec any data object in any coding manner that is free and open; and an object-based data splitting/merging method, namely splitting/stripping metadata and/or coded data of the data object from corresponding data content so as to ensure the safety of the data content. May be implemented separately, may be implemented in combination, or may be combined with other technical field applications, alone or in combination.

Description

Processing, data splitting and merging and coding and decoding processing method for handwriting input characters
The invention relates to a divisional application of an invention patent application with application number 2015800427616 and the invention name of a processing method for handwriting input characters, data splitting and merging and encoding and decoding processing, which is put forward in 2015, 08 and 11.
Technical Field
The invention relates to a data processing technology, in particular to a processing method for handwriting input characters, data splitting and merging and coding and decoding.
Background
At present, with the development of computers, the variety of coding techniques is also increasing, and the coding techniques as a basis of computers have been widely applied to data transmission, storage and processing.
The character codes are the most basic codes for human input, viewing, editing and modification; for computer analysis and processing. From the earlier ASCII literal code standard to today's Unicode, standardized literal codes are one basis for transferring information between people and machines and various systems. However, as a tool for recording human output, the existing standardized text codes are far from adequate. With the popularization of computers and the development of man-machine interaction technology, standard character codes and corresponding character input methods gradually become bottlenecks for entering the digital world from natural output of human beings.
Based on standard literal coding, a series of general and special coding methods have been developed to express structured data/documents and special domain data by a series of means such as marking, controlling, escape, etc., which we call text coding; the corresponding data format is called text format. The general tree structure formed by marks such as XML/SGML is used for describing a complex structure, and JSON is used for describing a complex object by JavaScript grammar; special HTML description web page based on XML, mathML description mathematical expression, SVG description vector graphics; CSV is used for expressing table data; RTF, markdown, etc. are used to represent formatted documents; various programming languages also use primarily text formats; etc. The coding based on standard characters allows human beings to participate in the process of creating, checking, debugging and modifying data, is convenient for integration and exchange among different systems, improves the development speed of the systems and reduces the fault maintenance cost of the systems. However, on the other hand, text formats are redundant for symbolized data and binary data, and as the complexity of the structure to be expressed by the system increases, the complexity of marks and grammar based on text codes increases greatly, and the data redundancy also increases. In addition, due to the limited number of codes in a specific character coding standard, the conflict between the data content and the grammar marks in the codes is unavoidable, and the character escape brings about certain data redundancy.
The world inside a computer is the world of numbers, and binary data is its natural form of data representation. The text format data defined by people can be converted into binary data, so that redundancy is reduced, and the processing and transmission efficiency is improved. There are also some general binary-based coding methods, such as code standard ans.1 of international standardization organization and international telecommunication union, buffer Protocol of google, thread and Avro of Apache, BSON, message Pack, etc. However, in contrast to text-based coding, binary data has the disadvantages of being relatively closed, unfavorable for exchange, unfavorable for human participation, etc.
For encoding, whether text encoding or binary encoding, there are two purposes, one is to describe the data object itself, also known as serialization, which is referred to herein as content encoding of the data object. The aforementioned coding standards and methods are mainly used for content coding.
Another use of encoding is for describing addresses or references to data objects, which will be referred to herein as reference encoding of data objects. Text-based references are encoded with URN, URL, object Identification (OID) in ans.1, etc.; binary-based references are encoded with keys in the database, UUID/GUID, IP address, MAC address, MD5, SHA-1, etc., and even one-dimensional codes, two-dimensional codes (actually also converted to text codes or binary codes by recognition), etc. based on graphics.
There are two main problems with existing citation codes. Firstly, the integration and exchange are not facilitated: different coding standards are being used in various fields, and the current situation is unfavorable for unified citation of objects in various fields in face of the current development trend of the Internet and the Internet of things. Another problem is the effectiveness of the encoding: with the increase of world interconnectivity, massive digital objects are online at any time, and although coding such as UUID (16 bytes) and SHA-1 (20 bytes) is theoretically enough to provide unified reference coding for them, transmission, processing and storage of such massive reference coding themselves occupy a large amount of resources, so that unnecessary waste is caused.
Disclosure of Invention
A first aspect of the present invention provides a method for processing a handwriting input character, including:
acquiring strokes input by a user and corresponding input information in a first target row/column which is currently activated; wherein the input information includes an input position of the stroke in the first target row/column;
for each stroke, a new character is created for the stroke or a character to which the stroke belongs is determined according to the input position of the stroke in the first target row/column or the input position of the stroke in the first target row/column and the character specified in the first target row/column.
The technical effects of the first aspect of the invention are: the processing method for handwriting input characters can realize the effect of inputting characters while forming the characters, and a user does not need to distinguish different characters by means of an explicit or implicit command of starting single character input or ending single character input, so that each time a character is written, a period of time is not required to be stopped or certain interactions with a system are carried out in the writing process, the writing process is smooth, and the efficiency is higher; in addition, the method directly determines the character to which the stroke belongs through the input position of the stroke without the need of identifying standard characters, so that the personalized information and writing style and characteristics of handwriting input of a user can be reserved.
A second aspect of the present invention provides a data splitting method, including:
when a storage request carrying a data identifier to be stored is received, acquiring metadata in a data object corresponding to the data identifier to be stored according to a preset metadata stripping protocol, and stripping the acquired metadata from the data object;
dividing the data content into at least two data fragments according to a preset data content splitting protocol.
The technical effects of the second aspect of the invention are: the data splitting method is provided, metadata in the original data of the user are separated from the data content, the data content is divided into a plurality of data fragments, the difficulty of illegally acquiring the original data of the user is increased, and the safety of data storage is more reliably realized.
A third aspect of the present invention provides a data merging method, including:
receiving a data object acquisition request carrying identification information; the identification information comprises positioning information, and the positioning information is used for positioning a storage address of partial data information in the data object;
acquiring storage content corresponding to the positioning information, and acquiring data information in other storage contents according to the acquired positioning information in the storage content until all data information of the data object is acquired;
and combining the acquired data information according to a preset combination protocol in the acquired data information to obtain the data object.
The technical effects of the third aspect of the present invention are: according to the data merging method, the data information which is split and stored in each storage body is obtained through gradual positioning according to the positioning information contained in the identification information in the data object obtaining request, so that each data information is merged according to a preset merging protocol to obtain the original data object of the user, the data which is scattered in each storage body can be obtained efficiently and safely, and the reliability of successful merging of the scattered data into the original data by the user is guaranteed.
A fourth aspect of the present invention provides an encoding processing method, including:
acquiring a data object to be coded and metadata thereof according to the received coding processing request;
and acquiring the object code of the data object according to the code warehouse and the data object and the metadata thereof.
The technical effects of the fourth aspect of the present invention are: according to the received coding processing request, the data object to be coded and the metadata thereof are obtained, and the object coding of the data object is obtained according to the coding warehouse and the data object and the metadata thereof.
A fifth aspect of the present invention provides a decoding processing method, including:
receiving a decoding processing request, and acquiring an object code to be decoded according to the decoding processing request;
disassembling the object codes to obtain meta codes or the meta codes and the instance codes;
inquiring a coding warehouse, and acquiring corresponding metadata and coding protocols according to the metadata;
and acquiring the data object corresponding to the object code according to the metadata and the code specification or the metadata, the code specification and the instance code.
The technical effects of the fifth aspect of the present invention are: the method comprises the steps of receiving a decoding processing request, acquiring an object code to be decoded according to the decoding processing request, disassembling the object code, acquiring metadata code or the metadata code and the instance code, inquiring a code warehouse, acquiring corresponding metadata and a code protocol according to the metadata code, and acquiring a data object corresponding to the object code according to the metadata and the code protocol or the metadata, the code protocol and the instance code.
Drawings
FIG. 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention;
FIG. 1B is a schematic diagram of a character in an embodiment of a method for processing handwritten input characters according to the present invention;
FIG. 1C is a schematic diagram II of a character in an embodiment of a method for processing handwritten input characters according to the present invention;
FIG. 1D is a schematic diagram of two adjacent lines activated simultaneously in an embodiment of a method for processing a handwritten input character according to the present invention;
FIG. 1E is a schematic diagram of a handwriting input character processing method according to an embodiment of the present invention when characters are inserted;
FIG. 1F is a schematic diagram of an edit mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention;
FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing a handwritten input character according to the present invention;
FIG. 1H is a flowchart of Chinese editing in an embodiment of a method for processing handwritten input characters according to the present invention;
FIG. 1I is a flowchart of a method for converting source codes of a handwriting program in an embodiment of a method for processing handwriting input characters according to the present invention;
FIG. 1J is a detailed flowchart of the "Standard code conversion for B" method for converting the source code of the handwriting program shown in FIG. 1I;
FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwriting input characters according to the present invention;
FIG. 1L is a schematic diagram of a handwriting input character processing device according to an embodiment of the present invention;
FIG. 2A is a flow chart illustrating a method of data splitting according to an exemplary embodiment;
FIG. 2B-1 is a flow chart illustrating a method of data splitting according to another exemplary embodiment;
FIG. 2B-2 is a diagram illustrating a system architecture in which data objects of a data splitting method of the present invention are audio data;
FIGS. 2B-3 are time domain analysis diagrams of audio data as data objects according to a data splitting method of the present invention;
FIGS. 2B-4 are phonetic text coding charts of audio data as data objects of a data splitting method according to the present invention;
FIGS. 2B-5 are diagrams illustrating a manner in which a data object of a data splitting method of the present invention is a phonetic text of audio data;
FIGS. 2B-6 are diagrams illustrating another presentation of phonetic text with audio data as data objects according to a data splitting method of the present invention;
FIGS. 2B-7 are diagrams illustrating still another presentation of phonetic text in which the data object is audio data according to a data splitting method of the present invention;
FIGS. 2B-8 are diagrams illustrating still another presentation of phonetic text in which the data object is audio data according to a data splitting method of the present invention;
FIG. 2C is a diagram illustrating a data splitting method according to the present invention in a computer system hierarchy;
FIG. 2D is a flow chart illustrating a method of data merge according to an example embodiment;
FIG. 2E is a flow chart illustrating a method of data merge according to another exemplary embodiment;
FIG. 2F is a schematic diagram of a data splitting apparatus according to an exemplary embodiment;
fig. 2G is a schematic diagram illustrating a structure of a data splitting apparatus according to another exemplary embodiment;
FIG. 2H is a schematic diagram of a data merge device according to an exemplary embodiment;
fig. 2I is a schematic diagram illustrating a structure of a data merging device according to another exemplary embodiment;
FIG. 2J is an exemplary data splitting flow diagram;
FIG. 2K is another exemplary data splitting flow diagram;
FIG. 2L is an exemplary data merge flow diagram;
FIG. 2M is a diagram of an exemplary data splitting description language definition;
FIG. 2N is an exemplary data splitting description language visualization flowchart;
FIG. 2O is a diagram of the association between concepts in three concepts of the present invention;
FIG. 3 is a schematic diagram of a prior art meta-model;
FIG. 4 is a schematic diagram of an encoding system according to the present invention;
FIG. 5A is a flowchart illustrating an embodiment of a coding method according to the present invention;
FIG. 5B is a flowchart of one embodiment of step 102C of FIG. 5A;
FIG. 6 is a relationship between data objects, metadata, encoding conventions, encoding meta-objects;
FIG. 7 is a schematic diagram of the core coding element model;
FIG. 8 is a conceptual model of object encoding, meta-encoding, instance encoding (i.e., object reference encoding removes meta-encoded portions) and data objects and encoded meta-objects;
FIG. 9 is a diagram showing an example of meta-coding in the present embodiment;
FIG. 10 is an exemplary diagram of an example of layer-by-layer correlation of similar code meta-objects (variable length coding of 16-bit word length);
FIG. 11 is a diagram of a metamodel of a corresponding code;
FIG. 12 is a conceptual model diagram of the object encoding;
FIG. 13 is a flowchart illustrating a second embodiment of a coding method according to the present invention;
FIG. 14 is a flowchart of a third embodiment of a coding method according to the present invention;
FIG. 15 is a schematic diagram of a font corresponding to a nonstandard character code stored in a code repository in the handwriting input system according to the embodiment;
FIG. 16 is a core conceptual diagram of an exemplary context-dependent object encoding system encoding metamodel;
FIG. 17 is a schematic diagram of a base object that may be applied to a base encoding space;
FIG. 18 is a schematic diagram of a coding scheme for a 128 fixed length coding scheme;
FIG. 19 is a diagram of four binary bits being four space bits;
FIG. 20 is an exemplary diagram of a coding scheme;
FIG. 21 is an exemplary diagram of a coding scheme of UTF-8;
FIG. 22 is a schematic diagram of object encoding constructed by meta-encoding and instance encoding;
FIG. 23 is a detail view of the encoding;
FIG. 24 is a rendering result diagram;
FIG. 25 is a schematic diagram of the encoding points of OTF-8 other than UTF-8;
FIG. 26 is a diagram of a code to be defined;
FIG. 27 is a flowchart of a fourth embodiment of a coding method according to the present invention;
FIG. 28 is a corresponding code metamodel update diagram;
FIG. 29 is a schematic diagram of a coding assembly;
FIG. 30 is a flowchart of a fifth embodiment of a coding method according to the present invention;
FIG. 31 is a handwriting input program;
fig. 32 is a flowchart of a decoding processing method according to an embodiment of the present invention;
FIG. 33 is a flowchart illustrating a decoding method according to a second embodiment of the present invention;
fig. 34 is a flowchart of a third embodiment of a decoding processing method according to the present invention;
fig. 35 is a flowchart of a fourth embodiment of a decoding processing method according to the present invention;
FIG. 36 is a diagram of the content of a handwriting input;
FIG. 37 is a schematic diagram of visualizing the length of a character spacing;
FIG. 38 is a schematic diagram of a decoding process;
FIG. 39 is an exemplary diagram of a hybrid coded content display;
FIG. 40 is a schematic diagram of the output content;
FIG. 41 is a schematic diagram of a handwritten stroke falling over the results of a character output;
FIG. 42 is a schematic illustration of the addition of a standard smiley face icon;
FIG. 43 is a schematic diagram of an on-line go;
FIG. 44 is a schematic diagram of a first embodiment of an encoding processing system according to the present invention;
FIG. 45 is a schematic diagram of a decoding processing system according to a first embodiment of the present invention;
FIG. 46 is a schematic diagram of a word processing system architecture based primarily on an object-based encoding system;
FIG. 47 is a schematic diagram of an architecture deployed within an application;
FIG. 48 is a schematic diagram of an architecture for terminal deployment;
FIG. 49 is a schematic diagram of an architecture for a mobile external device deployment;
FIG. 50 is a schematic diagram of an architecture in which applications share the same code repository;
FIG. 51 is an example diagram of a network deployment of an encoding warehouse being a private cloud deployment or an internal server deployment;
FIG. 52 is a schematic diagram of an architecture for point-to-point deployment;
FIG. 53 is a schematic architecture diagram of a hybrid deployment;
FIG. 54 is a diagram of an architecture that extends an operating system to allow legacy applications to support object coding;
fig. 55 is an interactive schematic diagram of an object coding system and an application system according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. In the drawings or the specification, like or similar elements are denoted by the same reference numerals.
First, the background of the invention is described, and with the development of the internet and mobile computing, a cloud storage system and related applications thereof have been developed. The cloud storage system is a server that stores data of a user in a cloud. Therefore, the user can access the data in the cloud storage at any time by using different terminal equipment, and migration of the data among different terminal systems is omitted. At the same time, users do not have to constantly update the storage devices, and cloud storage services provide sufficient scalability to cope with various storage requirements. Traditional data maintenance work, such as data backup, encryption and the like, is also transferred to the cloud storage server for carrying out, and is more specialized and efficient. In addition, due to the characteristics of reliability, anytime linearity and the like of cloud storage, data usage modes different from the traditional application also appear, such as data sharing, network collaboration and the like. These greatly improve the efficiency of data transfer between people and between application systems. Applications based on cloud storage systems are diverse, with the most predominant terminal application being a desktop proxy. Desktop agents are clients of a cloud storage that is file system based. The desktop agent synchronizes the specific folder in the terminal with the cloud storage, and the files stored in the folder can be automatically uploaded to the server side by the agent; other uploaded files received by the server are automatically downloaded to the corresponding folders through the agent. In this way, files of the same user are automatically synchronized on different terminals. The user can seamlessly use the data in the folder across platforms in a conventional manner. The desktop agent can also automatically synchronize the shared folder to the terminals of different users, thereby achieving convenient data sharing and collaboration. Dropbox is a typical desktop proxy. In addition, microsoft's OneDrive (used name SkyDrive), google Drive of Google, a hundred degree network disk, a gold mountain flash disk, etc. have cloud-stored desktop proxies. In addition to desktop agents, there are a variety of cloud storage and cross-device based end applications. The cloud storage system brings convenience and high-efficiency data access and sharing. However, the storage of data in the cloud end causes a necessary concern, namely, the protection of security and privacy, and the confidentiality of core data is completely dependent on the cloud storage system. Many organizations and individuals are based on this rather than having data, at least critical data, placed in a cloud storage system. There are mainly two aspects of hidden trouble, namely, one is that data in cloud storage is protected by relying on identity authentication of a user. Once the identity of the user is compromised, cloud data for all users is exposed to the compromised user. In addition, the security of cloud storage is based on complete trust to the cloud storage service provider. However, this basis is not firm. On one hand, the existing computer security technology has weak foundation, and security holes of various systems are endlessly layered. A malicious attacker can easily launch an attack on an online service. In recent years, major data leakage accidents occur, and the accident party does not deplete cloud storage suppliers.
The present invention is directed to a data processing method, system and application and addresses the problems identified above by providing an effective solution to the problems identified below. In particular to the innovation of the following three aspects: (1) A novel handwriting input method and system, especially a splitting method of handwriting input characters; (2) An object-based open codec solution, which can codec any data object in any coding manner that is free and open; and (3) an object-based data splitting/merging method, namely splitting/stripping metadata and/or coded data of the data object from corresponding data content so as to ensure the safety of the data content. These solutions can be implemented separately or in combination, or combined with other technical field applications. The invention has wide application prospect and great application value. The specific scheme is as follows:
the invention provides a coding method based on a data object, which comprises the following steps:
a) Extracting metadata from the data object, and/or parsing the data object and creating or generating corresponding metadata for the data object;
b) Selecting or creating an encoding specification for the data object in dependence upon at least a portion of the metadata to describe the data object in encoded form;
c) An object code is generated or returned for the data object in accordance with the code specification.
Further, on the basis of the scheme of the data object-based encoding method, the generating object encoding step in the step c) includes: a meta-code and/or instance code is generated for the data object according to a predetermined rule, and the object code is generated from the meta-code and/or instance code.
Further, on the basis of the scheme of the data object-based encoding method, the method further comprises a step of compressing and/or encrypting the data object before the step a), and a step of encrypting the generated object after the step c).
Further, on the basis of the scheme of the data object-based coding method, the meta-coding includes one of the following codes, or a combination and/or nesting of two or more of the following codes: spatial coding, context coding, type coding.
Further, on the basis of the scheme of the data object-based encoding method, before the step a), the method further includes: a data splitting step of splitting a large data object into small data blocks (or referred to as data fragments) according to a predetermined rule, and performing steps a) to c) on each split data block during or after the data splitting process until the encoding of all the data blocks is completed.
The invention also provides a decoding method based on the data object, which comprises the following steps:
a) Obtaining an object code;
b) Disassembling object codes to obtain meta codes and/or instance codes;
c) Acquiring corresponding coding metadata and/or coding protocols according to the disassembled metadata codes;
d) The original data object is restored according to the encoded metadata and/or encoding specifications, and instance encoding.
Further, on the basis of the scheme of the data object-based decoding method, the step of encoding the disassembled object in the step b) includes: the object code is disassembled into meta code and/or instance code according to the preset rule when in coding.
Further, on the basis of the scheme of the decoding method based on the data object, the method further comprises an authorization verification step for obtaining the predetermined rule when the object is encoded and/or encoded before the step a) and/or before the step b).
Further, on the basis of the scheme of the data object-based decoding method, if compression and/or encryption means are used in the encoding process, corresponding decompression and/or decryption means are needed in the decoding process.
The invention also provides a method for splitting the handwriting input characters, which comprises the following steps:
a) Receiving input of a user by taking a currently activated target row/column as constraint, and at least recording an input position of each stroke in the current row/column;
b) The relevance or association between each stroke and other strokes and/or characters is judged by comparing each stroke with all or part of the strokes and/or characters in the current row/column one by one, if a stroke is not associated with any character or stroke, a new character is created for the stroke, otherwise, the stroke is attributed to one or more characters with the highest relevance or association.
Further, on the basis of the above scheme based on the handwriting input character splitting method, the step c) is performed in one of the following cases: 1) during the writing of the current stroke, 2) or after the current stroke is input (i.e. after pen lifting), 3) or after the current line is input.
Further, on the basis of the scheme based on the handwriting input character splitting method, after the current stroke is input, the current stroke is compared with strokes and/or characters in a preset range one by one.
Further, on the basis of the above scheme based on the handwriting input character splitting method, the step c) includes:
Judging whether the currently input stroke is the first stroke in space or the last stroke in space in the row/column in the current input state;
creating a new character for a currently entered stroke if the stroke is the first stroke in space in the row/column and is not associated with other characters (or strokes) that have been entered in the current row/column, or if the stroke is the last stroke in space in the row/column and is not associated with other characters (or strokes) that have been entered in the current row/column; if the current stroke is neither the first or last stroke in space in the row/column, the current stroke is compared to the spacing between all characters that have been entered and the currently entered stroke is attributed to the associated character or characters (or strokes).
Further, on the basis of the scheme based on the handwriting input character splitting method, in the step c), a threshold value (min_gap) of the minimum distance between strokes and characters or strokes is preset, and the distance between each stroke and other characters or strokes which are already input is compared with the threshold value, so that the relevance between the stroke and other characters or strokes is judged.
Further, on the basis of the scheme based on the handwriting input character splitting method, the step b) further includes: when each input stroke is received, the input time and input position information of each stroke are recorded.
Furthermore, on the basis of the scheme based on the handwritten input character splitting method, the input time comprises a pen-down time and a pen-up time, and the input position at least comprises: the position at the time of pen down, the position at the time of pen up, and the coordinate position of each point in the handwriting of the stroke.
The invention also provides a data object splitting method based on the object, which comprises the following steps:
a) Acquiring metadata of a data object;
b) Selecting or creating a corresponding data splitting/stripping specification for the data object in dependence upon at least a portion of the metadata;
c) Splitting at least a portion of the data object into data fragments and/or stripping at least a portion of the data object according to the data splitting/stripping specification.
Further, on the basis of the scheme of the object-based data object splitting method, the data splitting/stripping protocol comprises at least one or a combination of more than two of the following options: 1) The data content splitting protocol records a method and a process for splitting the data content; 2) Metadata stripping protocol, recording the method and process of stripping the corresponding metadata from the data object; 3) If the codes are generated in the data splitting process, the method also comprises a code separation protocol, and the code rules and the code process between the corresponding codes and the coded objects are recorded.
Further, on the basis of the scheme of the object-based data object splitting method, after step c), the method further includes step d): and recombining the split data fragments.
Further, on the basis of the scheme of the object-based data object splitting method, at least one part of the metadata forms splitting metadata.
The invention also provides a data object merging method based on the object, which comprises the following steps:
a) Obtaining each split data segment, a splitting/stripping protocol or a corresponding merging protocol;
b) Obtaining split metadata of the data object according to the obtained data fragment and/or the split/strip or merge specification;
c) Based on the data splitting/stripping specification or the merging specification, and the splitting metadata, the data pieces are combined together to recover the original data.
Further, on the basis of the scheme of the object-based data object merging method, after the splitting processing of the data object is completed, the method further comprises: and a storage step, wherein each split/stripped data segment is respectively stored in different storage bodies or under different security channels.
The handwriting input method and system will be described in detail below.
Fig. 1A is a flowchart of an embodiment of a method for processing a handwriting input character according to the present invention. Compared with the existing handwriting input system, the handwriting input character processing method provided by the embodiment can be more approximate to the natural writing habit of people, and meanwhile, the writing style and characteristics of the writer are completely and locally reserved. As shown in fig. 1A, the method in this embodiment may include:
step 101A, acquiring strokes input by a user and corresponding input information in a first target row/column activated currently; wherein the input information includes an input position of the stroke in the first target row/column.
The execution subject in this embodiment may be a handwriting input device such as a conventional touch screen, a handwriting screen, or other suitable handwriting device, or may be directly adapted to the handwriting system of this embodiment. Preferably, the present embodiment may employ a touch screen type handwriting input device, i.e., an input device that can directly input information on a screen by handwriting or by means of a dedicated or non-dedicated writing tool or the like.
Specifically, the embodiment may be applicable to any writing manner, where the writing manner may be user-defined, or default settings may be adopted. The writing manner described in this embodiment may include, but is not limited to, the following ways: line-by-line writing (corresponding to the common horizontal format, writing habit from left to right and top to bottom); writing in columns (corresponding to a vertical row format, right-to-left writing habit from top to bottom); other writing formats customized by the user can be adopted, for example, the writing format which is set for Arabic from right to left can be adopted; or may be in a top-down, left-to-right writing format, etc.
Typically, a user handwriting-inputs each character in its own stroke sequence during the writing process. The embodiment can record each stroke and the input position of the user according to the time sequence. For example, when the user starts writing "i" words, the first stroke "i" is written first (skimming), and the system automatically records the skimming and the input position of the skimming on the panel, for example, the pixel position information of the handwriting input screen may be used as the corresponding input position, or other positioning algorithms or position determining methods may be used, so long as the input position of each stroke can be uniquely determined.
When a user performs handwriting input, there is a concept of a target row/column, and the target row/column can be used as a constraint range of handwriting input of the user, that is, when a certain row/column is activated, the target row/column is formed, and input can be performed on the certain row/column. All strokes entered by the user belong to the target row/column before the target row/column is changed. In this case, the user may be prohibited from handwriting input in an area other than the target line/column, or allowed to input at an arbitrary position, but when the stroke input by the user exceeds the boundary of the target line/column, the following several different processing manners may be adopted: first, under the condition of low precision requirement, discarding the part of strokes exceeding the preset threshold value of the boundary; secondly, when the original input needs to be restored with high precision, stroke information beyond the boundary, such as time and position information, can be completely recorded, so that the original input state of the user can be completely restored.
The method provided by this embodiment can take a row (horizontal row) or a column (vertical row) as a limitation or constraint of input, that is, the current input can only be limited in a specific row or column, and no strokes or characters crossing the row or column exist. Based on this row or column constraint, the input content can form a character stream in the input order. Compared with the prior art, the method provided by the embodiment is closer to the natural writing habit of people, so that the writing experience of users can be more natural and smooth.
When the user performs input, a range of target rows/columns may be displayed on the handwriting input screen, for example, the target rows/columns are highlighted, or row/column ground patterns or bright patterns in a text or letter format are displayed on the handwriting input screen, so as to prompt the user of the position where the target rows/columns can be currently input.
Preferably, the currently activated first target row/column may be selected or created prior to step 101A. The selection or creation of the currently active first target row/column may take a number of forms, the present embodiment provides for the following two.
Selecting a target row/column means one: the location range of each row/column is first determined and then the target row/column is selected by the user. Wherein determining the location range of each row/column may specifically include:
Acquiring size information of a handwriting input screen and information of row height/column width;
dividing the handwriting input screen into at least one row/column according to the size information and the row/column width information of the handwriting input screen, and determining the position range of each row/column;
the information of the row/column width is a default value or determined by the user input, and the position range of each row/column refers to the top side position and the bottom side position of each row relative to each other in the handwriting input screen or the left side position and the right side position of each column relative to each other in the handwriting input screen.
Through the steps, the handwriting input screen can be divided into a plurality of rows/columns, the position range of each row/column is determined, and in the subsequent input process of a user, strokes can be input based on the divided rows/columns.
After determining the location range for each row/column, the target row/column may be selected by the user. The selecting, by the user, the target row/column may specifically include:
receiving a target row/column selection message input by a user, wherein the target row/column selection message comprises a target row/column identifier to be input by the user;
and according to the target row/column selection message, taking the row/column corresponding to the target row/column identification to be input by the user as the first target row/column currently activated.
The identification of the target row/column to be input by the user can be any coordinate point clicked by the user, and the row/column where the coordinate point is located is the row/column corresponding to the coordinate point; alternatively, the identification of the target row/column to be input by the user may be a row/column number, for example, the 10 th row or the 10 th column, and then the row/column corresponding to the row/column number may be used as the first target row/column that is currently activated.
When other input devices are connected externally, the user can select a target row/column through the accessed input device. For example, when the keyboard is externally connected, the user can select a target row/column through the keyboard; or when the mouse is externally connected, a user can select different target rows/columns by moving the mouse; alternatively, when the stylus is externally connected, the target row/column may be selected by pointing the stylus before the stylus makes contact with the handwriting input screen.
Selecting a target row/column mode II: a target row/column is activated based on a character previously entered by the user. The method specifically comprises the following steps:
acquiring at least one character input by a user;
taking the row/column where the at least one character is located as the first target row/column which is activated currently;
Setting a position range of the currently activated first target row/column according to the character boundary of the at least one character;
wherein the position range refers to the top edge position and the bottom edge position of the first object row opposite to each other in the handwriting input screen or the left side position and the right side position of the first object column opposite to each other in the handwriting input screen.
Due to the different writing habits, an appropriate threshold may be set for the width of the first target row/column in order to meet the needs of a particular user. For example, for horizontal writing, the natural writing line of the writer may habitually tilt up to the right or down to the right, at which time the boundary of at least one character that the user has entered may be expanded upward or downward by a distance as appropriate as the boundary of the first target line/column.
The two modes for selecting the target row/column are provided, and the mode is simple and quick; the second mode can further meet the personalized input of the user and the handwritten character input in the graphic system.
Step 102A, for each stroke, creating a new character for the stroke or determining a character to which the stroke belongs according to an input position of the stroke in the first target row/column or an input position of the stroke in the first target row/column and a character specified in the first target row/column.
The present embodiment adopts a different word division or segmentation mode from the prior art, that is, the attribution of the current input stroke is determined based on the relevance between each input stroke and other characters or strokes. Therefore, the method provided by the embodiment can save the complicated interaction process of inputting by the user according to the character unit, thereby greatly simplifying the input operation.
Wherein, the character refers to an independent character object having a two-dimensional shape, and includes not only standard characters of ideograms, such as single Chinese characters, japanese, korean, arabic, tibetan, my, etc. or parts thereof (e.g. radicals, etc.), or standard words of phonograms, such as Western letters or words of English, german, french, russian, spanish, etc.; computer characters based on traditional standard codes, such as ASCII code characters, unicode code characters or character strings, and the like; or a combination character or character string formed by mixing the handwritten character and the standard character; and may even be any graphic, image, such as a "heart" shaped pattern, photograph, any graffiti, etc., or any other written representation of the user input.
Fig. 1B is a schematic diagram of a character in an embodiment of a method for processing a handwriting input character according to the present invention. Fig. 1C is a schematic diagram of a character in an embodiment of a method for processing a handwriting input character according to the present invention. Five characters are shown in fig. 1B, including "stroke characters" that is, handwritten characters entered by the user, such as first, third, and fourth characters, and "graphic characters" that is, any graphic or image information entered by the user, such as second and fifth characters. In addition, other characters, such as "standard characters" (any character in the existing standard word stock), "combined characters" (mixed characters in which various characters are mixed together), etc., may be input in the present embodiment, and the "combined characters" may also directly include handwriting strokes—when the handwriting strokes are written directly over non-stroke characters, "combined characters" are formed. As shown in fig. 1C, the "gulosity" two words are combined characters of standard characters and stroke characters.
In this embodiment, the character input by the user is not required to be recognized, and only the character to which each stroke belongs is required to be judged, and the character is divided. In determining the attribution of strokes entered by the user, the strokes entered in the first target row/column may be automatically divided according to an intrinsic convention of the set language (e.g., based on a writing or typesetting manner of each language, etc.).
The judgment of the character to which the stroke belongs is a process of splitting the input character. The splitting operation (i.e. word forming operation) of the input characters can be realized in a mode of splitting while inputting, namely, along with the natural writing of a user, the inputted strokes can be determined to which character, and thus, the effect of word forming while inputting can be realized.
For the triggering condition of character splitting, one of the following methods can be selected: (1) Starting from the moment of pen-down of the user, judging the input strokes in real time by taking the dot matrix of the input strokes as a unit, and determining the characters to which the strokes belong; (2) Judging the attribution of each stroke after finishing the input of each stroke (namely lifting the pen); (3) After one line of input is completed or when it is judged that the user has input for a relatively long time is stopped, all strokes input before are judged one by one, and those strokes with the highest correlation or the highest correlation are assigned to the same character.
The three methods have advantages and disadvantages respectively, and the calculated amount of the three methods is from large to small according to the sequence. I.e. the calculated amount under the triggering condition (1) is the largest, the latter two calculated amounts are comparable but smaller than the first one. In addition, under the trigger condition (1), since this real-time judgment can cause dynamic change of the judgment result, that is, it is judged that the current stroke should be attributed to the previous character according to the previously inputted dot matrix, but as the stroke is inputted, it is found that the stroke should be independently formed into a word or attributed to the next character thereafter, and at this time, the final attribution of each stroke needs to be updated to avoid attributing the same stroke to different characters. This process also increases the amount of computation. Although in most cases the user is not concerned about the word forming process as a background process, the processing method under trigger condition (1) may achieve a more real-time interactive experience compared to the latter two methods.
For each stroke, if the stroke is the first stroke of the first target row/column, a new character may be created for the stroke; if the stroke is not the first stroke of the first target row/column, a new character may be created for the stroke or the character to which the stroke belongs may be determined based on the input location of the stroke in the first target row/column and other characters in the first target row/column.
According to the method for processing the handwriting input characters, provided by the embodiment, in the first target row/column which is activated currently, strokes input by a user and corresponding input information are acquired, and according to the input positions of the strokes in the first target row/column or the input positions of the strokes in the first target row/column and the characters appointed in the first target row/column, a new character is created for the strokes or the characters to which the strokes belong are determined, so that the effect of word formation while inputting can be realized, and the user does not need to distinguish different characters by means of explicit or implicit commands of starting single word input or ending single word input, so that a period of time is not required to be stopped for each writing in the writing process or certain interactions with a system are not required, the writing process is smooth, and the efficiency is higher; in addition, the method directly determines the character to which the stroke belongs through the input position of the stroke without the need of identifying standard characters, so that the personalized information and writing style and characteristics of handwriting input of a user can be reserved.
The handwriting input can be more natural and smooth, so that the handwriting input device is more convenient for the old and children who are unfamiliar with the electronic input devices such as computers, mobile phones, tablet computers, laptop computers, notebooks, iPad and the like to use the device.
Unlike the conventional keyboard/character stream model, the handwriting input character processing method in this embodiment adopts a pen/paper model. The user may directly activate any line in the page for input. The system may process empty lines before and between handwriting input content into empty paragraphs. For the user, there may be only commands to change the input line, without the concept of carriage return, line feed.
When the user inputs at the end of a row, it may be necessary to move the target row/column to the next row/column of the first target row/column, facilitating the user's input at the next row/column, which is the outage function provided by the present embodiment. Specifically, the outage function may have various implementations, and four kinds of implementations are provided in this embodiment:
line breaking mode one:
receiving a line breaking/column command input by a user;
and taking a second target row/column as a currently activated target row/column according to the row/column breaking command, wherein the second target row/column is the next row/column of the first target row/column.
In this manner, the position of the disconnection can be determined by a preset interaction manner. For example, it may be predetermined that the end of a line is confirmed by continuously clicking a certain corresponding position or button of the right boundary of the two-down or three-down input box or screen each time the line is naturally written up to the end of the line which is considered self-made, or a command button may be provided at the end of the first target line/column, and when the user clicks the command button, the next line/column is automatically activated to edit.
And a second line breaking mode:
judging whether the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than a first preset threshold value or not;
if the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than the first preset threshold, taking a second target row/column as a currently activated target row/column so as to acquire the stroke input by a user in the second target row/column;
wherein the second target row/column is the next row/column to the first target row/column.
And a line breaking mode III:
judging whether the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than a first preset threshold value or not;
if the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than the first preset threshold value, the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column;
acquiring at least one stroke which is input subsequently by a user in a first target row/column and/or a second target row/column, and taking the second target row/column as a currently activated target row/column only when the second target row/column acquires the first stroke;
Wherein the second target row/column is the next row/column to the first target row/column.
In this way, in order to realize continuous input, the problem of stroke attribution in adjacent lines needs to be solved. When two or more adjacent rows are activated simultaneously, the user's stroke may span multiple rows/columns, and at this time the row/column to which the stroke belongs must be determined with a certain rule: the line/column with the starting point can be the line/column with the ending point, or the line/column with the largest proportion can be the line/column with the largest proportion. Of course, this contradiction can also be alleviated by increasing the row/column spacing between adjacent rows/columns.
Preferably, when the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column, the first target row/column and the second target row/column are both activated in a partial area;
the starting position of the activation region of the first target row/column is arranged between the ending position of the activation region of the second target row/column and the ending position of the activation region of the first target row/column.
And a line breaking mode is four:
the user decides whether to break or not by full control of the position of the handwriting panel within the segment representing the activation area. The handwriting panel itself has the feature of automatically breaking in a paragraph. When a user moves the handwriting panel in the typesetting direction or the opposite direction through interaction (such as a keyboard command or a touch screen gesture, etc.), the system moves a part or all of the handwriting panel to the next line or the previous line according to the position of the handwriting panel in the paragraph and the relation with the current line. With different locations within the segment, the content presented in the handwriting panel will also change accordingly. When the handwriting panel is moved to the end line of a paragraph, the automatic line breaking of the handwriting panel is triggered again, so that the paragraph is effectively broken.
Fig. 1D is a schematic diagram of two adjacent lines activated simultaneously in an embodiment of a method for processing a handwritten input character according to the present invention. The position of the square frame in the figure is the activation area. As shown in FIG. 1D, the active area is a logically contiguous area within two adjacent rows/columns, and the user can only enter input within the active area. Because of the overlapping of the active areas of adjacent rows/columns, the occurrence of strokes across rows/columns is avoided. Meanwhile, the activation area may also be switched to the full row/column range (first target row/column or second target row/column) according to the user's interactive operation.
For the case of simultaneous activation of two adjacent rows/columns, there is a constraint: for the first or last row/column of paragraphs, there is no corresponding forward-detour or backward-detour feature. The following is a detailed description.
In the same paragraph, if the currently activated target line is not the first paragraph line, when the distance between the input position of the stroke in the target line and the starting position of the line is smaller than a certain threshold value, the relevant areas of the target line and the previous line can be activated simultaneously; if the currently activated target line is not the end line, when the distance between the input position in the target line and the end position of the line is smaller than a certain threshold value, the relevant areas of the target line and the next line can be activated simultaneously.
However, for the first line and the last line of the section, if there are other sections before and after the section, when the user inputs the first line of the section, the first line and the previous line cannot be activated at the same time because the previous line belongs to the other sections; when the user makes an input at the end of the paragraph, the end of the paragraph and the following one cannot be activated at the same time because the following one belongs to the next paragraph.
In particular, for the tail line of a paragraph, the user may need to issue a "line expansion" command, and insert an empty line of the same paragraph thereafter to enable the function of activating two adjacent lines simultaneously.
Among the four outage methods, the first method and the fourth method are that a user actively breaks the line, and the transfer of the target line/column is realized through the interaction with the user, so that the method is accurate; the second and third methods are automatic line breaking, no extra interactive operation is needed with the user, the end position of each line/column can be automatically identified as long as the writing mode of the user completely meets the requirement of the line or column, and the user does not need to interactively confirm the end of each line/column, so that the whole handwriting input screen can be used as common paper, and the input experience of the user is greatly improved.
For the processing method of the handwritten input character in the present embodiment, there are two important concepts: break (soft carriage return) and end of paragraph (hard carriage return). Broken line means that the current paragraph does not end, but since the handwritten character has been entered at the end of the line, the next line needs to be activated; the end of the paragraph refers to the end of the content of the paragraph, when judging that the paragraph is ended, an empty line can be inserted after the paragraph is ended, then the next line of the empty line is activated as the first line of the next paragraph, and the user inputs the next line of the empty line; alternatively, when a determination is made that a paragraph is over, the next row/column of the line may be activated directly for input as the first row of the next paragraph.
For distinguishing the line breaking and the paragraph ending, different interaction modes can be set, such as clicking one button as the line breaking and clicking the other button as the paragraph ending; or automatically breaking the line when reaching the end position of the line, and ending the paragraph through manual interaction; alternatively, the automatic paragraph ends when the end position of a line is reached, and the line can be broken through manual interaction, which is not limited in this embodiment.
For example, the line breaking can be performed in any of the above-mentioned line breaking modes one, two and three, and for the end of the paragraph, some interaction operation with the user is required.
Alternatively, when the user inputs on different lines, the user may automatically attribute the different lines to different paragraphs and create empty paragraphs for empty lines between paragraphs, while for the extension of one paragraph to the next line (i.e., line breaking), explicit interactive commands are required to determine. Typically, paragraph expansion commands make sense only in the last line of a paragraph or the last line inserted. The current edit line and all other lines in the corresponding paragraph of the line will have some same visual state to distinguish from the other paragraphs.
On the basis of the technical scheme provided by the embodiment, preferably, the characters input by the user can be stored.
The saving function in this embodiment may specifically include:
storing new characters or attributive characters created by the acquired strokes at intervals of preset time;
or alternatively, the process may be performed,
when the currently activated target row/column on the page is switched from one target row/column to another target row/column, storing a new character or an attributive character created by acquiring the acquired strokes on the one target row/column on the same page;
or alternatively, the process may be performed,
when the current page is switched from one page to another page, storing new characters or attributive characters created by acquiring the acquired strokes on the one page.
Specifically, when storing, the strokes input by the user and the corresponding input information may be stored in a first memory; storing the saved characters in a second memory, wherein for each saved character, the characters comprise strokes constituting the character and indexes corresponding to the strokes; and the index corresponding to the stroke points to the input information corresponding to the stroke in the first memory. Alternatively, the strokes, the input information thereof, and the corresponding characters may all be stored in one memory, which is not limited in this embodiment.
For the storage order or sequence of strokes and characters, any suitable storage manner may be employed as long as it is capable of effectively distinguishing between each character to which each stroke belongs and each different character. Preferably, the input strokes and the information such as the divided characters may be stored in a temporary storage location or space of the system (such as RAM or flash memory of the system) while being input, and after the input of each target row/column is finished, all the divided characters and stroke information in the target row/column are stored in a designated permanent storage location or space.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the input information corresponding to the stroke further includes one or a combination of several of the following: the input time of the stroke, the input force of the stroke and the input speed of the stroke.
The input time comprises the pen falling time and the pen lifting time of the strokes and the stay time of each point in the handwriting of the strokes; the input location includes at least: the position when the pen is dropped, the position when the pen is lifted, and the coordinate position of each point in the handwriting of the stroke.
In this embodiment, the input time, intensity, speed and other information of each stroke may be recorded as required to further refine the input information. The strokes and the corresponding input times, intensities and speeds may be stored in a separate stroke database in the form of a list.
Since the embodiment can record and retain detailed input information of each stroke according to the stroke sequence when writing while receiving each input stroke, almost all information of all writing styles and habits related to each user, such as the stroke sequence style, the stroke style, the word spacing and the like, can be completely recorded and retained, thereby making handwriting identification and the like easy.
This embodiment also shows great advantages for missing strokes. For example, when the user forgets to input the upper right stroke of the character "me" when inputting the character "pen", (dot) and finds the missing stroke "pen", (dot) after inputting other characters, at this time, the user may add the "pen", (dot) at the corresponding upper right corner position at the original position of the "me" character as if writing on paper normally, and the "pen", (dot) "may be judged from the position information to belong to the component of the previously inputted" me "character, although the input time of the" pen ", (dot)" is different from the input time of the other strokes of the "me" character.
When a user draws a custom graphic or character in the form of a graffiti during the input process, the input time and input position of each stroke is recorded as in conventional characters.
Because the embodiment can completely reserve all input information including the input time, the position, the dynamics, the speed, the word spacing and the like of each stroke, a wider development space is provided for application services such as subsequent editing and other processing.
Based on the technical solution provided in the foregoing embodiment, it is preferable that in step 102A, according to the input position of the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the character specified in the first target row/column, a new character is created for the stroke or a character to which the stroke belongs is determined, which specifically may include:
comparing the input position of the stroke in the first target row/column with position information corresponding to the character appointed in the first target row/column, and judging the relevance between the stroke and the character;
if the stroke is not associated with any character, creating a new character for the stroke, the stroke being attributed to the new character;
If the stroke is associated with at least one character, attributing the stroke according to the associated at least one character.
Wherein, the specified characters in the embodiment may be all characters existing in the first target row/column; alternatively, the specified character may be a character in a region to be compared in the first target row/column, and a distance between a boundary position of the region to be compared and the stroke is smaller than a second preset threshold. The stroke is compared with the characters in a certain range, so that the calculated amount can be effectively reduced, and the stroke attribution judging efficiency is improved.
Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character appointed in the first target row/column, and judging the relevance between the stroke and the character, wherein various implementation methods are possible, and are respectively described below.
Judging the relevance mode I, and judging the relevance of the stroke and the character by judging whether the stroke coincides with the character or not. Specifically, the creating a new character for the stroke or determining a character to which the stroke belongs in step 102A according to the input position of the stroke in the first target row/column or the input position of the stroke in the first target row/column and the character specified in the first target row/column may specifically include:
Comparing the input position of the stroke in the first target row/column with position information corresponding to a character appointed in the first target row/column, and judging whether the stroke is overlapped with at least one stroke in the character;
if the stroke is overlapped with at least one stroke in the character, judging that the stroke is related to the character;
if the strokes are not overlapped with all strokes in the character, judging that the strokes are not associated with the character;
if the stroke is not associated with any character, creating a new character for the stroke, the stroke being attributed to the new character;
if the stroke is associated with at least one character, attributing the stroke according to the associated at least one character.
In the mode, strokes which are intersected with each other can be used as strokes of the same character, and the strokes are attributed to the same character, so that the mode is simple and quick.
And judging the relevance mode II, and judging the relevance of the stroke and the character by calculating the distance between the stroke and the character boundary. In this manner, the creating a new character for the stroke or determining a character to which the stroke belongs in step 102A according to the input position of the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the character specified in the first target row/column may specifically include:
Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character appointed in the first target row/column, and judging whether the distance between the stroke and the boundary of the character is smaller than a third preset threshold value or not;
if the boundary between the stroke and the character is smaller than a third preset threshold value, judging that the stroke is related to the character;
if the boundary between the stroke and the character is not smaller than a third preset threshold value, judging that the stroke is not associated with the character;
if the stroke is not associated with any character, creating a new character for the stroke, the stroke being attributed to the new character;
if the stroke is associated with at least one character, attributing the stroke according to the associated at least one character.
For example, for characters having a distinct left-right or up-down structure, such as a "warm" character, the left component "san" (three-point water) may be too large apart from the right half "" during writing due to the difference in personal writing habits, at which time the character to which the strokes belong may be determined by comparison with a third preset threshold value set in advance. When the distance between the currently input stroke and the adjacent character is smaller than a third preset threshold value, the stroke can be considered to belong to the adjacent character, otherwise, a new attribution character can be created for the stroke.
And judging the relevance mode III, and judging the relevance of the strokes and the characters by calculating the distance between the strokes and each stroke in the characters. In this manner, the creating a new character for the stroke or determining a character to which the stroke belongs according to the input position of the stroke in the first target row/column or the input position of the stroke in the first target row/column and the character specified in the first target row/column may specifically include:
comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character designated in the first target row/column, acquiring a minimum interval value in intervals between the stroke and each stroke corresponding to the character, and judging whether the minimum interval value is smaller than a fourth preset threshold value or not;
if so, the stroke is associated with the character;
if not, the stroke is not associated with the character;
if the stroke is not associated with any character, creating a new character for the stroke, the stroke being attributed to the new character;
if the stroke is associated with at least one character, attributing the stroke according to the associated at least one character.
In the first, second and third judging relevance modes, the performing attribution processing on the strokes according to the associated at least one character may include:
if there is one character associated with the stroke, attributing the stroke to the one character associated with the stroke;
if there are at least two characters associated with the stroke, combining the at least two characters and attributing the stroke to the combined characters.
In this embodiment, when a stroke can be assigned to both left and right characters, it is indicated that the stroke should be combined with its left and right characters to form a font, for example, the positional relationship between the stroke in the component "again" in the "tree" word and the left component "wood" and the right component "inch". Of course, if the subsequent recognition operation is not required, the above-described preset threshold may not be set as long as the characters can be divided.
In addition, in the second and third ways of judging the relevance, the relevance of the strokes and the characters can be further divided into strong and weak, and the attribution of the strokes can be judged according to the relevance.
Specifically, the attributing the strokes according to the associated at least one character may include:
Acquiring a character with strongest stroke relevance from at least one associated character;
if the character with the strongest association with the stroke is one, attributing the stroke to the strongest character;
and if at least two characters with the strongest relevance to the strokes exist, merging at least two characters, and attributing the strokes to the merged characters.
Accordingly, the acquiring the character with the strongest association with the stroke from the associated at least one character may include:
according to the distance between the stroke and the boundary of the character, sequencing at least one character associated with the stroke according to the sequence from small to large, and taking the character corresponding to the minimum distance as the character with the strongest association with the stroke; or alternatively, the process may be performed,
and according to the minimum distance value of the strokes and the characters, sequencing at least one character associated with the strokes according to the sequence from small to large, and taking the first character as the character with the strongest association with the strokes.
When constraint is input in a behavior mode, it is default that strokes with upper and lower position relations can be attributed to the same character, and only the position relations between the strokes and adjacent left and right characters are judged. Also, when the column is taken as the constraint of input, it is default that strokes having a left-right positional relationship can be assigned to the same character, and only the positional relationship between the strokes and adjacent upper and lower characters needs to be judged.
In the actual application process, when the attribution of the strokes needs to be judged, the methods in the multiple modes can be comprehensively adopted, for example, the method in the first mode of judging the relevance is adopted for some strokes, the method in the second mode of judging the relevance is adopted for some strokes, and the method in the third mode of judging the relevance is adopted for other strokes.
For example, if the currently input stroke is the first stroke or the last stroke in the first target row/column in space, the method can judge whether the stroke is associated with other characters already input in the first target row/column according to a judging association mode, and if not, a new character is created for the stroke; if the current stroke is neither the first stroke nor the last stroke in the first target row/column, the currently input stroke may be compared with all the characters or the intervals between strokes according to the method in the second or third judging relevance mode, and the currently input stroke may be attributed to the associated character or characters according to the comparison result.
The first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold may be determined by the user according to writing habits of the user, or may be a system default value.
In addition, the system may also provide visual information to assist in automatic partitioning, such as composition-based character partitioning: the character to which the current input stroke should belong may be determined based on the association between the current input stroke and the corresponding Wen Ge stripe in the current input line.
In this embodiment, the stroke assignment can also be determined by using the composition. Specifically, before the acquiring in step 101A acquires the strokes input by the user and the corresponding input information, the first target row/column may be divided into a plurality of composition boxes.
Accordingly, the creating a new character for the stroke or determining a character to which the stroke belongs according to the input position of the stroke in the first target row/column in step 102A or the input position of the stroke in the first target row/column and the character specified in the first target row/column includes:
Determining a composition grid where the stroke is located according to the input position of the stroke in the first target row/column;
judging whether characters exist in the composition grid or not;
if so, the strokes belong to the existing characters in the composition grid; otherwise, a new character is created in the action Wen Ge, to which the stroke belongs.
Specifically, if the stroke crosses a composition grid, judging whether characters exist in the composition grid, if so, attributing the stroke to the characters in the composition grid, and if not, creating a new character for the stroke, wherein the new character belongs to the composition grid; if the stroke spans at least two composition cases, judging whether characters exist in the at least two composition cases, if no characters exist in the at least two composition cases, creating a new character for the stroke, wherein the new character belongs to the at least two composition cases, if only one composition case exists in the at least two composition cases, the stroke belongs to the composition case with the characters, if a plurality of composition cases exist in the at least two composition cases Wen Ge, the characters in the plurality of composition cases Wen Ge are combined, and the stroke belongs to the combined characters.
The character to which the stroke belongs is judged in an auxiliary mode through the composition, so that the method is simple and convenient, and the input of a user can be restrained better, and the judgment result is more accurate.
The above describes how to judge which character a stroke belongs to, but automatic division inevitably involves a division error such as one word being recognized as a multi-word, a multi-word being recognized as a word, or the like. However, in this embodiment, the character is not generally recognized, and the input character is recognized only when it is particularly required. This is because, on the one hand, each input character of the present embodiment is divided and stored on the basis of a font object (nonstandard, i.e., handwritten character), in other words, each input character divided or divided in the present embodiment is treated as a nonstandard font object; on the other hand, if the handwritten content is ultimately intended for human reading only (which is more focused on the preservation of the original input information form), then the partitioning errors need not be corrected.
However, if a split error of characters occurs at the detour of a row/column, e.g. at the end of a row, the input "word" is erroneously split into two characters "white" and "spoon" and placed in a different row or column, then the split of such errors needs to be corrected in some way. Alternatively, when the user browses the previously entered character, the incorrectly split character is found and may be corrected in some manner.
For the correction function described above, the resolution of such errors can be modified in an interactive manner, and the same effect can be achieved in other possible ways. The embodiment provides a correction method, which specifically includes:
respectively acquiring and displaying the boundary of each character stored locally;
receiving a correction request input by a user, wherein the correction request comprises characters to be corrected, or characters to be corrected and strokes to be corrected;
and carrying out corresponding correction processing on the character to be corrected according to the correction request.
Specifically, the specific content of the correction request may be different according to different scenarios, and in this embodiment, the following several scenarios are provided:
scene one: combining two characters into one, namely, the correction request is a combined correction request, and the characters to be corrected are at least two characters to be combined;
correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and combining the at least two characters to be combined into one character.
Scene II: splitting one character into a plurality of characters, namely, the correction request is a splitting correction request, and the character to be corrected is one character to be split;
Correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and splitting the character to be split into at least two characters.
Scene III: changing a certain stroke belonging to one character into another character, namely, the correction request is an attribution correction request, the character to be corrected is one character to be attribution, and the stroke to be corrected is at least one stroke to be corrected;
correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and attributing the at least one stroke to be corrected to the character to be attributed.
Through the correction function, the split characters can be re-split in a mode of interaction with a user, so that the accuracy of character splitting is improved.
Since each character (possibly one or a combination of multiple words, words) has been split into separate individuals during the division of the characters, distinguishing between the characters is easy to achieve. Further, since the method provided in this embodiment can also record the stroke sequence (based on time) of each stroke written by the user and the shape features of the corresponding stroke, it is easy to look up characters having the same or similar stroke sequence and stroke shape features against each other based on these information, and in the case that the appropriate threshold condition is satisfied, these characters can be regarded as the same character. This makes matching, searching, and finding of characters easy, and even allows searching using characters entered by the user as search conditions.
In this embodiment, the functions of searching and inserting can also be added.
The search function may specifically include the following steps:
receiving a search command input by a user, wherein the search command comprises characters to be searched input by the user;
and respectively comparing the character to be searched with the locally stored characters according to the stroke quantity and the stroke characteristics of the character to be searched, and obtaining the characters matched with the character to be searched.
After the content input by the user is divided by the method provided by the embodiment, split handwritten characters can be obtained. On the basis, the handwritten character search based on the graph matching can be performed, and the process mainly comprises the step of matching each character in the search source with the character to be searched one by one. Matching characters can be found by matching the number of strokes and the sequence of strokes.
An exemplary process for single text matching based on strokes in this embodiment is given below:
judging whether the number of strokes in the character to be searched is the same as the number of strokes in a character stored locally, if so, failing to match, and if so, executing the next step;
and matching the character to be searched with strokes in the locally stored character one by one, namely matching curves, if the characters are inconsistent, the final matching result is failure, and if the characters are consistent, the final matching result is success.
Of course, any graphic analysis or other matching method known in the art may be used to implement the character search function, and the present embodiment is not limited in this regard. Based on the same principle as the search function, the function of replacing characters can also be realized, and the description is omitted here.
In this embodiment, the insertion function of the handwritten text input editing may specifically include the following steps:
receiving an insertion request input by a user, wherein the insertion request comprises a target row/column to be inserted, a position to be inserted in the target row/column to be inserted and a character to be inserted;
activating the target row/column to be inserted, and inserting the character to be inserted into the position to be inserted;
and correspondingly adjusting the characters after the position to be inserted.
If a new character is to be inserted in the middle of an existing content, an explicit command is required to enter/exit the insertion mode instead of automatically inserting as in conventional character input. In addition, since the inserted characters may be handwritten characters, standard characters inputted by a keyboard, nonstandard characters using other input devices, or the like, corresponding insertion control or switching instructions, and instructions for identifying and editing the inserted content are also required.
If the user needs to add a character at a certain position in a line that has become inactive, for example, when inserting a character between the 3 rd and 4 th characters of a certain line, the user needs to first activate the line, the system will provide an auxiliary interface at the blank characters of the line that accepts user input. The user activates the auxiliary interface between the 3 rd and 4 th characters of the line, i.e. selects the application of the insert operation at the character interval.
Insertion may be performed before and after any character. When for handwriting systems we can further restrict to insertion at blank characters. Fig. 1E is a schematic diagram of a state of a character inserted in an embodiment of a method for processing a handwriting input character according to the present invention. As shown in fig. 1E, after entering the insert edit state, the existing character after the insert position may be moved to the next line, the insert position to the current line end as a writable space. The row is inserted with the right arrow marked with the behavior, and clicking on the right arrow can exit the inserted state. Before the insertion ends, the user can only enter between the insertion marks.
Both the characters preceding the insertion location and the characters following the insertion location are read-only (but optional) until the insertion is completed. After the insertion is completed, the line is rearranged according to the inserted characters. The last line of the inserted line (when the insertion starts, the current line is the last line inserted) can be expanded, and the expanded line is the new last inserted line. Theoretically, insertions can be nested, that is, insertions can be made into the content of an insertion. The insert row has a different visual state than the usual row to help the user define the current edit state.
In addition to the above-mentioned search and insert functions, other processing may be performed on the character input by the user through handwriting, where the processing may include the following steps:
acquiring at least one character selected by the user;
receiving a selection processing command input by a user, and processing the at least one character according to the selection processing command;
wherein the selection processing command comprises any one or a combination of the following: and copying the at least one character, cutting the at least one character, replacing the at least one character, and merging the at least one character.
Fig. 1F is a schematic diagram of an edit mode under a selection processing command in an embodiment of a method for processing a handwriting input character according to the present invention. As shown in fig. 1F, functions of insertion, pasting, full selection, combination and the like can be displayed on the handwriting input screen, so that a user can conveniently perform corresponding operations.
In addition, the present embodiment can insert or add strokes, comments, delete some characters, or the like over the entered characters. The functions of searching, inserting, copying and the like provided in the embodiment can effectively avoid the defects of the existing handwriting input system such as insufficient intuitiveness, difficult modification and the like.
On the basis of the technical solution provided in the foregoing embodiment, it is preferable that the number of the first target rows/columns is plural;
the activation areas corresponding to the first target rows/columns are all non-overlapping and are not in contact with each other.
In this case, a plurality of users can input in the activation areas corresponding to a plurality of first target rows/columns, respectively, so that the function of allowing multiple persons to input simultaneously of the large-size handwriting input screen is satisfied.
On the basis of the technical solution provided in the above embodiment, it is preferable that the present embodiment is compatible with existing keyboards, mice, and other existing input devices, and the mode switching is performed to implement hybrid input. The mode switching method in this embodiment may specifically include:
receiving a mode switching request input by a user, wherein the mode switching request comprises a target mode;
and switching the handwriting mode to the target mode, and receiving at least one standard character input by a user in the target mode.
The target mode may be a keyboard input mode, a mouse input mode or other existing input modes. For example, standard coding characters or other symbols or information can be added or inserted in the input limit of rows or columns in combination with the existing keyboard, so as to realize mixed typesetting (see handwriting graphic mixing arrangement in the example of the application).
In particular, other input devices connected, such as a keyboard, may be activated by means of appropriate touch keys or operations (e.g. clicking), to allow the user to freely switch between handwriting input and other conventional input devices such as a keyboard. For the division of the keyboard input content, a standard code division form can be used, and a character division mode can also be used in the invention.
In addition, the activation area may also automatically move with the user's input during handwriting input. For example, the active area is always repositioned with the position of the last stroke of the user as the midpoint position of the active area. Thus, in most cases, the activation area will automatically move with the user's writing, so that the location of the activation area may not be set manually.
In the conventional standard code input state, the system has a blinking cursor to indicate the current input position. In the handwritten character input state, the system displays an activation area to represent the currently available range. When the user makes an input mode switch, the two can mutually translate according to a certain rule. For example, when switching from standard character input to handwriting input, the system sets the position of the activation region with the cursor position as the midpoint of the activation region; when switching from handwriting input to standard character input, the character position nearest to the midpoint in the activation region is set to the current input position.
On the basis of the technical scheme provided by the embodiment, preferably, the concept of control characters can be added, so that the problems of typesetting and editing of the handwritten text contents can be solved. Control characters exist in a standard code (such as ASCII code) character set, and similarly, the concept of the control characters can be introduced into the handwritten characters, so that the output and the processing of the handwritten character contents are more convenient and flexible.
Specifically, the control character may be a standard control character, such as a special character of space, tab, line feed, etc.; non-standard control characters, such as blank characters, are also possible. The standard control character is similar to the prior art, and the blank character is described in detail below by taking the sixth embodiment as an example.
In addition, the embodiment also provides a blank character function. Specifically, in the present embodiment, the space information between characters may be reserved, for example, the space size between left and right characters for a horizontal format, or the space size between upper and lower characters for a vertical format, or the like, and the space may be directly created as a space character with the space information.
For the character input by handwriting of the user, when the writing style is from left to right and from top to bottom, the horizontal baseline of the target line where the character is located can be defined as the horizontal baseline of the character, the position of the leftmost component (such as a graph, an image, a stroke and the like) in the character is set as the starting position of the character, each component in the character takes the baseline and the starting position as the original points, and the typesetting direction is the positive direction to record the position of the component in the character. Thus, the same character content can appear in different positions of the characters, and all the internal components can be correctly drawn as long as the corresponding character origin coordinates are correctly calculated according to the row of the characters and the position of the characters in the row. Also, for other types of writing styles, the starting position of each character may be set in a similar manner, with the relative coordinates of the starting position being used for the character interior part positions.
These starting positions are only required when the character is drawn. When storing the divided characters, the start position is not stored. But the spacing between the characters related to the spacing is separated to form blank characters which are stored in the character sequences corresponding to the characters.
Fig. 1G is a schematic diagram of a blank character in an embodiment of a method for processing a handwriting input character according to the present invention. As shown in fig. 1G, in this embodiment, a custom space character is introduced, and the word pitch is saved as a parameter/content. Numerals 12, 16, 10 in fig. 1G are numerical values of each blank character, and represent length information of each blank character. It can be treated differently during analysis, processing (e.g., recognition, detour, etc.). Similarly, a blank character based on time can be added into the text of the voice input.
In general, the maximum coordinate of a character input by a user in the typesetting direction is the width of the character. For the character width, we can store it or not, but can restore it by the position information of all the internal components in the character. When typesetting the characters, the initial positions of all the characters in the row/column can be restored only by acquiring the width information of all the characters (including control characters), so that a foundation is provided for further character rendering.
In this embodiment, standard control characters and blank characters are introduced, and these control characters have similar models, codes, fonts, word senses and the like as the characters input by handwriting of the user. Thus, the theory, method and tool of handling handwritten input characters may be used directly or indirectly to control characters. Furthermore, the characters input by handwriting of the user and the control characters can be mixed together for processing, so that the splitting of the characters has a great significance on the basis.
The object processed in this embodiment may be a stroke character, a standard character, a graphic character, a combination character or a control character input by the user, or may be a mixture of a plurality of characters therein.
Fig. 1H is a flowchart of a method for processing a handwriting input character according to an embodiment of the present invention. As shown in fig. 1H, the text editing in this embodiment may specifically include the following steps:
step 601A, judging the opening mode: if the existing document is opened, executing step 602A; if it is a new document, step 603A is performed.
The embodiment is mainly used for providing personalized handwriting character input for related documents, and mainly comprises two modes of entering a handwriting input system: a mode with document data and a mode without document data. The former is to open an existing document, and the latter is to newly create a document.
Step 602A, loading document data and typesetting text according to typesetting constraints, and executing step 604A.
In particular, the relevant data of the character may be hierarchically loaded. For example, in typesetting characters, only the width (height for column-based typesetting) of the relevant character is required, and thus in this step, only the width information of the character may be loaded. And other information such as stroke information, outline information and the like required for drawing can be loaded later as required, so that system resources (memory, network traffic and the like) can be saved. And performs step 604A.
Step 603A, initializing a handwritten document, and executing step 604A.
Step 604A initializes (empties) a sequence of handwritten text objects representing a text input line.
Hereinafter, a sequence of handwritten text objects representing a text input line will be abbreviated as AL (Active Line), and AL is core data to be processed in the method provided in this embodiment.
Step 605A, presenting the document content, and executing step 606A.
The presented content includes a plurality of portions: visualization information of the document itself (including visualization information of handwritten characters, such as information of positions, shapes, and the like of the characters), visualization information of a document presentation environment (such as background, ground marks, paper boundaries, and the like), visualization information related to document editing (such as a selected area, a cursor or an activated area representing an input focus, and the like, auxiliary lines, and the like), and the like. It is mentioned in step 602A that the visualized data of the handwritten character has to be loaded when presentation is required. For characters that do not need to be presented, their corresponding visual data may not be loaded.
In this embodiment, similar to the conventional data processing system, the character stream is loaded from the storage area to the memory, and typesetting is required before display. For simple plain text, typesetting is referred to herein as breaking.
Specifically, line Fu Chuduan (hard carriage return) may be marked/swapped at the end of a paragraph; the position of each character is calculated in each row/column, and the total length of the inputted text content is accumulated. The line is broken (soft carriage return) when the position exceeds the maximum position of the line. The truncated position is at the last previous break.
There are a series of decision rules for the position of the breakable:
the punctuation mark can be broken after the punctuation mark (the punctuation mark cannot be used as a line head character after soft carriage return);
blank places (blank characters, tab characters and the like) can be broken, and the first character of the next row is a non-blank character (the blank characters cannot be used as the head characters of the row after soft carriage return);
the east Asian characters can be cut off directly before and after;
the middle of English word can not be broken directly (for simple system, the whole word can be arranged to the next line directly, for complex system with identification function, it can also be broken according to the prefix and postfix of word, and add hyphen);
the hand-written character can be cut off directly before and after.
In a practical implementation, a blank character may be converted to a blank space with a standard length. The successive blank spaces may be directly merged so that the layout algorithm is simpler. The processing mode of the blank space is the same as that of the blank character.
The document model after typesetting includes information for each display line. The line includes the word with the position (including the word composed of characters, east asian characters and handwritten characters). The blank character does not need to appear in this model, and the relevant information is implicit in the positional properties of the word (left border, right border (left border + width)). Therefore, blank characters (including blank characters caused by handwriting space, standard blank characters, tab characters, etc.) can be discarded after typesetting.
For the document model after typesetting, the distance information between characters is implicit in the coordinate system of the characters. For example, within a certain line, the left end coordinate of a character is 12, and the word width is 2.5; the left end point of the next character is 16. It can thus be calculated that the distance between the two characters is 16-12-2.5=1.5. The text within each line may change as the user enters, and the user entering and rubbing off strokes may cause the spacing of the characters to change or create new characters. The space can be correctly generated as long as the character coordinates are correct. Only when the edited content needs to be stored, the blank character needs to be calculated and generated and inserted into the appropriate position.
Step 606A, receive the command, and perform different operations according to the command.
The command may be a command input by a user, or may be a system command or a command transmitted from other application systems.
The command may be sent in various ways by a conventional interactive device, or may be sent by a gesture, for example, when a user is recognized to input a horizontal line across several consecutive characters in a horizontal direction, the input gesture may be recognized as an operation to delete the characters. The method can also be automatically performed through some settings, such as automatically starting handwriting input after a document is newly built or opened, automatically ending handwriting input after content is selected, and the like.
Specifically, if the command is a text code typesetting command, step 607A is executed; if the command is a command to start handwriting input, step 608A is executed; if the command is a command to end handwriting input, step 610A is performed; if the command is a system exit command, step 612A is performed.
And 607A, typesetting the text content according to the command.
In the process of storing the character information, typesetting constraint and typesetting direction can be stored in the information of each character. Thus, when the same character appears in the characters of different typesetting modes, the internal relative positions of all parts of the character in the current typesetting mode can be adjusted according to the information, so that the character is correctly drawn.
The following describes the interconversion of different typesetting modes in two examples.
One example is to use the character of the original horizontal row for the vertical row, or vice versa. The characters transversely typeset are stepped according to width (i.e. the row length is accumulated from left to right according to typesetting direction), and the characters longitudinally typeset are stepped according to height. Therefore, in a specific implementation, it is necessary to distinguish between horizontal characters and vertical characters. For horizontal type characters, an internal coordinate system with a horizontal axis (alignment line) as the horizontal axis and the leftmost stroke point as the vertical axis may be used, while for vertical type characters, an internal coordinate system with a column axis as the horizontal axis and the highest stroke point as the vertical axis may be used. Thus, the different characters remain in the original alignment in the corresponding typesetting drawing. When the horizontal character is changed into the vertical character or the vertical character is changed into the horizontal character, the system can automatically perform coordinate conversion by the typesetting meta information of the character. The original alignment between characters, although not preserved, can be normally presented for each character.
Another example is the changing of a composition layout to a normal layout. In the composition layout, the composition layout is marked in the type of the character, and then the internal coordinate system of each character may be with the lower left corner (virtually any point, such as the center point) of the corresponding composition layout as the origin. Thus, each character is aligned with a corresponding composition grid. The handwritten text in the typeset of text has no text space/space characters (but has blank Wen Ge characters). When we change the typesetting of the composition into the common typesetting, we can recalculate each character, change the coordinate system (such as the system using the intersection point of the base line and the leftmost end as the origin), and insert the corresponding interval characters between the characters according to the new coordinate system.
Step 608A, activate the target row/column, and execute step 609A.
In this step, the target row/column may be activated, and the text object in the target row/column is activated (loading stroke information), and the object sequence is assigned to AL.
In this embodiment, the input of handwritten characters is performed under row/column constraints. Even though the input content spans multiple rows/columns, its corresponding character must ultimately be stored in a particular location in a particular row. The target row/column of character input can thus be presented in a visual manner and cross-row input by the user can also be avoided by specific settings, such as auxiliary panels, full screen row editing, etc.
Step 609A, handwriting input is performed under the constraint of the activated target row/column, and step 605A is executed back.
In this step, handwriting input can be performed under the constraint of the activated target row/column, and each stroke of input is automatically combined with AL according to a certain rule to form a new object sequence of handwriting (i.e. AL is updated).
The input process of the handwritten characters mainly comprises the steps of automatically grouping input strokes into different characters according to space constraint in rows/columns, and the implementation mode can be seen in the previous embodiment, and particularly, the word forming effect can be realized through word spacing constraint or Wen Ge constraint.
Step 610A, store the content of the chinese object in AL, and execute step 611A.
In this step, the content of the AL Chinese character object is stored, and if necessary, the AL related text content may be typeset again.
At the end of the handwritten character input, the character objects in AL are determined (all previously dynamically changing in accordance with the stroke input). Some of these character objects have no change, some of the content (strokes) have change, and some are completely new characters. Both the character with the change and the new character are new characters. The final AL corresponding character sequence needs to be updated to their corresponding position in the document. If a storage mode of splitting codes and contents is used here, the contents of new characters need to be stored in a code library first to obtain corresponding codes. The new code sequence is then saved to the corresponding location of the document (typically the document model in memory).
Because the handwriting character method uses space constraint in the rows/columns, the length of the rows/columns is not changed in general. But at the end of inserting the content editing and the extended line (soft carriage return) editing, the layout information of the current line and thereafter needs to be updated, i.e., the layout is re-laid from the current line.
Step 611A, the AL is cleared, and step 605A is executed back.
After the handwriting input is finished, the target row/column of the handwriting input does not exist, and the corresponding data structure can be emptied.
Step 612A, end.
The processing method of the handwriting input word is convenient for the user to edit and process the handwriting character, and further improves the input experience of the user.
In addition, in addition to editing and typesetting of document contents and splitting, merging, recognition, insertion, searching and replacement of characters, in this embodiment, other processing may be performed on document contents, such as saving, printing and the like of documents, and processing operations specific to handwritten character input may be performed, for example, but not limited to, the following examples.
In order to more closely resemble the effect of writing on paper, it is also possible to refer to the scroll bars in the existing conventional text editing tools or software, and in this embodiment, to arrange corresponding row and column scroll scales so as to extend the input range of the panel, i.e., the input range space of the rows and columns, upward, downward, leftward or rightward. And, as the scale is moved, the corresponding target row/column may be displayed and/or activated accordingly.
The line height during handwriting input can also be in a corresponding relation with the specific word size of the standard word stock, so that the word size of the handwriting input word is standardized or regulated.
Blank information among characters can be discarded after the characters are recognized, and even part of character spacing information and position information can be selectively discarded, so that a certain storage space is saved.
The coding function can also be added in this embodiment.
Specifically, the encoding function in the present embodiment may include:
receiving a coding request, and determining a font corresponding to a handwritten character in a handwriting input program according to the coding request;
and inquiring a mapping table in the coding warehouse to obtain the standard language parameters corresponding to the fonts.
Wherein the standard language parameters include one or a combination of several: numbers, symbols, keywords, public identifiers, and private identifiers.
The embodiment can realize the function of encoding characters generated in the handwriting input process, and the detailed description is described below.
In the present invention, the input text or data object is abstracted into the concept of "character". The characters may refer to handwritten characters of ideograms, such as single chinese characters, japanese, korean, arabic, tibetan, maine, etc., or parts thereof (e.g., radicals, etc.), or handwritten words of phonograms, such as western letters or words of english, german, french, russian, spanish, etc.; computer characters based on traditional standard codes, such as ASCII code characters, unicode code characters or character strings, and the like, and even control characters, such as special characters of space, tabulation, line feed, and the like; but also non-standard control characters such as the spacing or pitch between handwritten characters herein; or mixing the handwritten character with standard characters and/or synthesizing characters or character strings; and may even be any graphic, image, such as a "heart" shaped pattern, photograph, any graffiti, etc., or any other written representation, entered by the user. In the input scheme or system of the present invention, all character objects input in the above manner will be recognized as characters in a non-standard glyph manner.
The glyphs referred to in this invention are similar to the concept of characters in a standard word stock, except that the invention generates glyphs that are all non-standard. Since the present invention is not directed to generating standard fonts or word libraries, the glyphs ultimately generated by the system of the present invention are likely to include erroneous splits of various characters or words or combinations between them, as well as any graphics or images entered by the user, etc.
For modern high-level programming languages, two processing approaches, compile generation and interpret execution, can be largely divided. The former is to convert the source code through a series of compilations to generate a binary file that encapsulates the target machine (which may be a virtual machine) instruction sequence. The binary file needs to be loaded into the target system to be executed. While interpretation execution refers to an interpreter running in the target system, by reading the source code, running directly through a series of processes inside.
Languages based on interpretation execution are generally called scripting languages, typically JavaScript, lua, tcl, etc. Many conventional programming languages are compiled, such as C, C ++, objective-C, java, C#, go, swift, etc. There are also some languages that are supported, such as Python, ruby, lua, haskell, scheme, F #.
The front end architecture of the core component that processes program source code, whether it be a compiler or an interpreter, is very similar, or even identical. By front-end, we mean converting the source code into an internal intermediate form. Correspondingly, for a compiler, the backend refers to converting the intermediate form into machine code, and for an interpreter, to executing the intermediate form by an execution engine. In some systems, there is also processing and optimization for intermediate forms, called the mid-end. The focus here is on the front-end part, so in general we do not distinguish between compiled and interpreted forms. The front-end is referred to herein collectively as a compilation front-end.
The compilation front-end may generally include four processes: lexical scanning, grammar analysis, semantic analysis and intermediate code generation. The lexical scanner converts the source code into a markup stream; the parser converts the markup stream into an abstract syntax tree; semantic analysis adds semantic tags to the abstract syntax tree; the intermediate code generator converts the tagged abstract language book into an intermediate form of the compiler.
In one programming environment, there are other related system support systems/platforms and tools, among others, in addition to the core processor (compiler/interpreter) of the source code. Such as a code editor to input, modify source code, a debugger to debug code execution processes, a source code control tool to manage code versions, and so forth.
The integrated development environment (IDE, integrated Development Environment) is an application that integrates all of these systems and tools to provide an integrated use interface.
For the programming environment of the handwriting word system, the handwriting word system brings a brand new word input mode, and has the advantages of safety, convenience and the like. However, the input and editing result is still character stream, but only the standard code is used, but the proprietary code based on the individual of the inputter.
For handwritten characters, a special programming language can be designed; standard code based program source code may also be generated using a glyph matching service in a handwritten word system. For the latter, a large number of existing programming environments and tools can be reused directly. The present embodiment is mainly exemplified with respect to such a scheme.
In practice, this approach is quite straightforward-that is, the conversion of person-based proprietary codes into standard codes. That is, the handwritten source code is converted into source code that can be recognized by a general compilation front-end. Thus, the handwriting source code can be processed by adding a conversion process before the traditional compiling front-end, that is, the whole process can generally comprise five processing processes: handwriting source code conversion, lexical scanning, grammar analysis, semantic analysis and intermediate code generation.
The code conversion process mainly converts and matches the handwriting source codes according to the established rules to generate corresponding standard code contents which are separated from the fonts in the text library. The process is mainly divided into two parts, namely control symbol conversion and font conversion.
For control character conversion, the control characters in the programming language mainly comprise blank spaces, tab characters, carriage returns, line feeds and the like. This conversion is very straightforward since the same or similar control symbols as ordinary text can be used in our handwritten text. For example, the handwritten space code is directly converted to standard blank characters. If the handwritten line-feed symbol directly adopts standard line-feed codes, the handwritten line-feed symbol can be reserved without conversion.
For font conversion, the font conversion mainly converts personalized font codes in handwriting source codes into corresponding standard codes. The basis for this conversion is the fonts in its corresponding word font library, where the font matching service of the handwritten word system is required. Including numeric symbol mapping, key mapping, interface identifier mapping, and private identifier generation and mapping.
With respect to digital symbol mapping: the source program of most high-level programming languages exists in the form of text files. The foremost difference with respect to plain text content is the grammar constraints. This constraint is embodied in strict keyword and syntax symbol definitions.
The digit-symbol mapping is to search and match the character pattern in the handwriting source code according to the character pattern digit-symbol mapping table defined by the user, and replace the character pattern into the corresponding standard code digits and symbols. The symbol as referred to herein refers to punctuation used in programming languages, such as add, subtract, multiply, divide, greater than, equal to, less than, various brackets, etc.
It can be seen that this glyph digital symbol mapping table is the key to digital symbol mapping. This table is a personalized setting. The writing habit, the stroke order and the character form of each person are not the same, and the same person's character form is searched and matched to make sense. Thus, each programmer has its own glyph numeric symbol mapping table that can only map the handwritten source code written by that programmer. In a team software development environment, programmers need to authorize specific users/accounts, share their glyph digital symbol mapping tables, and their handwriting source code can be compiled/run by others. In practice, this is an extension of the security of handwritten text during software development/operation.
The glyph digital symbol mapping table may be a many-to-one mapping due to unreliability of the handwritten glyph. That is, multiple glyphs may correspond to the same number, symbol.
Because of the long-term validity of program source code, the glyph numeric symbol mapping table for a particular user for a particular program language should in principle be added only without deletion and modification. And the contents of the characters cannot conflict with each other, such as the same character pattern is not allowed to correspond to different numbers and signs.
Unlike keywords and identifiers, the numeric, symbolic characters in standard codes are not made up of characters in the alphabet. Therefore, when the traditional compiling front-end lexical scanning is carried out, special processing is often carried out on the symbol characters, and one symbol can directly terminate the previous lexical mark; the identifier also often cannot start with a numeric character. Similarly, we need to have special conventions for handwritten fonts to handle. For example, it may be agreed that numbers and symbols can only correspond to individual glyphs, and cannot correspond to combinations of glyphs.
Because of the specificity of the symbols, the glyph digital symbol mapping table is typically predefined by the user.
Regarding keyword mapping: as with the numeric symbol mapping, the key mapping is also a mapping of glyphs to standard codes based on a mapping table. This mapping table is the glyph keyword mapping table. Is a personalized many-to-one table.
Keywords are also critical to the recognition and parsing of the programming language, and determine the location and number of relevant syntax elements. The contents of the font keyword mapping table are also generally predefined by the user, and can also be interactively performed during handwriting source code conversion.
Unlike numeric symbol mapping, keyword mapping allows one keyword to correspond to a combination of multiple glyphs, that is, different combinations of the same glyph may correspond to different keywords.
With respect to interface identifier mapping: similarly, the interface identifier map maps glyphs to standard codes. The key here is also a mapping table, the glyph identifier mapping table. For traditional high-level programming languages, there are more or less libraries built in or third parties, we need to use the corresponding identifiers to access the system constants, system functions, standard library functions, class libraries, etc. inside. These identifiers often consist of standard codewords. The glyph identifier mapping table is a mapping table between the user's handwritten glyphs and the corresponding identifiers. Furthermore, some of the symbols in the handwritten code may also be interfaces-used and accessed by others, in which case we also need to provide them with corresponding standard code identifiers.
In the glyph keyword mapping table, the set of target keywords (including system punctuation marks) mapped to is an explicit closed, finite set for a particular programming language. And in the glyph identifier mapping table, the set of target identifiers is an unlimited, open set. As users have increased access to systems/external interfaces, as well as to external interfaces.
As with the glyph keyword mapping table, the content of the glyph identifier may be predefined by the user or may be interacted with during handwriting source code conversion.
In practice, we can put the common character string and code segment into this mapping table and use the proper font sequence to correspond to it. This increases programming efficiency and increases program legibility.
With respect to private identifier generation and mapping: there are two cases where a private identifier occurs in source code, one is a definition or declaration, and the other is a reference. The transcoding of the defined symbols is automatically performed standard code identifiers for user-defined or declared private symbols (non-interface symbols) according to established rules of the system. The standard code identifier does not need to have a specific literal meaning, but only the uniqueness of the identifier is guaranteed, namely, different standard code identifiers are generated by different fonts.
For transcoding of reference symbols, the mapping table is substantially similar to the mapping table-based conversion above, except that the mapping table is automatically generated by the system. The content of the mapping table is the correspondence between the glyphs of the above-defined symbols and the corresponding standard code identifiers.
In our handwritten text scheme, we can allow the handwritten text code and standard code to be used in a mix in the same content. In the process of handwriting programming, we also allow for such content. In source transcoding only, the standard code is skipped directly, without any conversion. Here, in order to prevent the interference between the standard code generated by the handwritten character and the original standard code, it is necessary to insert a blank character between the standard character and the non-control character handwritten character in the case where the standard character and the non-control character handwritten character are directly adjacent to each other during the conversion process.
Most programming languages are based primarily on alphabetic based natural language (e.g., english). Thus, the identifier often corresponds to a word. One benefit of using handwriting programming is that it may not be limited by this natural language, as long as it is mapped to the target language by a mapping table. For example, we can use Chinese. In chinese, there is no concept of words, and in particular, in handwritten chinese characters, each character may have a certain pitch. If we treat a single character as an identifier based on this spacing, such a result is obviously unparallel. Therefore, a larger character spacing needs to be defined to ensure that multiple characters can form an identifier.
The input, output and related processing of the standard code character string are inevitably needed in the traditional program, and the content of the standard code character string is more or less embedded in the corresponding code. One benefit of handwriting is that standard code strings are generated in real time without handwriting recognition. Therefore, embedding standard code strings in the program code of handwritten text is indeed a problem. It can be solved or circumvented by:
1. the character string is placed in the glyph interface identifier mapping table and the corresponding glyph is used in programming. Obtaining a required character string through a standard code conversion process;
2. the string is put into a resource file (many systems support this and this is recommended in view of internationalization issues), and the string is run-time loaded by its corresponding ID. Thus, character strings can be prevented from being embedded in the program source code;
3. the support of the handwriting word operation is added into the program, so that the written program can directly support the input and output based on some words.
In the glyph number symbol mapping table, 10 numbers of 0-9 and glyphs corresponding to decimal points can be directly defined. However, one problem with handwritten numbers is that the glyphs of certain numbers are difficult to distinguish from other symbols or words, resulting in a deviation in the results of the word finding matching service. For example, the letters 1 and brackets (or), and the letters uppercase I (I) and lowercase L (L) are highly similar, the letters 0 and O are indistinguishable in case, and the letter 7 and T may be identical. To address this problem, users need to deliberately distinguish their glyphs from other symbols and letters when entering handwritten numbers. This is also commonly used by people in daily life.
One advantage of handwritten text is that it may be unconstrained by the glyphs of standard encoded text, and a user may use any glyphs or symbols. Thus, in handwriting programming, we can use arbitrary glyphs or symbols as keywords or identifiers. But in the process of use we need to pay attention to the collision of the key with the identifier. If the identifier uses the same glyph as a certain key, the result of the translation often results in a grammar error. By employing special glyphs or symbols for keywords, we can well circumvent this conflict.
Fig. 1I is a flowchart of a handwriting program source code conversion method in an embodiment of a method for processing handwriting input characters. Fig. 1J is a detailed flowchart of "standard transcoding B" in the handwriting source code conversion method shown in fig. 1I.
As shown in fig. 1I and 1J, the entire conversion process has five inputs: a handwriting program source file, a handwriting text library, a font number symbol mapping table, a font keyword mapping table and a font interface identifier mapping table. The result of the conversion is three: standard code destination file, source destination location mapping table, and glyph private identifier mapping table. Wherein the glyph private identifier mapping table is only needed for use in the conversion process and may not be reserved. However, the source target position mapping table is very important because the compiling and interpretation executing process after the conversion is completed takes the generated standard code target file as input, and the corresponding system information is also given based on the position information in the text file. With the source-target position mapping table, we can directly convert the information into the corresponding position inside the handwriting source code file. This provides the basis for our entire handwriting programming environment and related auxiliary tools.
In the detailed conversion process described above, mainly the standard code program text file is outputted. However, in actual implementation, the conversion process may be integrated with the existing compiling front end, and the process of writing the file may be skipped, so as to generate a standard code character stream in the memory for further processing. On the other hand, the previous translation flow assumes that the glyph interface identifier mapping table has been predefined. In fact, by means of deep integration with the compiling front-end, the optimized conversion process can generate intermediate files (including complete numeric identifiers and key word conversions) without a glyph identifier mapping table, and then intelligently process handwritten identifiers based on the results of lexical analysis, grammatical analysis, and semantic analysis. For example, such processing rules may be employed: for the handwritten symbol in the symbol definition, automatically generating a standard code identifier thereof; for undefined handwritten symbols, the user is interactively queried for their identifier definition and automatically generates a glyph interface identifier mapping table based on the user input.
Further, the deep integrated compiler is used in the handwriting word editor, and the functions of grammar coloring, grammar intelligent perception and the like can be realized, so that the integrated development environment based on the handwriting word can be finally realized.
Fig. 1K is a schematic diagram of a handwriting procedure in an embodiment of a method for processing handwriting input characters according to the present invention. The corresponding programming language of the handwriting program in fig. 1K is Lua, which is an embedded script language. The corresponding glyph library encodings can be as shown in tables 1, 2 and 3.
TABLE 1
Figure SMS_1
TABLE 2
Figure SMS_2
TABLE 3 Table 3
Figure SMS_3
/>
Figure SMS_4
There are three types of codes in the above handwriting procedure: font coding, word pitch coding, and line feed coding. We denote the glyph code as w+ (specific glyph code) and the inter-word distance code as s+ (inter-word distance value). For a line-feed, we do not embed its code into the content for convenience, but instead represent it directly with a new line. Thus, the above code corresponding to the handwriting program can be expressed as follows:
S06 W01 S22 W02 S07 W03 S06 W04 S11 W05 S06 W06 S09 W07 S12W08 S09 W09
S05 W10 S38 W11 S13 W12 S11 W13 S13 W14
S46 W15 S39 W16 S23 W17 S24 W18 S33 W19
S114 W20 S40 W21
S51 W22
S113 W23 S39 W24 S25 W25 S25 W26 S11 W27 S08 W28 S12 W29 S12W30 S09 W31
S62 W32
S17 W33
S31 W34 S30 W35 S27 W36 S12 W37 S05 W38 S03 W39
S30 W40 S09 W41 S16 W42 S16 W43 S16 W44 S13 W45 S18 W46 S13W47
the code is converted and the user prepares a glyph number symbol mapping table as shown in Table 4.
TABLE 4 Table 4
Figure SMS_5
The glyph keyword mapping table is shown in Table 5.
TABLE 5
Figure SMS_6
The glyph interface identifier mapping table is shown in table 6.
TABLE 6
Figure SMS_7
/>
Figure SMS_8
Here, the system sets a syntax interval threshold of 20. The private identifier auto-generation rule is two underlines (_) followed by an underlined glyph-coded sequence.
Finally, from the previous flow, such standard code program code can be obtained:
Figure SMS_9
It can be seen that four private identifiers are generated, the generated private identifiers being shown in table 7.
TABLE 7
Figure SMS_10
Wherein the first identifier is actually the annotation content and is of no significance. If we employ an optimized conversion process, the conversion of it can be omitted directly when it is identified as annotation content.
The generated program can be normally interpreted and executed by a traditional Lua interpreter, and the execution semantics of the generated program are identical to those of handwriting source codes.
Further, in the present invention, based on fig. 1A, the method may further include:
when a storage request is received, acquiring metadata of the stored handwritten characters according to a preset metadata stripping protocol, and stripping the acquired metadata from the handwritten characters;
dividing the handwritten text into at least two data segments according to a preset data content splitting protocol.
Still further, the method may further comprise:
querying an encoding warehouse, selecting or creating an encoding protocol according to at least one part of the metadata, and generating a metadata encoding corresponding to the metadata according to the encoding protocol; coding the handwritten characters according to the coding rules to obtain instance codes, and obtaining character codes corresponding to the handwritten characters according to the meta codes and the instance codes;
Or alternatively, the process may be performed,
transmitting the handwritten text and the metadata to the coding warehouse, so that the coding warehouse can select or create a coding protocol according to at least one part of the metadata, and generate a metadata code corresponding to the metadata according to the coding protocol; coding the handwritten characters according to the coding rules to obtain instance codes, and obtaining character codes corresponding to the handwritten characters according to the meta codes and the instance codes; and receiving the literal code returned by the code warehouse, wherein the literal code is in a reference code form or a content code form.
It should be noted that, the processing flow of data splitting may refer to a specific description of an embodiment part of the data splitting method described later, and in addition, the specific flow of encoding processing may refer to a specific description of an embodiment part of the encoding processing method described later, which is not repeated here.
Fig. 1L is a schematic structural diagram of an embodiment of a processing device for handwriting input characters according to the present invention. As shown in fig. 1L, the processing apparatus for handwriting input characters in the present embodiment may include:
the acquisition module 1001A is configured to acquire a stroke input by a user and corresponding input information in a first target row/column that is currently activated; wherein the input information includes an input position of the stroke in the first target row/column;
The attribution module 1002A is configured to, for each stroke, create a new character for the stroke or determine a character to which the stroke belongs, according to an input position of the stroke in the first target row/column, or an input position of the stroke in the first target row/column, and a character specified in the first target row/column.
The processing device for handwriting input characters in this embodiment may be used to execute the processing method embodiment for handwriting input characters shown in fig. 1A, and the specific implementation principle thereof may refer to the above embodiment, which is not described herein again.
According to the handwriting input character processing device provided by the embodiment, the strokes input by the user and the corresponding input information are acquired in the first target row/column which is activated currently, and according to the input positions of the strokes in the first target row/column or the input positions of the strokes in the first target row/column and the characters appointed in the first target row/column, a new character is created for the strokes or the characters to which the strokes belong are determined, so that the character forming effect can be realized, and the user does not need to distinguish different characters by means of explicit or implicit commands of starting single character input or ending single character input, so that a period of time is not required to be stopped for each writing in the writing process or certain interactions with a system are not required, the writing process is smooth, and the efficiency is higher; in addition, the method directly determines the character to which the stroke belongs through the input position of the stroke without the need of identifying standard characters, so that the personalized information and writing style and characteristics of handwriting input of a user can be reserved.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
acquiring size information of a handwriting input screen and information of row height/column width;
dividing the handwriting input screen into at least one row/column according to the size information and the row/column width information of the handwriting input screen, and determining the position range of each row/column;
wherein the information of the row/column width is a default value or determined by the user input, and the position range of each row/column refers to the top edge position and the bottom edge position of each row relative to each other in the handwriting input screen or the left side position and the right side position of each column relative to each other in the handwriting input screen;
receiving a target row/column selection message input by a user, wherein the target row/column selection message comprises a target row/column identifier to be input by the user;
and according to the target row/column selection message, taking the row/column corresponding to the target row/column identification to be input by the user as the first target row/column currently activated.
Alternatively, the acquisition module 1001A is further configured to:
acquiring at least one character input by a user;
taking the row/column where the at least one character is located as the first target row/column which is activated currently;
Setting a position range of the currently activated first target row/column according to the character boundary of the at least one character;
wherein the position range refers to the top edge position and the bottom edge position of the first object row opposite to each other in the handwriting input screen or the left side position and the right side position of the first object column opposite to each other in the handwriting input screen.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
receiving a line breaking/column command input by a user;
and taking a second target row/column as a currently activated target row/column according to the row/column breaking command, wherein the second target row/column is the next row/column of the first target row/column.
Alternatively, the acquisition module 1001A is further configured to:
judging whether the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than a first preset threshold value or not;
if the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than the first preset threshold, taking a second target row/column as a currently activated target row/column so as to acquire the stroke input by a user in the second target row/column;
Wherein the second target row/column is the next row/column to the first target row/column.
Alternatively, the acquisition module 1001A is further configured to:
judging whether the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than a first preset threshold value or not;
if the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is smaller than the first preset threshold value, the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column;
acquiring at least one stroke which is input subsequently by a user in a first target row/column and/or a second target row/column, and taking the second target row/column as a currently activated target row/column only when the second target row/column acquires the first stroke;
wherein the second target row/column is the next row/column to the first target row/column.
When the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column, the first target row/column and the second target row/column are both activated in a partial area;
the starting position of the activation region of the first target row/column is arranged between the ending position of the activation region of the second target row/column and the ending position of the activation region of the first target row/column.
On the basis of the technical solution provided in the foregoing embodiment, it is preferable that the attribution module 1002A is specifically configured to:
comparing the input position of the stroke in the first target row/column with position information corresponding to the character appointed in the first target row/column, and judging the relevance between the stroke and the character;
if the stroke is not associated with any character, creating a new character for the stroke, the stroke being attributed to the new character;
if the stroke is associated with at least one character, attributing the stroke according to the associated at least one character.
Wherein the specified characters are all characters already existing in the first target row/column;
or the designated character is a character in a region to be compared in the first target row/column, wherein the distance between the boundary position of the region to be compared and the stroke is smaller than a second preset threshold value.
Specifically, comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and judging the relevance between the stroke and the character may include:
Comparing the input position of the stroke in the first target row/column with position information corresponding to a character appointed in the first target row/column, and judging whether the stroke is overlapped with at least one stroke in the character;
if the stroke is overlapped with at least one stroke in the character, judging that the stroke is related to the character;
and if the strokes are not overlapped with all strokes in the character, judging that the strokes are not associated with the character.
Or comparing the input position of the stroke in the first target row/column with the position information corresponding to the character designated in the first target row/column, and judging the relevance between the stroke and the character may include:
comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character appointed in the first target row/column, and judging whether the distance between the stroke and the boundary of the character is smaller than a third preset threshold value or not;
if the boundary between the stroke and the character is smaller than a third preset threshold value, judging that the stroke is related to the character;
And if the boundary between the stroke and the character is not smaller than a third preset threshold value, judging that the stroke is not associated with the character.
Or comparing the input position of the stroke in the first target row/column with the position information corresponding to the character designated in the first target row/column, and judging the relevance between the stroke and the character may include:
comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character designated in the first target row/column, acquiring the minimum interval value in the interval between the stroke and each stroke corresponding to the character, and judging whether the minimum interval value is smaller than a third preset threshold value or not;
if so, the stroke is associated with the character.
If not, the stroke is not associated with the character.
Wherein the attributing of the strokes according to the associated at least one character may comprise:
if there are at least two characters associated with the stroke, combining the at least two characters and attributing the stroke to the combined characters.
Alternatively, the attributing the strokes according to the associated at least one character may include:
acquiring a character with strongest stroke relevance from at least one associated character;
if the character with the strongest association with the stroke is one, attributing the stroke to the strongest character;
and if at least two characters with the strongest relevance to the strokes exist, merging at least two characters, and attributing the strokes to the merged characters.
Wherein the acquiring the character with the strongest association with the stroke from the associated at least one character comprises:
according to the distance between the stroke and the boundary of the character, sequencing at least one character associated with the stroke according to the sequence from small to large, and taking the character corresponding to the minimum distance as the character with the strongest association with the stroke; or alternatively, the process may be performed,
and according to the minimum distance value of the strokes and the characters, sequencing at least one character associated with the strokes according to the sequence from small to large, and taking the first character as the character with the strongest association with the strokes.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
Dividing the first target row/column into a plurality of composition grids before acquiring strokes input by a user and corresponding input information;
accordingly, the attribution module 1002A may be specifically configured to:
determining a composition grid where the stroke is located according to the input position of the stroke in the first target row/column;
judging whether characters exist in the composition grid or not;
if so, the strokes belong to the existing characters in the composition grid; otherwise, a new character is created in the action Wen Ge, to which the stroke belongs.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
receiving a search command input by a user, wherein the search command comprises characters to be searched input by the user;
and respectively comparing the character to be searched with the locally stored characters according to the stroke quantity and the stroke characteristics of the character to be searched, and obtaining the characters matched with the character to be searched.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
Storing new characters or attributive characters created by the acquired strokes at intervals of preset time;
or alternatively, the process may be performed,
when the currently activated target row/column on the page is switched from one target row/column to another target row/column, storing a new character or an attributive character created by acquiring the acquired strokes on the one target row/column on the same page;
or alternatively, the process may be performed,
when the current page is switched from one page to another page, storing new characters or attributive characters created by acquiring the acquired strokes on the one page.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
storing strokes input by the user and corresponding input information in a first memory;
storing the saved characters in a second memory, wherein for each saved character, the characters comprise strokes constituting the character and indexes corresponding to the strokes;
and the index corresponding to the stroke points to the input information corresponding to the stroke in the first memory.
The input information corresponding to the strokes also comprises one or a combination of the following: the input time of the stroke, the input force of the stroke and the input speed of the stroke.
The input time comprises the pen falling time and the pen lifting time of the strokes and the stay time of each point in the handwriting of the strokes;
the input location includes at least: the position when the pen is dropped, the position when the pen is lifted, and the coordinate position of each point in the handwriting of the stroke.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
respectively acquiring and displaying the boundary of each character stored locally;
receiving a correction request input by a user, wherein the correction request comprises characters to be corrected, or characters to be corrected and strokes to be corrected;
and carrying out corresponding correction processing on the character to be corrected according to the correction request.
The correction request is a combination correction request, and the characters to be corrected are at least two characters to be combined;
correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and combining the at least two characters to be combined into one character.
Or the correction request is a splitting correction request, and the character to be corrected is one character to be split;
Correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and splitting the character to be split into at least two characters.
Or the correction request is an attribution correction request, the character to be corrected is one character to be attribution, and the stroke to be corrected is at least one stroke to be corrected;
correspondingly, the correcting process is carried out on the character to be corrected according to the correcting request, and the correcting process comprises the following steps:
and attributing the at least one stroke to be corrected to the character to be attributed.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
receiving an insertion request input by a user, wherein the insertion request comprises a target row/column to be inserted, a position to be inserted in the target row/column to be inserted and a character to be inserted;
activating the target row/column to be inserted, and inserting the character to be inserted into the position to be inserted;
and correspondingly adjusting the characters after the position to be inserted.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
Acquiring at least one character selected by the user;
receiving a selection processing command input by a user, and processing the at least one character according to the selection processing command;
wherein the selection processing command comprises any one or a combination of the following: and copying the at least one character, cutting the at least one character, replacing the at least one character, and merging the at least one character.
On the basis of the technical solution provided in the foregoing embodiment, it is preferable that the number of the first target rows/columns is plural;
the activation areas corresponding to the first target rows/columns are all non-overlapping and are not in contact with each other.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
receiving a mode switching request input by a user, wherein the mode switching request comprises a target mode;
and switching the handwriting mode to the target mode, and receiving at least one standard character input by a user in the target mode.
On the basis of the technical solution provided in the foregoing embodiment, preferably, the acquisition module 1001A is further configured to:
Receiving a coding request, and determining a font corresponding to a handwritten character in a handwriting input program according to the coding request;
and inquiring a mapping table in the coding warehouse to obtain the standard language parameters corresponding to the fonts.
Wherein the standard language parameters include one or a combination of several: numbers, symbols, keywords, public identifiers, and private identifiers.
The data splitting and data merging will be described in detail below.
The data splitting of the present invention is a solution that can effectively solve the above-mentioned problems. Fig. 2A is a flowchart illustrating a data splitting method according to an exemplary embodiment, and as shown in fig. 2A, the present invention provides a data splitting method, including:
step 101B, when a storage request carrying a data identifier to be stored is received, acquiring metadata in a data object corresponding to the data identifier to be stored according to a preset metadata stripping protocol.
Step 102B, stripping the acquired metadata from the data object.
Step 103B, dividing the data content into at least two data segments according to a preset data content splitting protocol.
Optionally, the method may further include:
Step 104B, storing the metadata and the data fragments into different memory banks or different secure channels respectively.
According to the data splitting method, when a storage request carrying a data identifier to be stored is received, metadata in a data object corresponding to the data identifier to be stored is obtained according to a preset metadata stripping protocol, and the metadata is stripped from the data object; dividing the data content into a plurality of data fragments according to a preset data content splitting protocol; and storing the metadata and each data fragment into different storage bodies or different secure channels respectively. Therefore, the difficulty of illegally acquiring the original data of the user is increased, and the safety of data storage is more reliably realized.
Fig. 2B-1 is a flowchart illustrating a data splitting method according to another exemplary embodiment, and as shown in fig. 2B-1, the present invention provides a data splitting method, including:
step 201B, a storage request carrying a data identifier to be stored is received.
The data splitting method can be applied to equipment such as a terminal (client equipment) or a network end (server equipment), and when the equipment receives a storage request carrying a data identifier to be stored, the storage request can be triggered by a terminal application program, such as a mail system, a desktop agent and other application programs, for example, when the mail system sends file data, the mail system receives the storage request carrying the data identifier to be stored, and a data splitting device of the mail system carries out splitting processing on the file data in advance, so that a receiver of the mail needs to acquire file data fragments from each designated storage body to obtain complete file data; or the storage request is triggered by a user, if the user wants to split a certain file and then store the file, the data splitting device receives the storage request carrying the data identifier to be stored, and then split the file. The data identifier to be stored may be the name of the file data, code (such as fifth edition of information abstract algorithm of the file, message Digest Algorithm, abbreviated as MD5 code) and other identifying information.
Step 202B, if the metadata agreed in the preset metadata stripping protocol includes: and the attribute information, namely determining the content of the attribute information matched with the attribute information in the data object corresponding to the data identifier to be stored as metadata.
The process of stripping metadata is to strip the metadata of the data object, especially the key metadata, from the original position of the data object, so as to achieve the purpose that the original data object cannot be accessed, identified, correctly read out or used only through the data content and/or other remained metadata information. Where the key metadata are security-related metadata, the system will not be able to read, identify, decode or restore the corresponding data object normally once the key metadata are missing.
For example, for data in the form of files in Windows systems, the file type is a key metadata. When we remove the file type information (or the file extension in the Windows system), the system cannot normally open the file content. Storing the type information and the file content data of the file in different cloud storage respectively can cause a certain difficulty for a malicious attacker or a service provider to acquire complete data. Different types of data have different key metadata, for example, for table data (e.g., a spreadsheet or database table, etc.), the header (field name) is a key metadata. In practice, metadata may also cover a wider range, and any information related to the data content may be stripped off from the data content itself as metadata, as long as the security of the data is advantageous. Wherein the metadata includes: attribute information; attribute information is information that can identify some unique property of the data object, and is composed of descriptive information to help find and open the data object. The attributes are not contained in the actual content of the data object (data content), but rather provide information about the data object. Numerous information such as the size of the data object, the data type, the date of creation modification, the author, and the rating may be included. Since the attribute information can be set by one skilled in the art according to the nature of the data object, the content contained in the attribute information is only an example and is not a limitation on the content of the attribute information.
Or, if the metadata agreed in the preset metadata stripping protocol includes: and determining the data content matched with the keyword as metadata from the data content in the data object according to the data content identification.
The data content identification is used for prompting that the extraction position of the metadata comes from a data content part, and the keyword is used for indicating the specific data content to be extracted; the data content matching the keywords may be key information or sensitive information contained in the data. For example: in the bank statement, several keywords associated with the account information may be set so that sensitive information in the account is extracted as metadata for storage. For example: account numbers, user identification cards, user telephones, addresses, etc.
Or, if the metadata agreed in the preset metadata stripping protocol includes: attribute information, a data content identifier and a keyword, determining the attribute information content matched with the attribute information in the data object as metadata, and determining the data content matched with the keyword from the data content in the data object as metadata according to the data content identifier.
The strategy for generating the preset metadata stripping protocol can be determined by a developer, and the user can be allowed to define the protocol applicable to the user, so that the system needs to present the metadata to the user as comprehensively as possible, and the user can preset the most appropriate metadata stripping protocol according to the information. The preset metadata stripping specification is built into the data splitting system, as in the previous mail client example, and may be built into the application of the mail system. Of course, the preset metadata stripping protocol can also be stored as a part of metadata content along with metadata, so that when the receiver performs data merging, the receiver performs merging of data objects by referring to the preset metadata stripping protocol.
Further, by way of example of mail client, splitting an attachment file (data object) to be sent, metadata of the attachment file may be: such as file name, file type, file size, creation time, etc. The result of file metadata stripping is stored in the file meta-information system, and the method of file content segmentation and the segmented result information, such as hash value or ID of the file segment, storage location of the file segment, etc., are also stored in the file meta-information system and associated with the corresponding file meta-data. In fact, all of the content stored in the file meta-information system, as mentioned above, constitutes in its entirety this splitting/stripping convention instance.
Step 203B, stripping the obtained metadata from the data object.
Stripping may also be referred to as splitting, meaning that metadata is selected from the data objects that is relevant to implementing the splitting/stripping process of the data objects. The system will separate the metadata from the data objects according to a preset metadata stripping specification (which may be system default or user-selected or user-defined). The protocol has information such as rules, constraints, and methods relating to metadata splitting/stripping processing recorded therein. Such as, but not limited to: stripping location information for metadata, stripping methods for metadata, coding schemes, stripping code related information, content splitting rules, and other content splitting related data and/or information. Wherein the metadata may be a full set or subset of metadata for the data object. For specific information about the type of metadata, reference is made to the various cases in step 202B above.
Various methods for splitting data are available, for example, the data object is split directly into a plurality of fragments according to a predetermined rule, and the fragments are stored separately. However, this method cannot achieve a finer granularity encryption means, nor can it strip important information (metadata) closely related to the data object from the data content itself. The invention adopts a brand new data splitting method to realize the splitting of the data objects. This approach not only breaks the data object into finer granularity (e.g., in characters or even bits), but also strips important information (i.e., metadata) that is closely related to the data object from the data content itself. The stripped metadata, data content and/or the later mentioned codes can be stored separately in different storage positions or spaces or under different security channels, so that the security of data storage is realized more reliably.
Step 204B, dividing the data content into at least two data segments according to a preset data content splitting protocol.
Content splitting refers to dividing the data content in a data object into a plurality of (more than one) segments according to a certain rule. The visual metaphor is like tearing a sheet of paper into multiple pieces. However, the content splitting is not necessary, and the application with low security requirement on the content may not split the content according to the actual requirement. The content splitting method can divide data into a plurality of blocks by adopting a RAID disk array technology and write the blocks into a plurality of disks in parallel so as to improve the read-write speed and throughput of the disks.
The content splitting may be divided into a domain-related content splitting and a domain-independent content splitting. The splitting of the domain related content is mainly to split the data according to the characteristics of the data in the specific domain. Such as structural splitting for a particular file format, or splitting key information or sensitive information within the data. The latter may have some overlap with metadata stripping (when metadata is within the data). For example: the statement of bank can strip account information out as metadata, and can split account information out as data fragments for split storage.
Further, the preset data content splitting protocol may include: at least one of a disk array RAID splitting algorithm and an information dispersing IDA algorithm. Algorithm researchers Michael o.rabin first proposed in 1989 an information-dispersive IDA algorithm for fragmenting data at the bit level, so that data is not identifiable when transmitted over the network or stored in an array, and only users/devices with the correct keys can access it. This information is reassembled when accessed using the correct key. In the field of distributed storage, information dispersion IDA algorithms and related derivative algorithms have been widely used.
Step 205B, respectively performing coding processing on each data segment according to a preset coding separation protocol to obtain a code corresponding to each data segment.
In this embodiment, optionally, the encoding processing is performed on each data segment according to a preset encoding separation protocol to obtain a code corresponding to each data segment, which includes:
inquiring a coding warehouse according to a preset coding separation protocol, selecting or creating a coding protocol according to at least one part of the metadata, and generating a metadata code corresponding to the metadata according to the coding protocol; according to the coding protocol, coding each data segment to obtain an instance code corresponding to each data segment;
or alternatively, the process may be performed,
according to a preset coding separation protocol, each data fragment and the metadata are sent to the coding warehouse, so that the coding warehouse can select or create a coding protocol according to at least one part of the metadata, and a metadata code corresponding to the metadata is generated according to the coding protocol; coding each data fragment according to the coding protocol to obtain an instance code; and receiving the meta-code and instance code returned by the code repository.
It should be noted that, the specific flow of the encoding process may refer to the specific description of the embodiment part of the encoding processing method described later, and will not be repeated here.
Step 206B, arranging each code according to the original sequence of each data segment in the data content, so as to obtain the arrangement sequence information of the codes.
As described above, the data splitting method of the present invention covers two different data processing means, namely, stripping of metadata and encoding and splitting of data content. The stripping of metadata is described above, where code stripping refers to that after splitting the data content into n pieces of data, the n pieces of data are concentrated or stored separately, and the corresponding n codes (numbers) are obtained, which may be repeated, and the codes (numbers) are arranged in the order in which the data pieces appear. The code (number) sequence contains both code information and code arrangement order information, and the code result can be stored in another secure channel. The code is different from the previous data fragment in nature, and splitting it out can be called stripping. Meanwhile, in most cases, only the data content part of the data object needs to be split, and the stripped metadata part and/or the stripped coded part does not need to be split again, but if necessary, the stripped metadata part and/or the stripped coded part can also be further split, so as to achieve a protection effect with finer granularity. The stripping and splitting can be combined infinitely, and depend on the system requirements and processing capacity.
In most cases, code stripping is based on content splitting, i.e., content splitting is to split part or all of the data content according to a certain rule, and code the addressing mode of each split data. The final encoding results are formed into individual data. In the computer field, reference encoding of data is ubiquitous. Such as keys (keys) in the database that address data records; abbreviated website (http:// dwz. Cn/mzot 4) for facilitating website entry and reference; access identification used in cloud storage programming interfaces (APIs), and the like. These coding modes can be used by the above-mentioned codes. If the code is the split result of the data part content, the code result replaces the original corresponding data. However, sometimes the encoding may not be split based on content. For example, for low security level data, it is unnecessary to split the data content. At this time, if necessary, it is sufficient to give the entire data content one code, but it may still be necessary to separate the code from the data content. It can be seen that the code stripping of this embodiment differs from conventional content splitting, from existing data reference coding, and is a combination of both. The security risk of the data can be reduced to some extent by separating the encoding results (including the encoding itself and its corresponding order of combination) from the data content. For example: there are 6 bytes of data ACBDAC, two bytes of data are split and placed in the database. AC return code 1, bd return code 2. The result of the encoding of this data is a sequence of 1 2 1, not just 1 and 2. Wherein the numerals 1, 2 represent codes; 1. the arrangement rule of 2 and 1 is the coded arrangement sequence information.
In practical applications, the above-mentioned methods of stripping/splitting metadata, encoding and data content are not mutually exclusive, and they may be used in combination. For example, but not limited to, as previously described, metadata may be merely split from data content; it is also possible to separate only the encoded part from the data content part; the encoded parts can also be regarded as a kind of special metadata together with other metadata, as long as they are separated from the data content parts; more preferably, the three parts (metadata, encoded part, data content) are separated according to respective separation specifications.
In addition, steps 202B to 206B, that is, the order in which content splitting, metadata stripping, and code stripping are not performed, may be performed separately, may be performed across from each other, or may be performed simultaneously. But typically the encoding operations of the present invention need to be performed during or after the content splitting process. But when the content splitting process is not required to be performed, the encoding operation may not be performed. Since metadata stripping can be done before content splitting, metadata stripping can also be performed after content splitting and code distribution is completed. Meanwhile, for example, before and after each splitting step, i.e., between step 202B to step 206B, other data processing methods such as data compression, encryption, and the like may be mixed. The description of compression and encryption may also be added to the various specifications described above, but it is preferable to perform the metadata splitting step after the compression and/or encryption has been performed.
Step 207B, storing metadata, codes corresponding to the data segments, and arrangement order information of the codes into different memory banks or different secure channels, respectively.
Further, if the metadata specified in the preset metadata stripping specification includes: the step of obtaining metadata in the data object corresponding to the data object identifier to be stored according to the preset metadata stripping protocol comprises the following steps: the data object is parsed to generate a data object identification that uniquely corresponds to the data object.
Further, when the data object is audio data, the dividing the data content into at least two data segments according to the preset data content splitting protocol in step 204B may include: splitting audio data by adopting a time domain analysis method or a frequency domain parting method to obtain an audio data object to be encoded; wherein the audio data object to be encoded comprises sound wave segments and/or silence segments.
Specifically, speech is an earlier, more natural expression than text. However, in the world of computers and the internet, which are increasingly related to human production and life, voice data and related processing are always two citizens. The reason for this is mainly the current input, storage and processing modes of voice data and the corresponding technical limitations. People now process and use voice input through computers, and networks, in mainly two ways: voice call and voice recognition.
Voice communication mainly refers to converting voice signals output by a person into digital signals through a computer sound capturing device, then processing, transmitting and storing the digital signals through a computer and a computer network or a communication network (mainly packet-switched voice technology such as VoLTE and the like is adopted here, and the circuit-switched voice technology is irrelevant to the problems discussed by us), and finally playing back the digital signals through a digital audio playback device. The voice call may be real-time or non-real-time; either unidirectional or bidirectional. The most important problem of the current voice call is that the data volume is large, and the data is not easy to transmit and store. The common audio sampling rates of the current sound card are mainly 11KHz,22KHz and 44.1KHz. The sound obtained by 11KHz is called telephone tone quality (8 KHz sampling rate is adopted by telephone), so that people can basically distinguish the sound of the caller; 22KHz is called broadcast tone quality; 44KHz is CD tone quality. The higher the sampling rate, the better the quality of the audio data is obtained and the greater the occupied storage. Another sampling parameter is the sampling resolution, which refers to the size of the data occupied by a sound signal (typically the amplitude of sound waves), and there are two common types of 8 bits and 16 bits, 8 bits dividing the sound signal into 256 levels and 16 bits dividing the sound signal into more than 6 tens of thousands of levels. It can be calculated that the data size of the 8-bit stereo (left and right channels) audio signal sampled at 11KHz for 1 second is 22KB. This corresponds to the data volume of tens of thousands of Chinese characters. In the most commonly used bidirectional and real-time voice call applications, users rarely record and save call data. The main reason is that the audio data occupies a large amount of memory and cannot be searched and queried. There are also some applications that can retain the results of a one-way call, which generally have a limit on the size of the data retained. If the "push-to-talk" function of the WeChat has a limit of 1 minute, the text WeChat has no limit, and the sending of millions has no problem; similarly, skype has a voice message function, and the duration of the message is limited, and can be kept for 10 minutes at most. At present, most of common voice data are digital audio books such as comments, voices, lectures, audio electronic books and the like. They are typically stored in audio files (e.g., MP3, WMA, MOV, etc. formats) or accessed in real-time via network streaming protocols (e.g., PTSP, MMS, RTP, RSVP, etc.). People generally know related information of the audio data through metadata (such as ID3V1 and ID3V2 information in MP 3) other than the audio data; for the interior of the audio data which is listened for the first time, unless auxiliary text positioning information (such as a subtitle file) exists, the audio data cannot be randomly searched and positioned, and can only be listened to sequentially.
Speech recognition, which is known, text data is the first citizen of the current computer system. The text data has the characteristics of standardization, easy storage, easy viewing, searching, retrieving, processing and the like. Thus, speech recognition, which converts speech input into text data, can more effectively utilize the input data. However, there are two problems here, one is that information is lost; and secondly, the problem of recognition rate. The human natural speech output contains information other than the corresponding text content. At present, after speech is recognized and converted into standard text content, original speech data is not generally reserved, and the part of information is actually lost. Such information mainly includes speech, intonation, mood, timbre, pauses, etc., which may be implied by emotion, etc. The recognition rate problem is that speech recognition has not been a major obstacle as a preferred input to human computers. The recognition rate is quite high for the voice recognition which is trained for a certain person through a certain recognition, and can reach more than 90%. Therefore, the use of Siri, amazon echoes from apple corporation, microsoft's nina, google's Now, etc. for digital voice assistant applications has grown particularly fast in recent years, and a part of the population has been able to replace traditional search engines with digital voice assistants. However, we also see that the problems of language, accents, keep many away from these applications. The voice training and the voice recognition are the relationship between chicken and egg, and the recognition rate of the voice recognition is not too high for a specific crowd due to the lack of data of the voice training. In turn, because of the low recognition rate, the particular population has little enthusiasm to use speech recognition, resulting in a system lacking sufficient sample data to analyze and optimize. In addition, speech recognition for the purpose of text entry has difficulty in recognition of punctuation marks and text control, and the input efficiency is affected.
In summary, we have seen that the data of a voice call maintains the original voice information, but the data size is large and is unfavorable for automatic analysis and processing by a computer. Although the voice recognition can generate text data, so that the transmission, storage and analysis processing of a computer are facilitated, some original voice information is lost in the process; in addition, the accuracy and reliability of the current voice recognition are not guaranteed, and an effective method for acquiring voice sample data of most people is not available to improve the recognition rate.
The present embodiment proposes a compromise method to process the original voice data, so that the original voice data is retained, and text data is generated, which is convenient for the transmission, storage and analysis of the computer. The key here is that this literal data is not a standard literal code, but a proprietary code for a specific person. The voice data corresponding to the codes are stored in a specific character code warehouse, and the voice data in the code warehouse are distinguished and coded according to different users. The user can set access rights for different users for his voice data. As shown in fig. 2B-2, the system is largely divided into two parts: the code repository and related services surrounding the data. The voice input process is as follows: 1. the user logs in to the code warehouse and selects a voice character input system; 2. the voice text input system registers a series of encoders to the encoding warehouse according to the current user; 3. the user inputs continuous voice to the voice text input system; 4. the voice text input system stores the input of the user into an input cache; 5. the voice text input system divides voice data in an input buffer according to a certain rule to form different data objects; 6. the voice text input system submits data to the data warehouse through the corresponding encoder and obtains corresponding codes; 7. the voice character input system stores the obtained codes into a character input result and clears the corresponding input cache content; 8. repeating the steps 3 to 7, and continuously obtaining user input and corresponding codes by the voice text input system; 9. when the user stops inputting and there is no data in the input buffer, the whole voice input process is completed.
It can be seen that the segmentation of the speech data in the input buffer is a key step here. In practice, this is a well-established technique for voice data processing, called "endpoint detection" or "voice detection". There are two common methods of time domain analysis and frequency domain typing. Here exemplified by a method of time domain analysis. Fig. 2B-3 are time domain analysis graphs of a piece of audio data defining an amplitude less than a certain range (here 0.005) and a duration of silence (here 20 ms). For silence less than 50ms we divide directly from the middle, before belonging to one segment, and then to another segment. For silence greater than or equal to 50ms, we divide from the beginning and end of silence. This audio is divided into nine segments: 901ms silence, 949ms silence, 421ms silence, 2558ms silence, 337ms silence, 578ms silence, 368ms silence, 1209ms silence, 679ms silence. Two coding types are used here, one being sound fragment coding, indicated by the number corresponding to the letter V followed by; the other is silent coding, which is coded with the letter S followed by the duration of silence (in milliseconds). The encoding repository is shown in figures 2B-4 for the data in the phonetic text encoding table corresponding to the user. Thus we can get the corresponding literal code as follows: S901V 001S 421V 002V 003V 004S 368V 005S 679
In this way, we converted 8 seconds of audio data into 9 special literal characters. Calculated as four bytes per character (which is actually relevant to a specific coding scheme, using context-dependent object-based coding, it is fully possible to achieve an average word length of four bytes), the overall coding result, i.e. 36 bytes, is almost a 5000-fold of the original audio data 176K (22K/s X s). Thus, the encoding results are much more convenient and efficient in storing, transmitting, editing, mixing with other data, etc. Only the user who finally needs to play the sound content needs to acquire the corresponding data from the code warehouse to restore the audio content.
It should be noted that the method of separating the code and the content can easily place the code and the data content in different secure channels respectively, and has natural security.
Meanwhile, the voice data stored in the coding warehouse are directly related to specific people, and can be well analyzed and arranged as training samples naturally. The existing voice analysis and recognition technology can analyze and recognize a lot of useful information such as pitch, timbre, tone, syllable and the like; more efficient feature parameters, such as MFCC parameters, LPCC parameters, etc., can also be extracted. These can be stored in a coding repository, providing further coding services for the corresponding speech codes. Such as a content lookup matching service, a content normalization service, a content selection service, etc.
The speech-text output may have two different output modes for the obtained speech-text content, i.e. the encoding result, one is graphic output mainly based on text display output and the other is audio play mainly based on speech play.
The graphic output of the voice characters refers to the output of the voice characters according to the presentation mode of the common characters, namely the typesetting output of the characters. The method has the advantage that the existing word processing method and tool can be used for processing and processing the voice words. In addition, the graphic output of the voice characters is supported, the voice characters and the traditional characters as well as other forms of characters (such as graphic characters, picture characters and the like) can be allowed to appear in the same character document, and the application of more colorful applications is supported.
The specific presentation mode of the phonetic text can be different according to different access rights of users.
1. For a text output system supporting multiple text types, if the user does not have any access rights to the text code (including text type information), the user can see only the code itself, which may be presented as in fig. 2B-5.
2. If the user is able to obtain the encoded type information, but cannot access the specific content of each audio word code. The system may present successive phonetic text codes (including speech data codes, silence duration codes, etc.) as a whole, for example: "+an unauthorized phonetic text (9 characters, 4 silent characters; total duration of silence 2' 369)" when the user expands the content in the quotation marks, more details can be output as shown in fig. 2B-6.
As shown in fig. 2B-6 above, we can see not only each phonetic character, but also the silence period visually. With this information, the system can also provide relevant search functions, such as silent searches (with or without duration constraints).
3. Still further, if the user has access to voice data corresponding to voice characters, the system can display more relevant information and allow the user to play out the voice content, such as "+ voice content, duration 8 '(5 voice characters, 4 mute characters; mute duration 2'369 total
Figure SMS_11
"when the user expands the phonetic text, more detail is available, as shown in FIGS. 2B-7.
The user can click on any phonetic character to play it. The phonetic characters are output in a graph mode, and can be visualized in various forms, such as a displayed waveform chart, a spectrogram, a visualized time length and the like; depending on the specific application requirements. In addition, the result of the analysis of the phonetic characters or the semantic tags added to the characters by the user can be presented at the same time. As shown in fig. 2B-8, the third and fourth audio characters also show the results of the pinyin-based analysis.
Related systematic text searches can also provide more search control, such as searching based on semantic tags entered by the user, due to the ability to access the encoded warehouse information of the audio characters.
Wherein, the output process of single phonetic characters (including mute characters) is as follows:
1. the user logs into the code repository.
2. The system decomposes its meta-code based on the target character code.
3. The system submits character element codes to a code repository.
4. The code warehouse checks the access rights according to the meta code and the current user. If the access is forbidden, an error message is returned to the system; the system carries out graphic output according to the character codes; the process ends. If access is allowed, corresponding encoded metadata is returned to the system; the process continues.
5. The system decomposes the instance code based on the target character code.
6. The system parses the instance code based on the code metadata. Specifically, if the character is mute character, resolving the instance code into mute time length; if the character is an audio character, the character code is submitted to a code repository. The coding warehouse checks the access authority according to the audio coding setting and the current user, and if the access is forbidden, an error message is returned; if access is allowed, the corresponding voice data is acquired and returned to the system.
7. And the system graphically outputs the characters according to the analyzed or obtained data.
8. If the system obtains the playing request of the user, the waveform data is recovered according to the voice data and played.
If multiple consecutive characters are to be output, the system needs to obtain all the corresponding phonetic characters and phases
And outputting the data in a visual form according to a certain typesetting rule. If a user's play request is obtained, a play buffer is established, and the audio data is played out in turn (while taking into account the playing of mute characters).
The voice playing, the voice playing output of the voice characters is similar to the playing of the traditional audio data, and the graphic typesetting of the characters is not needed to be considered. However, the playing of the phonetic text is also based on the user access rights. Only on the premise that the user obtains the access right of the data corresponding to the voice characters, the voice characters can be played.
Besides time positioning similar to traditional voice playing, abundant searching positioning can be performed on voice characters, such as searching according to voice duration, mute duration, semantic tags, mixed traditional characters in voice characters and the like.
It is worth mentioning that through the mixture of pronunciation characters and traditional characters, can realize many traditional pronunciation broadcast can not realize the effect. Such as embedded subtitles, embedded structured navigation information, embedded photo links, embedded graphics, etc.
The voice text editing, through the text encoding of the audio data, makes it possible to edit the voice data in a traditional text editing mode. In the state of voice text graphic output, a user can conveniently delete, insert, modify and the like any character, and can also search, replace, copy, paste and the like traditional text coding operations.
Wherein part of the operations require the use of specialized audio services. For example, change the mute duration, divide one audio character into a plurality, merge a plurality of phonetic characters into one, and so on.
From the above, we can see that the text of audio data provides more opportunities for people to safely and effectively use computers to express and communicate with speech. However, some questions may be raised about this approach.
Noise cancellation, audio data recorded in a normal environment generally has ambient noise. After slicing and encoding, playback is performed, and noisy phonetic character data is played with noiseless silent characters, which will not sound strange?
This is indeed a problem. The solution to this problem is straightforward, namely, a unified denoising process is performed before the audio data is stored. The existing automatic denoising technology is mature, and noise elimination aiming at pure voice is easier.
The frequency range of sound that can be recognized by the human ear is 20Hz to 20kHz. The sound frequency emitted by the human body sound organ is about 80Hz to 3400Hz; whereas the signal frequency when a person speaks is typically 300Hz to 3000Hz. This frequency range is generally more limited for a particular individual. In addition, the volume of a normal person's conversation in a room is approximately between 20 and 60 db. From this frequency range we can automatically remove high and low frequency noise. By means of low dB delay, the voice detection can be performed to automatically obtain the mute section. By spectral analysis in the silence section, noise filtering can be performed on the entire audio data. It should be noted here that the same frequency range as the audio data occurs in some silence segments, and that we ensure that the audio of the non-silence segments is not processed into low db silence segments when automatic filtering is performed.
The voice data after the whole noise elimination and the mute character after complete mute can be played together in harmony.
In an actual application environment, the segmentation and denoising processing generally cannot be performed until the voice data are completely obtained. We can build a buffer memory for several seconds in the memory and analyze it. But the identified noise characteristics may be accumulated for reuse and updating in later audio processing.
Real-time voice call, since this method is based on segmentation of all voice data, for voice applications with high real-time requirements, this method is not just inapplicable? Indeed, the method may also be applicable for speech applications that can allow delays of several seconds. If the real-time requirement is high, the voice segmentation cannot be performed. However, for these applications, the method can be used to record voice, so that the problems of large data volume, difficult editing and the like in the conventional voice recording are avoided.
Voice transmission, in conventional voice call applications, voice data may be delivered directly to the recipient. In this method, the phonetic text is transmitted to the receiver, and the receiver acquires the actual phonetic data from the code repository. Will this process not be inefficient?
In practice, the code repository for the voice over internet protocol applications should be deployed in the cloud-based data center. Current data centers typically provide CDN (content delivery network) services, i.e., automatically select the fastest route to deliver data. This process can be most efficient, which is entirely dependent on the deployment scenario of the coding warehouse.
On the other hand, due to the separation of coding and data, the transmission can completely conceal part or all of the voice data after the voice data is transmitted. The receiver cannot play back all or part of the speech code, even if it receives it. This is not possible in conventional voice call applications.
The actual data size is actually much smaller than the original audio data, but the data size is not reduced but increased for the user who finally needs to use or play the original audio content (phonetic text coding part). We can then not say that it is a defect of this approach? Admittedly, for a particular piece of speech, the amount of data is not reduced (this and noise cancellation is ignored) if the final playback is able to recover the original input. However, it must be seen that by centrally storing personalized speech data in the code repository, there is actually a great redundancy. Processing the redundant information can greatly improve the storage efficiency and the transmission efficiency. This is described in detail below.
For a particular individual, the sound that can be made throughout his lifetime is limited. The basic phones/syllables are more limited in view of the language constraints. Combinations of primitives are also very limited. Irrespective of the volume level, the specific phonemes that can be formed are limited. Based on this, when voice data is stored, the voice data can be reused by further segmentation. As in the prior art audio processing, the speech data is split into successive frames. A frame is typically 10ms to 40ms, and there may be some overlap between frames. The appropriate frame segmentation can facilitate audio analysis, further parameterize the audio data, and achieve final reuse.
Some existing audio fingerprint extraction and matching methods can be used for detecting redundant voice data well to achieve services such as content normalization, searching and matching in a coding warehouse. Such as the Waveprint method of google (patent US 841977 B1).
It is envisioned that by the method of the present embodiment, all voice data of a person for a lifetime can be easily recorded to complete applications that were previously not imaginable.
Tamper of coded content, literal audio data is actually easier to modify, who is then to guarantee the security and reliability of the audio data? How does it guarantee that the audio character sequence is the original character sequence? In fact, this is not a new problem and conventional text faces the same problem. We can solve the same problem using existing solutions (e.g. digital signatures).
Non-speech audio data, where speech data is mentioned with emphasis, then for non-speech audio data, such as music, audio track data in video and audio, etc., is this method also applicable?
First, the method herein does not alter the original data, but rather slices and encodes it, the original content being split into encoded streams and corresponding audio data in the encoded repository. The final playback still enables the original audio to be fully recovered and played. In this sense, there is no problem with this approach at all.
However, from a literal perspective, the text obtained by this method is personalized and relevant to the particular user. This also ensures subsequent voice analysis, recognition and other highly personalized services for the user. If music or other sounds not related to the individual user are stored in the code repository and associated with the user, this will actually affect the personalized services later. Therefore, it is preferable to divide voice data into different audio channels from other audio data. Corresponding coding classifications are employed for other audio data, such as musical instrument related coding for music. Finally, the data of the different audio characters divided into a plurality of channels are mixed together.
A mixture of text types, since we refer to the segmented and encoded content of the speech data as text, whether or not it can be mixed with traditional text and other types of text codes? Indeed, this is one of the advantages of this solution. The natural output of the person is multi-channel, e.g., the person can speak while writing or tapping the keyboard. The existing system can only disperse the results into different data to store and process, and the natural synchronization characteristic of the existing system is lost. By adopting a proper coding method, different data can be literally stored, processed and correlated.
With the development of cloud computing and big data technology, the computer system can analyze, summarize and even predict the production and life of human beings more systematically and deeply. However, the data that computer systems are able to analyze and process today is mainly data generated inside the digital world. The output of humans is mainly entered into the digital world through keyboards, which is a huge bottleneck. And the keyboard is not a friendly, easy-to-use device for most people. The method provided herein is based on human natural output, and segments and encodes the output voice data. The coding result can be processed by using the traditional text method and tool, and the corresponding data of the coding is stored in a coding warehouse. The code warehouse can be placed in cloud storage, so that analysis and utilization are facilitated. This approach will greatly improve the efficiency of human speech output digitization. And as voice data accumulates, the coding warehouse has the opportunity to provide more intelligent, personalized voice data services. Eventually allowing humans to seamlessly blend with the digital world.
Further, the method further comprises: generating a unique identifier of the coding sequence information based on the coded ranking sequence information and/or generating a unique identifier of each data segment based on each data segment, and storing the unique identifier of the coding sequence information and/or the unique identifier of each data segment as part of the metadata.
The data object identifier, the unique identifier of the coding sequence information, and the unique identifier of the data fragment which are uniquely corresponding to the data object are respectively hash values (such as MD5, SHA1, etc.) corresponding to the data object, the coded arrangement sequence information, the content of each data fragment, or are globally unique identifiers (UUID/GUID) generated by the system or any other globally unique codes. The identifier may be used to perform an integrity check on its corresponding respective content to verify whether the identifier matches its corresponding information and whether the corresponding information is complete.
In summary, the splitting of data refers to splitting a complete data into two or more parts, and then storing the two or more parts in different storage systems. It should be noted that, although the split data is split and then split and stored as in step 104B and step 207B in the above-described embodiment, the data split of the present invention is not limited to storage, but is a data split process for data security. For data storage at a cloud provider, the user may not trust, but through data splitting, a piece of data may be stored in one or more providers in a scattered manner, and only if all the data is revealed (including metadata, individual pieces of data), the disclosure of the data may be caused. This greatly increases the difficulty of combining data by an illegitimate. The data splitting of the present invention allows direct intervention and control by the end user of the data (i.e., the user who has the right to own the data). The data splitting method is established on an operating system (comprising a cloud operating system), specifically in an application system for splitting or in splitting services of other application systems. The storage system is an infrastructure built above the storage physical device and below the operating system. The data splitting method of the invention can be finally applied to a data storage system. FIG. 2C is a diagram of the location of a data splitting method of the present invention in a computer system hierarchy, showing the location of the application domain of the present invention in the computer system hierarchy.
The splitting and merging of data may be performed at the terminal or by a server or service provider. Thus, the data obtained from some cloud storage server, whether an attacker or the data service provider itself, is not complete and is not enough to pose a threat to the privacy and confidentiality of the user. An attacker needs to acquire the identity of the same user in different cloud storage services to obtain different data fragments constituting the complete data. This difficulty is often much greater than cracking a single system. In addition, the correct merge protocol is required to restore the fragment data to the original complete data. This gives the user's data a further layer of protection. Of course, a hacker may attack the user's terminal system to obtain the complete data before or after the user is split. This risk remains irrelevant whether cloud storage is used. In general, terminal devices, particularly mobile terminals, have fewer exposed services to the outside and are not stably online, and the risk of being directly attacked is generally smaller than that of a server which is online at any time. In addition, the application system with the data splitting and combining function can split and combine data in real time at the running time, and the data before splitting or after combining is not necessarily stored in the terminal system. In this case, even if the terminal system is attacked, it is still safe to split the stored data; when the terminal system fails, maintenance personnel and personnel of the enterprise IT department cannot acquire the data protected in this way. Taking a mail system with a data splitting function as an example: when the data is not used, the terminal side may not have any pieces of data. When a document is given to a person, the document exists on the terminal side only after the recipient downloads the document. Still further, a hypothetical enhanced mail client using the data splitting and merging method of the present invention, where the mail server may also be a conventional mail server, when an attachment needs to be added to the mail, the content of the attachment file is split into multiple parts, where several parts are saved in the cloud storage specified by the user and the other parts are saved in the mail as common attachments. The user selects the sender and sends the mail, and the mail cloud application system can register metadata and split information (preset metadata stripping conventions and the like) in the original attachment file into a file meta-information base (an online service system is required to have accounts for the sender and the recipient), and can automatically set corresponding data access links for the sender according to the setting of the user side. Corresponding to the recipient, the terminal side thereof does not have any fragments of the data before it downloads the attachment. The actual storage of data is distributed among the cloud storage, mail servers, and corresponding metadata in the file meta-information library. Of course, this data also exists in the sender's terminal (if the sender is not using a distributed file system and the file is not deleted). When the receiver uses the enhanced mail client, when the receiver opens the attachment, the system can automatically locate the corresponding item in the file meta-information base according to the part of the content stored in the mail as the common attachment, then locate the part of the content in the cloud storage, restore according to the corresponding splitting method, and finally restore the original data at the receiver client. Of course, this process is automatically completed if the account information needed is preset in the mail client of the recipient. At least three accounts are involved: mail system, cloud storage system and file meta information base system.
In accordance with the data splitting of the present invention, fig. 2D is a flowchart illustrating a data merging method according to an exemplary embodiment, and as shown in fig. 2D, the present invention provides a data merging method, including:
step 401B, receiving a data object acquisition request carrying identification information.
The identification information comprises positioning information, and the positioning information is used for positioning a storage address of partial data information in the data object.
Step 402B, acquiring storage content corresponding to the positioning information, and acquiring data information in other storage contents according to the positioning information in the acquired storage content until all data information of the data object is acquired.
Step 403B, merging the obtained data information according to a preset merging specification in the obtained data information to obtain a data object.
According to the data merging method, a data object acquisition request carrying identification information is received, storage content indicated by the positioning information is acquired according to the positioning information in the identification information, and data information in other storage contents is acquired according to the positioning information in the storage content until all data information forming the data object is acquired. And combining the acquired data information according to a preset combination rule to obtain a complete data object. Therefore, the difficulty of illegally acquiring the original data of the user is increased, and even if part of user data is acquired by an illegal means, the complete and correct data object is difficult to obtain, so that the security of data storage is more reliably realized.
Fig. 2E is a flowchart illustrating a data merging method according to another exemplary embodiment, and as shown in fig. 2E, the present invention provides a data merging method, including:
step 501B, receiving a data object acquisition request carrying identification information.
The identification information comprises positioning information, and the positioning information is used for positioning a storage address of partial data information in the data object. The type of the data information is one or more of the following combination modes: metadata, data fragments, encoding order.
Step 502B, acquiring storage content corresponding to the positioning information, and acquiring data information in other storage contents according to the positioning information in the acquired storage content until all data information of the data object is acquired.
And 503B, merging the acquired data information according to a preset merging protocol in the acquired data information to obtain a data object.
Specifically, one or more data information (the data information may be a split data segment, or may be part or all of metadata, or may be part or all of coding and coding sequence) is obtained according to the positioning information, corresponding data information is gradually obtained according to one or more data information according to a specific rule, that is, a preset merging rule, and each data information is combined together (that is, metadata, data segments, coding sequence, and the like are merged), so that the original data object is recovered. The specific merging case is as follows:
A. When the type of the data information is the combination of the data fragments, the codes and the coding sequence, decoding the codes according to a merging algorithm in a preset merging protocol to obtain the data fragments corresponding to the codes; and arranging the decoded data fragments according to the coding sequence to obtain the data objects arranged according to the original sequence of the data fragments.
B. When the type of the data information is a combination of metadata and a data fragment:
b1, if the metadata appointed in the preset merging protocol comprises: the attribute information is used for carrying out integrity verification on the data object after the merging of the data fragments according to the attribute information so as to confirm that the attribute of the data object is matched with the attribute information in the metadata; or alternatively, the process may be performed,
b2, if the metadata appointed in the preset merging protocol comprises: the data content identification and the keywords are combined into data fragments corresponding to the data content identification, and then the data fragments are combined to form a data object; or alternatively, the process may be performed,
b3, if the metadata appointed in the preset merging protocol comprises: and merging the data matched with the keywords into the data content corresponding to the data content identifier, and carrying out integrity verification on the merged data object of each data fragment according to the attribute information so as to confirm that the attribute of the merged data object is matched with the attribute information in the metadata.
Step 504B, if the metadata includes the unique identifier of the data object, performing integrity verification on the combined data object according to the unique identifier.
The data merging process is actually the inverse of the data splitting process, and works according to a preset merging specification. In actual operation, the preset merging protocol (hereinafter referred to as merging protocol) may be the same content as the preset splitting protocol (including preset metadata stripping protocol, preset data content splitting protocol, preset code separating protocol, etc., hereinafter referred to as splitting/stripping protocol). Similar to the split protocol, the merge protocol is data information prepared for recovering the data, or may also be referred to as a split merge protocol, because splitting requires ensuring that the split data can be recovered. The resolution protocol often includes or implies a merge reduction.
Taking a mail client as an example, after the client takes a mail attachment, positioning data information in each storage content in a file meta information system library, a mail system, a cloud storage and other positions according to the attachment name (namely, the unique identification of the data object), wherein the data information comprises a splitting algorithm, each data fragment, positioning information, related file meta data items and the like, the mail system can position and download the data fragments according to the acquired data information, acquire an inverse algorithm according to the splitting algorithm to combine the data fragments and the meta data, and recover the data fragments according to codes if the codes exist, so as to acquire the original user data object content; if the metadata contains a unique identifier of the data object, the file size, the recovery file name, the file type, the creation time, etc. can also be verified according to the file metadata. The information of the split reduction in the example of the mail client may be the merge reduction. The specific merging specifications, namely the inversion process, can be deduced through the data splitting description document.
Therefore, when merging data, when only obtaining each data segment, the original data cannot be recovered, at least the splitting/stripping protocol established in the data splitting process needs to be obtained, and the merging protocol of the data is obtained through reverse analysis or the preset merging protocol is directly obtained. Typically, the system will retain the corresponding split/strip conventions after the data splitting process and store the relevant location information (e.g., its storage location) in the split data segments or any designated accessible storage space. Of course, the merging specification corresponding to the splitting/stripping specification may be directly generated and stored in each split data segment or other specified location during the data splitting process. In this case, in the merging process, only the merging specifications need to be directly acquired. Then, the system searches or extracts corresponding splitting metadata according to the obtained splitting/stripping protocol or merging protocol, and splices and combines all data fragments together based on the information such as the data splitting/stripping protocol or merging protocol, the metadata and the like, so as to restore the original data.
Further, the decoding operation is performed on the code according to a merging algorithm in a preset merging protocol to obtain a data segment corresponding to the code, including:
According to a merging algorithm in a preset merging protocol, the data information is disassembled to obtain a meta code or the meta code and the instance code;
inquiring a coding warehouse, and acquiring corresponding metadata and coding protocols according to the metadata;
and acquiring the data object corresponding to the data information according to the metadata and the coding protocol or the metadata, the coding protocol and the instance coding.
It should be noted that, for the specific decoding flow, reference may be made to an embodiment portion of a processing method of the subsequent decoding in the description, which is not described herein again.
The following describes the splitting and merging process of the whole data object with a specific example, and it should be noted that, the specific data, algorithm, etc. are referred to in this example only for illustrative purposes, and are not meant to limit the present invention. Splitting the target: dividing the information of the data object into three parts: metadata blocks, data blocks (i.e., data fragments), index blocks (i.e., encodings). Any information distribution algorithm, such as the IDA algorithm, may be used to divide the lossless compressed source file content by four bytes (32 bits), although compression is not necessary. And sorting and merging the divided results, namely eliminating duplicate items, and storing the duplicate items as data block files which are not duplicated. The divided data blocks (data fragments) are stored as index files (codes and arrangement order information of codes) in the original order by corresponding to the indexes (codes) of the data block files. The file names of the data block files and index files may be hash values (MD 5, SHA-1, etc.) of the corresponding file contents or a system generated Globally Unique Identifier (GUID) or any other globally unique code. The file name, size, date, etc. of the source file, the file names of the data block files and the index files may be stored in the metadata base. The three parts (metadata block, data block, i.e. data fragment, index block, i.e. coding and coding sequence information) can be respectively stored in a plurality of cloud storage systems, so that the established safety protection effect can be achieved. The deployment scheme is flexible and various, and the data block files and the index files can be placed in a cloud storage based on the files, and metadata can be placed in another cloud database; the three data can also be respectively stored in three different cloud storages; to improve availability, a separate redundancy backup may also be provided for each data. In addition, in the usage mode of sharing and collaboration of multiple data, the scheme of sharing data is more flexible and diversified, and the sharing of three data can be a combination of multiple communication and sharing modes: email, cloud sharing, instant messaging, FTP, etc. After three data or access authorization of a storage system corresponding to the data are obtained, the system can restore the target file through the data merging process: for example, splicing four bytes of content corresponding to the index position of the data block (data fragment) file according to the coding and the arrangement sequence information of the coding in the index file; decompressing the spliced result (if compression processing is performed previously) to obtain the target file. Desktop proxies may also be built in such a generic split storage system. However, the desktop is based on the desktop agent stored in the base cloud, and the splitting and merging processes are automated, so that convenience in use is brought to users. For example, a split storage desktop agent of a user client runs in the system background, whose underlying cloud stores are, for example, google Drive and microsoft One Drive. Google Drive has catalog C: \GDrive is automatically synchronized with the cloud storage of Google, and One Drive has catalog C: \MDrive is automatically synchronized with the cloud storage of Microsoft. The synchronous catalog corresponding to the split storage desktop agent is C \DDrive. When a user stores a file to C: \DDrive, the desktop proxy service program detects the change of a file system, automatically splits the file, stores a data block (data fragment) file to C: \GDrive, stores an index file (coding and coding arrangement sequence information) to C: \MDrive, and stores metadata to a special database cloud service. Google and microsoft desktop proxy services will automatically synchronize the data block files and index files into the cloud storage of google and microsoft, respectively, and other terminal directories of the user. If the corresponding terminal runs the split storage desktop agent, the change of the catalog of C: \GDrive and C: \MDrive is ascertained, metadata is automatically obtained, and the metadata, the data block file and the data index file are combined into an original file and stored in the catalog of C: \DDrive, so that the synchronization of split/combined storage is realized.
Fig. 2F is a schematic structural diagram of a data splitting device according to an exemplary embodiment, and as shown in fig. 2F, the present invention provides a data splitting device, including: the obtaining stripping module 61B is configured to obtain metadata in a data object corresponding to the data identifier to be stored according to a preset metadata stripping specification when receiving a storage request carrying the data identifier to be stored, and strip the obtained metadata from the data object. The dividing module 62B is configured to divide the data content into at least two data segments according to a preset data content splitting protocol. The storage module 63B is configured to store the metadata and the data fragments in different storage banks or different secure channels respectively.
According to the data splitting device, when a storage request carrying a data identifier to be stored is received, metadata in a data object corresponding to the data identifier to be stored is obtained according to a preset metadata stripping protocol, and the metadata is stripped from the data object; dividing the data content into a plurality of data fragments according to a preset data content splitting protocol; and storing the metadata and each data fragment into different storage bodies or different secure channels respectively. Therefore, the difficulty of illegally acquiring the original data of the user is increased, and the safety of data storage is more reliably realized.
Further, fig. 2G is a schematic structural diagram of a data splitting device according to another exemplary embodiment, and as shown in fig. 2G, the obtaining stripping module 61B includes: the receiving sub-module 611B is configured to receive a storage request carrying an identifier of data to be stored. The determining submodule 612B is configured to, when the receiving submodule 611B receives the storage request carrying the identifier of the data to be stored, when the metadata agreed in the preset metadata stripping specification includes: attribute information; determining attribute information content matched with the attribute information in a data object corresponding to the data identifier to be stored as metadata; alternatively, the metadata for specifying when the metadata stripping specification is preset includes: the data content identification and the keywords are used for determining data matched with the keywords as metadata from the data content in the data object corresponding to the data identification to be stored according to the data content identification; alternatively, the metadata for specifying when the metadata stripping specification is preset includes: the method comprises the steps of determining attribute information content matched with attribute information in a data object corresponding to a data identifier to be stored as metadata, and determining the data content matched with the keyword as metadata from the data content in the data object according to the data content identifier. Stripping submodule 613B is configured to strip the metadata determined by the determining submodule 612B from the data object.
Further, the obtaining peeling module 61B includes: the parsing submodule 614B is configured to, when the metadata agreed in the preset metadata stripping specification includes: and analyzing the data object to generate the data object identifier uniquely corresponding to the data object.
Further, the apparatus further comprises: the encoding module 64B is configured to perform encoding processing on each data segment according to a preset encoding separation protocol, so as to obtain an encoding corresponding to each data segment. The arrangement module 65B is configured to arrange each code according to the original sequence of each data segment in the data content, so as to obtain arrangement sequence information of the codes. The storage module 63B is specifically configured to store metadata, codes corresponding to each data segment, and arrangement order information of the codes into different storage banks or different secure channels, respectively.
Further, the apparatus further comprises: an identifier generation module 66B for generating a unique identifier of the coding sequence information based on the coded arrangement sequence information and/or generating a unique identifier of each data segment based on each data segment; the storage module 63B is further configured to store the unique identifier of the coding sequence information and/or the unique identifier of each data fragment as part of the metadata.
The preset data content splitting protocol comprises the following steps: at least one of a disk array RAID splitting algorithm and an information dispersing IDA algorithm.
The implementation method and principle of the data splitting device are similar to those of the data splitting method, and are not repeated here.
Fig. 2H is a schematic structural diagram of a data merging device according to an exemplary embodiment, and as shown in fig. 2H, the present invention provides a data merging device, including:
a receiving module 81B, configured to receive a data object acquisition request carrying identification information; the identification information comprises positioning information, and the positioning information is used for positioning a storage address of partial data information in the data object.
The obtaining module 82B is configured to obtain the storage content corresponding to the positioning information, and obtain the data information in the other storage content according to the positioning information in the obtained storage content until all the data information of the data object is obtained.
The processing module 83B is configured to combine the obtained data information according to a preset combination rule in the obtained data information, so as to obtain a data object.
The data merging device of the embodiment obtains the storage content indicated by the positioning information by receiving the data object obtaining request carrying the identification information and according to the positioning information in the identification information, and obtains the data information in other storage contents according to the positioning information in the storage content until all the data information forming the data object is obtained. And combining the acquired data information according to a preset combination rule to obtain a complete data object. Therefore, the difficulty of illegally acquiring the original data of the user is increased, and even if part of user data is acquired by an illegal means, the complete and correct data object is difficult to obtain, so that the security of data storage is more reliably realized.
Further, fig. 2I is a schematic structural diagram of a data merging device according to another exemplary embodiment, where, as shown in fig. 2I, the types of data information are one or more of the following combinations: metadata, data fragments, encoding order.
A. When the type of the data information is a combination of a data segment, a code, and a coding order, the processing module 83B includes: and the decoding submodule 831B is used for decoding the codes according to a merging algorithm in a preset merging protocol to obtain data fragments corresponding to the codes. The arrangement submodule 832B is configured to arrange the decoded data segments according to the coding sequence, so as to obtain data objects arranged according to the original sequence of the data segments.
B. When the type of the data information is a combination of metadata and a data fragment, the processing module 83B is specifically configured to, when the metadata agreed in the preset merge specification includes: and the attribute information is used for carrying out integrity verification on the data object after the merging of the data fragments according to the attribute information so as to confirm that the attribute of the data object is matched with the attribute information in the metadata. Or, the metadata specifically used for being appointed in the preset merging conventions comprises: and merging the data matched with the keywords into the data fragments corresponding to the data content identifiers, and merging the data fragments to form the data object. Or, the metadata specifically used for being appointed in the preset merging conventions comprises: the method comprises the steps of merging data matched with keywords into data content corresponding to the data content identifiers, and carrying out integrity verification on the merged data object of each data fragment according to the attribute information so as to confirm that the attribute of the merged data object is matched with the attribute information in metadata.
Further, the apparatus further comprises: the integrity verification module 84B is configured to, when the metadata includes a unique identifier of the data object, perform integrity verification on the combined data object according to the unique identifier.
The implementation method and principle of the data merging device are similar to those of the data merging method, and are not repeated here.
In the following, a software/hardware implementation method of the present invention will be given as a specific example in connection with the above embodiments of the splitting and merging method and apparatus.
For split-based application systems, splitting is primarily to consider in the system architecture how the system distributes data among multiple stores. Such systems typically employ metadata, encoding, and domain-related data content splitting. Thus, the method can be naturally disassembled for the application field, namely, a field-related splitting method is used. The splitting/stripping and merging flow of the data is often built in a data access layer of the system and is associated with business logic related to the field. The data splitting/stripping method can be various whether the data is split by the related data of the field or the data is split by the unrelated data of the field. Thus, we introduce the concept of a "data splitting description language (which may be part of the splitting/merging protocol)" to configure the splitting process of data. In this way, the system or user can split/strip data at run-time using a dynamic data splitting/stripping method. The description of the data splitting/stripping method itself (which may be part of the splitting protocol) may be stored in a specific store as part of stripping out the metadata. Different data may have different splitting/stripping methods. Finally, the merging of data will also vary from data to data, and the merging process must be based on an understanding of the description of the splitting/stripping method. The data splitting/stripping/merging engine is a system component for resolving and executing the data splitting/stripping description information to finish data splitting/stripping/merging. The core for the data splitting description language and the data splitting/stripping/merging model is the data processor model. The data processor is a software/hardware component that processes data. The splitter is called a splitter and the corresponding merged data are called a merger, which are also data processors. The compressor, the decompressor, the encryptor, the decryptor, the storage, the extractor and the like are also data processors. The core of the data processor is the processing procedure, and the data processor further comprises a plurality of input ports (comprising two types of data input ports and parameter input ports) and a plurality of outputs. The data input port corresponds to data input, the output port corresponds to data output, and the parameter input port corresponds to parameter information needed in the data processing process. For example, the compressor has an input port (an additional password parameter input port is required when there is a compressed password), a data output; the splitter has one data input and a plurality of data outputs; the combiner has a plurality of data inputs and a data output; the memory has one data input, multiple parameter inputs (corresponding to memory locations, access information, etc.), no output (the processing procedure is to submit the inputs to the memory); the extractor has no input and one data output; there is also a very specific class of data processors-generators, without data input (and sometimes parameter input), one or more data outputs, the data outputs of which often participate in the overall data processing process as parameters of the data processing. The distributor is a data input, a plurality of data outputs, each of which is identical to the input data. The output of one processor must be connected to the input of another processor (either the data input or the parameter input). In addition, we can see that almost every data processor has a corresponding inverse processor, otherwise we cannot complete the process of data merging through the data splitting description (the only exception is the data generator, the process of data generation is generally irreversible, the inverse processing in the system is that the generated data can be obtained directly or indirectly from the storage and other processors). Generally, the data input of a data processor is the data output of its corresponding inverse processor, and the data output is the data input of its inverse processor; the parameter input remains unchanged. A splitter corresponds to a combiner, an encryptor corresponds to a decryptor, a compressor corresponds to a decompressor, a holder corresponds to an extractor, whether a distributor corresponds to a distributor (the distributor inverts a process in which one data input port is selected), and so on. The whole data splitting/stripping/merging process is actually implemented by a network of data processors, and the essence of the process can be characterized by using a Petri network model. The process is Transition, the input port is a pool (Place), the output to the next input port is a directed arc (Connection), the directed arc from the data processor input port to the process of the present processor is implicit inside the processor-when all data ports have data (tokens), the process is automatically activated and the data flows down.
The aforementioned data splitting description language is mainly used for describing an assembly flow chart of the data processor. The document described in the data split description language is referred to as a data split description document. The dataflow graph described in the data splitting description document is also essentially a data processor. Thus, another dataflow graph may also be used in a dataflow graph as a data processor. The data split description document actually defines one or more dataflow graphs. For documents that are directly used for data split descriptions, the final entry flow graph needs to be specified. Each dataflow graph includes a plurality of data processors, and their connections. The connection relationship is described in the data output port of the data processor. The dataflow graph has a specified starting data processor. The data split description document may be presented and edited graphically. And the data splitting and merging engine splits and merges the data according to the description of the data splitting description document. The corresponding data splitting flow is shown in fig. 2J: step 1001B, obtaining metadata of a data object to be separated; step 1002B, creating a separate archive document according to the metadata; step 1003B, reading in a data separation archive document; step 1004B instantiates the data-separate storage document as a dataflow graph (instantiating the data processor and establishing a connection therebetween); step 1005B, transmitting the data to be separated to a starting data processor of a data flow diagram; and step 1006B, destroying the data flow graph after the execution of the flow graph is finished.
It can be seen that in practice, the main process of data splitting is performed by a data processor in a data flow graph, and the data splitting and merging engine is mainly responsible for loading and instantiating a data splitting description document into an executable data flow graph, and finally transferring the data to the flow graph for data processing. The data processor is an active object, that is, the instantiated processor object has its own threads/processes, which continuously check its own executable conditions, and automatically execute once all input ports are found to have data, and pass the results to other data processors. And the operation is automatically destroyed after the operation is completed. As shown in fig. 2K, step 1101B is to determine whether data is transferred to the input port; if there is execution of step 1102B, if there is no execution of step 1103B; step 1102B, receiving input data; step 1103B, judging whether all data ports have data; if an empty input port (typically a parameter port) is found, i.e. an input port without any data source, the user is given input of the corresponding information via the interactive interface. If yes, go to step 1104B, if no return to step 1101B; step 1104B, performing a data processing procedure; step 1105B, the processing result is transferred to the data processor corresponding to the output.
The corresponding flow of data merge is shown in FIG. 2L: step 1201B, positioning corresponding data separation storage documents according to input information; 1202B, reading in a data separation storage document; step 1203B, instantiating the data separation storage document into a corresponding reverse data flow graph; and step 1204B, destroying the data flow graph after the execution of the flow graph is completed.
When the split data is recovered, the input information can be the reference code of the data splitting document or the split partial data content. In the latter case, the hash value obtained by the hash function (also called the hash function, which is a method of creating a small digital "fingerprint" from the data content, is always the same as the digital fingerprint obtained by the hash function, and is not considered to conflict with other digital fingerprints.) the hash value obtained may also be encoded as a reference to the document. By this encoding, a corresponding data split document can be obtained. The data splitting document describes a splitting flow of data, and a corresponding reverse flow is required to be obtained when the data is combined. The inversion process is actually started from the actual data processor, and the relevant data processor is traversed according to the output port to perform inversion. The inversion process for a data processor varies from type to type, but in general, the type is changed to an inverse process type, the data input port is changed to an output port, and the output port is changed to a data input port. The input parameter port is unchanged.
For example, the data split description language definition is shown in FIG. 2M; the data splitting description language visualization flow chart is shown in fig. 2N; data splitting description document samples are shown in, for example, table 1:
table 1, data splitting description document sample
Figure SMS_12
/>
Figure SMS_13
/>
Figure SMS_14
The specific splitting process is as follows: the data to be split is firstly subjected to DES encryption, and an encryption key is from system configuration storage; the encrypted data is split into block data and encoded data by 4-byte partition encoding; the encoded data is stored in Amazon S3 cloud storage, its corresponding SHA1 hash value is stored in the metadata database as a key to address the corresponding metadata; the block data is stored in a local file with a file name that is a system generated GUID that is also stored as a key in the metadata database. Metadata database related records are shown in table 2; the split items and metadata mapping table are shown in table 3;
table 2, metadata table:
Figure SMS_15
table 3, split entry, metadata mapping table:
Figure SMS_16
when any one of the two key values is acquired, the corresponding data splitting description document is acquired, so that the data is recovered.
From the above description, it is readily apparent that for three concepts of the invention: namely (1) handwriting input systems and methods; (2) an object-based data encoding scheme; and (3) an object-based data splitting scheme, wherein each technical scheme can be independently implemented to obtain respective technical effects. Preferably, these concepts may be combined together, or one or more of them may be combined with other applications, in which case the value and benefits of these inventive concepts may be better manifested or embodied. Fig. 2O shows the association between various concepts under the three concepts described above, as well as some specific examples of applications that may be extended with these concepts and concepts. These specific applications are merely exemplary, and many more variations are possible in practical applications, so the invention has a very broad application prospect.
Over decades of development, information technology has now entered a network age where it is highly converged with communication technology. Traditional standardized coded data processing systems lay a solid foundation for various modern computer technologies, but cannot meet various requirements of networked personal computing, such as individuation, security, high efficiency and the like. In order to adapt to the development of the age and make up for the defects, the invention not only provides a novel handwriting input method and system, but also combines the data processing method and system of the object-based open coding and decoding scheme and the object-based data splitting/stripping/merging method, and constructs an open, safe and efficient data processing system which is in the positive sense and is oriented to the future and is based on the network environment on the basis of the traditional data processing system.
In addition, in the present invention, regarding the codec processing method mentioned below, first, the basic background content will be described, and the generation and development of the computer will be kept away from the encoding technology. Various coding techniques are currently available. The coding technology as the basis of the computer is widely applied to the transmission, storage and processing of data, and the importance of the coding technology is self-evident. On the other hand, cloud computing and big data are raised, and the storage potential of the Internet of things (The internet of things) is reserved, so that new opportunities and challenges are brought to the coding technology.
In particular, the generation and development of computers is independent of the encoding technology. Various coding techniques are currently available. Essentially, the coding schemes can be divided into two categories: content encoding and reference encoding.
Wherein, the content encoding is a method of digitizing or converting the content of the encoding object. Base64 coding, various data compression coding (including lossless compression, lossy compression, etc.), image coding (JPEG, SVG, etc.), video-audio coding (PCM, MP3, MP4, etc.), etc. all belong to the category of content coding. The digital content of the data itself is directly contained in the result of content encoding and can be analyzed and processed by a computer. There is also a class of structured coding techniques for describing structural information of data. Which is mainly encoding structured data/document content. Such as HTML, mathML, SVG, are specific structured description languages, and the corresponding coding specification is the meta language XML. Similar coding specifications are JSON, protocol Buffer, etc.
Unlike content encoding, the result of the reference encoding process is not the data content itself, but rather a reference to the content or a description of the addressing path of the access object. Huffman coding is a method of creating optimized reference coding for the source symbol (the content itself). URL, IP address, RFID, barcode, two-dimensional code, ISBN, zip code, etc. are all reference codes. It is worth mentioning that the literal code (in particular, the standard code) is also a reference code in essence, which is a code corresponding to a specific literal position in the literal code scheme. The data such as sound, shape, meaning, etc. as a character body is only represented in the code specification.
With some standardization of reference codes (rather than coding methods), a computer program may directly process the codes without coding the corresponding contents (or the computer program has built in the corresponding contents). For example ASCII, unicode. Such encoding, in itself, already constitutes a higher level of data content. Standardized literal coding is one such typical example. Many text-based coding specifications (e.g., JSON, CSV, XML, etc.) today are based on this.
Regarding objects and models, objects (objects), taiwan translates as objects, which are terms in Object Oriented (objected) that represent both a specific thing in the objective world problem space (nasesace) and basic elements in the software system solution space.
Regarding OMG, a non-profit standardization organization in the computer domain successfully defined a series of languages and standards for object modeling. OMG divides the model into four levels of abstraction, which are: a meta-model layer (M3), a meta-model layer (M2), a model layer (M1), a runtime data object (M0). Wherein the meta-model layer contains elements required to define a modeling language; the meta model layer defines the structure and grammar of a modeling language, which can be specifically corresponding to UML (unified modeling language) or object-based programming languages such as Java, C# and the like; the model layer defines a model of a particular system, specifically what we often call a Class (Class) or object model; the runtime contains the state of the object of a model at runtime, etc., i.e. what we say is an object or instance.
FIG. 3 is a schematic diagram of a Meta-model in the prior art, and as shown in FIG. 3, a Meta-Object Facility (MOF) is a standardized set of specifications defined by OMG to build a Meta-model (M2). MOFs include meta-modeling language (M3 model) and methods of creating, manipulating, meta-models.
The object model has multiple layers, a static model that represents structure and function, and a dynamic model that describes runtime behavior. Of interest herein are mainly static models related to coding, including data and interfaces.
For reference coding and object identification, the Identifier (ID) of an object is in fact a reference code, in the context of the use of an object identifier, the identifier must be unique, in a one-to-one correspondence with the object. In this way, the system can locate the corresponding object by identifier addressing.
In most cases, reference encoding of objects and object identifiers are a concept because their usage goals are consistent. Sometimes, however, reference codes may not necessarily be identified as objects. Reference codes only guarantee that the target can be correctly addressed, and do not necessarily guarantee a one-to-one correspondence with the objects, and sometimes there will be many-to-one situations (one object, multiple codes). For example, a host may have multiple IP addresses; there may be multiple URLs on the same web site.
In addition, in the field of computer science, reflection refers to a class of applications that are self-describing and self-controlling. That is, such applications implement self-presentation and monitoring (animation) of their own behavior by employing some mechanism, and can adjust or modify the state and related semantics of the behavior described by the application according to the state and result of their own behavior.
Reflection technology has been supported by modern software development platforms, tools, and programming languages. For example, reflections can be utilized at runtime to directly obtain metadata for the running objects in Java and Net platforms.
In addition, in the present invention, the encoding and decoding processes are performed by an object-based encoding system, and fig. 4 is a schematic diagram of the architecture of the encoding system of the present invention, and as shown in fig. 4, the encoding system is mainly divided into three parts: the system comprises a client, an encoding server and a data storage. Wherein, the coding service end and the data storage end together form a coding warehouse.
As shown in fig. 4, the client may obtain the corresponding data object by sending the code to the code repository; the new data object is sent to the code repository and the corresponding code can be obtained. Inside the coding warehouse, the coding service end provides service for the client. An encoding warehouse may include one or more data stores in which real data is stored. The encoding server may send a data query to the data store to obtain, update, insert the relevant data.
The coding warehouse provides a centralized coding service, so that different clients can code the shared data object and the coding meta object by referring to the coding. Still further, various systems may register new code meta-objects with the code repository to meet various different coding requirements. This centralized coding service facilitates the integration and exchange of data for the various systems. In general, the coding warehouse is internally provided with a data access control system, so that different access rights can be provided for different data objects and coding meta-objects. In particular, the encoded meta-object and the data object may be stored on different data stores and/or set with different data access rights. In an object-based encoding system, the encoding meta-information is stored in an encoding repository, the data object itself may be present in the encoded stream (content encoding) or in a storage system of the encoding repository (reference encoding), and the reference encoding of the data object is present in the encoded stream. The data objects in the encoded stream and the encoded repository may be placed in different secure channels. The separation of this information has on the one hand a natural security and on the other hand a better coding efficiency.
In a specific implementation, the data storage end can be implemented by using different storage systems such as file storage, a relational database, a NoSQL database, cloud storage and the like.
In particular, the invention proposes a novel object-based coding and decoding scheme and system, which is also an open solution. In contrast to standard coding schemes, object-based open coding schemes may be fully personalized, non-standard. Such non-standards are different from the traditional standards that were first established and then used by organizations or institutions, but are essentially based on the de facto standards (code specifications) of the code warehouse. The scheme not only can provide more flexible and various data services, but also can provide more reliable security guarantee for the data.
The coding scheme of the invention can code data of any type and any length, can have any coding format and any coding word length, and can ensure that the coding rule is not fixed, i.e. the coding rule can be randomly changed according to the requirement. So that a fully personalized code can be created. In other words, the encoding scheme of the present invention is a coding scheme that can encode an arbitrary object and that can be independent of the length of the object data, the encoding rule, the encoding word length, and the like. This breaks through the inherent form and limitations of the existing standard codes. This coding scheme can be arbitrarily extended. The same code can be reused in different coding processes without mutual influence, so that the utilization rate of the code is greatly improved.
The idea of the coding scheme of the invention is to create a coding specification for a data object based on metadata of the data object and to generate a code for the data object based on the coding specification. In other words, the present invention may obtain characteristics or structures of a data object in an encoded manner and generate a corresponding encoding for the data object based on the characteristics and/or structures of the encoded object.
Furthermore, based on the data of the existing standard text coding scheme, all parties participating in transmission and receiving and storing parties have the opportunity to obtain all information in the data in the transmission process of the data. This is not favorable for confidentiality of data, but also makes the transmission amount of data very large, increases the network bandwidth and the burden of CPU processing, and especially for large-block data transmission, and thus reduces the data transmission efficiency.
Another feature of the present invention is that: and storing the data object to be transmitted into a code warehouse, setting corresponding data access rights, and obtaining the corresponding reference code. When in transmission, only the reference code of the data object is required to be transmitted, and only the receiver which finally has the data access authority can obtain the complete data. This can greatly reduce the amount of data transmitted while increasing the security and reliability of the data.
In addition, unlike the encryption process of data in the prior art, in general, the encryption process of data does not require any participation of metadata, but only requires conversion of original data into contents which are not normally identifiable or displayable by an encryption algorithm. Although the invention can also achieve encryption effects, on the one hand, the invention achieves data protection in a completely different way. In particular the data content is protected in an encoding isolated manner by means of metadata of the data object. On the other hand, the encrypted ciphertext data tends to be the same size or larger than the original plaintext, but the present invention requires only a very small amount of information, such as the corresponding reference code, to be transmitted. Furthermore, thanks to the inventive concept, more beneficial functions and operating space are provided for data processing in addition to security. For example, but not limited to, transmission of data may be reduced, reducing network load; the flexibility of encoding also provides greater convenience for subsequent data processing, and so on.
Although the key and the encrypted data also need to be stored or transmitted separately after encryption, on the one hand, encryption requires that the original data be converted into a code or data completely different from the original data by a predetermined rule or algorithm, and thus cannot be easily recognized by a third party. However, the invention can fully preserve the original form of the data content, and can realize the security of the data without any change to the content, which is not possible by the conventional encryption system.
In addition, in the encryption process, only one secret key is needed, and in the encoding process, the open system of the invention can endow different codes to each data segment and can set different access rights to different users, so that finer granularity security assurance can be realized.
As previously described, due to the similarity of object reference codes to standard literal codes, we can extend the basic coding form based on object codes from that of standard literal codes. Thus, the standard character becomes a special object (object number of the built-in encoded metadata); the object reference code becomes a special character-a non-standard character. Unlike available technology, the present invention may be used in directly accepting the naturally output digitized result of human being, and the digitized result is divided into different data objects in certain rules and set in encoding warehouse to form non-standard character (with the non-standard character being the reference encoding based on the encoding warehouse, and with emphasis on the data object being the data segment obtained through splitting the naturally output digitized result). The content of each character or the association of the preceding and following characters may not be concerned, and thus, data may be stored and processed in a character unit as in the existing standard text-based system. This also provides a great expansion space for flexibility in subsequent editing, encoding, and storing operations.
Preferably, the invention can build a proprietary word stock for each writer by assigning a custom unique code or form of code to all or a segment of the digitized result naturally output by each human individual. In this case, since any information input in advance by the user is not required as a reference, the user can input at any time to build or supplement his own word stock at any time, and the trouble of requiring information such as a reference word stock to be input in advance as disclosed in chinese patent CN103136769 a is eliminated.
The invention can also place the object reference code in different coding spaces, such as user coding spaces divided by users, and different users can use the same reference code to correspond to different data objects in the coding warehouse; also coding space divided according to date; coding space divided according to geographic position; coding space divided according to departments; coding space divided according to online session; etc. The coding space divided according to the session has extremely high safety characteristic, namely, the reference codes of the data are all present in the coding space corresponding to the session, the session is ended, the corresponding coding space disappears, and all the codes in the space cannot be decoded correctly. By utilizing the characteristics, the effect of burning after reading can be realized. Preferably, the storage consumption of the reference code can be greatly reduced by introducing the coding space and adopting the variable length coding, and the efficiency of transmission, processing and storage is improved.
Due to the rapid development of modern storage technology, the continuous expansion of storage means makes large-capacity and mass storage possible, and particularly under the background of taking cloud storage as strong support, the original local preservation of the digitized content naturally output by all human beings becomes possible.
It has been estimated that, assuming someone writes continuously for 60 years each day, the total handwritten information storage capacity is no more than 250GB. This is just like a little wizard looking into a big wizard for existing mass storage technology and cloud storage technology. This enables complete preservation of the original work (e.g., novels, composes, print, etc.).
In addition, when combining the handwriting input system described herein above with the object-based coding scheme concept, a new data processing system can be built as follows. The new data processing system introduces the concept of an encoding repository, and an application can not only query and use existing encoding meta-objects in the encoding repository, but also register and use new encoding meta-objects. The new system breaks through the limitations of the existing systems from four different levels.
First level, built-in security
In the new data processing system, literal coding is non-standardized. The literal code and the corresponding decoded information are stored in the application system and the code repository, respectively. The code repository can support different levels of code isolation for users, applications, content, etc. simultaneously. Thus, access and use of text content may be authorized by access control management of the code repository. That is, the new data processing system has built-in security.
This security is multi-level. We can set different access rights for different users, different applications, different literal content, and even different encodings. This is completely impossible in conventional data processing systems based on standardized literal coding.
In addition, not only the simple text content, but also the application system using the new data processing system code and the data have corresponding security.
Second level, comprehensive coding capability
In existing data processing systems, various generic, specialized text formats have been established to describe various generic, specialized data structures. Such as XML, JSON, CSV, RTF, etc. However, these formats all use the same coding standard for marking and definition, which makes both content text and marked text have many limitations and are relatively inefficient to store and parse. For example, in XML, ">", "<", "&" and other characters have special meanings, and cannot be used in text contents. We have to use the escape sequence "> "< "," "replace, or put text into" < -! Protection of [ CDATA [ "and" ] ] > ] "or quotation marks.
In new data processing systems, open coding allows us to fully break through these limitations. We can use some coding types for the tags and other types for the text content, and the corresponding text parser can distinguish which text is the tag and which text is the content based on the coding metadata.
Meanwhile, due to the arbitrary nature of the coding of the new system, any things which can be coded in series can be stored and coded through the system, such as music melody, dance movements, chess manual, video captions and even computer instructions. The stored results are divided into two parts, one part is the data object in the coding warehouse, which can be multimedia data or proprietary data, and the other part is the coded coding sequence. Such reference encoding of data objects is not specific to the present system, and conventional standardized encoding-based data processing systems may also implement encoding of arbitrary data. But far from being simple, efficient and natural to implement based on object-based coding systems.
Third layer, conciseness and high efficiency
Object coding in an object-based coding system can include a meta-coding and instance-coding part, and for a certain system, the number of meta-coding is very limited, for example, two bytes 16 bits can code 6 ten thousand meta-coding, and in fact, 6 ten thousand object types can be corresponded, which is enough for most application systems. For a specific object, because of the arbitrary nature of the object code, we can simply represent its instance code directly with one number, e.g. 4 bytes 32 bits can code more than 40 hundred million object individuals, plus we can place the reference code in different code spaces, 32 bits being sufficient for most systems. That is, 6 bytes may represent reference encoding of objects in most application systems. Furthermore, if variable length coding is used, by setting default meta-coding, using client-side coding, etc., we can often express an object reference code with fewer word counts. In contrast, in order to prevent the conflict of the data blocks in the current cloud storage, a scheme for performing reference coding on one data block by tens or even tens of bytes is much more concise and effective.
In addition, in the new data processing system, the object reference code corresponding data object can be stored in the code warehouse, so that the storage efficiency of the data object can be greatly improved, and the transmission and processing efficiency of the data can be improved. For example, the HTML of the web page is recoded by using an object coding technology, elements and attributes of various tags of the standard HTML are subjected to object coding, related meta information is put into a coding warehouse, the size of the obtained web page document is greatly reduced, and the traffic can be saved for network transmission of the web page.
Fourth layer, personalized literal code
In contrast to standard literal coding schemes, the coding schemes used by object-based coded data processing systems may be personalized, non-standard. This is mainly achieved by isolation of the context encoding space, where different users, unused applications etc. have respective context encoding spaces. The personalized code can be further accessed by accessing the personalized context code space. Each object reference code has a one-to-one correspondence with a data object in the code repository. Upon text input, the input data object content is stored in an encoding repository where its location is converted into a corresponding object reference encoding. When the text is output, the system finds the corresponding data object content in the code warehouse according to the object code and outputs the content to specific equipment.
Because of the openness of the object-based coding system, the digitized results output by human beings can be divided and coded in any mode, and any content to be expressed can be expressed, and only the content and the code need to be correspondingly arranged. That is, the data processing system may dynamically add the data object type and its encoding.
Thus, under the system, one can enter in a manner that is most natural, nor is such input limited to handwriting input as described above, but can be any data stream such as, but not limited to: voice, image, multimedia stream, braille, sign language, lip language, semaphore, even also a burst (bust) with or without meaning, etc. The system automatically stores the input content to the encoding repository while it is in-put and encodes the content at the location of the encoding repository. The output process is to take the input content from the code warehouse according to the object reference code and play it back naturally.
The previous handwriting input system is still taken as an example. Specifically, corresponding to a scene of handwriting text input, a writer writes under a natural writing constraint (such as a line constraint or a column constraint), the system divides the writing content according to a natural word division (such as a Chinese character's composition word division) or word division (such as a space word division of a word in an phonogram language) rule, and stores the shape of the divided word or word in a code warehouse, and generates a corresponding reference code. These codes are stored in a specific typesetting order into a collection of text content, i.e., literal codes.
It can be seen that the above-described handwritten text input process is intermediate between text recognition handwriting input and non-recognition handwriting input. Like the word recognition system, this process requires division of words and phrases. However, unlike the standard code corresponding to the input content, the input is not analyzed, but the input is "obtained". This method does not have the problem of recognition rate, always 100%. This is the same as for the non-identification system. But the process is different in that the input content is divided and encoded separately. This allows us to perform some word processing, such as editing, copying, pasting, transmitting, searching, retrieving, etc., on the encoded results in the new system, just as it would be for ordinary words.
Similarly, open code based data processing systems may also be used in optically recognition based input systems. Particularly in the recognition of handwriting input, whether handwriting is scratched or not is not important, and an optical recognition system based on open coding can divide and store an image in a coding warehouse only by dividing lines and words of an input image and generate corresponding image object reference codes. It is worth mentioning that the corresponding data objects in the code repository formed on the basis of the system can be taken as good samples due to the individualization of the code. The result of the analysis training can in turn improve the conventional text recognition rate for that particular individual.
The data processing system is also suitable for a voice input system, the input voice signals do not need to be identified, and the input voice signals can be stored in a coding warehouse and can be correspondingly coded only by simple processing and dividing.
The data processing system can also be applied to other text input methods, such as braille, lip language, sign language and semaphore input. Furthermore, based on this new data processing system, a new text input method can be created. For example, on a small-sized screen touch screen device, a particular gesture may be designed as a line break, word break, and end mark, and then input in full screen handwriting, or voice. The input content is divided according to the word segmentation marks, and is respectively stored in a code warehouse, and corresponding character codes are obtained. For another example, a sign language input method based on 3D glove can be designed. The motion information of the 3D glove is stored as text content in a code warehouse, the codes correspond to characters, and a certain time interval is used as the separation of actions. The sign language is output by playing back the 3D glove motion information in the coding warehouse through a three-dimensional model.
In summary, the new data processing system has the following advantages:
First aspect, simple nature
The new data processing system does not need to generate specific standard codes, so that the simplest and natural input mode can be designed for common users, and the result codes can be directly coded into personalized codes.
Without the limitations of the coding standard, the user can input any content he wants to express, including multimedia data such as graphics, symbols, sound, video, etc. Unlike conventional text recognition systems, text output in a new data processing system does not require recognition, which ensures uninterrupted and efficient input. And smooth and natural user input experience is ensured.
Second aspect, safety
The new data processing system is a non-standardized object-based reference code. One cannot understand the content from a literal code sequence and also needs to obtain the specific content information encoded from the code repository. The access control of the code repository ensures the security of the data content. At the same time, due to the separation of the reference code and the data object, the readability/visibility of the non-standard text after the code sequence is obtained is completely dependent on the security settings of the corresponding code repository. Thus, the code repository is essentially an all-round cryptographic server. Furthermore, the data in the coding sequence and the coding warehouse can be placed in different safety channels, so that the difficulty of completely obtaining all data by a data stealer is greatly improved. Furthermore, unlike the context-free nature of traditional standardized literal coding, non-standard literals based on object coding may be contextually relevant literals. By isolation of the context space, the same code can be person-to-person, application-to-application, document-to-document, time-to-time, place-to-place, and so forth. The application system, and even the user individual, can register new up and down Wen Guiyao with the code repository, thereby introducing new code space to further isolate the literal code. Compared with the traditional data processing system, the new system has natural security and privacy.
The software developer can store the coded non-standard text information for the user, and can further process the non-standard text, such as searching, analyzing and the like. They cannot understand true non-standard text content. Likewise, the code repository provider may analyze, process, or even identify the content in the code repository, but since it does not have object references to the final ordering of the codes, the non-standard literal content is not known to it. Only users who have access rights to the corresponding application system and the code repository can obtain complete text information. Thus, for a network application to be authorized for access, the user must have both rights-the application rights and the code repository rights-to obtain complete non-standard text information.
Because of the openness of the object code, the data content (comprising the traditional standardized literal code) to be protected can be directly recoded, and the authorized access service of the code warehouse can specially control the special codes, so that the encryption of the specific conditions and the specific literal code is realized. The specific conditions here may be context-based rules (time, place, environment, user, application, etc.), thus implementing complex, flexible literal code security.
The code warehouse can also provide services in terms of identity authentication and digital copyright protection of users or systems on the basis of context-aware security.
Third aspect, open
From object reference encoding to non-standard literal content, from encoding services to non-standard literal services, object-based encoding data processing systems are an entirely open system. Any data object may be placed in the code repository and its reference code recorded by non-standard text. The software developer may register a new context object specification, a new coding space, a new coding meta object, a new data object with the system, may add a new coding service, a new non-standard text service (including a new non-standard text input/output, non-standard text editing, etc. system), etc. to the system.
Meanwhile, due to the more efficient and safe general text data (including nonstandard text and standardized text) solution brought by the new data processing system, a model in any specific field can be built by using the general text data. That is, different application systems may encode their domain models using the object encoding data processing system and deploy the encoding in the encoding warehouse. Thus, the application system and the corresponding data object content not only have the advantages of the new data processing system, such as high efficiency, safety and the like, but also can fully utilize various word services to process the data.
Fourth aspect, flexibility
In a non-recognition handwriting application system, people can input arbitrary text and graphic contents; the voice recording software can record voice information of a person; video recording software can also record movement information (including sign language) of a person. Unlike these full content recording systems, the new data processing system divides, stores and encodes the same content in split. In the process, the system can directly filter out useless information, and only important information focused by people is reserved, such as noise in audio, noise points in scanned characters and the like. In addition, by returning the content to a service, repeated content is not required to be stored repeatedly, so that the storage space is greatly reduced, and the transmission speed is improved. More importantly, we can utilize existing word processing infrastructure and tools to process and manipulate the word code content formed in the new data processing system, such as searching, indexing, editing, etc.
In addition, flexibility is also manifested in the deployment of the code and access control. Flexibility in code deployment means that for the same code type, we can selectively configure it to different code spaces, thus possessing different security levels and visibility. The flexibility of access control means that a user or an administrator of an application system can very flexibly configure access to object codes by setting access control to a code repository: on the one hand, the access control can be configured to different coding levels, can be coding space, or coding metadata, even specific data objects; the access control to the code on the other hand may be based on different conditions, such as time, place, user, application, state of the domain model, etc.
Fifth aspect and high efficiency
In a networked environment, efficient storage and transmission is guaranteed by the split storage of data object encodings and content in new data processing systems. The content of the data object needs to be transferred from the coding store to the user only when it is really needed.
In non-standard word processing systems, the content of unidentified data objects formed in the new data processing system may be a very well personalized recognition training sample. The trained character recognition system can more efficiently recognize personalized non-standard characters into corresponding standard codes.
In non-standard word data processing systems, word format information may be stored in a code repository. Characters in a text format are coded in a non-standard way, and text data can be randomly used without escape, so that efficient text data transmission and processing are achieved.
Further, the new data processing system has the following main significance:
first aspect, facilitating popularization and penetration of personal computing
The new data processing system makes it possible to input text in similar to natural text input mode, and solves the problem of difficult computer input. Secure, natural data processing systems are more acceptable to the average person. Such computer text entry is no longer a matter of cultural background, familiarity with keyboards, which is advantageous for popularity and depth of personal computing.
Second aspect, facilitating popularization and penetration of cloud computing
In recent years, more and more internet applications and services are being converted into a computing mode of cloud computing, which is a consumer-on-demand and dynamic distribution. However, security is a non-negligible challenge for cloud-based systems, especially public clouds. In the new data processing system, the data object coding and the content splitting can greatly improve the security level of the system. The enterprise can use various public cloud-based applications and services with confidence, and can allow its employees to use their personal mobile devices inside the enterprise at will, as long as the code repository is deployed within the enterprise's firewall. All enterprise data information stored in public clouds is meaningless "messy code" to people outside of the firewall. Similarly, a home or individual may only have to protect his home or individual code repository. The information stored in the public cloud is safe and reliable. Here, the code repository plays the role of a codebook. This high level of security feature can speed up the pace at which businesses and individuals accept and use public cloud services.
The third aspect is beneficial to the development and popularization of the Internet of things
The internet of things (The internet of things) integrates intelligent perception technology, identification technology and general computing technology, and is called as the third wave of information industry development after the computer and the internet. The internet of things is an extension of the internet. On the one hand, the internet of things has urgent demands on object addressing coding/identification in three layers of a perception layer, a network layer and an application layer, and has the characteristics of huge node number, various types, limited processing capacity and the like, so that the related coding has huge challenges, and a general standard is not formed at present. The compact and flexible object coding mechanism can well meet these needs.
On the other hand, the large number of sensors in the sensing layer need to store the sensed data records, and the object coding technology can effectively provide relevant coding storage support.
Fourth aspect, is beneficial to cultural protection and inheritance
There are now seven thousands of common languages worldwide, even more so, dialects. Unicode covers only a few hundred of them. Under existing computer data processing systems, many language words are difficult to input into a computer system. In the new data processing system, the use of language and text has almost no limitation (the typesetting mode is the only limitation for handwriting text and needs to be pre-designated). One can store any non-standard text content directly into a computer system or communicate with others through a computer. Breaks through the unreasonable constraint of the prior computer words that the computer words are standardized first and then used.
The keyboard input of the existing computer characters causes that many people 'forget to character by lifting the pen'. The new data processing system is capable of maintaining the original writing tradition of a human being.
Fifth aspect, environmental protection
The new data processing system makes the direct input and use of text on electronic equipment more natural, convenient and safe. Is beneficial to the formation of paperless environment and can finally save the use of paper.
The encoding processing method and decoding processing method provided in the following embodiments of the present invention can be implemented based on the above-described encoding system. The technical scheme of the invention is further described in detail through the attached drawings and specific embodiments.
Fig. 5A is a flowchart of an embodiment of an encoding processing method provided by the present invention, where, as shown in fig. 5A, an execution body of the method in this embodiment is an encoding system, and the method includes:
step 101C, according to the received encoding processing request, acquiring the data object to be encoded and the metadata thereof.
In the present embodiment, the metadata of the acquisition object is mainly encoded metadata of the acquisition object. The encoded metadata may be a subset or a full set of metadata. Such as, but not limited to: type of object, corresponding data structure, constraints for storage and transmission, control, etc. Metadata of an object is the basis of the present system and must be extracted from the data in some way. Metadata for objects can be obtained automatically using reflection mechanisms in modern software platforms such as Java, # Net, etc.
In addition, in this embodiment, a data object (also referred to herein simply as an object) is a basic object for performing data processing in the present invention, that is, a target object that the present invention needs to encode. It may be in any data form, either in the form of individual words, symbols, parts of them, audio, video, multimedia streams or fragments thereof, or in the form of codes themselves or documents, etc. It comprises at least the metadata part (or metadata) of the data object and typically also the content data part of the data object, which is the remaining part of the data object after stripping the metadata, or what is called the data object, or data content, or content data. The content data may be related to the metadata portion or unrelated.
Metadata is data about a data object, which is a description of the characteristics, attributes, inherent logical relationships, and/or structures, etc., of the data object. Metadata may appear in: the data is internal, independent of the data, accompanied by the data, or combined with the data. Metadata may include information such as the type of object, creation and or modification date, historical version information, data structure, interface, storage constraints, transmission constraints, encoding context constraints, and so forth. Specific metadata examples may include, but are not limited to, the following: description of the program set; identification (name, version, locality, public key); the type of export; other program sets from the program set; security rights required for operation; description of the type; name, visibility, base class, and interface implemented; membership (method, field, attribute, event, type of nesting); an attribute; other illustrative elements of the modification type and member; table header and/or table structure information of the table; palettes in picture files, and the like.
Metadata is different for different data objects. For example, for the metadata portion of a data object we refer to metadata of the data object; and for the later mentioned metadata part of the encoded object we can refer to encoded metadata. Metadata corresponding to a data object can be obtained or added at runtime, which is the basis for the present system to encode the data object.
Step 102C, according to the code warehouse and the data object and metadata thereof, obtaining the object code of the data object.
In this embodiment, the data object to be encoded and the metadata thereof are obtained according to the received encoding processing request, and the object encoding of the data object is obtained according to the encoding warehouse, the data object and the metadata thereof.
Further, for example, fig. 5B is a flowchart of one implementation of the step 102C in fig. 5A, and as shown in fig. 5B, one implementation of the step 102C is as follows:
step 102C1, selecting or creating a coding protocol according to a coding warehouse and at least a part of the metadata, and generating a metadata code corresponding to the metadata according to the coding protocol.
In this embodiment, metadata related to the following encoding process may be further selected from among the metadata based on a predetermined extraction rule, and then a corresponding encoding protocol may be created or generated based on these selected metadata.
In addition, based on metadata extracted from the object, a coding specification is selected or created and saved. The corresponding code will be generated for the object using the coding conventions. A default or default coding scheme may also be set for the system to perform the corresponding codec, where only a selection is needed without creating a new coding scheme. Some or all of the encoding conventions may be interactively selected or created by the user. It should be noted that the coding protocol generated in the coding process can be automatically destroyed after the coding process is completed (after the coding factory is exported), or can be stored.
The process of adding or creating coding conventions may be performed at the time of object modeling; or may be performed while a particular application system is running. The method can be automatically performed by a certain rule or performed in an interactive mode.
The coding protocol mainly comprises a coding mode of an object, coding constraints of an internal structure of the object and the like.
Step 102C2, encoding the data content of the data object according to the encoding protocol, obtaining an instance code, and obtaining an object code corresponding to the data object according to the meta code and the instance code.
Wherein the object code is a reference code form or a content code form.
Further, as can be seen from fig. 3, the encoding system mainly includes an encoding warehouse and a client, and the encoding process flow can be implemented in two ways, which are described in detail below;
the first implementation mode:
step 1a, the client obtains a data object to be encoded and metadata thereof according to the received encoding processing request.
And 2a, the client transmits the data object to be coded and metadata thereof to a coding warehouse.
And 3a, selecting or creating a coding protocol according to at least one part of the metadata by the coding warehouse, and generating a metadata code corresponding to the metadata according to the coding protocol.
In this embodiment, the object coding conventions (which may be referred to as coding conventions) refer to specifications and constraints on how data objects are encoded and decoded. May include coding of the data object (content coding, reference coding, or a mixture of both), coding constraints of the object metadata (e.g., details of the scheme, word length, endian, data alignment, etc. of the associated data serialization), etc. The object encoding conventions may also be part of the metadata of the data object.
The object coding conventions may be added manually (by a modeler) or automatically (by a tool) at the time of object modeling, or may be added interactively (by a user) or automatically (by a system policy) at runtime.
Encoding metadata refers to metadata related to the encoding and decoding of a data object. The encoded metadata may be part or all of the metadata. The encoded metadata of a data object is the basis for the system to encode and decode the data object.
And 4a, encoding the data content of the data object by the encoding warehouse according to the encoding protocol to obtain an instance code, and obtaining the object code corresponding to the data object according to the meta code and the instance code.
In this embodiment, the data objects and their metadata are stored in an encoding repository. In addition, the code repository generates a corresponding object code that is actually a reference code for the data object in the code repository.
And 5a, the client receives the object code returned by the code warehouse.
The second implementation mode is as follows:
step 1b, the client obtains the data object to be encoded and the metadata thereof according to the received encoding processing request.
And 2b, the client queries an encoding warehouse to select or create an encoding protocol according to at least one part of the metadata, and generates a metadata encoding corresponding to the metadata according to the encoding protocol.
In this embodiment, the client side makes an encoding processing request to the encoding server side in the encoding repository, and obtains the meta-encoding corresponding to the encoding meta-object (actually, the reference encoding of the encoding meta-object in the encoding repository).
Optionally, the meta-code may include one or a combination and/or nesting of: type coding, spatial coding, and context coding.
And 3b, the client encodes the data content of the data object according to the encoding protocol to obtain an instance code, and obtains an object code corresponding to the data object according to the meta code and the instance code.
In the present embodiment, in the above step 3b, for two different forms of object coding—content coding and reference coding, the generation of instance coding is correspondingly divided into two types: for instance coding of the content coding form, the coding client directly sequences the content of the data object into instance coding according to the coding protocol. For instance codes in a reference code form, a code client sends a code request to a code server; the coding server obtains corresponding data objects, coding protocols and related information according to the request, and stores the data objects in a coding warehouse according to the coding protocols and the related information; a corresponding instance code is generated and returned to the client.
Correspondingly, the decoding process of object encoding is the inverse of the encoding process. Generally, the encoding server obtains an object code to be decoded according to a decoding processing request of the encoding client. Data objects located in the code repository according to the code are returned to the client.
In particular, object encodings obtained for multiple steps of reading are aimed at. The coding client analyzes the object code into meta code and instance code according to a preset rule. And sending a decoding request of the meta-code to the coding server. And obtaining a corresponding coding element object, encoding and decoding the instance according to the coding protocol and related information in the coding element object, and combining the coding element object to obtain a corresponding data object.
For two different forms of object coding, content coding and reference coding, the decoding process of the above example coding is also divided into two types: for content encoding forms, the encoding client may directly encode and decode the instance into the corresponding data object content according to the encoding specifications. For the quoted coding form, a coding client sends an instance coding and decoding request to a coding server; the coding server obtains corresponding instance codes, coding protocols and related information according to the request, positions the corresponding instance codes, the corresponding coding protocols and related information to the data objects in the coding warehouse, and returns the corresponding instance codes, the corresponding coding protocols and the related information to the client.
In addition, in the decoding process based on object encoding, the system first acquires encoded metadata; and then obtains the corresponding content code based on this metadata. Specifically, the encoding metadata may include encoding type information for locating, loading or transmitting encoded contents, constraint information for a target encoding space to which encoding belongs, and the like. Encoding metadata is encoded so that metadata can be obtained. In practice, the encoded content of the meta-code in the code repository is mainly the code element object. Meta-coding is typically an integral part of the coding. After the decoder analyzes the meta-code from the code, the decoder can acquire the corresponding code meta-data according to a certain mechanism.
In this embodiment, as an encoding system, we can also consider the encoded metadata as a data object, i.e. a data object with the encoded metadata as content, which may be referred to as an encoded metadata object or may have its own metadata. Therefore, the encoded metadata as a data object may also have the encoding of the corresponding metadata, which is called metadata encoding.
Preferably, fig. 6 is a relationship among data objects, metadata, coding conventions, and coding meta-objects, and as shown in fig. 6, the coding meta-object is also a data object (for a common data object, it is an object with an M1 abstraction level), and a model formed by metadata (with an abstraction level of M2) is called a coding meta-model. The encoding metadata of the encoding meta-object is part of the encoding meta-model.
The coding metamodel is a base stone of an object coding system, and in general, is relatively stable in operation, less subject to dynamic changes, but scalable. That is, the encoding metadata of the encoding metadata object is built into the system. Thus, the system can directly store, transmit, and encode these code meta-objects.
An object coding system may correspond to a unique core coding element model (there may be an augmentation mechanism). Specifically, fig. 7 is a schematic diagram of the core coding element model.
In addition, a meta-code, which is an object code that encodes a meta-object, also has its own meta-code? This is actually related to the specific design of the coding metamodel and the codec method. If there is only one encoded meta-object in the coding meta-model, then meta-encoding is the entirety of that encoded meta-object. If there are multiple encoded meta-objects in the meta-model and they can be encoded into the same meta-code at the same time, then this situation does not require meta-coding of the meta-code either. Otherwise, meta-coding of meta-coding is required to distinguish them. Sometimes, there is a hierarchical relationship between the encoded meta-objects, and multi-level decoding may be required to obtain the encoded meta-object of the final data object.
In general, variable length coding is more straightforward, flexible, and easy to handle for the expression of such a meta-object hierarchy: the former code word is the meta code of the latter code word, and the latter code word is the meta code of the latter code word, so that multiple levels can be nested.
Specifically, fig. 8 is a conceptual model of object coding, meta coding, instance coding (i.e., the object coding removes the meta coding part) and the data object and the coded meta object, and as shown in fig. 8, the following layer relationships are shown:
1. the encoded meta-object may also be used as a data object
2. The meta code itself can also be an object code
3. Data objects and code meta-objects are interrelated
4. Object encoding includes meta encoding and instance encoding
5. The object encodings are associated with corresponding data objects, where the same correspondence between meta-encodings and encoded meta-objects is implied (mainly in relation 1 and relation 2 above).
In addition, the meta-code includes various examples of the encoded meta-object, and fig. 9 is a diagram showing an example of the meta-code in the present embodiment. As shown in fig. 9, the object code is a 128-bit fixed-length code, and there are only two code meta-objects in the code meta-model: an owner of the object, and an object type. They may or may not be dependent on the definition in the coding metamodel. The corresponding coding logic is different whether it is dependent or independent.
By way of further example, FIG. 10 is an exemplary diagram of a similar layer-by-layer correlation of code meta-objects (variable length coding of 16-bit word length).
Further, fig. 11 is a schematic diagram of a meta model of a corresponding code, as shown in fig. 11, where there are two code meta objects: user and coding type. The coding type may have one owner (01) or no owner (00). Thus, both of the above encoding forms are legal. Only type codes as object codes of meta codes correspond to data objects without owners. Another represents a data object with an owner.
In this embodiment, a meta code is generated based on the meta data and the coding conventions, and an instance code is generated from the data content. These specific steps may be implemented using an encoding plant. The code factory is another important component of the system, which can be created dynamically by the code warehouse, or can exist across components or across systems. The encoding factory may provide direct codec service to related objects.
The code repository may provide two important sets of services: registration and access of encoded metadata; the object references the encoding and decoding of the encoding.
The encoding repository may also use an external storage service to store encoding metadata, object data, and the like.
The final object code is generated from the meta code and the instance code based on a predetermined rule. The meta code and the instance code may be formed into object codes in any manner, for example, splicing, or some operation, so long as they can be reversely disassembled and restored at the time of decoding. The process of generating object codes can be placed at the user end or can be automatically performed by the coding factory, depending on the actual design. Furthermore, a combination of representative element codes and instance codes or a splice code may be included in the final object code. If necessary, codes representing the combination or splice mode and the object codes can be stored under different security channels separately, and the respective access rights are set respectively, so that the object codes and the codes representing the combination or splice mode of the element codes and the instance codes can be obtained only through authorization and verification, and the element codes and the instance codes can be correctly disassembled in the decoding process.
In this embodiment, the content data may be the application object itself, or may be positioning and index information of the application object. In the latter case, the data access component of the application system can obtain the corresponding application data according to the content data through a certain way or algorithm, thereby obtaining the final application object.
In addition, the content of the data object may preferably be stored in a third party storage system that interfaces with the code repository, in which case the code repository needs to store relevant information for accessing the data object in the third party storage system.
In this embodiment, the process of encoding data objects is referred to as object-based encoding. Data serialization, simply called serialization, is the process of content encoding data. The metadata of the data object and the content data are finally serialized, and are stored in a result based on object coding (content coding mode) or in a storage outside the result (reference coding mode). In addition, in the encoding and decoding process, the content of the data object and the content of the metadata need to be serialized before transmission in the system.
Indeed, serialization of data objects, i.e. content encoding itself, may also be entirely based on object-based encoding methods. The key point is that the coding metadata is stored in a coding warehouse through the method to obtain the corresponding coding metadata object reference code, namely the metadata code. The subsequent serialization of the data objects can be performed smoothly with the participation of the encoded metadata corresponding to the meta-encoding. Thus, it can be said that object-based reference encoding is the basis of the present method. On the basis, the metadata object can be subjected to reference coding, so that the metadata is obtained. On the basis of meta-coding, we can perform reference coding of data objects and serialization of data objects, namely content coding. In the process of implementing the reference code, the content code of the data object (the method is used for itself) needs to be obtained first, and the content code is transmitted to the code warehouse for storage, and then the reference code is obtained.
In the present embodiment, object encoding refers to encoding of an arbitrary object. The objects herein may be either physical objects such as data, content information, images, speech, etc. (reference encoding may be generally used for them), value objects (e.g., dates, instance encoding may be generally used for them), or high-level objects including internal object structures such as tuple objects, table objects, tree/document objects, etc. Object encoding is one of outputs of the system after encoding an arbitrary object, and is also one of inputs when decoding an object.
For example, fig. 12 is a conceptual model schematic diagram of the object code, and as shown in fig. 12, the object code may include two parts, one is meta code and the other is instance code. Meta-coding is the coding of a coded meta-object. Meta-coding is typically an integral part of object coding. After the decoder analyzes the meta-code from the code, the decoder can acquire the corresponding code meta-data according to a certain mechanism. Content encoding is the encoding of data content under corresponding encoding constraints.
Fig. 13 is a flowchart of a second embodiment of an encoding processing method according to the present invention, where, on the basis of the embodiment shown in fig. 5A, as shown in fig. 13, the method of this embodiment further includes:
And step 201C, setting access rights for the data in the code warehouse.
In this embodiment, the data may be metadata, a data object, or the like. Optionally, the metadata includes one or a combination of the following:
the type of data object, the creation time of the data object, the modification time of the data object, the historical version information of the data object, the data structure of the data object, the interface of the data object, the storage constraint of the data object, the transmission constraint of the data object, and the encoding constraint of the data object (including the constraint of the encoding space).
Further, the method may further include:
step 202C, the object code is sent to the target client.
Fig. 14 is a flowchart of a third embodiment of an encoding processing method according to the present invention, and based on the embodiment shown in fig. 5B, as shown in fig. 14, a specific implementation manner of step 102C2 is as follows:
step 301C, obtaining a context object.
And step 302C, acquiring a corresponding coding space according to the context object and the coding protocol.
And 303C, coding the data content in the data object in the coding space to obtain an instance code.
And step 304C, obtaining an object code corresponding to the data object according to the meta code and the instance code.
In this embodiment, the code repository (also referred to herein as code repository) may be a repository storing code metadata, code meta-objects, and object data, which may also provide various services of relevance. Similarly to the word stock based on the standardized coding system, the fonts corresponding to the character codes in the handwriting input system can be stored in the coding warehouse. Fig. 15 is a schematic diagram of a font corresponding to a non-standard character code in the handwriting input system according to the embodiment stored in the code repository, and as shown in fig. 15, by accessing font information in the code repository, an application program using the new data processing system can render an arbitrary text font.
However, unlike conventional word stock, not only font information is stored in the code warehouse. New data processing systems employ solutions based on open coding of objects. Graphics, voice, or other multimedia data may be encoded, or different domain data may be encoded. These encoded metadata are also stored in the encoding repository. The application system can not only query and use various codes in the code warehouse, but also register new code types with the code warehouse and submit code data to the code warehouse.
FIG. 16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system, as shown in FIG. 16, illustrating the relationship between some of the core concepts in the encoding metamodel. Definitions of these specific concepts are given later.
The coding space refers to a logical space for isolating object codes. Different instance encodings of the same object type in different encoding spaces correspond to different objects. The coding space is directly related to one or several coding objects (only one of the above coding metamodels), and the coding object(s) are called the space and the direct context of the coding object in the space. The coding space is referred to as the coding space of the object(s).
The coding space of the coding object within the coding space is called subspace. The coding space is called the parent space of its subspace. The coding space without parent space is called the root space. The root space is typically the coding space of the coding warehouse.
In the computer world, we are coded with binary bits. Given a sufficient number of bits, we can use as many encodings as possible, including meta-encodings as well. But in implementation a larger number of bits means performance and memory costs. In addition, flat meta-coding is also disadvantageous for management. This is also one of the reasons why programming languages (e.g., C++, java, etc.) and XML technology employ namespaces. Similarly, we have introduced the concept of coding space to manage coding more efficiently. In practice, the coding space is a means for hierarchical classification and isolation of coded metadata. The coding space is hierarchical, that is, the coding space may also have subspaces. The same code belonging to different code spaces may correspond to different objects. The same unary code may also be completely different in different spaces. In practice, different coding spaces perform different levels of security isolation on the codes.
The division of the coding space can be done in different ways. But some basic objects are inevitably involved in the use and processing of the code. For example, fig. 17 is a schematic diagram of a basic object that can be applied to a basic coding space.
For the purposes of the present invention, any code is present in the code repository, except, of course, standard codes. In practice, different code warehouses correspond to different code spaces, and the code space corresponding to one code warehouse is the root space of all codes of the code warehouse.
Also, in the same code repository, each code has its own. The codes of different users belong to different user code spaces. The partitioning of user space may be more complex with varying degrees of complexity of the user model in the coding warehouse. For example, there may be a group space shared by multiple users.
Also, a data object is often used by different applications, and for a specific user of a certain code repository, different applications can share the same code; it is also possible to have these applications use separate codes. For the former, the same literal content can be handled and used by different applications without conversion. While in the latter case, independent encoding improves the security of the data-the code leaked from a malicious or cracked application only affects the data corresponding to that application. Of course, the advantages of the former correspond to the disadvantages of the latter and vice versa. Interoperability and security is directed to both sides of a coin. Here, however, we can see that the introduction of spatial concepts gives us flexibility in choice.
Further, the code is to be serialized into a specific data store. The data store may be a file, a database field, or a string transmitted in a network. Isolating the code against this data content itself maximizes the security of the code. In practice, this data content based isolation of content space is a codebook that establishes a one-to-one correspondence of content to code.
Finally, the code may be divided into different fields for ease of management, which may be referred to as management space. A name/identifier can be used to distinguish between different administrative spaces, and is therefore also called a namespace.
In the context of code formation and use, the two coding spaces (named coding space, context coding space) described above may be implicitly present. We refer to as the context space.
In an encoded warehouse, the permutation and combination of the different kinds of context objects determines the final context space. For example, different permutations of users and applications correspond to different context spaces. But in general, the codes in non-standard literal content are uniquely corresponding to the content, with the content itself implying the corresponding applications and users (except, of course, multi-application, multi-user content). There is no need to subdivide the application subspace or the user subspace in the content space. Of all the context spaces, there is a special space, namely a context-independent coding space, which we call public coding. In practice, standardized codes are all public codes. The codes in the root space are not common codes, but codes related to the code warehouse, and the code space is the root space corresponding to the code warehouse.
For a coding system, anything will ultimately be embodied as a code. The code to which the code space corresponds last is a meta code, which we can call spatial code. The coding space is also in fact a special coding meta-object-its corresponding object instance is also a coding meta-object. For context-free spatial coding, there is no coding space for this coding. But for context dependent spatial coding, the coding may correspond to different coding spaces depending on the context object. Thus, for context-free coding spaces, such as namespaces, we can directly employ spatial coding, with the corresponding instance coding being subspace coding or other meta coding. For the context coding space, we can directly use the coding of the context object as the corresponding spatial coding. The code corresponding to the code warehouse space is the code warehouse code. The content space corresponds to instance encoding. The application space corresponds to application coding. The user space corresponds to the user code.
For example, fig. 18 is a schematic diagram of a coding structure of a 128-length coding scheme. In addition, the arrangement and combination of the codes are not unique, and for example, the example codes may be placed at any position in the object code, as long as it is clear from the definition in advance.
In actual use, context space coding is implicit in the context in which the coding is used and does not need to occur in the final object coding. For example, the currently used code repository implies code repository coding; the application program using the code currently implies the corresponding application code; the document content where the current encoding is located implies an instance encoding and a user encoding of the encoding owner (assuming a single user document). However, when codes from multiple spaces of the same kind are present in the same text content at the same time, context space codes must be present in the text to set different code contexts to isolate the different spaces. For example, text in a document includes codes for multiple code warehouses. In this case, the corresponding code repository code must appear in the document content to distinguish between the different code repository spaces. Of course, the code repository supporting the code repository code must provide information to access the code repository to which the repository code corresponds. Also, the text content of multiple users must be encoded using the user; application coding must be used in content that can be read and written by multiple applications and that uses application space isolation. Content space is an exception because content encoding is the encoding of the document content itself, which is in one-to-one correspondence with the document content. It is impossible to correspond to a plurality of content codes in any content, and therefore, the content codes do not need to be displayed in the codes. In an implementation, the content encoding may be a hash of the document content, or a hash of the application encoding, a timestamp. So, the content encoding is either calculated in real time or stored as content metadata.
It is mentioned above that in general, spatial coding need not be included in the coding, but it is necessary to specify which spatial coding is used, which may be specified using spatial bits in the coding. This spatial bit actually corresponds to the encoding context in the encoding scheme.
Further, for example, fig. 19 is a schematic diagram of four binary bits, that is, four space bits, and as shown in fig. 19, the code warehouse bits may also be called reserved bits. An illustrative example may be, for example, when the reserved bit is 0, the code is from the current code repository. Otherwise, additional information is required to define the code or to specify the source of the code, such as the client code as will be mentioned later. When the content bit is 0, the code is independent of the content; for 1, the code exists for that particular content. When the application bit is 0, the encoding is application independent; when the bit is 1, it is the application specific code. When the user bit is 0, the code is public code; 1, is the code owned by the current document user. And vice versa. Any other coding scheme may be used as long as it is effective to distinguish between different spaces.
It should be noted that, as with the normal coding, the type coding also has coding space. And the space of type codes and instance codes may be different. For example, using public codes for user space may serve as a secure barrier to that user space. In this example, the code type of the code is user space, and the example code is public space. Since instance codes must be affiliated with a certain code type, the spatial bits of instance codes of the same type are all the same. In the specific decoding process, metadata of the coding type in the coding warehouse can be accessed according to the type coding. Therefore, the type code must include a corresponding space to ensure that the decoder can obtain the correct code type information from the code repository. The type information in the code repository may contain spatial bits corresponding to the instance code, so that the spatial bits do not need to be present in the instance code.
The context space is a main means for securely isolating the code, and the main body for managing and setting the application program and generating the code target space should be the individual (such as the user himself) and the administrator (such as the system administrator and the application administrator) corresponding to the context object. The management space is a hierarchical management of the convenient code, which is registered and used by the application.
The code word length refers to the minimum number of bits required to encode a character in a word encoding system. For example, UTF-8 has a code word length of 8 binary bits, or one byte. The code word length of UTF-16 is two bytes. In a code of a certain code word length, not all codes are of this length. But must be an integer multiple of the code word length. For multi-byte word length coding systems, there is also a need to consider the byte order problem in one code word length. This problem does not exist in single byte word lengths, and all data is arranged in a byte unit in order from low to high.
In addition, for fixed-length coding and variable-length coding, in one coding system, all the lengths of the codes are equal to the code word length thereof, and such a coding system is called a fixed-length coding system. Otherwise, it is called a variable length coding system.
In an object coding system, the code word length and the related coding method have close relation with the coding and decoding processes, and are irrelevant to the coding metamodel. That is, the object coding system corresponding to the same coding meta-model can select different coding word lengths and correspond to different coding methods. Even a combination of word sizes, i.e. coding methods, can be supported simultaneously, of course, an efficient mechanism needs to be designed to distinguish them.
It should be noted that the coding word length and coding method of the system are not directly related to the serialization word length and method specified in the specific object coding specification. Except that if the serialization results are part of the object code, the compatibility of the object code word length and method needs to be considered.
Like Unicode, the object coding system may be a system that is independent of the code word size. That is, there may be different word length coding schemes based on the same coding warehouse. In short word length coding schemes, a code word length often cannot put down a complete code (as previously described, including three parts, spatial coding, type coding, and instance coding). In this case we can use variable word length coding, i.e. one code can comprise a plurality of words. For example, meta-coding portions and instances The coding section is split into a plurality of consecutive code words. Even so, sometimes a code of one word length does not cover all code instances. We can use the variable length coding technique in Unicode-using the marker bits to define the code word length. For example, for a code with a word length of one byte, fig. 20 is an exemplary diagram of a coding scheme that enables the encoder to automatically obtain the corresponding code word length from the previous one or two bytes, as shown in fig. 20. The scheme can represent a coding range of 0 to 2 65 -1。
Fig. 21 is an exemplary diagram of the coding scheme of UTF-8, and comparing the coding schemes of UTF-8 (as shown in fig. 21) shows that the coding results of the two coding schemes do not conflict with each other and may appear in the same document. When the first bit of the first byte of the code is 0, the byte corresponds to the ASCII code part in UTF-8; when the first two bits of the first byte of the code are 10, the corresponding code is an object code; when the first two bits of the first byte of the code are 11, the corresponding code is Unicode code. In this way, hybrid encoding of object encoding and Unicode can be achieved.
Similarly, other variable length coding schemes of one byte word length and multiple byte word lengths can be designed.
In addition, as for the coding type, the coding type is an object type to which a related coding specification is added.
In addition, for an encoding context, the encoding context is an abstraction of a context object. In effect, is a selection condition for the context object to be selected at runtime. The above coding metamodel uses the coding type plus the object color name. In the same coding context (generally referred to as a specific application), the role names of the same type must be unique.
For example, in a web blog application, there are authors and readers, who are all user objects, but different roles. The encoding context of the data object in the blog content should be the author user. Thus, when any reader opens the content, the problem of decoding errors caused by the fact that the current login user is not an author is avoided. Of course, the premise of proper decoding is that the encoding context object is set properly. For the example of blogs, it is when each specific blog content is opened that the corresponding author user object is set to the encoded context object.
In addition, for the encoding path, the encoding context path is abbreviated as encoding path, and corresponds to a series of encoding contexts Wen Guiyao, which is a constraint on the encoding space to which the instance encoding of the corresponding data object belongs. The definition of the coding space indicates that the coding space is a hierarchy with which the coded objects having the associative coding are associated-the subspace may also have subspaces. The coding path is a coding spatial path positioned to determine the coding object. For example, the picture coding path in a personalized diary may be such that:
Figure SMS_17
The picture corresponding to the picture object code can be found in the final application space.
The coding paths exemplified above are run-time specific paths. The coding path in the coding element model is a higher level of abstraction coding path, corresponding to:
Figure SMS_18
at run-time, this encoding path will be instantiated as the encoding path instance above by selecting the corresponding context object.
By context object is meant a specific object that is assigned to a context specification, which object must conform to the constraints of the context specification and must be accessible during the corresponding encoding process. For example, there is an "author" context constraint, whose corresponding type is "user". When setting the context constraint, the current application cannot be set to the corresponding context object. The settings must be made with an object of the "user" type. Typically, after the author information corresponding to the document is obtained, it may be set as a context object corresponding to this "author" context constraint. If the author object is not accessible to the current user, the context object cannot be instantiated, that is, the encoding context constraint is not satisfied, and the next relevant instance encoding cannot be decoded. This is also one embodiment of context-based coding security in the present method.
In fact, in an implementation of the system, the encoding path instance is directly related to the encoding space in the encoding repository where the corresponding data object instance is encoded, alternatively, the storage location of the corresponding data object in the encoding repository may also be constrained by the encoding space. The specific implementation of the coding path by the coding warehouse can have various choices according to different storage schemes. A specific implementation example is given here. In a coding warehouse implemented using relational database technology, a simple implementation is to splice simple context names to form table names for contextually relevant data objects. In the above example, the table name of this picture table may be:
Figure SMS_19
the instance encoding of the corresponding data object may directly use the keys of the table.
Another implementation of coding space is to uniformly store data objects, and only the coding space for coding is distinguished. A specific implementation example is given here. In a code warehouse implemented using relational database technology, the system maintains a table of code spaces as follows:
coding space ID Parent space ID Context object reference encoding
0 Null Null
8 0 (reference code of user 001)
100 8 (application 005 reference code)
Wherein the coded space ID field is the table primary key; the parent space ID is the foreign key of the table and is used for representing the nesting relation of the coding space.
For each data object placed in the data warehouse, there are two tables. One is the data table itself of the data object, such as a picture table:
picture ID Field 1
Wherein the picture ID field is the table primary key. The data for all pictures is placed in the table. The other is a corresponding picture coding table:
coding space ID Encoding Picture ID
100 001
100 002
The encoding space ID field is an external key of the system encoding space table, and the picture ID field is an external key of the picture table. The code space ID field plus the code field is the primary key of the table.
In addition, for an encoding directory entry, the encoding directory entry is a specific encoding meta-object of the context dependent object encoding. There is one and only one code directory in each code space, the code directory being a list of code directory entries. Each code directory entry has a unique number, i.e., meta-code, in the code directory. In the above coding metamodel, the coding directory entry is specifically the coding type plus the coding path. The encoded path may be a relative path, i.e. the current space of the encoded directory entry, or an absolute path-based root space; or both can be supported at the same time, and only a mechanism for distinguishing the two needs to be established.
That is, in the context-dependent object coding system, the meta code (code corresponding to the code directory entry) and the instance code in the object code may not be in one code space.
The coding directory entry may unify the aforementioned space coding and type coding, and if a meta-coding corresponds to the coding type or the coding directory entry in the object data (actually, the coding directory entry), the meta-coding corresponds to a coding space; the instance code following this meta-code is actually also a meta-code. Thus, meta-coding can represent both spatial coding and coding of coded directory entries, depending on whether the corresponding coding type is a coded directory entry type. Thus, with the support of this design, the meta-code of one object code may be a combination of one or more meta-codes; the last meta-code corresponds to a normal code meta-object, and the previous meta-codes all correspond to code space. Furthermore, we can hide the aforementioned concept of space bits into the code repository through code directory entries instead of directly exposing them to the code. The coding path is more flexible and safer than the coding bit, and different context object combinations can be set.
In addition, for code directory entry instantiation, the code directory entry instantiation is mainly the process of instantiating a code path (a series of contexts Wen Guiyao) into a target code space when the context dependent object code system is running. Thus, with different context objects in the codec, the same meta-code (code corresponding to the code directory entry) will correspond to different target code spaces, and the object instance code will be encoded into different code spaces accordingly (of course, only the reference code form will correspond to the code space). For the coded directory entry with the coding path being empty, there is no instantiation process, and the corresponding target coding space is the space where the directory entry is located.
Coding catalog item instantiation is the key for a context-dependent object coding system to implement context-dependent.
In addition, for the encoding factory, the encoding factory is the object codec of the corresponding runtime of the encoding directory entry instantiation. It includes the corresponding code catalog item, the current code space (the space of the code catalog), the target code space (the space of the object instance data, which is actually instantiated by the code path through the corresponding context-related object). The encoding factory contains all the information that encodes and decodes the data object, except the data content of the object. The encoding factory provides a codec service for data objects corresponding to the encoded directory entries (in effect, the particular type of particular target space).
The coding space can be used as a special coding factory, and the coding type of the corresponding coding directory entry is the coding directory entry type. That is, the encoding space provides a codec service for encoding directory entries, i.e., encoding meta-objects.
The final output of the encoding plant should be object encoding, which includes meta-encoding and instance encoding. The process of combining or stitching the meta-code with the instance code into the object code may be placed at the user's end or in the code repository, depending on the actual design. Furthermore, a combination of representative element codes and instance codes or a splice code may be included in the final object code. If necessary, codes representing the combination or splice mode and the object codes can be stored under different security channels separately, and the respective access rights are set respectively, so that the object codes and the codes representing the combination or splice mode of the element codes and the instance codes can be obtained only through authorization and verification, and the element codes and the instance codes can be correctly disassembled.
In addition, for the system coding of the context-dependent object coding system, the realization of the variable length coding method is more direct due to the multi-level meta coding combination characteristic of the context-dependent object coding system. Both directory entry encoding and instance encoding may be one word length.
In addition, for context object set coding, this system coding is used to set the current (coding, decoding moment) context object, which will work for data objects later in the code directory that are used for the relevant context.
Possible forms of this coding:
Figure SMS_20
in the above encoding meta-model core conceptual table, the encoding context object needs to be modified to one encoding object to support this systematic encoding, that is, the encoding of the context object is the basis of the above encoding form.
Another possible form is:
Figure SMS_21
the encoded context identification may be a combination of a context type name and a context role name.
For the termination code, the termination code is used to inform the decoding process of the termination of an object code resolution. Terminal coding is not necessary. In most cases, object coding always ends up with instance coding, which would be parsed all the time if there were no instance coding. Thus, the system may be set to take the end of the instance code as the end-of-code resolution identifier. It is implied here that the coding space cannot be nested circularly, and must be a strict tree structure. May be a word length mark.
For root space coding, after a default factory is set with space coding, it is sometimes also necessary to use codes other than the default factory. At this point, we can use the root space code to convert the current code to other space. Root space encoding is the starting point for all complete encodings, and all other encodings, as well as meta-encodings, can be decoded starting from the root space. A literal content can only correspond to a unique root space. In the case where the default factory is not set, the default factory is the root space. The root space code may be a special mark of word length followed by the complete object code of the object from root directory code to instance code.
For default meta-code setting codes, the default meta-code setting codes are actually settings for the code space or the code factory. The root space encoding may break this setting. The decoding is performed by the encoding factory except for the object code from which the root space coding starts.
Since this encoding must end up with meta-encoding, it must be terminated using termination encoding.
Possible forms of this coding:
Figure SMS_22
the context related object code can improve the code expression and shorten the code length, is very suitable for the code storage and transmission of big data, rich data types in cloud storage and massive data objects with complex relations, and is also suitable for the requirements of light weight and diversity of the identification of the Internet of things.
For object codes and literals, a standard literal code is actually a reference code to a character object. So we can consider the object code sequence as a special literal content. For some operation concepts and processing tools of the traditional text, we can use the method by reference and multiplexing and combining the characteristics of object codes. Such as text searching, retrieving, editing, replacing, etc.
Meanwhile, the object code and the character code can be mixed, and the character code is only used as a special object code.
When the object code and the text code are mixed together, the corresponding coding and decoding method can have three methods:
1. assigning a specific meta-code to a literal code
2. When the object code is needed, the specific character code is used, and the object code is converted into the specified escape character.
3. The specific character code is extended to an extended character code capable of expressing the object code.
For structured object coding, the object coding sequence can be regarded as a special text content, and on the basis of standard text, there are a large number of coding standards and formats of structured documents, such as comma separated text table format CSV, structured document standard SGML/XML based on markup language, JSON format of packaging data structure by JavaScript grammar, and the like. On the one hand, we can directly use the relevant format and standard, and mix the object coding characters as content.
On the other hand, we can also consider the structured document formed by the object coding sequence as a special object, and code the structured document by using the object coding mode, and the coding result is the serialization of the corresponding codes of all the sub-objects forming the object. The encoding and decoding process of the structured object can be implemented by using the object structure information as a part of the encoding element data as a common data object, and putting the object structure information into an encoding warehouse to encode and decode the content according to the encoding metadata. The encoding and decoding process synthesizes and analyzes the object coding sequence which is the structured object serialization content, and further encodes and decodes the object coding in the sequence. This process may be a recursive, nested process. In addition, the codec of the structured object may be defined in other forms as follows:
Encoding of object arrays
Generally refers to encoding a set of objects whose meta-codes are identical. In the variable length coding method, an array system code is defined, and redundant meta codes can be removed. An array system code may be defined as follows:
array code =array systematic code + array length n + object code of first element of array (including meta code + instance code) +instance code of n-1 remaining elements)
Under this definition, an array system code may be considered to be a meta-code of an array object. The meta information of the array object is implicit in the whole array code, including array length, array type, etc.
Encoding of object two-dimensional tables
Generally, it refers to the same two-dimensional array code for each column of metadata code. Also, the table system code may be defined and redundant element codes may be eliminated.
Table coding =table systematic coding + data line number n + object coding of first line element (including meta-coding + instance coding) +instance coding of n-1 remaining lines
Under this definition, an array system code may be considered to be a meta-code of an array object. The meta information of the array object is implicit in the entire array code and may include array length, array type, etc.
Coding of object trees
Tree structures are commonly used and can represent complex object combinations, such as document trees, abstract syntax trees, etc. A special class of tag codes may be defined. The tag code is actually a tag of the start of the tree node, and the tag end tag is specified in the metadata of the tag object. When the decoder parses the end tag, the data objects between the tag and the end tag are combined to form a tree node object. Tree node objects may be nested and combined.
In addition to the fact that the tree structure information can be put in the encoded metadata corresponding to the root node encoding, the tree structure information can also be put in the encoded metadata corresponding to the tree nodes in a hierarchical manner.
Meta-element encoding
Meta-coding is the coding of metadata related to coded metadata. Is also part of the meta-code.
Mark encoding
For object coding, there is also a case where there is no instance coding. That is, object encoding has only meta-encoded portions. This coding is called Token coding and corresponds only to coded symbol data. Its main role is to provide semantic tags to the decoder. Are heavily used in structured encoded streams.
Further, one specific implementation manner of step 304C is:
the object code is generated from the meta code and the instance code using a predetermined rule.
In the present embodiment, the manner in which object codes are constituted by meta codes and instance codes may be varied. The object code may be constructed by directly combining or stitching together the meta code and the instance code. For example, FIG. 22 is a schematic diagram of object encoding constructed by meta-encoding and instance encoding.
In addition, object encoding may also be obtained, for example, by some operation between meta-encoding and instance encoding or other feasible hybrid manner, as follows:
Figure SMS_23
Thus we can strip the object code into meta code and instance code by corresponding operations:
Figure SMS_24
thus, any way of obtaining object codes from meta-codes and instance codes is applicable to the present invention as long as the meta-codes and instance codes can be retrieved in a reversible manner.
Both meta-code and instance-code are used internally by the object-code system, are typically automatically generated internally within the system, and are not visible to the application systems built upon the system. Depending on the relevance of the metadata portion to the data content portion, instance encoding may or may not be related to metadata encoding.
Type coding is a typical meta-coding. Type information of the object instance and coding specifications of the relevant type can be obtained through type coding.
Preferably, the method may further comprise:
adding a code representing the predetermined rule to the object code.
Or alternatively, the process may be performed,
storing codes representing the predetermined rule and the object codes under different secure channels, respectively, and setting different access rights for the codes of the predetermined rule and the object codes, respectively.
In this embodiment, with respect to context-dependent coding, it is mentioned above that object-based coding already has type-based coding isolation. However, for a certain type of data object, there are two major drawbacks to unified encoding space: first, the encoding is not secure enough. By directly modifying the code or using random codes, it is possible to directly access the same type of data object to other users. Secondly, the coding is not efficient enough. In order to ensure that the encodings of the same type of data object do not conflict with each other, the storage space occupied by the object encodings themselves increases as the number of data objects increases. Eventually, the coding efficiency is easily reduced.
Context-dependent coding is a concept that introduces context-dependent coding space, solving both of the above problems.
The coding space is an abstract concept for isolating the coding of data objects. The encoding of certain defined types of data objects in a defined encoding space is unique. But it may correspond to different codes in different coding spaces. At the same time, the same type, the same coding, may correspond to different data objects in different coding spaces.
Context objects refer to data objects related to the encoding use environment, such as users, application systems, time, place, domain, etc. Some data object encodings are closely related to these usage environments. For example, a data object that is private to a user is closely related to that user, and thus, the corresponding code should also be related to that user.
The context-dependent coding space refers to the coding space that is subordinate to the context object. By using the information of the context object in the meta information of the data object, we can specify the coding space of the corresponding data object. In this way, we can directly encode the data object with the encoding in the encoding space. In the process of coding use and analysis, the same object coding can correspond to different coding spaces according to different context objects. This further improves the effectiveness of the encoding.
In addition, a certain security access mechanism is provided for some key context objects, so that the security of the corresponding coding space can be ensured, and the security of the codes in the space is ensured.
More importantly, in the present embodiment, the key to object-based encoding is the meta-information of the data object. Serialization (content encoding), transmission and storage of data objects are all controlled by their meta-information. The type of data object is an important meta-information. Various data objects have different data types, and there is a relationship between these types, for example, complex types are composed of simple types, multiple data objects of one or more types may be arranged according to a certain convention to form a certain special structure, and so on. All of these types together constitute a type of system. Object-based coding systems are built on top of a complete type system. That is, in the corresponding encoding system, all data objects have their object types. And the type system is extensible, and users can define self-defined types based on the existing types and type definition and extension mechanisms. The type system mainly provides three benefits to the corresponding coding system:
First, type inspection
With the object type, the data validity of the corresponding object is verified. This is extremely important for the reliability of data encoding and transmission.
Second, type derivation
With the object type we can deduce its local type or related type. Thus, this partial type or related type may be omitted during the encoding process. Thus, the coding efficiency is greatly improved.
Third, code isolation
With object types, we can reuse the coding (in particular, reference coding) for different types. This also improves the effectiveness of the encoding, and the security.
In addition, in this embodiment we introduce OTF-8 coding, first, with respect to literal coding in OTF-8 coding, the target coding here is a literal coding. But unlike conventional literal coding, the coding and decoding process requires the participation of a coding warehouse. Thus, the encoding result and the decoding source can support non-standard characters. The data of the nonstandard characters exists in the code repository.
This literal code is based on UTF-8 and we call OTF-8.OTF-8 takes one byte as a unit, and no byte order problem exists. It is backward compatible with UTF-8. That is, any UTF-8 content can be directly decoded in the form of OTF-8 encoding, and the decoding result is completely consistent with the UTF-8 decoding result.
Second, with respect to the encoded digital representation of OTF-8, OTF-8 can encode digits of 0 to 128 bits in addition to conventional UTF-8 characters. Variable length coding is used here: for 0 to 31, one byte is used; for 32 to 255, two bytes are used; 2 8 To 2 16 -1, represented by three bytes; and so on. Specifically, bytes starting with 100 represent 0 to 31, and the five following binary digits correspond to specific numbers. For example, 0x80 (binary representation of bytes 10000000) represents 0,0x81 (10000001) represents 1,0x82 (10000010) represents 2. For numbers greater than or equal to 32 we use the first byte starting at 101 to represent the number of bytes afterThe number is followed by the big-end digital code of the corresponding byte number (the high order is in front, the low order is in back, and the high order is complemented by 0). 0xA0 (10100000) indicates that 1 byte follows for representing a number; 0xA1 (10100001) indicates a two byte number followed; 0xA2 (10100010) indicates the last three bytes … … and so on until 0xAF (10101111) indicates 16 bytes, i.e. 128 bits, later. For example, 0x a 0x20 (10100000 00100000) represents the number 32;0xA0 0xFF (10100000 11111111) represents the number 255;0x a 10 x01 0x00 (0x10100001 0000000100000000) denotes the number 256;0xA2 0x01 0x00 0x00 (10100010 00000001 0000000000000000) indicates the number 65536. The corresponding coding details are shown in fig. 23.
Finally, with respect to the object reference encoding of the OTF-8 encoding, the numbers appearing in the OTF-8, if there are no special tags, or special context, are used by default to reference encode the objects in the encoding warehouse.
The following briefly describes the coding space, the coding directory entry, and the meta-coding:
this encoding is mainly done by numerical numbering and is hierarchical numbering. This layering is mainly manifested in the layering of the coding space in the coding warehouse.
In order to access the various encodings in the encoding space, there is one and only one encoding directory in the encoding space of OTF-8. Each code directory entry includes a code type, and a code path. The coding path may be a context sequence from the current coding space to the other coding spaces. For example, when the coding path is "current user", the corresponding coding space is the subspace of the current user in the current space. When the coding path is empty (does not contain any context), the coding space to which the corresponding code belongs is the coding space in which the code directory entry is located. The encoding path may also be a string, i.e. a name, and the corresponding encoding space is the naming sub-space of the current space. When the code type of the code directory entry is the code directory entry, the data object corresponding to the code is the target space, and the code is called space code.
The number corresponding to the code directory entry is the directory entry code.
Both directory entry encoding and space encoding are meta-encodings that do not correspond to specific data object instances, but rather to metadata objects of the objects. Specifically, the directory entry and the coding space are correspondingly coded. The meta-coding is followed by instance coding to construct the complete object code.
The default coding starts from the root code space of the current coding warehouse. For example, the code directory of the root space of the code repository is shown in the following table one:
list one
Numbering device Type(s) Coding path
00 Encoding directory entries
01 Type supplier
02 Storage drive
03 Coding type
04 Coding context
05 User' s
06 Application of
07 Document and method for producing the same
08 Coding space User' s
09 Coding space Application of
10 Coding space Document and method for producing the same
11 Handwritten text
12 Handwritten text User' s
Then we can represent the user numbered 256 with two levels of number 05|256. With the aforementioned OTF-8 digital coding scheme, the reference code for this user object can be represented in four bytes:
Figure SMS_25
here, the protocol code "10000101" is the meta-code of the user object code; the latter "10100001 00000001 00000000" encodes an instance of the object code.
Assume that the code list of the code space of the current user is as shown in table two below:
Watch II
Numbering device Type(s) Coding path
00 Coding protocol
01 Application of
02 Document and method for producing the same
03 Coding space Application of
04 Coding space Document and method for producing the same
05 Handwritten text
Then we can represent the 256 th handwritten text of the current user with three levels of numbers 08|05|256. The reference code of this handwritten literal object may be represented in five bytes:
Figure SMS_26
here, the reduction code "10001000" of the root space corresponds to the user code space, that is, the space code. The following "10000101" corresponds to the reduction code of user space number 55. Thus, the space code and the protocol code together form a meta code '10001000 10000101' of the handwritten text object; the latter "10100001 00000001 00000000" encodes an instance of the object code.
It is noted that the code directory entry with the root space code directory number 11 is identical to the code directory entry with the code directory number 05 in the current user space. But their corresponding data objects are from different coding spaces, one being the root space and one being the current user space. In practice, the data object pointed to by the code directory entry numbered 12 in the root space code directory is the handwritten text in the current user space. Therefore, the data object corresponding to the above code may also be represented by the secondary number 12|256, which is specifically as follows:
Figure SMS_27
Here one byte is saved, only four bytes are needed.
In addition, regarding the coding context and its setting, two codes for the handwritten character object are different from each other in comparison with the above-described two codes, and there is a point that the meta codes are different from each other: the former may correspond to different coding types according to the current user, and the latter corresponds to the coding type which is always handwritten text. This is because the code directories of the different user code spaces are not necessarily identical.
In practice, the code space corresponding to code directory entry number 08 in the root space code directory is not a defined code space, but a code space of the user defined according to the current context "user" object. The corresponding coding space is different depending on the current user.
A context is a role that appears in the system during the use of the code, and actually corresponds to a specific object, called a context object. The context object may be determined prior to using the encoding, such as a user login being able to determine the current "user" context. Context objects can also be dynamically switched during the encoding process, such as in a multi-user chat application, where the current user needs to switch back and forth in the chat-recorded document. We specify a certain current context object with a coding sequence starting with a specific byte 0xBD (10111101) ". This coding sequence is called context-setting coding, and its specific syntax is as follows:
Figure SMS_28
If the context in the root space is as shown in Table three below:
watch III
Numbering device Type(s) Name of the name
00 Coding warehouse
01 Encoding meta-objects Default meta-object
02 User' s Current user
03 Application of Current application
04 Document and method for producing the same Current document
05
Then, the following codes:
Figure SMS_29
the user object numbered 256 (05|256) is set to the current user (04|02). This 7 byte setting has an effect on the user-related code after and before the reset.
Further, for coding terminators, the "current user" is just one coding context, and a variety of different coding contexts may occur from application to application. One common system context is a "default meta-object". As mentioned earlier, the default meta-object of the system is the root space of the current code repository. This root space is our "default meta-object" that we can change by "context-setting encoding" as described above.
In the conventional text coding, there is a concept of Code Point (Code Point), one Code Point corresponding to one character. OTF-8 has a similar concept, except that the encoding points of OTF-8 correspond to a Unicode encoding point, OTF-8 numbers and, or, a complete setting, such as the contextual setting. How does then represent a meta-object in the encoding? The direct use of meta-coding may misinterpret the subsequent codes as instance codes. Here we use a specific byte called "encode terminator" to tell the decoder the end of the encoding point. This byte is 0xB8 (10111000). The following code is to set the meta-object corresponding to the code directory entry No. 12 in the root space code directory as the default meta-object:
Figure SMS_30
After this setting, the original secondary number 12|256 becomes the primary number 256. Previous encoding:
Figure SMS_31
two object encodings are made, the first being the current user private handwritten character numbered 12 and the second being the current user private handwritten character numbered 256.
It follows that the encoding terminator is mainly used for the encoding corresponding to the meta-object.
Further, for the root space prefix, after the system default meta-object is changed, some object needs to be encoded by some method from the root space, and in OTF-8 we use a special byte to represent the root space, called the root space prefix. Thus, the following code is independent of the current default meta-object:
Figure SMS_32
it corresponds to the secondary number 12|256 starting from the root space.
For all object reference encodings in OTF-8, the encoding without root spatial prefix is decoded starting from the current default meta-object.
Still further, for system client encoding. We have seen that by setting default meta-objects, the coding length can be shortened and the coding efficiency can be improved. However, in some cases, a plurality of kinds of codes may occur in a document, and each code belongs to different coding spaces, and the system default meta-object can only improve coding efficiency for one of the codes. OTF-8 provides 8 system client encodings to bind arbitrary encoding objects (including encoding element objects), all one byte, respectively:
10110000
10110001
10110010
10110011
10110100
10110101
10110110
10110111
We also specify that the client encodes the corresponding data object with the same coding sequence starting with the specific byte "10111101". This coding sequence is called client coding setup coding, and its specific syntax is as follows:
Figure SMS_33
for example, the following set code sets the client code "10110000" to the user object corresponding to the secondary code 05|256.
Figure SMS_34
Once the extended client code is defined, we can replace the code of the data object to which it corresponds with it. Then, the following codes:
Figure SMS_35
the semantics corresponding to the context setup code of the previous 7 bytes are completely consistent. Here, the original four-byte object code is replaced by a one-byte client code.
Still further, for the representation of the object of the OTF-8 encoding, it was mentioned above that the numbers appearing in OTF-8 are used by default to represent the reference encoding of the object in the encoding warehouse. How then is the number directly represented in OTF-8? Still further, how does an object directly encode itself rather than its reference/number?
The answer is automatic type derivation and direct object coding with type coding.
Regarding type derivation, in the OTF-8 content decoding process, a classical "unification algorithm" may be used for type derivation. All OTF-8 content has one type, the default type is OTF-8 string type, i.e. root/generic object array. When decoding, there is a system decoding type stack. The stack top is put to be the specific type to be analyzed currently, and after the analysis of the data object corresponding to the current type is completed, the stack top is replaced by the type of the next element of the current type structure. If the current structure is completed, the stack top is popped, and the stack top content is the next element of the parent structure.
For example, there is the following structure:
Figure SMS_36
when parsing this type, the first number encountered will be parsed into an integer, rather than what object references the code. And at this point, if the parsed content is not OTF-8 digital, it is actually a data type error. The type information here also provides a basis for type checking.
When the second element of the type is resolved, the system automatically receives the content of the integer or the character string according to the type, and the resolver can automatically judge the actual type of the data object according to the coding format because the coding formats of the number and the character string in the OTF-8 are completely different.
When parsing the third element, there will be some overlap of the two types of encoded forms, since byte is a subset of int. Therefore, the type inference of the parser can be difficult. OTF-8 provides a system context "current parse type" to allow for refinement of the type of the next data object. At this time, it is possible to use
Figure SMS_37
To specify that the next data object is a byte type. Or use
Figure SMS_38
To specify that the next data object is of the int type.
When setting this "current parse type" context, we cannot use incompatible types. For example, in this example, int32 is a type compatible with int and therefore can be used. However, string type and byte and int are not compatible, and setting them to "current parse type" will result in type errors.
With respect to direct object encoding, in addition to direct object encoding by setting "current parsing type" as described above, OTF-8 also allows for direct immediately following reference encoding of an encoding type or reference encoding of an encoding directory entry, its corresponding data content encoding.
For parameterized types, the coding list needs to be applied to the type immediately following the type for which the type parameter corresponds.
Thus, the basic types of all data objects that need to be represented in OTF-8 must be stored in the code repository. In the aforementioned root space code directory, the code directory entry with the number 03 is the code type. The corresponding information is shown in the following table four:
table four
Numbering device Coding type
00 Type(s)
01 Unsigned integer
02 Signed integer
03 Floating point number
04 GUID
05 Boolean quantity
06 UTF-8 character
07 UTF-8 character string
08 Object reference
09 Nullable objects
10 Array of arrays
11 Tuple(s)
12 Dictionary for dictionary
Then the representation of the various types of data objects is as follows:
1. representation of numbers
2. Unsigned integer representation
For unsigned integers, the data is placed directly after unsigned integer type encoding. For example, the following codes represent the number 256:
Figure SMS_39
/>
3. representation of signed integers
For signed integers we need to be represented by unsigned integers, here using ZigZag coding.
ZigZag is actually a positive integer represented by an even number and a negative integer represented by an odd number. The following table shows:
Figure SMS_40
Figure SMS_41
the ZigZag code may decode unsigned integers into corresponding signed integers by the following algorithm: (n > > 1)/(n & 1)
The following codes represent the symbols 128:
Figure SMS_42
4. representation of floating point numbers
For the representation of floating point numbers, OTF-8 directly adopts the IEEE 754 standard. Common single precision 32-bit (four bytes) floating point is supported, as well as double precision 64-bit (eight bytes) floating point. Represented by four and eight byte numbers, respectively, of OTF-8. The numeric portion is encoded with a large end. The specific numerical forms are:
Figure SMS_43
and
Figure SMS_44
Half precision floating points and four precision floating points may also be supported if desired.
Representation of GUID
Similarly, the GUID can be directly represented by a 16 byte number in the form:
Figure SMS_45
5. representation of Boolean quantity
OTF-8 directly defines two special bytes to represent the boolean quantity.
Byte 0xBB (10111011) represents a logical true; byte 0xBC (10111100) represents a logical false.
Representation of characters and character strings
OTF-8 can directly represent characters and strings of UTF-8. To separate consecutive strings, the agreed string in OTF-8 may end with "0x0" (if not "0x0", the OTF-8 string ends at the last consecutive OTF-8 character); the character string including only one "0x0" character is a null string.
6. Representation of complex objects
Complex objects are composed of simple objects combined by some rule. In OTF-8, two special system objects need to be marked, one is an object start mark, denoted by byte 0xFE (11111110); the other is an object end tag, represented by byte 0xFF (11111111). The content of the data object is represented encoded between the start and end markers.
Further, in the present embodiment, with respect to the OTF-8 encoding and type system, we can see that the encoding type is critical to the object representation of OTF-8. In practice, OTF-8 builds on a set of scalable, fully-typed systems. OTF-8 has built in some basic types: integer type, unicode character type, boolean type, floating point type, unicode string type, OTF-8 string (actually, object array). At the same time, OTF-8 also supports parameterization types, some built-in parameterization types include: the reference type is encoded, the nullable type, the tuple type, the array type, the dictionary type. OTF-8 allows users to customize the structure, interface and services; and also allows the user to inherit and augment on an existing type basis. In addition, the user is allowed to introduce an external coding method to extend the existing types.
OTF-8 defines a coding type definition language. Through which the user can define new types. This definition language is independent of any existing programming language. But can establish the mapping relation with the element in the existing programming language, thus realize the automatic conversion between languages, such as generating the type statement of the specific programming language from the type description in the code warehouse; the description of the coding type definition is extracted from the source code or construction result (executable file) of the specific programming language. In this type definition language, we use a compact description for the built-in type, the correspondence table five is as follows:
TABLE five
Figure SMS_46
Figure SMS_47
In this embodiment, the type of OTF-8 has a unique type identifier with respect to the type identifier. To guarantee the uniqueness of the type identifier, a specific naming convention is typically employed, such as specifying separators, namespaces, naming rules, and the like.
With respect to root type, the data objects that OTF-8 can express all have a common root type. The standard string of UTF-8 thus corresponds to the object string of OTF-8. This root type is the "opencode. Object" type. The encoding type and encoding space thereof can be obtained by any opencode.
Figure SMS_48
In the type definition syntax of OTF-8, the heel type is represented by an asterisk (x), and virtually any type is represented.
With respect to null types, null types refer to types that do not correspond to any data objects. For example, the aforementioned context settings, spreading code settings, etc. correspond to the null type. In the type definition syntax of OTF-8, the null type is represented by the symbol "()"; in this syntax, the return type of a method or function may be omitted if it is null. For example, the following function:
Start()
a function indicating that the input type is null and the return type is null. The corresponding type is
()->()
Simple type and complex type
Whether simple or complex here is in terms of coding expression. In OTF-8, simple types include: coding reference type, integer type, boolean type, floating point type, unicode character type, unicode string type, and extension types thereof. Wherein, except for Unicode string types which correspond to multiple objects, the other types correspond to a single object. Simple types can directly encode expression in OTF-8.
With respect to type aliases, type aliases refer to new types that define an existing type as a different type representation. The corresponding coding type definition syntax is as follows:
< New type identification >: type < existing type identification >)
Such as:
MyTypes.YesOrNo:type OpenCode.Boolean
with respect to constraint types, existing simple types (mainly including numeric types, character types, and string types) can be defined by type constraints, resulting in a new numeric and string type with constraints. The corresponding coding type definition syntax is as follows:
< New type identification >: type < numeric type, character type, or string type > { constraint }
For a value type, the constraint is a range of values, such as:
OpenCode.Byte:type OpenCode.Integer{[0,255]}
representing an integer type of 0 to 255.
For character types, the constraint is the character range of Unicode.
For string types, constraints are the length constraint of the string, and regular expression matching patterns, such as:
postal code: type OpenCode. String { [0-9] {6}
Representing the 6-digit string type.
With respect to parameterized types, OTF-8 also supports parameterized types, also known as generic types or model types. The parameterized type means that the subelements constituting the type are parameters and not defined types. The final type is determined after the parameters are specified. For example, a general array type, whose parameters are specified to be shaped, the corresponding type becomes an integer array type; designating the parameters as character strings, the corresponding type becomes a character string array. All complex types in OTF-8 can be defined as parameter types, and parameterized types can also be used directly in the definition process. The grammar form of parameter definition in parameter type is enclosed by angle brackets "<", ">" after type key (class, enum, type, etc.), and a plurality of parameters are used and partitioned. In the type definition, parameters can be directly embodied in the use of the parameter types, and the grammar form is a list of parameters which are marked by the parameter types and are surrounded by brackets "<" > ", and are segmented by the brackets".
All classes or part of parameters of the parameterized type can be directly determined in the definition of the type alias.
For example, for parameterized dictionary types, there are two types of parameters, one is key type and one is value type. We can define a string-to-string dictionary as follows:
character string dictionary: type dictionary < string >, string)
A parameterized dictionary of key types as integers may also be defined as follows:
integer key dictionary: type < T > dictionary < int, T ]
Where T is a type parameter, corresponding to the value type of the dictionary.
In encoding a parameterized type of data object, a reference to the type or type code for the type to which the parameter corresponds needs to be given before the data object itself is encoded. The type references and data objects are distinguished by a special separator. The system delimiter object for OTF-8 is byte 0xBA (10111010). The separator is used to separate different syntax elements in a structure. For example, one example of directly encoding a parameterized dictionary type data object is as follows:
Figure SMS_49
since neither byte 0xFE,0x00,0xFF is a character that can be displayed normally, the highlighting is here to show the distinction.
Regarding the merge type, the merge type refers to one type in which a plurality of types of encoded forms exist at the same time. The definition of merge type is in the following grammatical form:
< New type identification >: type < existing type identification 1> { constraint 1} | < existing type identification 2> { constraint 2} | …
Such as:
OpenCode.SmartFloat:type OpenCode.Float64|OpenCode.String{[+-]?[0-9]*(\.[0-9]+)?|-?[1-9]\.?[0-9]+([eE][-+]?[0-9]+)?}
double precision floating points, which would otherwise require 9 bytes, can be expressed in fewer bytes when appropriate. For example, "1" has only one byte, ".24356" has only 6 bytes, and "6e23" has only 4 bytes.
When defining merge types, recursive definition is allowed, i.e. the defined target types can be used directly in the type definition body. For example, one tree type is defined as follows:
tree: type < T > (T, tree [ ]) T
The encoding of a corresponding string tree data object is as follows:
Figure SMS_50
/>
Figure SMS_51
it can be seen that this is a tree structure of chinese administrative division. The line feed and blank/tab are added manually for easy reading, and these control symbols are not present in the true encoded content. However, according to the previously defined tree types, the OTF-8 parser is able to encode, decode and validate the corresponding data objects.
With respect to null objects, unlike null types, null objects are one object and not a type. The Null object has its own special type (rather than a Null type without any instance), which we call Null. But this type is only one instance, namely this null object. And this particular type is not directly used.
The null object indicates that the corresponding data object does not exist. We directly represent the null object with a coding terminator (0 xB 8).
Regarding nullable types, nullable types are types formed by combining virtually any type with Null. The nullable type corresponds to a data type that may not have data. The type grammar can be described as follows:
nullable type: type < T > T|Null
OTF-8 coded type systems have built-in direct support for nullable objects, which can be used in simplified real-time in a type definition grammar-directly after the corresponding type plus question "? "change the type to an nullable type". The following is shown:
string?
representing an nullable string. This type of null object and null string are two completely different objects. The former indicates absence. The latter indicates that the content is an empty string.
With respect to the array type, which is also a parameterized type, a plurality of data objects of any type may be sequentially arranged. The OTF-8 encoded type system also provides built-in support for array types, and also has a compact expression form-after placing a pair of brackets in a particular type, the type is converted into the corresponding array type.
The numbers in brackets may limit the number of elements in the array.
For example, the following type is an integer array, and the number of array elements is not limited:
int[]
the following type is a string array with only 5 strings:
string[5]
the OTF-8 decoding system generates errors in type checking if the number of elements obtained is not 5 when parsing the corresponding data object.
The following type is a Boolean array in which the number of elements can only be 5, 6 or 7
bool[5..7]
In addition, OTF-8 also supports the definition of multi-dimensional arrays. Such as:
string[3][4..5]
this is a two-dimensional array of 3 rows, 4 columns or 5 columns. For a particular two-dimensional array object, it can only be a 3X4 or 3X5 array, there can be no row 4 columns, and there can be a row 5 column.
Regarding tuple types, a tuple type is also a parameterized type, and the parameters may be any number of arbitrary types. The corresponding data is the sequence arrangement of the corresponding type of data objects. Only one data type is equivalent to that data type. The null type is the absence of any data type tuple type.
The support of built-in tuple types in OTF-8, the type parameter list is enclosed by brackets "(" and ")" and the types are separated by commas to represent a tuple.
For example, (int, string) [ ]? Is the nullable array type of an integer, string-formed tuple.
Tuple objects also need to be surrounded by start (0 xFE) and end (0 xFF) marks when serializing/encoding.
Regarding dictionary type, which is also a parameterized type, there are two parameters: key type, value type. The essence is an array of corresponding tuple types. There is only one constraint more: the key portions in the array element object must be unique and not repeatable. The support of built-in dictionary types in OTF-8 is that key and value types are separated by colon (":"), and are surrounded by brackets ("[", "") to represent the corresponding dictionary types. Such as:
[string:int]
a dictionary type of string-to-number mapping is represented. Individual elements of the dictionary are not surrounded by a start and end tag.
Regarding classes, the classes in OTF-8 include members and methods, as are object-oriented classes. The grammar form of the class definition is as follows:
Figure SMS_52
Figure SMS_53
and when the corresponding object coding is carried out, the contents of the member data objects are coded sequentially according to the occurrence sequence of the members. In addition, when a member is a default value, a system may be notified using a special flag defined by the system. The default value is marked as a special byte 0xBE (10111110).
When defining class members, a system key context may be used. The member data content with the key mark is stored in the corresponding coding space; and the member data content without the tag is stored in a unified store.
For example, the following contact categories:
Figure SMS_54
then a corresponding data object is encoded as follows:
Figure SMS_55
the data object is finally stored in the code warehouse, and the contact is often stored in address books of different users, so that main information of the contact can be referred to by different users as shared storage; however, the "nickname" will generally vary from person to person, and therefore, this "context" herein is an indication that the field is stored in the target context space. The specific contact context of one possible data storage server is stored independently as follows:
contact ID Name of name Mail address Contact telephone
4623478 Zhang San zhangsan12345@sina.com 13234567890
This type of context-sensitive storage is as follows:
coding space ID Contact numbering Contact ID Nickname
(encoding space ID of user 1) 005 4623478 Elder Zhang
(encoding space ID of user 1) 007 4623478 Three children
In this way it is ensured that different users can share the same contact, but that the numbers and nicknames of the users to the same contact are separated by the code space. Therefore, the utilization rate of the storage space can be improved, and the context-free part in the data object is not stored for multiple times.
Unlike the object methods in object-oriented programming languages, the methods in OTF-8 are only syntactic definitions. The method in the definition can be applied directly in OTF-8 encoded documents. The definition of the method determines the type of the method, and both the client and the server need to verify the correctness of the grammar of the application of the method according to the type information. The specific implementation of the final method is performed by a remote service.
With respect to interfaces, interfaces are only methods. An interface is an abstract type, mainly defined by roles between objects and interaction protocols between objects. The interface will eventually be implemented by the class.
Figure SMS_56
For example:
Figure SMS_57
inheritance and implementation
As with classes in an object-oriented programming language, one class may be a sub-class of another class and one interface may be a sub-interface of another interface. Classes may also implement interfaces. The interface of OTF-8 supports single inheritance; classes also support only single inheritance, i.e. can only be derived from one class at most, but multiple interfaces can be implemented simultaneously.
The encoding of the child members is to encode all ancestor, parent and own member orders in inheritance chains, starting from the root object. The method numbering of the subclasses is also sequential in terms of inheritance chains, methods of all ancestor and parent classes, methods in the implemented interface, and methods defined by itself.
Figure SMS_58
Regarding the coding reference type, the coding reference type is a parameterized type, and its parameters can only be a class. The content of the data object is the corresponding number of the type storage in the corresponding coding space of the object. The coding reference type is the most important type in OTF-8. With this type, we can refer to the data in the code repository by code number. The code directory entries in the code repository also appear as meta-codes in the form of code references. In the type grammar definition of OTF-8, we represent the corresponding coding reference type with a class identification followed by a "#". For example:
Figure SMS_59
the reference type corresponding to the "contact" class is indicated, an example of which is the corresponding code repository application code.
Enumeration of
There are two types of enumeration in OTF-8, one is symbol enumeration and one is object enumeration.
Symbol enumeration is just as enumeration type in common programming languages, namely a list of digitized symbols. The definition is a set of named integers. The grammar form is as follows:
< New type identification >: enum { < name 1[ =number 1] >, < name 2[ =number 1] >, … }
Unlike the enumeration type in a common programming language, the object enumeration type of OTF-8 is a parameterized type, which defines a set of named objects. The grammar form is as follows:
< New type identification >: enum < < enumerated type > > { < object 1[ =number 1] >, < object 2[ =number 1] >, … }
Such as:
week: enum < string > { "Sunday", "Monday", "Tuesday", "Wednesday", "Tuesday", "friday", "Saturday" }
When an object has no corresponding number, the first object is encoded as 0, its object is concurrent.
The number corresponding to the name may also be explicitly specified, such as:
Poker.Figure:enum<string|int>{3=3,4=4,5=5,6=6,7=7,8=8,9=9,10=10,“Jake”=11,“Queen”=12,“King”=13,“A”=14,“2”=15,“Black Joker”=16,“Red Joker”=17}
in practice, the type definition language of OTF-8 supports object descriptions of all types, mainly object descriptions used in object enumeration type definitions and default value descriptions in class definitions.
Service
Unlike object methods, a service does not belong to a certain object, but is a set of functions.
Typically corresponding to a network service on a certain node in the network.
Figure SMS_60
For example, a digital weather forecast network service may be defined as follows:
Figure SMS_61
regarding the external type, the OTF-8 can support the external type through a type provider in addition to the built-in support of the above type, thereby realizing accommodation of any existing encoding format.
The existing coding format has no two kinds of coding modes: text encoding and binary encoding. Text encoding corresponds to string type. Can be directly expressed in OTF-8. While for binary encoding, one specific flag byte 0xBF (10111111) is used in OTF-8 to represent the binary byte stream. Followed by an OTF-8 integer indicating the size of the byte stream, and then the specific binary byte stream.
Based on supporting text and binary coded content, OTF-8 coding systems support specific different coding syntax and semantics by providing different coding drivers for coding types.
Specifically, in the present embodiment, in conjunction with the above description, the following description is made by way of two specific examples:
the first example, with respect to XML encoding.
XML is a text-based markup language. In OTF-8, support can be provided in two ways.
One is to embed the content of the XML document directly into the OTF-8 document, in effect being a string object corresponding to one OTF-8. But through the type provider of XML (embedded) we can get and access the Document Object Model (DOM) of the object.
Another way is to directly extend the type system of XML into OTF-8. XML is a meta language, and the grammar structure of a specific XML document can be defined by the languages of DTD, XML Schema, relaxNG, etc. For example, the standard network vector graphics format SVG is defined by DTD. Through the DTD type provider, we can read in and parse the DTD definition of SVG to generate a corresponding series of element types and attribute types. There are certain relationships and constraints between these types. These types may be derived from the syntax checking and type accordingly. The DTD type provider (map type) generates a corresponding space in the coding repository according to the DTD definition of the SVG and directly codes the formed type object therein. Thus, for data objects of the corresponding SVG type. One SVG document may be encoded directly from the SVG types (corresponding element types and attribute types) in the encoding repository. This encoding is much more efficient than the traditional XML text approach. And the existing XML technology heritage can be reused to the greatest extent.
For example:
Figure SMS_62
Figure SMS_63
the rendering result is shown in fig. 24 below for the content of one SVG file.
By means of the DTD type provider we get a series of SVG elements and attribute types. As shown in fig. 24, it is easy to see that a large amount of redundancy in XML is mainly element names as syntax marks, attribute names, and some system characters that distinguish node names from node values, such as ">", "<", "/", "=", and the like. Since in OTF-8 we can directly encode the information items in the XML corresponding information set (XML Infoset) using open coding without the restriction of standard coding, redundancy can be greatly reduced.
We can put part of the XML information item attributes into the code repository, directly using the corresponding code. The content of the coded warehouse type information is obtained as follows:
Figure SMS_64
the relevant encoded warehouse data for the type xml.info set.element is as follows:
Figure SMS_65
/>
the relevant encoded warehouse data for the type xml.info set.attribute is as follows:
Figure SMS_66
/>
Figure SMS_67
with OTF-8 encoding, the original SVG document can be represented as follows:
Figure SMS_68
the document object model is identical to the previous one, but the data content of the latter is only 380 bytes, and more bytes than 980 bytes of the former one save more than 60% of data quantity.
The above OTF-8 document was observed, comparing the previous example of a chinese administrative division string tree. We will find that there are many types of labels in this document, such as green element labels, and green-blue attribute labels. This is because the type expression in DTD is limited, and attribute types are mostly string types, so it is difficult to derive the correct type from the type derivation. And thus type tags are indispensable. In fact, the types generated by the type suppliers based on XML Schema or RelaxNG are richer, and finally the corresponding XML OTF-8 document is more compact and efficient.
In the second example, buffer Protocol coding is used.
The Buffer Protocol of google is also an object serialization format with Schema, the type definition language can be directly used as the type definition of the corresponding type, and through the Buffer Protocol type provider, we can correspond the binary data object encoded by the Buffer Protocol to the data object of one type of OTF-8. Specifically, in OTF-8 we define a systematic code 0xBF (10111111) as the start marker for the embedded binary data block. This flag byte is followed by an integer representing the number of bytes of the binary data block (encoded in an open encoded form) followed by the corresponding binary byte stream.
In practice, depending on the type derivation, it is sufficient to add the data block to the binary data type directly corresponding to the data block length. We introduce here this binary data block notation mainly to guarantee the reliability of the coding resolution. Because any code point (including systematic encoding) of OTF-8 may occur in the binary stream, we need to avoid parsing the embedded binary stream without any data meta-information (including type information). This binary marking system encoding does this.
It can be seen that in OTF-8, the "type provider" is the key to implementing an existing coding standard or custom coding scheme.
In fact, OTF-8 defines the rules for the corresponding types and combinations of types for all code points, which together constitute an OTF-8 type system. There are two types of "type suppliers", one is a mapping type, which refers to a type system that corresponds a specific type in the definition of an external type to OTF-8, so that we can reconstruct the encoding of the external type in the way of OTF-8. The method has the advantages that various benefits brought by the code warehouse are added on the basis of keeping the original code Schema definition, such as a safer metadata authorized access model, centralized metadata sharing, a more simplified code form and the like. The "DTD type provider" in the previous SVG example is of this mapping type.
Another type provider is embedded, which means that the entire external coding mode data is directly embedded into the OTF-8 code, corresponding to one data type. The original encoder and decoder directly encode and decode the corresponding content to form a corresponding OTF-8 object. Specifically, for the text-based data serialization method, a UTF-8 string is embedded (if the original code is not UTF-8, a corresponding code conversion is needed); for binary data serialization methods, embedded is the aforementioned, the guided block length plus the specific binary block content encoded with a 0xBF binary flag. The aforementioned XML-type provider is an embedded text code, and the Buffer Protocol type provider is an embedded binary code.
In summary, OTF-8 is a specific coding system that is based on an object-based context-dependent coding method. On the basis of a built-in perfect type system, the method can be used for carrying out reference coding on the data objects in the coding data warehouse and can also be used for directly carrying out efficient and safe content coding (coding metadata, comprising type information, arranged in the coding warehouse) on the objects.
Referring to fig. 25, encoding points of otf-8 other than UTF-8 are listed herein. Furthermore, according to this code definition, there are a number of codes to be defined for system expansion. They are all shown in FIG. 26. For example, we can define the double byte 0x a 0x00 as an application function/method. To achieve this, the support for Remote Procedure Call (RPC) can be provided on the basis of OTF-8, which is much more efficient than existing approaches such as XML-RPC, SOAP, etc.
Similarly, in this embodiment, further Unicode extensions such as OTF-16 and OTF-32 may be introduced. Extending from UTF-16 and UTF-32, respectively. In contrast to OTF-8, the coding warehouse, object-based context coding method, type system, etc. concepts and compositions are identical. The main difference is that the specific definition of the open code (mainly including coding the number and systematic coding) will be different according to the coding mode corresponding to Unicode, and will not be repeated here.
Further, the method may further include:
and carrying out normalization processing on the data content of which the corresponding coded content is the reference code.
In this embodiment, the processing system based on the coding warehouse of the present invention can provide various analysis and processing services to the coded data (byte stream) by using the coded metadata of the coding warehouse and various related services in addition to the most basic codec services. This includes two different levels of service: one is a code analysis processing service that does not rely on specific coded data. The service mainly carries out statistical analysis on specific codes of specific users and specific types, and stores analysis results for further utilization, such as text retrieval service. We call this service layer a literal code service layer. The service only processes the codes and does not need corresponding text content information, so that the safety of the text content and personal privacy of the user is completely ensured, which is difficult to achieve by standardized text. Another hierarchy is to provide various related services over literal code and its corresponding data to facilitate the use of new data processing systems by applications. Referred to as the literal content service layer. The analysis results of the first hierarchy may be used directly by the second hierarchy.
For conventional data processing systems, literal coding is used not only for literal processing, but also for the expression and delivery of general purpose data. Some general structured text and proprietary text processing techniques have grown endlessly, such as SGML/XML (and HTML, SVG, mathML, etc. above) series techniques, programming language processing techniques, domain-specific modeling languages, and so forth. The new data processing system is fully built on the traditional data processing system, and besides personalized word processing bringing a brand new concept, open coded words based on a coding warehouse can be introduced into the existing text data processing technology. A new text data processing technology which is safer and more efficient can be formed by only slightly modifying the prior art. Thus, the word processing system in the new data processing system actually includes two aspects, one is the new word processing system and the other is the new text data processing system. Of course, these two aspects may also be combined, such as processing based on a handwriting programming language, and the like.
Optionally, some other services or applications may also be provided, including but not limited to the following service options: the data content is normalized to the service.
Specifically, data content normalization refers to merging identical or similar data content in a coding warehouse to allow them to use the same code. For example, the same person writes the same word at different times, and although the final glyphs are not necessarily identical, they may be normalized by some sort of feature classification.
Normalization may be automated according to certain rules. For example, normalization of the sounds may preserve only the same sounds of the highest sampling frequency from which sounds of lower sampling frequencies may be generated. Normalization may also be semi-automated by means of manual intervention, i.e. the content normalization service finds the same or similar content item in the coding repository and then outputs it to a designated user (e.g. content item owner) who designates the last remaining content item according to his criteria.
The normalization service may be performed in real time. In this case, each time the encoding repository receives the input content, the content hosting service looks up the same/similar item in the encoding repository, if the same or similar content item is present, it will directly encode it back, and if necessary (according to certain rules), it will also need to replace the original content item with new content. The return to service may also be performed offline, not in real time. At this time, after the content normalization service finds the normalization-possible content in the code repository, a correspondence between the original instance code and the normalization-post code is established. According to the correspondence, the normalization service converts the input character string into a character string returned after normalization.
The return to service needs to be accomplished using a specific content matching algorithm. Such as matching handwritten content requires the use of pattern matching or image matching algorithms. Matching voice content requires the use of a voice matching algorithm, and so on.
Although content normalization is an optional service, the code repository implementing content normalization may minimize code redundancy, thereby maximizing the use of existing text infrastructure and related tools.
In addition, further, some other services or applications may be provided, including but not limited to the following service options:
1. code management service
The content in the coding warehouse can be of various types, which brings great flexibility and openness to the system-different input and output methods can be mixed; different specific implementations can be mixed with the same type of input method; different kinds of coding may be used in a particular input/output scheme; new coding schemes can be dynamically added; etc. In this case, it is necessary to manage the encoding to some extent.
The code management is mainly the access and maintenance of the code metadata. Including the management of coding space, coding type, coding conventions, etc.
Due to the individualization of the new data processing system and the arbitrary nature of the code, a mechanism for code type registration and query needs to be introduced. In this way, the application system can dynamically increase the coding type. Existing encoding types, as well as associated metadata, such as specific details of the corresponding encoding conventions, etc., can also be queried and used.
2. Content selection service
In different environments, the output of the text content has different requirements. For example, a high-precision character printing apparatus requires high-precision font information; low bandwidth network devices have to find a balance between font quality and data size; a system with high safety requirements hopes that the written content hides the stroke order information; movie dubbing and video chat require different quality audio output; etc. These all require a content selection service.
The content selection is in effect a conditional output of the content. The output content may be directly the data object in the encoding warehouse. There may be multiple data objects in the code repository corresponding to the same code (the normalization service may reserve multiple data objects for the same code). The content selection service needs to select the most appropriate data object for output. The output data object may also be dynamically generated. For example, the text image output can be obtained by dynamic rendering of text graphic data; the low sample rate audio may be degraded by the high sample rate audio; etc.
3. Content caching service
The specific implementation of the code warehouse can be storage and related services in a certain application program, can be services shared by a system, and can also be services in public cloud or private cloud.
When the encoding repository is shared in a network environment, the content needs to be downloaded locally over the network. Sometimes, due to network transmission reliability, bandwidth, etc., it is necessary to provide local buffering of the coding warehouse. The local cache may cache some or all of the data objects of the shared code repository in the network at the client or intermediate node to support fast, reliable output. Also, in case the code repository access is unreliable or even offline, the input may also be done directly in the local cache, resulting in a temporary code. When the content cache is synchronized with the code repository, the temporary code is updated to the formal code, and the corresponding code content is updated accordingly.
4. Transcoding service
Based on the new data processing system, the computer system is able to decompose various inputs into data objects in the code repository and encoded content. The computer system can then restore this output to what the human (at least the inputter himself) can understand based on the code repository.
However, due to the non-standard character encoding of the present system, the encoded character content cannot be understood by anyone or machine in an environment without an encoding warehouse. Transcoding is mainly to provide services for converting personalized literal codes into standard literal codes. The result of the conversion is a traditional standard word that can be used in a traditional application environment that is outside of the coding warehouse.
Specifically, converting the handwriting-based object code into a standard character code is to perform handwriting recognition on corresponding character content; the conversion of the speech-based object code into a standard text code is the speech recognition of the corresponding text content. The result of this identification may also be used to implement content serving.
Once the correspondence between object-based codes and standard literal codes is established, the system can implement the conversion from standard codes to object-based codes to some extent.
Further, different object codes can be converted from each other. May be a conversion between different text output modes of the same person. For example, the result text of handwriting input is subjected to voice output. Or transcoding between different users. For example, a secretary's handwriting draft is directly converted into a manager's handwriting. There are two ways to implement the conversion between object encodings. One is to use standard literal codes as intermediate codes for conversion. One object code is converted to a standard literal code, which is then converted to another object code. Another method of conversion between object codes is to directly establish the mapping relationship between the two codes.
In addition, some object codes are based on standard text codes, such as sensitive word codes for encryption purposes, common word codes for compression purposes, and the like. These codes are themselves used to convert to standard literal codes.
It is worth mentioning that the relation between different encodings is not a one-to-one mapping. For example, in many languages, one-to-many meaning is common, and thus, a one-to-many relationship often occurs between a code formed based on speech input and a standard text code.
5. Access control service
For a security-demanding environment, access to the code repository needs to be protected by a system-level access control system. Of course, this access control is optional. In some single-user systems, it is not necessary to separately set up the content access control service.
In a multi-user environment, the access control system confirms the identity of the user of the system and for that identity, either allows or prohibits use of the services provided by the code repository according to rules set by the code repository. For example, a user with a code repository text entry account may store their entered data objects to the code repository. And only the user, and other users authorized by the user, have access to the user's data objects in the code repository.
The code in the code repository is in use with a relevant context model. Such as a document model, a user model, an application model, etc. Thus, we can set the rights to access different encodings entirely according to these models, and this right can be set at different levels, which can be the encoding space level, the meta-encoding level, and even the instance encoding level. Unlike conventional resource access control (e.g., files, computers, etc.) and website access control, this code-level permission setting enables finer granularity of information access control.
It is emphasized here that the access control system does not protect the encoded content itself (the set of object encodings), only the data objects in the corresponding encoding repository. Thus, an authorized user can restore the original input content in conjunction with the data objects in the code repository. And the unauthorized user cannot correctly output the same coded content, and only unordered content or 'messy code' is obtained.
6. Text service
The object-based coded text system may also include service subsystems to provide advanced text services based on the coded services provided by the coding repository.
7. Text search and replacement
As with conventional literal searches, object code can be searched (literal code layer) in the new data processing system, especially for literal content after normalization. In addition, text searches may also be content-based searches, as new data processing system encodings and content are in a one-to-one correspondence. Taking handwriting input characters as an example, searching (a character content layer) can be performed according to partial contents (such as radicals) of the characters; fuzzy search can be performed according to the content; a lookup may be performed by the number of strokes, etc.
In addition, due to the openness of the new data processing system, any kind of data can be subject to object coding through the coding warehouse, and the new text searching service can also search and replace according to the type of the subject coding and the field characteristics of the related type.
8. Text conversion
The text conversion service refers to a service of converting an open code into a standard code. The service is based on transcoding in the coding repository. However, unlike code conversion in a code warehouse, text conversion also needs to select an optimal result from a plurality of candidate target codes based on syntax semantic analysis. In effect a more comprehensive, higher-level identification system.
9. Text matching
Because the new data processing system is capable of supporting highly personalized text input, the application program can formulate matching rules based on the personalized input, corresponding the input to a particular output. For example, an internet browser may correspond different characters or icons of the handwriting input to different websites; the handwriting programming system may map specific inputs to corresponding keywords, and so on.
10. Literal data service
The security and efficiency of the new data processing system is equally applicable to structured text technology. The text data technology based on open coding transformation brings performance and efficiency compared with the existing binary data, namely metadata can be completely stored in a coding warehouse, and object codes which are not in conflict with each other can ensure the minimization of coding word length. It is entirely reasonable for an application to unify text content, structured, semi-structured data, described by an object coding system. The literal data service provides a service that opens up the transition between the coded string and the application-specific model.
In addition, unlike traditional text entry, text entry in new data processing systems does not require the generation of standard codes, but rather, the input is preceded and the code is generated. Therefore, the text input system can perform input in the most natural and efficient way. The input result needs to be divided into minimum units in a natural and reasonable manner, such as characters of a text or words, fragments of voice, and the like. And then the contents are sent to an encoder or a coding system through the coding system to obtain corresponding codes.
It can be seen that the input subsystem comprises at least two functions, namely the reception of an input and the segmentation of a content unit.
It should be noted that, due to the privacy and openness of the personalized code, different input methods can be mixed, and the personalized code can be put into the same text by only using different code types or different code spaces. For example, text for speech input is inserted into text for handwriting input.
The input of the new data processing system allows for a variety of input content such as graphics, images, video, sound, etc. Also allows for the multidimensional nature of the input content, such as reading of written content simultaneously during handwriting. The content selection service of the encoding repository may output the multidimensional content in a form that is appropriate for selection. Multidimensional content also provides more information to facilitate content segmentation and content recognition by the system.
For the output system, the output subsystem is to restore the text encoding to the original information of the input. Unlike conventional output systems, the output of the new system relies entirely on an open code warehouse. The output form and content of which depends on the form and content of the input. No output can be made for content that has not been input.
For editing systems, appropriate modification adjustments often need to be made while inputting. As with conventional editing systems, editing systems based on personalized object coding also provide basic add, delete, and change functions. But in a different way, the new editing system may also provide functions such as modification and adjustment of the input content and management of the splitting of content units.
It should be noted that the new data processing system is neither present nor possible to replace the existing data processing system. Instead, by proper design, we can also make maximum use of the infrastructure and tools of existing systems and organically merge the two systems together. Such utilization and fusion comprises at least the following aspects:
first aspect, standard controller
In existing word processing systems and tools, there are just general data tools and no special processing is done for specific code, such as compression, encryption, storage, etc. In new data processing systems, we can use them directly.
However, in some word processing systems and tools, special processing is required for some characters. Most common are control characters such as line feed, space, tab, etc. For example, the text line counter is used for calculating the number of line-feed characters in the text; the version management system of the text or the text comparison and merging tool is also an index system based on English words and is also carried out in units of rows; word counting and word segmentation for English retrieval are also divided by using standard control characters and punctuation marks as words.
Thus, as long as methods are provided for entering such standard control symbols and punctuation marks in new text input systems, more conventional text processing systems and tools are available for use in new data processing systems.
Second aspect, hybrid coding
Furthermore, if compatibility of conventional standard literal coding is considered in literal coding of new data processing systems, we can easily mix conventional literal with new literal. The existing text can be directly and effectively used, and the existing text input editing system and the new text input editing system can be mixed. A simple hybrid coding scheme is an extension of the existing standard literal coding scheme directly, distinguishing the object code from the standard code in some way. In this way, object-encoded characters, and even other voice or multimedia streams, may appear in text at the same time as standard characters.
With hybrid coding, existing text data techniques can be effectively retrofitted. In the conventional text data technology, both data characters and format characters come from standard text codes, which results in that format characters cannot be directly used in data characters, but are finished through character escape, which is inconvenient and inefficient. For example, in the CVS tabular text data, commas are used as separators to separate the text data. Thus, if comma is included in the text data, the data must be protected by placing it in quotation marks. If the quotation marks appear in the data text, special treatment is also carried out on the quotation marks. Hybrid coding solves this problem well-since object coding can be distinguished from standard literal coding, we can be used fully as a format character. Thus, standardized characters can be arbitrarily used in the text data without any limitation; the corresponding parsing program can also directly process the corresponding data without any character escape processing. Furthermore, the data mode (Schema) and the detailed information of the format data can be put into the coding warehouse, so that the data redundancy is greatly reduced, and the transmission and processing efficiency is improved.
Third aspect, keyword mapping
One direct benefit of hybrid coding is that we can use the new data processing system with traditional structured text, grammatical text. The key and special symbol are still encoded using the original standard text code, and the identifier or data content is encoded using the object. This means that either handwriting programming or voice programming is possible.
In such a hybrid coding system, we can use a new text input system to complete the input of all text. Only the key words and special symbols of the system are required to be defined, and the corresponding object code text contents are required to be defined. For other characters, standard characters may also be encoded by escape. In the text input process or the text data processing process, the system can automatically convert the text input process or the text data processing process into corresponding standard text codes according to the content matching result, the corresponding standard text codes are processed by a traditional text processing tool, and the returned result is mapped back to the object codes and is presented to a user in a visual mode. A typical example is a handwriting programming system, where only the mapping system of the object code and standard code is provided at the front end, and the back end can use a series of tool chains such as a conventional compiler, connector, debugger, etc., to achieve the predetermined effect.
Also, we can map standard encodings to object encodings. Thus, a standard text code sequence set in advance can be input by using a traditional text input system, and the system automatically matches the standard text code sequence to a corresponding object code. This is significant for editing and modifying object codes. For example, for an XML editor that supports object encoding, an XML document may be edited and modified in a conventional manner and stored as object encoding when the document is serialized.
Fig. 27 is a flowchart of a fourth embodiment of an encoding processing method according to the present invention, where, on the basis of the embodiment shown in fig. 5A, as shown in fig. 27, the method further includes:
step 401C, mapping, when there are a plurality of object codes of the same type and belonging to the same owner, the plurality of object codes of the same type and belonging to the same owner, or meta codes in the plurality of object codes of the same type and belonging to the same owner, to a specified system code.
Wherein the system code comprises one of: default meta-code setting code, root space code, and client code setting code.
In the present embodiment, systematic encoding refers to encoding capable of changing the behavior of systematic encoding and decoding. The corresponding data object is directly related to the components of the system codec. In general, system coding is built in a coding and decoding system, and a certain extension mechanism is allowed. The terminal code, default meta code setting code, root space code, and client code setting code, which will be mentioned later, are all systematic codes.
For example, following the example above, if there are a large number of data objects of the same type all belonging to the same owner, then their corresponding object encodings are all three encoding points (user encoding+type encoding+instance encoding), where the first two encoding points are all the same, which is a redundancy.
We can introduce a systematic code to reduce this redundancy to some extent, for example using client code setup coding. By client coding is meant a reference coding that indicates for some purpose that a data object has been decoded. The encoding directly corresponds to the data object without requiring an additional decoding process. In general, client-side encoding is shorter than the original encoding of its corresponding data object. The encoding and decoding processes of the encoding are not participated in an encoding warehouse. The client code is directly distinguishable from other common codes in terms of code. The client code may correspond to a data object or may correspond to a code meta-object.
The client code set code is a system code that sets the client code. The general form is:
Figure SMS_69
the specified object code/meta code is mapped to the specified client code. In this way, any subsequent occurrence of the client code can represent the corresponding object code/meta-code.
In this example, the effect of this client code set code is to define the meta-code of two code points as a code of one word length. This one word length meta-code can then be used instead of the previous two code point meta-code. The corresponding coding metamodel update is shown in fig. 28.
Based on this coding metamodel, the system adds two new coding combinations, as shown in fig. 29: the target element code in the figure corresponds to the alternative type code.
In this way, the encoded storage of the above case can be reduced by one third.
If necessary, system codes with different roles can also be designed in different object coding systems.
Further, the method may further include:
and encrypting the object code.
Or alternatively, the process may be performed,
and compressing or encrypting the data object to be encoded.
Fig. 30 is a flowchart of a fifth embodiment of an encoding processing method according to the present invention, where, on the basis of the embodiment shown in fig. 5A, as shown in fig. 30, if the data object to be encoded is a handwritten text, the method further includes:
and step 501C, receiving a code conversion request, inquiring a mapping table in the code warehouse according to the code conversion request, and acquiring standard language parameters corresponding to the handwritten characters in a font matching mode.
Step 502C, performing transcoding processing on the object code corresponding to the handwritten text according to the standard language parameter corresponding to the handwritten text and the object code corresponding to the handwritten text, so as to obtain the standard text corresponding to the handwritten text.
Wherein the standard language parameters include one or a combination of several: numbers, symbols, keywords, public identifiers, and private identifiers.
In this embodiment, for example, fig. 31 is a handwriting input program, and the corresponding programming language is Lua, which is an embedded script language. The corresponding glyph library code is as follows:
Figure SMS_70
there are three types of codes in the handwriting procedure shown in fig. 31: font coding, word pitch coding, and line feed coding. We denote the glyph code as w+ (specific glyph code) and the inter-word distance code as s+ (inter-word distance value). For a line-feed, we do not embed its code into the content for convenience, but instead represent it directly with a new line. Thus, the above code corresponding to the handwriting program can be expressed as follows:
Figure SMS_71
the code is converted and the user prepares the glyph number symbol mapping table as follows:
Figure SMS_72
the glyph keyword mapping table is as follows:
Figure SMS_73
The glyph interface identifier mapping table is as follows:
Figure SMS_74
here, the system sets a syntax interval threshold of 20. The private identifier auto-generation rule is two underlines (_) followed by an underlined glyph-coded sequence.
Finally, from the previous flow, such standard code program code can be obtained:
Figure SMS_75
it can be seen that four private identifiers are generated:
Figure SMS_76
wherein the first identifier is actually the annotation content and is of no significance. If we employ an optimized conversion process, the conversion of it can be omitted directly when it is identified as annotation content.
The generated program can be normally interpreted and executed by a traditional Lua interpreter, and the execution semantics of the generated program are identical to those of handwriting source codes.
Fig. 32 is a flowchart of a decoding processing method according to an embodiment of the present invention, as shown in fig. 32, where the method includes:
and 601C, receiving a decoding processing request, and acquiring an object code to be decoded according to the decoding processing request.
Step 602C, disassembling the object code to obtain a meta code, or the meta code and the instance code.
And 603C, inquiring a coding warehouse, and acquiring corresponding metadata and coding protocols according to the metadata codes.
Step 604C, obtaining a data object corresponding to the object code according to the metadata and the coding specification, or the metadata, the coding specification and the instance code.
In this embodiment, the object code includes or implies a meta code of the related code element object. The code repository obtains the corresponding code metadata through this meta-code and returns or creates code meta-objects for it. If during or after encoding, authorization information or other control information has been set for the access once encoded to the object, these access control rights must first be authorized verified before decoding.
In addition, after the object code is obtained, it needs to be disassembled, so as to obtain the meta code and/or the instance code therein. After obtaining the meta-code, corresponding encoded metadata and/or encoding conventions are obtained in accordance with the obtained meta-code. And recovering the original data object according to the encoded metadata and/or the encoding specifications, and the instance encoding.
Wherein the decoding of the data object will be performed according to the content of the encoding specifications. Direct content decoding may be included, or decoding by reference to an encoding warehouse, or both.
The system is an open system, and the existing content encoding and decoding technology can be used by the encoding meta-object (only the corresponding description exists in the encoding protocol) and can also be used for the transmission and storage of the encoding warehouse.
Fig. 33 is a flowchart of a second embodiment of a decoding processing method according to the present invention, and based on the foregoing fig. 32, as shown in fig. 33, a specific implementation manner of step 602C is as follows:
step 701C, obtaining a predetermined rule corresponding to the object code.
Step 702C, disassembling the object code according to the predetermined rule, so as to obtain the meta code, or the meta code and the instance code.
Further, the method further comprises:
performing access right authentication on the preset rule;
the specific implementation of step 702C is:
and if the access authority authentication of the preset rule is successful, the object code is disassembled according to the preset rule to acquire the meta code or the meta code and the instance code.
Fig. 34 is a flowchart of a third embodiment of a decoding processing method according to the present invention, where, on the basis of the foregoing fig. 32, as shown in fig. 34, the method further includes:
Step 801C, performing access right authentication on the meta-code.
One specific implementation of step 603C is:
and step 802C, if the access authority authentication of the preset rule is successful, the object code is disassembled according to the preset rule to obtain the meta code or the meta code and the instance code.
Fig. 35 is a flowchart of a fourth embodiment of a decoding processing method according to the present invention, and based on the foregoing fig. 32, as shown in fig. 35, a specific implementation manner of the step 604C is as follows:
step 901C, obtaining a context object.
And 902C, acquiring a corresponding coding space according to the context object and the coding protocol.
And 903C, decoding the instance code from the coding space to obtain corresponding data content.
Step 904C, obtaining a data object corresponding to the object code according to the metadata and the data content.
Based on the description of the embodiments described above, a specific application of the handwriting input system based on the encoding process will be schematically described below taking the handwriting input system of the present invention as an example.
For example, using handwriting input based on line-space word segmentation as an example, the user has entered the text as shown in fig. 36 in the current line. Then, the input system forms four characters according to the pitch word segmentation algorithm, and stores the four characters in the code warehouse (assuming that 64 characters 0x1-0x40 exist in the code warehouse):
Figure SMS_77
Wherein, 0x41,0x42,0x43,0x44 are 16-ary expressions, respectively representing decimal numbers 65,66,67,68. The object code may be the position of the data object in the code warehouse directly or may be the hash value of the position. The specific content of each encoded item is graphical data, which may be in a common format, such as SVG, or in a proprietary format.
Correspondingly, the input system also generates corresponding text data as follows:
0x41 0x20 0x42 0x20 0x43 0x20 0x44
where 0x20 is the space character in standard ASCII codes (assuming the system uses standard spaces to separate characters). The above words are seen in a conventional text viewing environment as follows:
AB C D
this is because 0x41,0x42,0x43,0x44 correspond to A, B, C, D characters in ASCII codes, respectively, and when a conventional text is output, the corresponding character outline is extracted from the corresponding standard-based code word stock by these codes.
In the new data processing system, the text output will go to the code repository to fetch the corresponding graphics and draw them in sequence to the output display. The drawing results are shown in fig. 36.
In addition, with respect to encoding types, as mentioned above, in the new data processing system, multiple types of encoding may exist simultaneously. We can uniformly code different types of characters/logograms. However, the unified coding has a problem that the system needs to acquire corresponding coding type information from a coding warehouse for each code during decoding so as to correctly decode and output the codes. This greatly affects system performance.
Another approach is to encode the type, storing the encoded type information in an encoding repository. Thus, the object-based literal code will include two parts: coding type coding (meta coding) and specific coding under that type (instance coding). This may increase the size of the encoding result, but may greatly improve the flexibility and openness of the codec.
Based on the previous example, the coding warehouse needs to add type coding information (coding meta information):
Figure SMS_78
meanwhile, all the coding items need to be coded according to the corresponding types and placed at different positions of a coding warehouse. For example, for database-based implementations, codes of different coding types may be placed into different tables, and the object factory may find the corresponding table from the type codes (meta-codes) according to system conventions (e.g., using the type ID as the table name for the corresponding code).
In this example, the "com.sample.handwriting.word" table is as follows.
Figure SMS_79
Figure SMS_80
Accordingly, the text data generated by the input system becomes encoded as follows:
0x01 0x41 0x02 0x01 0x42 0x02 0x01 0x43 0x02 0x01 0x44
where 0x02 corresponds to a space. This is a control symbol and does not require specific text content nor a corresponding table in the code repository.
We can use dynamic coding for the coding type, which can achieve the efficiency, security and openness of the new data processing system. Multiple input methods and coding modes can be mixed in the same application system. Unauthorized systems or individuals cannot obtain any information from the encoded results. New input methods, coding types, applications can be dynamically added to the new data processing system.
In addition, for encoding data, it is often not sufficient for a system capable of encoding any data object to provide only the encoding of the text content itself, and it is also necessary to encode some other related information, namely, encoding the data. Unlike the encoding of object data, the data content may not be stored in a literal code repository, but rather encoded directly in the object code, i.e., the aforementioned content code.
A typical example is the pitch of the text. In conventional ASCII encoding systems, a space is a control character. In the corresponding text output result, the width of one space is fixed. The distance between the characters separated by spaces is determined by the number of spaces between them. This pitch can only be an integer multiple of the space width. But in a naturally written text, the spacing between characters or words is arbitrary (of course, all within the range of paper). In the previous example, careful observation will find that the pattern of handwriting input and the corresponding output are not consistent, mainly the spacing between characters. The result of the encoding in the example is the same encoding used for the spacing between characters. To ensure the effect of what you see is what you get, the length of the character spacing can also be encoded into the character object encoding result. We can put this length information into the coding repository and then code the position of the content item into text. Obviously, it is much more effective to binary encode the word space and put it directly into the word. Fig. 37 visualizes the length of the character spacing. As shown in fig. 37, the length uses a logical unit, and can adapt to different devices and outputs with different font sizes. We update the coding type information as follows:
Figure SMS_81
Figure SMS_82
Wherein, the encoding length of the space is changed from 0 to 1, which means that there is one byte of length encoding after space encoding. The null encoded data type indicates that decoding of the length encoding does not require access to the encoding warehouse. The encoding program may directly convert the interval length between characters into bytes to be stored in the encoding result. The corresponding literal code is as follows:
0x01 0x41 0x02 0x0C 0x01 0x42 0x02 0x10 0x01 0x43 0x02 0x01 0x0A0x44
in this way, the text output subsystem can fully restore the original input content according to the code.
It is worth mentioning that the interval in the example is the length spacing between handwritten characters. However, other kinds of spacing exist for other input methods, such as the time spacing between sound units in a speech input. We can provide different coding types to support different kinds of pitch coding.
In this example we see the effect of directly object encoding the data. Here we encode the integer. In fact, in computer systems, binary representation/encoding of various data is the basis for data storage, processing, and these techniques are well established. For example, the IEEE 754 standard is a standard for binary encoding floating point numbers. We can use all these techniques to directly encode arbitrary data into the object encoding results.
Thus, in the coding scheme of the new data processing system, the data content of our data objects may not only be stored in the coding repository, but may also be placed in some way directly into the object code. Thus, the literal code of the new data processing system may actually be a mixture of reference code and content code. We can distinguish between them by coding type. Still further, it is also possible to determine whether the code meets the type constraint through a type security check of the code type, and to determine the specific type of code through type derivation.
In addition, for hybrid coding, the new data processing system allows us to create object-based coded literals from beginning to end with new coding. In many cases it is desirable to be able to directly utilize existing text resources, directly modify them over existing standard code-based text. It is also sometimes desirable to be able to mix keyboards with new input methods to modify and edit text. This requires that the new literal coding scheme be compatible with existing standard coding so that the literals of both systems can appear mixed in the same document.
There are many schemes for implementing hybrid coding. A straightforward solution is to put each standard code sequence as object data content into a code repository, defining new object codes for these contents. Another approach is to place a type code before each standard literal code in the text content, which tells the decoder that the code is a standard literal code. The two schemes have a main problem that the existing standard code text contents can be converted into target codes, and the code results are completely incompatible with the original standard codes. It is difficult to use existing text infrastructure and tools for processing and analysis.
A better solution is to base the new literal code directly on the existing standard code. A specific UTF-16 based literal coding scheme is presented herein:
1. all UTF-16 standard codes are encoded by adopting the original encoding standard, such as BOM, surrogate Pair and the like.
2. The meta-code of all object codes uses the private extension code of UTF-16 (from U+E000 to U+F8FF)
3. Example code word length (here one word is 2 bytes) after type coding is based on information in the code repository
4. The high order of the word example code word after the type coding is 1 (i.e. from 0x8000 to 0 xFFFF) so as to avoid the conflict with other control symbols.
For this encoding scheme, the decoding process is shown in fig. 38.
In addition, a specific example is given here. As shown in fig. 39, this is a hybrid coded content display.
In the corresponding literal code, five standard Unicode characters u+0049 (I), u+0020 (space), u+0061 (a), u+006D (m) and u+002E (). The others are non-standard codes. Correspondingly, we have the following coding information:
Figure SMS_83
the type "com. Sample. Handwriting. Word" code content in the code repository is:
Figure SMS_84
the type "com. Sample. Photo" code content is:
Figure SMS_85
The codes corresponding to the Chinese content in the example are as follows:
U+0049U+0020U+0061U+006D U+0020U+E0001 0x8000 U+002EU+0020U+E0000 0x8041 U+0020U+E0000 0x8042 U+0020U+E0000 0x8043U+0020U+E0000 0x8044
this code will appear in a conventional UTF-16 data processing system as:
I am
Figure SMS_86
where the two type encodings u+e0000 and u+e0001 are private characters, belonging to encodings not supported by standard UTF-16 fonts, the output will vary from implementation to implementation. Here, a blank (blank before the five Chinese characters above) is taken as an output. Some systems appear as boxes or black blocks.
It can be seen that based on this coding scheme, our conventional UTF-16 text can be used directly in the new data processing system without any conversion. The encoding results of the new data processing system may also be processed with the infrastructure and tools supporting UTF-16. For example, in a traditional text editor, "I am" in the example is replaced with "I am". The corresponding modifications can be directly embodied by the new data processing system output, as shown in fig. 40.
That is, the processing power and tools of the original UTF-16 may be inherited and preserved in the new system. At the same time, the new encoding results may be stored intact in any UTF-16-enabled storage system.
Similarly, other standard coding systems such as UTF-8, UTF-32, etc. can be extended to support new data processing systems.
In addition, with respect to transform coding, in the new object coding system, we can put the content of the data object in the coding warehouse, and can put the code itself as the data content in the coding warehouse. This type of coding that converts other codes is called transcoding. The specific content stored in the code repository for transcoding is text. One simple application is the conversion of standard codes. As follows we define a transcoding:
Figure SMS_87
Figure SMS_88
thus, our original ASCII code string "This is a SECRET-! "will be encoded as" 0x41 0x42 0x43 0x44 0x45 0x43 0x44 0x45 0x46 0x450x47 0x48 0x49 0x50 0x48 0x41 0x51 "under the new data processing system. For a person without the corresponding code repository access rights, he cannot output in the new data processing system if he has obtained a literal code. This code is output as "abcdeepfghijhak" in a conventional ASCII code system. Thus, the user who is not authorized by the code repository cannot obtain the real content. This is in effect an encryption function. This encryption is not the same as conventional encryption. Traditional encryption is to encrypt the entire text data in its entirety. This transcoding-based content protection relies on authorized access to the coding repository, allowing fine-grained content protection. Such as transcoding only the characters or words that need to be protected, or granting different access rights to different encodings.
For example, based on the aforementioned UTF-16 hybrid encoding, we can re-encode only part of the content in the text, and the other content is encoded using UTF-16. Here a new type of coding is used:
Figure SMS_89
the corresponding code repository is as follows:
Figure SMS_90
Figure SMS_91
the original UTF-16 string "This is a SECRET-! "encoded as" U+0054U+0068U+0069U+0073U+0020U+0069U+0073U+0020U+0061U+0020U+E002 0x8000 U+0021 "in the new data processing system. In the new data processing system, special display output may be made by different users for the type "com. For example, for an authorized user, the content corresponding to u+e00020x8000 can be obtained normally, and the result is shown as follows:
This is a SECRETE!
for unauthorized users, the content corresponding to U+E00020x8000 cannot be acquired, and the result is shown as follows:
This is a!
the code is output in UTF-16 text environment as follows:
the is a ■ Yao-!
Here we can see that this flexibility is difficult to achieve with conventional encryption. In addition, the conventional encryption method and transcoding can also be used simultaneously: the text encoding is encrypted in its entirety, or the text content is encrypted, etc. In this way, the content security of the system can reach a higher level. After the ciphertext is obtained by the user, the user needs a secret key to obtain the plaintext, but the plaintext cannot be understood, the user also needs to obtain the corresponding content by obtaining the identity verification of the code warehouse, and if the content is also encrypted, the user also needs to decrypt the content to finally obtain the corresponding information.
Meanwhile, it should be noted that the practice of changing a plurality of characters into one code here also achieves the effect of compressing text in practice.
In addition to standard codes, the purposes of encryption and compression can be achieved by transcoding, and any other coding can also be achieved by transcoding to achieve the grouping and transcoding of codes.
There is a specific example: as mentioned above, the new data processing system code results and the characters entered by the conventional keyboard may be mixed together. Assuming that at this time we use handwriting input methods, what would result if handwriting input were performed directly on top of the content of a traditional character? If such interaction is allowed, then the intuitive result is that the handwritten stroke falls over the result of the character output. As shown in fig. 41.
Here we can mix different types of codes together with transform coding to form one code. The types of codes used are as follows:
Figure SMS_92
the content item of the coding type "com.
Figure SMS_93
The related content item of the coding type "com.sample.handwriting.mixedword" is as follows:
Figure SMS_94
Figure SMS_95
here, the code u+e003 0×8000 corresponds to the mixed content in which UTF-16 code and handwritten character object code are mixed. When the code repository obtains the content, the code repository detects that the code in the code repository exists in the code content, and the code repository fetches and sends all the directly or indirectly referenced object data content to the client. This minimizes the number of accesses to the service and also facilitates detection of problems with circular references (the same code being directly or indirectly referenced by itself). The corresponding text output system breaks the encoded content into two parts, the first part being a handwritten code, which may have previously included a space code. This interval coding is the spatial interval of the handwritten content from the previous position. The handwriting encoding is followed by a second part, which is any mix of UTF-16 encoding and space encoding. The two parts are rendered in turn to get the correct result.
In this embodiment, the personalized text encoding makes it necessary for the text to rely on its encoding repository for proper output, as will be appreciated. This has a natural safety advantage. We can deploy the literal code and the literal code repository separately in two different systems. Thus, only the user who has both system-related access rights can acquire the final text information. This is the concept of split storage as described earlier. For example, for a traditional web microblog system, a website administrator or system database administrator may easily see any microblog content stored within its system, whether that content is public or private. However, if the microblog content adopts the handwritten text content based on the object code, and the corresponding code repository is provided by another internet service provider, an administrator without the access authority of the code repository can see the text code of the microblog, and he/she cannot obtain the text content. Meanwhile, although administrators of code warehouse service providers can acquire the fonts corresponding to each literal code, they do not have literal codes of the whole microblog, so the microblog content is unknown to them. Similarly, for hackers who have a man-in-the-middle attack on such handwritten microblog systems, they must break Jie Weibo and code warehouse systems simultaneously to fully intercept the microblog information of the systems. This approach greatly increases the cost of the attack.
In addition to non-standard literal coding, we can also use the conversion coding mentioned above to re-encode the standard code through a coding warehouse to de-normalize to achieve content protection.
In addition to the security afforded by the splitting of such code and content based on object-coded data processing systems, the new system can provide more careful protection to literal content through other mechanisms such as, but not limited to, coding space, access control, encryption coding, content verification coding, etc.
In addition, as mentioned above, the code access space may completely isolate codes of different security levels. For example, for an encoding warehouse deployed inside an enterprise, any direct request for private encoded content would be denied. Likewise, an encoding repository deployed in public clouds will reject both enterprise encoded and private encoded text requests.
We can specify the corresponding coding space by specifying the range of type codes. For example, in some open code-based data processing system, we define 0-99 to be public codes, 100-199 to be enterprise codes, and 200-255 to be private codes. Thus, type codes above 99 cannot be directly supported by public code warehouses. For a private cloud-based code repository inside an enterprise, type codes larger than 199 are unsupported codes, 100-199 type codes are directly stored supported code types, and 0-99 type codes are indirectly supported code types. Such indirect support may be implemented as a content caching service of a public cloud code repository.
It can be known that only one public code warehouse exists in the public cloud for the same person. In particular, it exists in an internet service. However, the private code repository and the enterprise code repository may be multiple, each residing in a different network environment and computer system. For these different code warehouses, it is necessary to generate different code warehouse identifications. The corresponding text file or text data needs to store the identification of the corresponding code repository to ensure proper encoding, decoding, input, output.
Different non-public code warehouses will lead to the occurrence of information islands. Under certain conditions, the closed code repository is also allowed to submit content to the open code repository to facilitate sharing of the content.
Sometimes, the three-level coding access space cannot meet the actual requirements. For example, some applications may wish to establish a department-level sharing mechanism, at which point the application may define a finer subspace within the enterprise encoding space. The management of the subspace is done by the application system.
A specific example is given here:
a personal handwriting diary application uses a local private code repository. The text content of the diary is stored in cloud storage of the internet. And the code warehouse is stored in a USB flash disk carried by a user. Thus, even if hackers acquire diary contents in cloud storage and do not have corresponding U disks, they cannot acquire information therein. When a user publishes diary content as a blog, the system needs to convert corresponding text content from a private coding space to a public personal coding space, and the process is that the corresponding coded content is actually taken out of a USB flash disk coding warehouse, stored in the public coding warehouse and the corresponding public coding process is obtained.
In addition, protection of the encoded content in the encoding warehouse is accomplished primarily through the access control services of the encoding warehouse. Access control is primarily directed to coded metadata and specific data objects. Unlike ordinary access control, object-coded access control can achieve fine-grained control of literal content access. Encryption of portions of text content has been illustrated above in connection with access control and transcoding.
In addition, in the case of encryption coding, in the case of partial text content encryption, the coding store of the conversion coding stores the coding of the sensitive text content. Then, the system administrator of the code repository or a hacker invading the code repository may actually obtain all information of the sensitive text from the text code repository based on the code content. In addition, the plaintext obtained from the code warehouse can be directly transmitted through a network, and potential safety hazards exist. Another approach is to use encryption coding. The so-called encryption coding is a particular type of coding. The text content corresponding to the encryption code is a secret key. The encryption code is followed by the length of the encrypted content, after which the code of this length is the ciphertext after being encrypted by this key. When the characters are output, if the secret key corresponding to the encryption code can be normally obtained, the ciphertext can be correctly restored to the original code by the decryption process, and the original code can be correctly output. Thus, access control to the encrypted code enables dynamic access control to the encrypted code. Conventional encryption and decryption techniques may be used herein. Here, as an example, we define a simple encryption scheme: the secret key is a pseudo-random number (which can be automatically generated when encryption is set), and the encryption and decryption functions are identical, namely each instance of encoding is exclusive-or with the secret key.
Using this scheme in the previous example, the update coding type information is as follows:
Figure SMS_96
the "com. Sample. Scanning" coding warehouse is as follows:
Figure SMS_97
Figure SMS_98
the original UTF-16 string "This is a SECRET-! "encoded as" U+0054U+0068U+0069U+0073U+0020U+0069U+0073U+0020U+0061U+0020U+E004 0x8000 0x0006 U+FFAC U+FFBA U+FFBC U+FFCD U+FFAC U+FFCA U+0021 "in the new data processing system. Here u+e004×8000×0006 is in fact an encryption code. When the decoding program reads in U+E004, it will find this to be an encryption type of coding. There are two parameters that follow, 0x8000 is the specific code, and the corresponding code repository is its decoding key. 0x0006 is the data length of the encryption encoding effect, here 6 words (here one word 2 bytes). The decryption program will attempt to read the content corresponding to 0x8000 from the code repository and this key, if available, will be used for the 6 16-bit digits after decryption. The corresponding codes are obtained: u+0053u+0045u+0043u+0052u+0045u+0054.
Otherwise, the following 6 words are encrypted words, and cannot be displayed correctly, and the decoding program directly skips the 6 words, and the display output is as follows:
the is a [ here, 12 bytes are encrypted ]!
The coding mode can easily realize real-time authorization of characters. For example, we send text encrypted by email. Afterwards, for some reason, it is not desirable for the recipient to be able to see the mail content. At this point we need only set the corresponding encryption code to be addressee prohibited from accessing. Thus, the mail that has been sent becomes unreadable. We can use this mechanism to implement the mail drop function. In addition, it is worth mentioning that the search engine is ineffective for the encrypted literal code as it has changed.
For content verification coding, similar to encryption coding, verification information for part or all of the text codes can be placed in a coding warehouse to form a code. This encoding is called content verification encoding. By means of the content verification code, whether the text content is tampered or not can be monitored.
For example, a leader gives an explicit indication of an item in an email, he can set this text to "tamper-proof". At this point, the system may perform a hash algorithm on the text to form a 128-bit number that is in a one-to-one correspondence with the segment of text. The system stores the 128-bit number in the code repository, forms a content verification code (including the length of the word), and places the code before the word. After the mail is forwarded by a plurality of times, the decoding program can compare the verification code obtained by the content verification code with the hash value of the corresponding text to determine whether the text is the original information of the original person. If the verification is correct, the verification result can be visualized in some form, so that the final reader knows that the information is read without tampering.
For multi-user coding schemes, in a multi-user environment, the literal content of multiple users is stored in a literal code repository. At this time, only the user identification is needed to distinguish the text content of different users. The coding type information may also be differentiated according to different users if necessary. In this way, different users may code the same code type differently, further increasing the security of the system.
For the code home space, sometimes different users need to share codes. We distinguish by different coding spaces. As mentioned above, personal codes vary from person to person, and the sharing codes are identical. In an enterprise code repository, the corresponding code is typically a shared code if the enterprise logo is placed therein. The various standard codes existing are typically common shared codes. In addition, some control codes, such as space codes of handwritten text, and systematic codes, such as codes representing user IDs, may employ shared codes. In this way, some system tools (e.g., retrieval systems) may use these codes more efficiently. In fact, unicode also has the concept of code attribution space, most of which is shared code, but a private area is reserved, and in fact, we say personal code here.
In the foregoing, in an object code data processing system, we can code a code type, and the object code includes two parts: type encoding (meta encoding) and specific instance encoding in that type. The coding home space is applied to the two parts, so that three specific coding modes are generated in practice: complete shared code, shared type personal code, complete personal code. The complete shared code is in fact shared by all users of the code repository and is not relevant to any users. The encoding and corresponding content thereof is typically managed by an encoding warehouse manager. The shared type of personal code is still actually a personal code, the code of which varies from person to person. But its type code is shared. That is, different users use a code whose corresponding type code portions are identical, but the remaining portions are different from person to person. One benefit of using such codes is that word processing tools can obtain the type information of the word code without any personal information and then process the word code based on this information. By fully personal coding is meant that both parts of the code are personalized, person-to-person. The security of such a code is therefore highest, but at the same time the operability is lowest. The word processing tool must obtain the code type information from the user information of the code owner to further obtain the full code information. Here we see that the same code type, three different specific types of codes may exist simultaneously in one code repository.
For the same user, his personal code and the available shared code will appear simultaneously in his text. At this time, it is necessary to distinguish by the coding space. The following are examples:
in the previous example, we have added a standard smiley face icon at the end of the period, as shown in fig. 42. This smiley-face icon is also from the code repository, and the corresponding code is an expression code shared by all users. Meanwhile, space encoding here also uses shared encoding. Handwriting coding uses type sharing personal codes. The sharing type information is as follows (here it is assumed that the sharing type code is 0x01-0x 7F):
Figure SMS_99
in the above table, the encoding type 0x01 and type 0x02 are identical except for the home space. In practice, types 0x01 and 0x03 are both shared codes, while type 0x02 is a personal code. But these three types are shared, and in the same code repository, the personal type information will have one more user ID than the shared type information.
The following is a content item of type 0x 02:
Figure SMS_100
the following is a content item of type 0x 03:
Figure SMS_101
therefore, the corresponding codes of the characters are as follows:
0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05
in addition, for encoding a user, from the above example, we note that there is user ID information in each personally encoded content item. For a multi-user code repository, the individual coded data objects are person-to-person. The coding of different users can be distinguished by the user ID of the data object. However, in a literal code that may exist separately from the literal code repository, how does the corresponding user ID information be placed? There are two situations.
For single user literal codes, one case is that the individual codes in the literal code all come from the same user (shared codes are not accessible by user ID). There may be different implementations, one way is to set up the system code using the context object mentioned before; another way is to explicitly define the user type as a context object type in the coding model, in which case we need only to encode the user ID information into a shared code and place it at the forefront of the literal code content.
The sharing type information of the above example adds the user ID code, updated as follows:
Figure SMS_102
/>
Figure SMS_103
accordingly, a two-byte user ID is directly used as a coding parameter of type 0x 01. The final encoding of the above example is:
0x01 0x0C3F 0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05
thus, the character code reading program can know to which user the following personal code belongs after reading the first three bytes 0x01 x0c3 f.
Sometimes, this user code may be omitted, in effect an implicit code context. For example, in a personal handwriting application, each user's literal code content is the user's personal code. In such a system, the user ID of the literal code repository corresponds uniquely to the system account. The ID may be stored elsewhere than in the literal code.
For multi-user literal coding, another case is multi-user hybrid coding, that is, coding in which multiple literal code repository users may occur in the same document. We can use the above scheme, but different user codes can appear multiple times in the text. The personal code after each user code is the personal code of that user. In addition, we can also take the user ID as an attribute of text in a structured document (e.g., XML-based document: XHTML, SVG, etc.).
There is of course also a most straightforward context-free coding scheme, namely to have the user ID directly as part of the coding.
For multi-user multi-application coding schemes, in multi-user systems, as a repository for data object data content, the coding repository is often shared by multiple application systems. The developer of the application system is given the opportunity to obtain the object code that the user stores in his system. If the same user adopts the same coding mode for different applications, if a hacker or a malicious application developer analyzes the object code of a certain user in a certain application, the corresponding relation between the user code and the content can be established. This correspondence can be used directly for other application systems. Thus, the code isolation between different applications greatly enhances the security of the system. Coding segregation is the fact that the data content of the same data object is different for different applications. To achieve coding isolation and sharing between applications, coding space associated with the applications may be used herein. Different applications may use different coding spaces or the same coding space when applying for a certain coding.
Examples of some applications of handwriting input systems that may incorporate the coding scheme of the present invention are further shown below:
1. handwriting systems in specific fields, such as handwriting diary, handwriting account book, handwriting Sudoku, handwriting character filling game, etc.;
2. a handwriting-based command line input system;
3. a handwriting-based formula editor;
4. handwriting-based programming systems.
In addition, to further describe various implementations of the encoding scheme, the following is again, for example: for DSL personalized documents, we can also code user interactions in a specific area due to the openness of new data processing system coding. Thus, the interactive data of the user can be stored, processed and transmitted in a literal mode. One benefit of this is that we can mix this interaction with the user's other text for storage and processing. At the same time, we can also process it with existing word processing tools. In addition, we can also use various coding schemes mentioned above to individualize the user data, so as to realize the security of the interactive data.
Specifically, an example of go on the net is taken as an example, and specifically shown in fig. 43.
We can define four shared coding types: one is the user code type, in which the code repository user ID of the user is encoded. One is an open code, which is a domain specific (application) code, followed by a black-and-white user ID. One is the drop code followed by the location of the drop. As indicated above, we can indicate in two bytes, e.g., 0x00 is the upper left corner position and 0x09, 0x09 is the space position. The last one is a delay code, which records the number of seconds after the last drop. Here we use an 8 bit word length, compatible scheme with ASCII coding. Thus, all non-ASCII encodings herein use bytes with a first 1. The type information (encoded metadata) is as follows:
Figure SMS_104
here, all six encodings are content encodings, so there are no data objects in the encoding repository. The following is an example of the game (in this example, codes other than ASCII codes are all represented by hexadecimal):
0x81 0x85 0x83
0x80 0x85 0x83 0x85 0x82 0x8F 0x83
0x83 0x82 0x84 0x86
0x80 0x83 0x83 0x8A 0x82 0x83 0x83
0x80 0x86 0x83 0x87 Hello,everybody!
0x80 0x85 0x83 0x88 0x82 0x8F 0x90
0x80 0x83 0x83 0x8F 0x82 0x83 0x8F
0x85 0x86
0x80 0x85 0x83 0x83 0x82 0x90 0x8A
0x80 0x83 0x83 0x8F 0x82 0x8D 0x82
the object code sequence will be stored in the web site store of the go application. Since the new data processing system is adopted, playing data and chatting data can be mixed together. By this content, the application can visualize it in the chat log of the user (here, assuming that the user name of user ID 0x05 is "small bright", the user name of user ID 0x03 is "small bright", and the user name of user ID 0x06 is "small strong"):
The system comprises: the small brightness is black and the small brightness is white. The game starts.
(5 seconds after the start of the office) Ming: drop P4
(7 seconds after the start of the transaction) system: little force is imposed into the audience
(15 seconds after the start of the call) light: drop D4
(22 seconds after the start of the office) small intensity: hello, everybody-!
(23 seconds after the start of the office) Ming: drop P17
(38 seconds after the start of the office) is bright: drop D16
(38 seconds after the start of the transaction) system: leave with small intensity
(41 seconds after the start of the office) Ming: drop Q11
(56 seconds after the start of the office) light: deciduous seed N3
The playing process can also be visualized in a graphical mode.
According to the text record, the go application can play back the whole playing process. If the privacy of players is considered, only the playing process authorized by both players can be played back normally. For traditional applications, implementing this function requires much effort in the application system: establishing a user authorization system, maintaining user authorization information, and the like. The playing data separated from the authorization system does not have any privacy protection. Therefore, leakage of application data due to any cause may lead to leakage of user privacy. In the new data processing system, the key data is placed in the protection of the context coding space of the coding warehouse, so that the safety of the application and the data can be greatly enhanced, and the complexity of the application system can be reduced.
Returning to the example of a go application, we need only replace the drop sub-type with a context dependent type:
Figure SMS_105
in user space corresponding to the code repository (actually, document space in user space), there is the drop code data for the game by the small book:
encoding X Y
1 P 4
2 P 17
3 Q 11
The small bright drop code data are:
encoding X Y
1 D 4
2 D 17
3 N 11
Thus, the corresponding literal code is:
Figure SMS_106
thus, the system or other people can be controlled to access the chess game as long as the small lights and the small lights properly authorize the respective coding spaces in the coding warehouse.
As already mentioned above, the coding repository may be regarded as a font repository for the new data processing system. The font library is not necessarily standard font information, but can be any other type of information; the location where the information is stored is also not specific, but arbitrary. Such a font library is of course also capable of storing standardized coded font information, i.e. the content of a conventional font library. Taking the example of a vector outline word stock, the vector outline information of each word (or letter) may be stored in a specific store in the code repository according to the location of its standard code (e.g., unicode code). Other information needed in text output, such as Hinting, kerning, etc., can also be stored in the code warehouse.
The code repository may be deployed in a network, and the networked word repository may be more easily maintained, upgraded, new fonts added, etc. A conventional word stock file may be used as a local cache of the corresponding content of the code stock. Meanwhile, the content selection service of the coding warehouse can also select font content with different quality according to different output devices.
The text display client only needs to obtain rendering information or rendering results corresponding to the characters from the coding warehouse according to font information when rendering the characters of standard codes, and can correctly render the traditional characters.
In computer systems, people not only record their own or other words with text data, but also use them to characterize models and data in different fields. In general, we will record models and data using formatted text. Formatting text has the benefit of facilitating automated analysis and processing by a computer. XML is a typical formatted text that can express any model in the world through a tree structure. Due to the man-machine readable, extensible and flexible advantages of XML, text formats using the XML specification are commonly used and widely available. HTML (version 4.0 or more), SVG, RDF, etc., as used by internet web pages, are all XML-based formats. In fact, the XML standard is one of the basic stones of the Internet.
However, XML has a deadly weakness in that it is too redundant, resulting in too great a cost for file storage, transmission, and processing. This is also the reason why the world wide web consortium (W3C) has formulated the EXI (Efficient XML Interchange) standard. This is a binary XML standard.
Similarly, representing XML files in a new data processing system can also circumvent its deadly vulnerabilities. However, unlike the full binarization of EXI, the XML file in the new data processing system remains in text format, but the corresponding code has only changed to object code. As can be seen from the SVG example in OTF-8, we reduce the redundant information on XML syntax by object coding. In combination with metadata in the code repository, the converted result is fully equivalent to the information before conversion. With the aforementioned "hybrid coded universal display and edit" text service, one can easily view and edit text content. We can use the encoding repository to a greater extent, with the values of XML elements, attributes as the data parameters of the corresponding encoding, and directly encode using object encoding. This can further compress the storage space and reduce the likelihood of errors. Of course, we can also store the XML content or fragments directly in the encoding repository and use the encodings in the XML file, but this is just the use of XML by the encoding repository and not the optimization of the XML encoding itself.
Using an object encoded XML file, we need only make small changes in the XML parser to obtain relevant information from the encoding repository. Based on this, all existing XML technologies such as SAX, DOM, XPath, XSLT, XSLT-FO and the like can be directly used. For application developers, all changes occur at the storage layer and the parsing layer of the XML file, and if the API remains unchanged, the application using XML does not need any changes and can enjoy smaller file sizes and faster transmission speeds directly.
In fact, in the existing XML specification, the same set of characters is used to express both grammatical markers and textual content. Thus, there are many limitations in generating XML files, such as: some system characters ("<", ">", "-and"; the unresolved data also passes "< [ ]! CDATA [ "and" ] ] > "is packaged; etc. The use of object coding makes these restrictions entirely unnecessary, since we do not need to determine whether it is a mark or content by coding itself, but rather by coding the corresponding coding repository information. So we can simplify the complexity of XML and the corresponding parsing process.
Similarly, we can encode existing arbitrary text format (e.g., CSV, RTF, CSS, JSON, even programming language, etc.) objects using the same approach:
1. placing the corresponding content of the grammar tag/key in an encoding repository, encoding in the file using the corresponding object;
2. any character restrictions in the data/text content are removed.
Above we mention that object coding can easily eliminate the conflict between the original standard coded formatting code and the text content. Also, this coding of open coding and the split of content and the openness of the coding types makes it possible to mix together a plurality of different arbitrary text formats. This possibility is also considered in some of the text format specifications available. For example, javaScript can be embedded in XHTML, and binary data encoded by Base64 can also be embedded; an OLE object or the like may be embedded in the RTF. However, on one hand, these formats are limited by standard literal coding, and data in different formats requires certain transcoding or character escape; on the other hand, the existing format mixing is also limited, and is mainly performed in one format (other formats are just embedded data). However, by object coding, we can easily do arbitrary format mixing. For example, embedded in form data in one node of an XML document (actually a tree document); or vice versa, placing a tree document in a cell of the form; or two different forms of document data may be placed side by side. Of course, this mixture of multiple formats is also rule-constrained:
1. Each format must have an explicit format start and end code.
2. The beginning and ending of different formats cannot be interleaved together. That is, one format starts inside another format and must end inside it.
Furthermore, object coding also allows us to embed binary data directly into the coding result. In fact, it is the content encoding mode of the data object data content. Only the corresponding binary encoding method needs to be described in the corresponding encoded metadata. The composition of such object codes may be in the form of:
Figure SMS_107
in fact, the implementation of mixed format encoding is very natural for object-encoded data processing systems. In an open object coding system, different encoders and decoders are originally required for different coding types, and in one object coding document, the encoders and decoders are dynamically loaded according to requirements. The encoder encodes the object into a byte stream and the decoder decodes the byte stream into the object. And the different formats are to divide the codec into different groups. Thus, the encoding of a format actually encodes the corresponding memory model into a byte stream, while the decoding of the format decodes the byte stream into the memory model, i.e., higher level objects. Thus, the format codec is actually a more macroscopic object codec, which can be managed in the same way in the new data processing system.
In essence, the object encoding system encodes an object string with a byte stream. The object strings, i.e. the objects in the object array, may be as simple as a single character, or may be as complex as an abstract syntax tree corresponding to the program code, or a tree structure corresponding to XML.
In addition, for handwriting-based programming systems, the objects of interest to the compiler and interpreter are primarily symbols in the programming system. As to whether this symbol corresponds to a word or a graphic, it does not affect the progress of compilation and interpretation execution. In this process, symbol matching is extremely important. Therefore, in handwriting data processing system, we can reuse the existing programming language infrastructure by only making graphic matching of literal content and using the same code for matching content. This pattern matching is largely divided into two types: keyword matching and identifier matching. The result of the keyword matching is a system keyword (typically standard code for conventional programming languages); the result of the identifier matching is the same custom code or extension code.
In addition, text files are currently used in most programming languages. Also, the program source code object can be encoded using the method described above. Object coding of program source code may provide the following benefits:
1. The file size is reduced. This is particularly important for source code, such as JavaScript, that needs to be transmitted in the network.
2. The programming can be performed using non-standard coding. This makes possible, for example, handwriting programming, voice programming.
3. The security features of open code may be used to place the code in the source code in the author's or copyrighter's associated context space, which only authorized users can use.
4. In the process of analyzing the source code of the open code of the keyword, the lexical scanning and analysis of the keyword become direct code recognition, and the method is more efficient.
As with the object coding of most text files, the object coding of program source code is done primarily at the tool level, and is completely transparent to the end user.
In addition, open coding itself opens new possibilities for programming languages. We can build computer software in a completely new way: the data may exist in an encoding warehouse, to which direct reference may be made in the program; the program can also exist in a code warehouse, and can be referenced in a code manner; the data may also be mixed with the program in some form.
In addition, for machine instruction encoding, the code repository is in fact a natural code repository. The data encoded by the encoding warehouse has strong security. Therefore, we can encode binary data not only literally through the encoding warehouse, but also by encoding it. One typical application is the context-dependent object encoding of machine instructions. Thus, the binary files of the same application are quite different for different users. The user cannot execute the executable files of other users. This is in fact a solution for digital rights protection of applications. In addition, the scheme can also prevent viruses or malicious programs from damaging the executable files.
The implementation of this scheme is mainly accomplished by modifying the implementation of the program execution engine or virtual machine. Taking a Java virtual machine as an example, recoding standard Java virtual machine instruction codes according to different users only by a certain method (such as a random algorithm), placing the recoded Java virtual machine instruction codes in a code warehouse, and setting proper protection authorities; encoding the executable Java byte code according to the encoded instruction code; and in the execution process of the Java virtual machine, the current byte code is dynamically restored into a standard instruction code according to the current user information. In this way, only the corresponding user can correctly execute the corresponding Java bytecode.
For binary format coding, similar to executable files, some or all of the key information in other binary data files can be placed in the coding region, thereby playing a role of copyright protection—only authorized users can obtain the key information and use the corresponding binary data.
Taking video files as an example, many video file formats are actually container formats in which video, audio streams of different coding formats can be accommodated. The industry typically uses four byte coded format identification called "FourCC". The video player will decode and play the video and audio stream using the correct decoder according to this fourier cc. There are several hundred registered fourccs. We can replace the fourier ccs in the video file with object encodings while the actual stream encoding identifiers are stored in the corresponding encoding repository storage. Thus, by controlling the corresponding access rights of the code repository, we can control the playback of video files or streams.
In addition, regarding data compression, with the use of the coding warehouse, a data compression function can also be realized: the repeated portions of the data are placed in the coding region and the corresponding open codes are used.
In addition, for network digital stores, we have seen that the security mechanism built in the object code repository allows digital rights management, identity authentication, etc. to be easily implemented on the basis of the code repository. We can use it for the construction of network digital stores.
The network digital store system is mainly an application system that provides digital content transaction services to network users. Such as application stores, electronic libraries, etc., fall into this category. The users here are mainly classified into two categories: a provider of digital content and a consumer of digital content. The network digital store system can be directly built on the basis of the code warehouse, all users are users of the code warehouse, and the built-in security of the code warehouse can be used by connecting the corresponding digital content with the context codes related to the users.
Specifically, consumer consumption of digital content is largely in two modes: rental mode and purchase mode.
Rental models are those where the digital content or digital asset is owned by a provider and the consumer has only gained temporary access or use by some means, typically payment. Rented digital content is typically time-efficient and out-of-date content is not accessible to consumers. By incorporating provider-related context coding into the digital content, access control of the rental mode-access authorization based on the rental period of each user-can be achieved.
Purchase mode refers to a consumer obtaining access to digital content in some way, such as a pay purchase. It is mainly the problem of digital rights protection-preventing the generation of illegal copies. The specific implementation of the code repository is to put a special context code in the personal space of the user in the digital content purchased by the user. The code is only accessible to the user and the user cannot change the code access rules. Thus, other users cannot use the same digital copy of the same content normally even if they obtain it.
As can be seen from the above description, the most central part of the object-based coded data processing system is the coding warehouse (or coding library). Various encoded metadata may be stored therein; the actual content of the text may also be stored therein. Through various services provided by the code repository, the new text input system may convert various text content, or other content (e.g., user interactive content, domain-specific content, application content, etc.), into text codes that are stored and processed by the application system. In the process of generating the literal code, part or all of the literal content is stored in the code repository. Also, through the service of the code repository, the new text output system can convert the character strings sent by the application into text content that can be rendered or played, or an object model that can be used by the application.
Of course, the code repository is not the only bank or storage space. The generalized code repository may be a combination of multiple storage volumes, and may even be a cloud storage service provider under different secure channels in cloud storage.
Metadata in new systems, whether coding layer processing or text data processing, the coding, decoding systems or functions are their basis. As the core of the new coding system, the coding warehouse provides at least two basic services. One is to receive the content to be encoded, ensure that the content is correctly stored in the encoding repository, and return the corresponding encoding. Known as an encoded service. The encoding system uses this service to obtain the correct literal code. Another service is to return the corresponding content item according to the encoding, called decoding service. The decoding system needs this function to obtain content that can be properly output by the output system. Of course, for a single user system, the encoding/decoding function or service may be directly provided at the user end, and not necessarily at the code repository end.
Fig. 44 is a schematic diagram of a first embodiment of an encoding processing system according to the present invention, and as shown in fig. 44, the encoding processing system includes: a receiving unit 11C, a metadata extracting unit 12C, a metadata encoding generation unit 13C, an encoding protocol selection or creation unit 14C, an instance encoding generation unit 15C, and an object encoding generation unit 16C; specifically, the receiving unit 11C is configured to receive an encoding processing request, and obtain a data object to be encoded according to the encoding processing request; a metadata extraction unit 12C, configured to obtain metadata according to the data object to be encoded; the metadata code generation unit 13C is configured to query a code repository according to the metadata data, and obtain a metadata code corresponding to the metadata data; the coding scheme selecting or creating unit 14C is configured to select or create a corresponding coding scheme according to the meta-coding; the content code generating unit 15C is configured to encode the data content of the data object according to the encoding protocol, so as to obtain an instance code; the object code generation unit 16C is configured to obtain an object code corresponding to the data object based on the meta code and the instance code.
In this embodiment, the encoding processing system may execute the technical solutions of the method embodiments shown in fig. 5A and fig. 5B, and its implementation principle and effect are similar, and will not be described herein again.
In addition, further, the encoding processing system may further include: the data compression unit is used for compressing data before data transmission and storage, and corresponding compression processing can be described or embodied in the coding protocol; and an encryption unit for encrypting the data object or code to be encrypted.
Fig. 45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention, as shown in fig. 44, the apparatus includes: a receiving unit 21C, a disassembling unit 22C, an acquiring unit 23C, and a recovering unit 24C; the receiving unit 21C is configured to receive a decoding processing request, and obtain an object code to be decoded according to the decoding processing request; the disassembling unit 22C is configured to disassemble the object code to obtain a meta code, or the meta code and an instance code; the obtaining unit 23C is configured to query a code repository, and obtain corresponding metadata and a code protocol according to the meta code; the recovery unit 24C is configured to obtain a data object corresponding to the object code according to the metadata and the coding scheme, or the metadata, the coding scheme, and the instance code.
In this embodiment, the decoding processing system may execute the technical solution of the method embodiment shown in fig. 32, and its implementation principle and effect are similar, which is not described herein again.
Further, the decoding processing system may include a corresponding data decryption unit, a corresponding data decompression unit, and the like, corresponding to the encoding processing system.
In this embodiment, for example, a word processing system based mainly on an object coding system is described in detail, and fig. 46 is a schematic diagram of a word processing system based mainly on an object coding system, and as shown in fig. 46, the new system is generally divided into two parts: an encoding warehouse, and a corresponding processing system.
Coding warehouse (coding warehouse) the coding warehouse may comprise two parts: encoded data, and related services surrounding the data.
In particular, it can be seen from the open coded coding model that the model can be easily implemented using an object-based approach. Because of the permanence of the encoding, we can use object databases or store objects in various databases through object-to-relationship mapping techniques.
For the encoding service, the encoding service is actually a process in which an encoding warehouse receives object data, stores it in a library, and returns a corresponding encoding. As can be seen from the previous coding model, this coding is divided into two parts: meta-coding and instance-coding. For the more common short word length codes, we generally provide two corresponding sub-services.
For the registered coding element object sub-service, after obtaining the registered naming coding space, the client can register the coding type with the registered naming coding space. The coding type includes a target coding space corresponding to the coding, which is actually specified by a meta coding space corresponding to the type data. After receiving the registration request, the code warehouse verifies the security and legitimacy of the request according to the system and the user setting. And after passing the verification, returning the corresponding code to the client.
The naming code space is not the only target space for code type registration, and the client can also register directly with the root space of the code repository. Similar to registering a namespace type, the code repository places the code type into a particular code space based on system and user settings and returns the corresponding entire code space path and type code to the client.
For the object coding sub-service, when a client makes a coding request to a coding warehouse, the client must simultaneously provide corresponding meta-coding and type coding. The code repository may store the object in a data store corresponding to the code type and return the object to the client at the stored location.
For the decoding service, the decoding service is a code warehouse receiving codes and returning corresponding data objects to the client, as opposed to the encoding service.
In particular, the coding warehouse provides two sets of decoding sub-services. In a short word length implementation of the decoding service, we give a simple constraint: the meta-code and the instance-code are each represented by a separate code point, the instance-code only appearing after the meta-code. Thus, the decoding service can be completed through two sub-services.
For the decode meta-coding sub-service, when a client is making a decode request to the code repository to code a particular code space (or root space if not specified), the code repository will first perform a security check to see if the current context object meets the system security settings. And on the basis of meeting the security setting, returning the encoded metadata of the metadata encoded in the designated encoding space to the client. This encoded metadata includes corresponding type information for the type encoding and the target encoding space for the corresponding encoding instance. If the corresponding type is an encoded metadata type, its corresponding encoded space is a subspace of the current space.
For the decoding encoding object sub-service, similarly, after the client obtains the encoding metadata, a decoding request for a specific encoding space, a specific encoding type, and a specific encoding may be made to the encoding warehouse. On the basis of meeting the security setting, the code warehouse returns object data of the corresponding position of the code to the client.
For the content caching service, the content caching service may be implemented by object encoding the encoding repository. Specifically, object codes of another or more code warehouses are built in one code warehouse, and the content of the code warehouse object is mainly a reference code of a target warehouse, such as a URL, a connection character string and the like. Then each target warehouse actually corresponds to one coding space. In the coding and decoding process, the content caching service can store the target codes and the corresponding contents into the code space corresponding to the target code warehouse in the cache code warehouse in a proxy caching mode by arranging the cache code warehouse.
For environment-aware authorized access systems, the security of the new system is based primarily on encoding warehouse authorized access services. Other services of the code repository are provided on the basis of the authorized access to the service.
Unlike typical authorized access systems, the granularity of authorized access by the code repository may be very fine, and may be some specific code. Moreover, the use of codes has a specific context, such as the author, reader, application, document, etc. of the code. Thus, based on this context model and its associated extension model, various rules may be defined to facilitate access settings to various coding services within the coding warehouse.
The implementation of the environment (context) aware authorized access system does not have any technical difficulties and can meet the needs using conventional rule-based techniques.
The access authorization rule base is mainly set by the system administrator and the code author to set the rule of the code access except the default setting of the system.
The authorization rules are set based on the coding model and the coding context model, such as coding type, coding space, coding context, time, location (GPS), coding author, coding reader, etc., in addition to the extended model of the coding context that can be provided to the coding repository by the application system using the coding repository, the coding access rules can be set based on all these models.
Applications combined with the object-based context-sensitive encoding scheme of the present invention may also include, but are not limited to: handwritten logins, security authentication models, text services, text codec serialization services, etc.
In addition, unlike the above-mentioned coded codec service, the text codec serialization service converts objects in an application system and codes into each other. The serialization services of text codecs are based on the codec services of the coding warehouse. The serialization service of the text codec is actually the content encoding service of the data object. In addition, the main difference between the text codec and the coding warehouse codec is that the corresponding models of the codec data are different: the literal codec corresponds to the application model and the code repository codec corresponds to the storage model. Of course, in some cases, both models are identical.
For text input/output services, we have mentioned that the new data processing system mainly has two coding capabilities, one is the coding capability of personalized text, and the other is the recoding capability of traditional text data. The text input output service is mainly aimed at the former. The input/output to the latter is mainly through the "general display editing service" mentioned later "
Common personalized characters are mainly handwritten characters and phonetic characters. Of course, any other form of text capable of being stored and transmitted by means of a computer system, such as sign language, gesture, semaphore, lip language, etc.
The personalized text is presented here primarily through a description of the handwritten text, as opposed to traditional computer text.
The personalized handwriting can be various, and according to the different input methods, the personalized handwriting can be the graphic/stroke information directly input into a computer system, which is called on-line handwriting; or a scanned image of a result written on a sheet of paper conventionally known as off-line handwriting. Depending on the details of the strokes, there are hard-pen handwriting, soft-pen handwriting, and the like.
The personalized handwriting word is the most essential difference with the existing handwriting input, namely the personalized word adopts personalized codes, and the personalized handwriting word does not need to be identified into standard codes according to different people. Therefore, the input and output process of personalized characters is mainly a natural writing process. In this process, the computer needs to adapt to the writing habit of the individual as much as possible, and the writing result is kept to the greatest extent. This is in contrast to conventional human-adapted computer keyboard entry.
The output of the personalized handwriting is mainly the display output of a computer screen, and of course, the output of later printing and the like. The input is mainly direct writing of a finger or pen type device on a touch screen of a computer. There are two natural writing constraints to ensure that we input text, not graphics:
1. based on the overall layout constraints of the rows or columns. That is, when a user performs an input, a target row (or column, hereinafter collectively referred to as a row) must be activated in some manner before the input can be performed in the row. In this way, the text input system can very effectively determine the overall sequence of text.
2. Space-based inline layout constraints. In the same row, the text input system must be able to identify the most basic text units to ensure efficient text storage, encoding, and reuse. In phonographic data processing systems, the distance between words tends to be significantly greater than the letter and radical spacing within the words. Thus, we can use words as the most basic word units of the corresponding data processing system, while the partitioning of words within a line is done by analysis of the spacing. Meanwhile, the length of the interval is also encoded to ensure the correct playback of the text content. In this case, the output result is identical to the input even if the result of the pitch analysis is not exactly correct (mainly this process is not exactly the same as the human recognition process, lack of letter recognition and semantic analysis). The text input system may also provide tools to correct the pitch analysis results, taking into account the error of the pitch analysis. In an ideographic data processing system, individual characters are of comparable size and of similar word spacing, all of which are relatively small. In this case, the text input system may add an auxiliary grid to assist in the segmentation of the characters by the input system. For example, for Chinese characters, when inputting characters, we can provide auxiliary lines in the form of a grid to help the user to correctly input characters into the corresponding grid, and in character interval analysis, word separation can be performed based on the grid. We refer to as a grid layout constraint. In fact, text typesetting rules are very cultural different, often from language to language. In the new system, different input-output systems may be provided for different language cultures.
For the general display and editing of hybrid codes, one of the main benefits of standard-based coded data processing systems is its readability, which is the ability to understand the corresponding text content. This readability is based on the coding standard being universally supported by various software and hardware systems. The most widely supported coding standard is ASCII coding.
In the new data processing system, we can fully compatible with the existing coding standard. Support for UTF coding by OTF coding as previously mentioned. In addition to the display support for UTF standard text, we can also provide a generic text display, editing service to provide direct display and editing of open coded text. The display and editing referred to herein is neither that of a complete text display editing nor a binary display and encoding, but rather a generic service therebetween. The service has the following features:
1. the UTF standard characters can be displayed and edited correctly;
2. for non-UTF codes, code type IDs (including space type IDs) and numbers corresponding to codes can be displayed and edited;
3. for some common public open codes, such as XML, JSON, HTML, SVG, the original text content is directly displayed and edited.
The universal display and editing service of the text can support the traditional text input and output mode: monochrome text terminals (which can distinguish between the encoding and the display of the corresponding content using a back display) and keyboards (which can distinguish the encoding edit status from the encoding content edit status). It is mainly convenient for developers and system maintainers to view and modify text data in a traditional manner.
The universal display and editing service of text is an important guarantee that the new system maintains human readability.
For matching (service) of code repository content, taking personalized handwriting content as an example, normalization of code repository content is shape matching.
At present, matching technology of graphics and images is mature, and various algorithms are used for matching according to fonts. There are a method based on stroke curve fitting, a method based on contour lines, a matching method based on feature analysis, a method based on machine learning, and the like. And will not be described in detail herein. In addition, the invention can record the time and position information of each stroke, thus the invention can realize the matching of the input content by utilizing the input time and position information of the strokes.
For code repository content normalization, code repository content normalization is based on code repository content matching to ensure that the same or similar content corresponds to only one code. Taking personalized handwriting content as an example, the optimal normalization result is that handwriting of the same content by the same user always corresponds to the same code of the code warehouse.
The normalization of the code warehouse content can be automatically performed according to a set threshold value or can be performed interactively with a user. For example, taking personalized handwriting as an example, when the user's written content is submitted to the code repository, the code repository retrieves all similarly shaped glyphs and lets the user confirm whether to normalize and the normalized glyphs.
For searching and matching of object codes, a traditional character string pattern matching algorithm can be directly used for searching and matching of the object codes. However, there are two points to be noted:
1. binary comparison cannot be used simply to determine whether the codes in the source string and the target string are identical, but rather to ensure that the coding space, coding type, and instance codes of the source code and the target code are identical.
2. The encoding can be directly ignored for the spaces (i.e., spaces between characters) in the source and target strings.
Thus, existing string matching algorithms, such as classical KMP algorithms, can be used with new data processing systems with only minor modifications. It should be noted that, the searching of the object code does not need to encode the corresponding text content, but only the corresponding code metadata, mainly including the code type information, the information of the code space, and the like.
For retrieval of object codes, the retrieval of object codes can be based entirely on existing retrieval methods, similar to the search matching of object codes. It is also necessary to modify the existing methods for the above-mentioned features.
For input search of personalized words, in the new data processing system, all coded contents can be stored in the coding warehouse, so that search of user input contents can be optimized on the basis of the coding warehouse content normalization service. The search process is as follows:
1. inputting text contents (source text) to be searched through a text input system;
2. the code warehouse performs normalization matching on the source text;
3. if the source text contains a new code (unmatched code), directly returning to search failure;
4. if the source text contains text codes which do not appear in the target text, directly returning to search failure;
5. And searching a code string corresponding to the text to be searched in the target code.
For personalized text recognition, the recognition of personalized text is a subset of traditional text recognition. The result of the identification may be stored in the code repository. It is noted that there may be multiple recognition results for the same code. For example, the capital letter I may correspond to the number 1, or the lowercase letter l. This is also encountered during conventional text recognition. Only a little change is needed in the traditional character recognition process, and the whole sentence and the whole character recognition are carried out by combining the single word or the word recognition information in the code warehouse.
For a multi-level output system, in an object coding warehouse, we do not have any limitation on the corresponding literal content of the coding. Thus, two situations may occur:
1. the corresponding text content of the code is vectorized/parameterized information, and different output can be realized according to different conditions/parameters;
2. the same code may correspond to multiple copies of literal content.
Either case would necessitate the use of some content selection mechanism in the decoding service of the encoding repository. For the first case, the encoding repository dynamically generates corresponding encoded content based on the information of the decoding request. In the second case, the code repository will select the most appropriate literal content based on the system settings and the decode request.
For visual touch control editing of personalized characters, under a new data processing system, visual mixed editing typesetting of the personalized characters and the traditional characters is possible. Traditional visual text editing is designed by using a keyboard as main editing equipment. There are two core concepts:
1. the focus, i.e. the position where the current text is inserted or overlaid, is entered. For text streams, it is a one-dimensional position coordinate. But for the visualized edit area it corresponds to a two-dimensional coordinate (row and column). A blinking cursor is typically used to visualize its position. By changing it via a directional key, a system supporting a pointing device (e.g., a mouse) can also use the pointing device to directly position the focus.
2. The text is selected (i.e., the text to be manipulated). For text streams, one-to-one dimensional position coordinates are used. Generally, the input focus and the selected text cannot exist at the same time. The input focus may be understood as a selected word of length zero. The selected text is typically visualized by a reverse or highlighting. The initiation and termination of text selection is defined by the keyboard, mainly by a combination of directional keys and specific function keys. Point-of-use devices, such as mice, mainly select text by "press and hold, drag, release".
Conventional what-you-see-is-what-get visual text editing is based on the manner in which commands are applied to the selected text. But such user interfaces are not natural to the increasingly popular touch devices. In addition, handwriting input is not satisfactory for the conventional visual editing method. In contrast, touch devices are very natural input devices for handwritten text. Therefore, on the basis of the existing visual text editing, an input mode is introduced to ensure the switching of different input modes, and the input focus is expanded into a region range in the touch input mode, so that the visual text editing under the touch equipment can be improved. The following are the input modes and input areas contemplated by the present invention.
1. An input mode. Based on the original keyboard input mode, we also allow the handwriting input mode. When an input is made, we must be in either of these two modes. The user can freely switch between these two modes. When in keyboard input mode, the user can type text directly with a keyboard (virtual keyboard or numeric keyboard) and use a conventional visual editing interface. While in the touch input mode, the user can input in a specific area with a touch device (stylus or finger). And a touch-friendly visual editing interface is used.
2. The input area (i.e., input panel) is effective only in the handwriting input mode. Corresponds to the input focus in the keyboard input mode. Unlike the input focus in the conventional editing system, the input area corresponds not to one-dimensional position coordinates but to the two-dimensional area of the edit display. In the handwriting input mode, the user can directly write text in the input area. The written text is directly presented in a what-you-see-is-what-you-get manner and participates in typesetting editing. The input area has row information corresponding to the current text layout, so that the text information written in the area can directly correspond to a position after text layout. Without any other limitation, the most direct, natural input area is the display area where the rows, or columns, are located. The user can change the current input area by touch clicking outside the input area; the position of the input area may also be changed directly by a movement command.
For typesetting, different language cultures and different characters have different typesetting rules. For example, arabic characters are horizontally arranged from top to bottom and right to left, while traditional Chinese characters are vertically arranged from right to left and top to bottom. The personalized text must also follow the corresponding typesetting rules.
However, regardless of the typesetting rule, the intra-segment wrapping is performed on the basis of the accumulation of the character lengths. Similar to standardized characters, personalized characters based on open codes also have length information; however, unlike standardized text codes, there is no special space character of a fixed length in the personalized text based on open codes, and instead space characters of different lengths (space length as coding parameter) can be used.
In addition, punctuation marks often participate in typesetting. However, in handwritten text punctuation marks do not necessarily need to be recognized. Thus, personalized punctuation marks are often treated as common characters, synthesized with other characters.
Two typical layout algorithms are given below, through which other layout rule algorithms can be modified.
For input, in the handwriting input mode, handwriting input may be directly performed in the input area. The input result does not need to be recognized, but is directly converted into personalized text based on open codes. In this process, the text and the spacing of the text need to be identified. Typesetting rules also play a limiting role in this recognition process.
For deployment schemes of object coding systems, open code-based computer data processing systems split the object coding and the content of the data objects. As with conventional data processing systems, literal code can exist in different stores-memory, documents, databases, networks, or clouds. Therefore, the specific storage scheme adopted for the literal code is completely determined by the requirements and the architecture of an application system, and is irrelevant to the storage scheme of a corresponding code warehouse. And we will discuss here, not the storage scheme of literal codes, but the deployment scheme of corresponding literal code warehouse. On the other hand, the use of different storage systems for storing the literal code and code repository can effectively improve the security of the system-as mentioned above, in which case an attacker can only ultimately obtain the literal information if both systems are broken at the same time.
In addition, the system architecture of the traditional application system is independent of the code warehouse deployment scheme, whether the traditional application system is a stand-alone application or a network application, whether the traditional application system is a single-user or multi-user model, whether the traditional application system is based on a browser or a rich client, and the like. Of course, in the new data processing system, the same application system adopts different code warehouse deployment schemes, and different security levels and performance indexes will be provided.
FIG. 47 is a schematic diagram of an architecture deployed within an application. As shown in FIG. 47, in-application deployment means that each application system has its own particular code repository. In such a deployment scenario, the text content in one application can only be recognized and displayed by the system. In other application systems, the "scrambling code" cannot be interpreted.
The text content security level in such deployment schemes is high-at least with isolation between different applications. Can be used for personal application with higher safety. A "personal diary" is a typical application system in which diary content can only be opened by an authorized application. The disadvantage of in-application deployment is its other side of security: data is difficult to share.
Fig. 48 is a schematic diagram of an architecture of a terminal deployment, and as shown in fig. 48, unlike an in-application deployment, the terminal deployment of the code repository is shared as one system service of the terminal system, and can be used by a plurality of applications at the same time. This deployment scheme also has a high security because text content that is detached from the terminal cannot be used.
Fig. 49 is a schematic diagram of an architecture of a mobile external device deployment, as shown in fig. 49, where the terminal deployment of the code repository is well suited for personal applications with low sharing requirements. However, with the popularity of mobile terminals and tablet devices, there are more and more individuals who possess multiple computer devices, which has led to the frequent need for personal information to be shared among multiple devices. This need is directly met by deploying the code repository on an accessible mobile device. This mobile device may be an intelligent mobile terminal running the code repository service, a mobile storage device storing the code repository, or a dedicated code repository device.
For network deployment, linguistic words are mainly used to communicate with others. Thus, the primary deployment of the code repository is also network deployment. For an Internet-wide network, it is the cloud deployment. As shown in fig. 50, all applications share the same code repository. Thus, all people using the application can use and exchange text information under the access control of the same code repository.
For a local area network or enterprise intranet, the network deployment of the code repository is a private cloud deployment or an internal server deployment, as shown in fig. 51. Thus, the code warehouse is isolated from the outside by the firewall, and the corresponding code content can only be used inside the organization.
FIG. 52 is a schematic architecture diagram of a point-to-point deployment, a particular example of which is a point-to-point deployment. As shown in fig. 52, the code repository is temporarily or permanently shared with other users on an in-application or end-of-use deployment basis. One typical application is a personal instant messaging application: during the conversation, both parties of the conversation share the code repository with each other, so both parties can communicate normally. If one party closes the sharing of the code repository at the end of a call, the other party cannot see the call record of the other party. In real life, we sometimes need such a safety effect.
The code repository deployment scheme used by an application is not absolute and straightforward. The application system can mix different schemes at the same time. FIG. 53 is a schematic diagram of a hybrid deployment architecture, as shown in FIG. 53, where three different code warehouses may be used by the same application. In this way, the application can be used in three different environments, only the corresponding code repository needs to be switched.
In combination with the above description, the present embodiment is specifically exemplified in combination with practical application, so as to implement enhancement and modification of the conventional information system, and support for the text system based on object coding.
As shown in fig. 54, text in a conventional information system is generally input and output by directly using text services provided by an operating system. Since object code in a new data processing system may be fully compatible with legacy literal code, we can add support to the new data processing system by modifying the literal services of the operating system, as shown in FIG. 54. Thus, the traditional information system can directly support the input and output of nonstandard characters (such as personalized handwriting characters) without modification.
In particular, the modification of back-end storage based on object coding, in existing software application systems, the loading and storage of sustainable data objects is accomplished by a data access module/component. When in storage, the data access component directly stores the data corresponding to the application object in the application storage; when loading, the data access component obtains corresponding data by accessing the application store, and loads and instantiates the data as an application object.
The object coding system of the present invention may be implemented as follows, but the specific implementation method is not limited thereto. For example, the code repository may be provided on the user side, on a third party server, or anywhere in cloud storage, etc.
Please refer to fig. 55: the object coding system carries out systematic numbering on the data to be loaded and stored, thereby obtaining corresponding object codes. Thus, the application store stores primarily the encoded object code and the object code sequence. The actual application data needs to be obtained by the object coding system using these codes. The association between the application system and the application data introduces the indirection layer of "encoding". Thus, additional running and even storage overhead is naturally introduced, but a plurality of advantages such as safety, flexibility, high efficiency and the like are brought. This is very beneficial in certain applications.
As shown in fig. 55, an application of the object coding system based on the present invention stores the used codes/code sequences in an application storage. When in storage, the data access component converts the data corresponding to the application object into coded content according to specific application logic; the data object is converted into corresponding codes and returned to the data access component through the object coding system, and the content of the data object is stored in the object coding system; the data access component stores the resulting code/code sequence in the application store. When loading, the data access component obtains the required codes through accessing the application storage and restores the codes into data objects through the object coding system; finally, a data access component of the application system converts the data object into an application object.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (10)

1. A coding processing method, comprising:
acquiring a data object to be coded and metadata thereof according to the received coding processing request; the metadata includes: attribute information; and/or, the metadata includes: data content identification and keywords; wherein the attribute information is used for describing the property of the data object; the data content identifier is used for indicating that the extraction position of the metadata is from a data content part, and the keyword is used for indicating the extracted data content;
And acquiring the object code of the data object according to the code warehouse and the data object and the metadata thereof.
2. The method of claim 1, wherein the obtaining the object code of the data object based on the code repository and the data object and metadata thereof comprises:
selecting or creating a coding protocol according to a coding warehouse and at least one part of the metadata, and generating a metadata code corresponding to the metadata according to the coding protocol;
encoding the data content of the data object according to the encoding protocol to obtain an instance code, and obtaining an object code corresponding to the data object according to the meta code and the instance code;
the object code is either a reference code form or a content code form.
3. The method of claim 2, wherein encoding the data content of the data object according to the encoding specification to obtain an instance code comprises:
carrying out serialization processing on the data content of the data object according to the coding protocol to obtain a serialization result; wherein the instance is encoded as the serialization result;
Or alternatively, the process may be performed,
carrying out serialization processing on the data object content according to the coding protocol to obtain a serialization result, and storing the serialization result in the coding warehouse to obtain an object number in the coding warehouse; wherein the instance is encoded as the object number.
4. A method according to any one of claims 1 to 3, further comprising:
and setting access rights for the data in the code warehouse.
5. A method according to claim 2 or 3, wherein said encoding the data content of said data object according to said encoding protocol, obtaining an instance code, comprises:
acquiring a context object;
acquiring a corresponding coding space according to the context object and the coding protocol;
and coding the data content in the data object in the coding space to obtain an instance code.
6. The method of claim 2, wherein the meta-coding comprises a combination and/or nesting of one or more of the following: type coding, spatial coding, and context coding.
7. A method according to any of claims 1-3, characterized in that it further comprises, prior to the encoding process, a data splitting process method comprising the steps of:
Acquiring metadata in a data object corresponding to a data identifier to be stored according to a preset metadata stripping protocol, and stripping the acquired metadata from the data object to obtain stripped data content;
dividing the stripped data content into at least two data fragments according to a preset data content splitting protocol, wherein the data fragments are the data objects to be encoded in the encoding processing method.
8. A decoding processing method, comprising:
receiving a decoding processing request, and acquiring an object code to be decoded according to the decoding processing request;
disassembling the object codes to obtain meta codes or the meta codes and the instance codes;
inquiring a coding warehouse, and acquiring corresponding metadata and coding protocols according to the metadata; the metadata includes: attribute information; and/or, the metadata includes: data content identification and keywords; wherein the attribute information is used for describing the property of the data object; the data content identifier is used for indicating that the extraction position of the metadata is from a data content part, and the keyword is used for indicating the extracted data content;
And acquiring the data object corresponding to the object code according to the metadata and the code specification or the metadata, the code specification and the instance code.
9. The method of claim 8, wherein the obtaining a data object corresponding to the object code based on the metadata and coding conventions, or the metadata, coding conventions, and instance codes, comprises:
acquiring a context object;
acquiring a corresponding coding space according to the context object and the coding protocol;
decoding the instance code from the coding space to obtain corresponding data content;
and acquiring a data object corresponding to the object code according to the metadata and the data content.
10. The method according to claim 8 or 9, further comprising a data combining processing method after the decoding processing, the data combining processing method comprising:
obtaining each split data fragment, a splitting/stripping protocol or a preset merging protocol;
obtaining split metadata of the data object according to the obtained data fragment and/or the split/stripping protocol or the preset merging protocol;
And combining the data fragments together based on the data splitting/stripping protocol or the preset merging protocol and the splitting metadata to obtain the data object.
CN202310088220.3A 2014-08-11 2015-08-11 Processing, data splitting and merging and coding and decoding processing method for handwriting input characters Pending CN116185209A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201410392557 2014-08-11
CN2014103925574 2014-08-11
CN201580042761.6A CN106575166B (en) 2014-08-11 2015-08-11 Method for processing hand input character, splitting and merging data and processing encoding and decoding
PCT/CN2015/086672 WO2016023471A1 (en) 2014-08-11 2015-08-11 Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201580042761.6A Division CN106575166B (en) 2014-08-11 2015-08-11 Method for processing hand input character, splitting and merging data and processing encoding and decoding

Publications (1)

Publication Number Publication Date
CN116185209A true CN116185209A (en) 2023-05-30

Family

ID=55303878

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310088220.3A Pending CN116185209A (en) 2014-08-11 2015-08-11 Processing, data splitting and merging and coding and decoding processing method for handwriting input characters
CN201580042761.6A Active CN106575166B (en) 2014-08-11 2015-08-11 Method for processing hand input character, splitting and merging data and processing encoding and decoding

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201580042761.6A Active CN106575166B (en) 2014-08-11 2015-08-11 Method for processing hand input character, splitting and merging data and processing encoding and decoding

Country Status (2)

Country Link
CN (2) CN116185209A (en)
WO (1) WO2016023471A1 (en)

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107154924B (en) * 2016-03-04 2021-01-05 阿里巴巴集团控股有限公司 Verification processing method and device based on verification code
US10692015B2 (en) * 2016-07-15 2020-06-23 Io-Tahoe Llc Primary key-foreign key relationship determination through machine learning
US11321614B2 (en) 2017-09-29 2022-05-03 Oracle International Corporation Directed trajectories through communication decision tree using iterative artificial intelligence
US11481640B2 (en) * 2017-09-29 2022-10-25 Oracle International Corporation Directed trajectories through communication decision tree using iterative artificial intelligence
JP6372813B1 (en) * 2017-12-20 2018-08-15 株式会社イスプリ Data management system
CN108073913B (en) * 2018-01-05 2022-06-14 南京孜博汇信息科技有限公司 Handwriting datamation data acquisition method
CN110134452B (en) * 2018-02-09 2022-10-25 阿里巴巴集团控股有限公司 Object processing method and device
CN109359283B (en) * 2018-09-26 2023-07-25 中国平安人寿保险股份有限公司 Summarizing method of form data, terminal equipment and medium
CN111078907A (en) * 2018-10-18 2020-04-28 中华图象字教育股份有限公司 Chinese character tree processing method and device
GB2578625A (en) * 2018-11-01 2020-05-20 Nokia Technologies Oy Apparatus, methods and computer programs for encoding spatial metadata
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN109814913B (en) * 2018-12-25 2020-09-18 华为终端有限公司 Method and device for splitting, recombining and operating application package
CN112230781B (en) * 2019-07-15 2023-07-25 腾讯科技(深圳)有限公司 Character recommendation method, device and storage medium
CN110543243B (en) * 2019-09-05 2023-05-02 北京字节跳动网络技术有限公司 Data processing method, device, equipment and storage medium
CN110548290B (en) * 2019-09-11 2023-10-03 珠海金山数字网络科技有限公司 Image-text mixed arrangement method and device, electronic equipment and storage medium
CN111046632B (en) * 2019-11-29 2023-11-10 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN110968592B (en) * 2019-12-06 2023-11-21 深圳前海环融联易信息科技服务有限公司 Metadata acquisition method, metadata acquisition device, computer equipment and computer readable storage medium
CN111401137A (en) * 2020-02-24 2020-07-10 中国建设银行股份有限公司 Method and device for identifying certificate column
CN113569534A (en) * 2020-04-29 2021-10-29 杭州海康威视数字技术股份有限公司 Method and device for detecting messy codes in document
US11442712B2 (en) * 2020-06-11 2022-09-13 Indian Institute Of Technology Delhi Leveraging unspecified order of evaluation for compiler-based program optimization
CN114077466A (en) * 2020-08-12 2022-02-22 北京智邦国际软件技术有限公司 Automatic layout algorithm for multiple rows and multiple columns of fields in Web interface form
CN112181950B (en) * 2020-10-19 2024-03-26 北京米连科技有限公司 Construction method of distributed object database
CN112333256B (en) * 2020-10-28 2022-02-08 常州微亿智造科技有限公司 Data conversion frame system and method during network transmission under industrial Internet of things
CN112966475A (en) * 2021-03-02 2021-06-15 挂号网(杭州)科技有限公司 Character similarity determining method and device, electronic equipment and storage medium
US11494201B1 (en) * 2021-05-20 2022-11-08 Adp, Inc. Systems and methods of migrating client information
CN113360113B (en) * 2021-05-24 2022-07-19 中国电子科技集团公司第四十一研究所 System and method for dynamically adjusting character display width based on OLED screen
CN113625932B (en) * 2021-08-04 2024-03-22 北京字节跳动网络技术有限公司 Full-screen handwriting input method and device
CN113659993B (en) * 2021-08-17 2022-06-17 深圳市康立生物医疗有限公司 Immune batch data processing method and device, terminal and readable storage medium
CN113760246B (en) * 2021-09-06 2023-08-11 网易(杭州)网络有限公司 Application text language processing method and device, electronic equipment and storage medium
CN113723048A (en) * 2021-09-06 2021-11-30 北京字跳网络技术有限公司 Method and device for setting rich text space, storage medium and electronic equipment
CN113608646B (en) * 2021-10-08 2022-01-07 广州文石信息科技有限公司 Method and device for erasing strokes, readable storage medium and electronic equipment
CN114221783B (en) * 2021-11-11 2023-06-02 杭州天宽科技有限公司 Data selective encryption and decryption system
CN114900315B (en) * 2022-04-24 2024-03-15 北京优全智汇信息技术有限公司 Document electronic management system based on OCR and electronic signature technology
CN115022302B (en) * 2022-08-08 2022-11-25 丹娜(天津)生物科技股份有限公司 Equipment fault data remote transmission method and device, electronic equipment and storage medium
TWI821128B (en) * 2023-02-23 2023-11-01 兆豐國際商業銀行股份有限公司 Data checking system and method thereof
CN116827479B (en) * 2023-08-29 2023-12-05 北京航空航天大学 Low-complexity hidden communication coding and decoding method
CN117371446B (en) * 2023-12-07 2024-04-16 江西曼荼罗软件有限公司 Medical record text typesetting method, system, storage medium and electronic equipment

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3725877A (en) * 1972-04-27 1973-04-03 Gen Motors Corp Self contained memory keyboard
JP3017740B2 (en) * 1988-08-23 2000-03-13 ソニー株式会社 Online character recognition device and online character recognition method
CN101311887A (en) * 2007-05-21 2008-11-26 刘恩新 Computer hand-written input system and input method and editing method
CN101673408B (en) * 2008-09-10 2012-02-22 汉王科技股份有限公司 Method and device for embedding character information in shape recognition result
CN101739118A (en) * 2008-11-06 2010-06-16 大同大学 Video handwriting character inputting device and method thereof
CN102375989A (en) * 2010-08-06 2012-03-14 腾讯科技(深圳)有限公司 Method and system for identifying handwriting
CN102455845B (en) * 2010-10-14 2015-02-18 北京搜狗科技发展有限公司 Character entry method and device
CN102156608B (en) * 2010-12-10 2013-07-24 上海合合信息科技发展有限公司 Handwriting input method for writing characters continuously
JP5550598B2 (en) * 2011-03-31 2014-07-16 パナソニック株式会社 Handwritten character input device
CN102455867B (en) * 2011-09-29 2015-06-24 北京壹人壹本信息科技有限公司 Method and device for matching handwritten character information
CN102508598B (en) * 2011-10-09 2014-03-05 北京捷通华声语音技术有限公司 Method and device for gradually blanking character strokes
CN103513898A (en) * 2012-06-21 2014-01-15 夏普株式会社 Handwritten character segmenting method and electronic equipment
GB2509552A (en) * 2013-01-08 2014-07-09 Neuratron Ltd Entering handwritten musical notation on a touchscreen and providing editing capabilities

Also Published As

Publication number Publication date
WO2016023471A1 (en) 2016-02-18
CN106575166A (en) 2017-04-19
CN106575166B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN106575166B (en) Method for processing hand input character, splitting and merging data and processing encoding and decoding
Goossens et al. The Latex Web Companion: Integrating TEX, HTML, and XML
US10089299B2 (en) Multi-media context language processing
Asakawa et al. Transcoding
Kohlhase Using as a semantic markup format
US8375086B2 (en) Shared state manager and system and method for collaboration
US20170075973A1 (en) Automatic Synthesis and Presentation of OLAP Cubes from Semantically Enriched Data Sources
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
US7860815B1 (en) Computer knowledge representation format, system, methods, and applications
TWI394051B (en) Web page rendering priority mechanism
TWI590082B (en) Sharable distributed dictionary for applications
Schmidt The inadequacy of embedded markup for cultural heritage texts
US9972358B2 (en) Interactive video generation
US8750630B2 (en) Hierarchical and index based watermarks represented as trees
JP2021197133A (en) Meaning matching method, device, electronic apparatus, storage medium, and computer program
CA2448787A1 (en) Method and computer-readable medium for importing and exporting hierarchically structured data
CN110597963A (en) Expression question-answer library construction method, expression search method, device and storage medium
CN116702737B (en) Document generation method, device, equipment, storage medium and product
CN110990057A (en) Extraction method, device, equipment and medium of small program sub-chain information
US20140214867A1 (en) Framework for Generating Programs to Process Beacons
CN110569488A (en) modular template WORD generation method based on XML (extensive markup language)
GB2603586A (en) Document access control based on document component layouts
CN102193789A (en) Method and equipment for realizing configurable skip link
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
Larsen Learning Microsoft Cognitive Services

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination