WO2016023471A1

WO2016023471A1 - Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing

Info

Publication number: WO2016023471A1
Application number: PCT/CN2015/086672
Authority: WO
Inventors: 张锐
Original assignee: 张锐
Priority date: 2014-08-11
Filing date: 2015-08-11
Publication date: 2016-02-18
Also published as: CN106575166A; CN116185209A; CN106575166B

Abstract

Provided are methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing. An object-based open encoding and decoding solution can encode and decode any data object in any free and open encoding manner; and with regard to an object-based data splitting/merging method, metadata and/or encoded data of a data object are split/stripped from corresponding data contents so as to ensure the security of the data contents. The methods can be implemented individually, and can also be implemented in combination, or can be combined with the applications in other technical fields either alone or in combination.

Description

Handwritten input character processing, data splitting and merging, and codec processing method

Technical field

The present invention relates to data processing technologies, and in particular, to a method for processing handwritten input characters, data splitting and merging, and encoding and decoding.

Background technique

At present, with the development of computers, there are more and more types of coding technologies. As a computer-based coding technology, it has been widely used in data transmission, storage and processing.

Among them, text encoding is the most basic encoding for human input, viewing and editing, modification; for computer analysis and processing. From the earlier ASCII text encoding standards to today's Unicode, standardized text encoding is a basis for transferring information between people and machines and various systems. However, as a tool for recording human output, the existing standardized text encoding is far from enough. With the popularity of computers, the development of human-computer interaction technology, standard text encoding and its corresponding text input methods have gradually become the bottleneck of human natural output into the digital world.

Based on standard text encoding, a series of general-purpose, specialized encoding methods have been developed to express structured data/documents with characters and character sequences through a series of means such as markup, control, and escaping. Specialized domain data, we call it text encoding; the corresponding data format is called text format. Common XML/SGML tree structures with tags to describe complex structures, JSON to describe complex objects with JavaScript syntax; dedicated XML-based HTML description pages, MathML description mathematical expressions, SVG description vector graphics; CSV for expression Tabular data; RTF, Markdown, etc. are used to represent formatted documents; various programming languages also mainly use text formats; Standard text-based coding allows humans to participate in the process of data creation, viewing, debugging, and modification, facilitating integration and exchange between different systems, improving the speed of system development, and reducing the cost of system troubleshooting. However, on the other hand, the text format is redundant for the expression of symbolized data and binary data. As the complexity of the structure to be expressed by the system is improved, the complexity of the mark and syntax based on text coding is greatly improved. Data redundancy will also increase. In addition, due to the limited number of codes in a specific text encoding standard, the conflict between the data content and the grammar mark in the encoding is also inevitable, and text escaping also brings certain data redundancy.

The world inside the computer is the world of numbers, and binary data is its natural form of data representation. People-defined text format data will also be processed into binary data through conversion to reduce redundancy and improve processing and transmission efficiency. There are also some general binary-based encoding methods, such as the International Standards Organization and the International Telecommunications Union coding standards ANS.1, Google's BufferProtocol, Apache's Thrift and Avro, as well as BSON, Message Pack and so on. However, contrary to the text-based coding method, binary data has the disadvantages of relatively closed, unfavorable exchange, and unfavorable human participation.

For encoding, whether it is text encoding or binary encoding, there are two purposes, one is to describe the data object itself, which is also called serialization, which is referred to as the content encoding of the data object. The aforementioned coding standards and methods are mainly used for content coding.

Another use of encoding is to describe the address or reference of a data object, which is referred to herein as a reference encoding of a data object. Text-based reference encoding has URN, URL, object identifier (OID) in ANS.1, etc.; binary-based reference encoding has keys in the database, UUID/GUID, IP address, MAC address, MD5, SHA-1, etc. There are even one-dimensional codes based on graphics, two-dimensional codes (actually converted into text encoding or binary encoding by recognition) and so on.

There are two main problems with existing reference coding. First, it is not conducive to integration and exchange: different coding standards are being used in different fields. Faced with the development trend of the Internet and the Internet of Things today, this status quo is not conducive to the unified reference of objects in various fields. Another problem is the validity of coding: as the world's interconnectivity improves, massive digital objects are always online, although encodings like UUID (16 bytes) and SHA-1 (20 bytes) are theoretically sufficient. They provide a uniform reference code, but the transmission, processing, and storage of such massive reference code itself will occupy a large amount of resources, causing unnecessary waste.

Summary of the invention

A first aspect of the present invention provides a method for processing handwritten input characters, including:

And acquiring, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column;

For each stroke, according to the input position of the stroke in the first target row/column, or The input position of the stroke in the first target row/column and the character specified in the first target row/column creates a new character for the stroke or determines a character to which the stroke belongs.

The technical effect of the first aspect of the present invention is to provide a method for processing handwritten input characters, which can realize the effect of inputting a word while inputting, and the user does not need to explicitly or implicitly "start a single text input" or "end". The command of a single text input distinguishes different characters. Therefore, it is not necessary to pause for a period of time or perform some interaction with the system during the writing process, and the writing process is smooth and efficient; and, in the method The character to which the stroke belongs is determined directly by the input position of the stroke, and the identification of the standard character is not required, so that the personalized information and the writing style and characteristics of the user's handwriting input can be retained.

A second aspect of the present invention provides a data splitting method, including:

When receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. ;

The data content is divided into at least two data segments according to a preset data content splitting specification.

The technical effect of the second aspect of the present invention is to provide a data splitting method, which separates the metadata in the user's original data from the data content, and divides the data content into a plurality of data segments, thereby increasing illegal acquisition. The difficulty of the user's original data makes the security of data storage more reliable.

A third aspect of the present invention provides a data merging method comprising:

Receiving a data object acquisition request carrying the identification information; wherein the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object;

Acquiring the storage content corresponding to the positioning information, and acquiring data information in the other storage content according to the obtained positioning information in the storage content, until all data information of the data object is obtained;

And obtaining, according to the preset merge rule in the obtained data information, the acquired data information, to obtain the data object.

The technical effect of the third aspect of the present invention is to provide a data merging method, which is obtained by stepwise positioning according to the positioning information included in the identification information in the data object acquisition request. The data information stored in each storage body is split, so that each data information is combined according to a preset merge rule to obtain a user original data object, thereby ensuring that data dispersed in each storage body can be efficiently and safely. The acquisition ensures the reliability of the user successfully merging the scattered data into the original data.

A fourth aspect of the present invention provides a coding processing method, including:

Acquiring the data object to be encoded and its metadata according to the received encoding processing request;

Obtaining an object encoding of the data object according to the encoding repository and the data object and its metadata.

The technical effect of the fourth aspect of the present invention is: obtaining a data object to be encoded and its metadata according to the received encoding processing request, and acquiring an object encoding of the data object according to the encoding warehouse and the data object and the metadata thereof, Since the data object can be encoded according to the metadata of the data object and the encoding warehouse, a flexible and diverse encoding method is realized.

A fifth aspect of the present invention provides a decoding processing method, including:

Receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request;

Decomposing the object code to obtain a meta code, or the element code and the instance code;

Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;

Obtaining a data object corresponding to the object encoding according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.

The technical effect of the fifth aspect of the present invention is: receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request, disassembling the object encoding, obtaining a meta encoding, or the meta encoding and the instance encoding. Querying the code repository, obtaining corresponding metadata and coding specifications according to the meta code, and acquiring data objects corresponding to the object code according to the metadata and the coding protocol, or the metadata, the coding protocol, and the instance code, The metadata and the encoding warehouse realize the encoding of the data object. Therefore, not only the flexible coding method is realized, but also the space is saved to a certain extent. Correspondingly, according to the meta-coding of the disassembly and the coding warehouse, Effectively improve the efficiency of decoding.

DRAWINGS

1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention;

FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention; FIG.

1C is a schematic diagram 2 of a character in a method for processing handwritten input characters according to an embodiment of the present invention;

FIG. 1 is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention;

FIG. 1 is a schematic diagram of a state in which a character is inserted in a method for processing handwritten input characters according to an embodiment of the present invention; FIG.

FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.

FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.

FIG. 1H is a flowchart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.

1I is a flowchart of a handwriting program source code conversion method in an embodiment of a method for processing handwritten input characters provided by the present invention;

FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I;

FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.

1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention;

2A is a flowchart of a data splitting method according to an exemplary embodiment;

2B-1 is a flowchart of a data splitting method according to another exemplary embodiment;

2B-2 is a structural diagram of a system in which a data object of the data splitting method is audio data according to the present invention;

2B-3 is a time domain analysis diagram of data objects of the data splitting method according to the present invention;

2B-4 is a diagram of a speech text coding table in which a data object of the data splitting method is audio data according to the present invention;

2B-5 is a schematic diagram showing a voice text of a data object of the data splitting method according to the present invention;

2B-6 is another schematic diagram showing the voice text of the data object in the data splitting method according to the present invention;

2B-7 is still another schematic diagram of a voice text of a data object of the data splitting method according to the present invention;

2B-8 is still another schematic diagram of a voice text of a data object in which the data object is a data splitting method according to the present invention;

2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention;

2D is a flowchart of a data merging method according to an exemplary embodiment;

2E is a flowchart of a data merging method according to another exemplary embodiment;

2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment;

2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment;

2H is a schematic structural diagram of a data combining apparatus according to an exemplary embodiment;

2I is a schematic structural diagram of a data merging device according to another exemplary embodiment;

2J is an exemplary data splitting flowchart;

2K is another exemplary data splitting flowchart;

2L is an exemplary data merge flowchart;

2M is a schematic diagram of an exemplary data split description language definition;

2N is a flow chart of an exemplary data split description language visualization;

Figure 2O is a diagram showing the relationship between concepts in the three concepts of the present invention;

3 is a schematic diagram of a meta model in the prior art;

4 is a schematic structural diagram of an encoding system of the present invention;

FIG. 5C is a flowchart of Embodiment 1 of a coding processing method according to the present invention; FIG.

FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C;

6 is a relationship between data objects, metadata, coding protocols, and coding meta-objects;

Figure 7 is a schematic diagram of the core coding metamodel;

8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object reference coding removes the meta-coded part), and data objects and coding meta-objects;

FIG. 9 is a diagram showing an example of meta-encoding in the embodiment; FIG.

Figure 10 is a diagram showing an example of a layer-by-layer correlation of a coded meta-object (variable-length coding of 16-bit word length);

11 is a schematic diagram of a meta model corresponding to a code;

Figure 12 is a schematic diagram of a conceptual model of the object encoding;

FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention;

FIG. 14 is a flowchart of Embodiment 3 of a coding processing method according to the present invention; FIG.

15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment;

16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system;

17 is a schematic diagram of a basic object that can be applied to a basic coding space;

18 is a schematic diagram showing the coding structure of a 128 fixed length coding scheme;

Figure 19 is a schematic diagram of four binary bits being four spatial bits;

Figure 20 is a diagram showing an example of a coding scheme;

21 is a diagram showing an example of a coding scheme of UTF-8;

Figure 22 is a schematic diagram of object coding consisting of element coding and example coding;

Figure 23 is a detailed view of the encoding;

Figure 24 is a rendering result diagram;

25 is a schematic diagram of code points other than UTF-8 of OTF-8;

Figure 26 is a schematic diagram of the coding to be defined;

FIG. 27 is a flowchart of Embodiment 4 of a coding processing method according to the present invention; FIG.

28 is an update diagram of a corresponding coding element model;

Figure 29 is a schematic diagram of coding combination;

FIG. 30 is a flowchart of Embodiment 5 of a coding processing method according to the present invention; FIG.

Figure 31 is a handwriting input program;

32 is a flowchart of Embodiment 1 of a decoding processing method according to the present invention;

FIG. 33 is a flowchart of Embodiment 2 of a decoding processing method according to the present invention;

FIG. 34 is a flowchart of Embodiment 3 of a decoding processing method according to the present invention;

FIG. 35 is a flowchart of Embodiment 4 of a decoding processing method according to the present invention;

Figure 36 is the content of the handwritten input;

Figure 37 is a schematic view showing the length of the character pitch;

Figure 38 is a schematic diagram of a decoding process;

Figure 39 is a diagram showing an example of a mixed encoded content display;

Figure 40 is a schematic diagram of the contents of the output;

Figure 41 is a schematic view showing the strobe stroke falling on the result of the character output;

Figure 42 is a schematic diagram of adding a standard smiley face icon;

Figure 43 is a schematic view of an online Go;

44 is a schematic structural diagram of a first embodiment of an encoding processing system according to the present invention;

FIG. 45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention; FIG.

46 is a schematic structural diagram of a word processing system mainly based on an object coding system;

47 is a schematic diagram of an architecture of an in-application deployment;

48 is a schematic structural diagram of terminal deployment;

49 is a schematic structural diagram of a mobile external device deployment;

Figure 50 is a schematic diagram of an architecture in which an application shares the same code repository;

Figure 51 is a diagram showing an example of a network deployment of a code repository being a private cloud deployment or an internal server deployment;

Figure 52 is a schematic diagram of the architecture of a point-to-point deployment;

Figure 53 is a schematic diagram of a hybrid deployment architecture;

Figure 54 is an architectural diagram of an extended operating system to allow legacy applications to support object encoding;

Figure 55 is a diagram showing the interaction of an object encoding system and an application system based on the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the embodiments of the present invention. It should be noted that in the drawings or the description, similar or identical elements are denoted by the same reference numerals.

First of all, it is necessary to introduce the background of the invention. With the development of the Internet and mobile computing, cloud storage systems and related applications have emerged. The so-called cloud storage system refers to storing user data in a server in the cloud. In this way, users can use different terminal devices to access data in the cloud storage at any time, eliminating the migration of data between different terminal systems. At the same time, users don’t need to By temporarily updating storage devices, cloud storage services provide sufficient scalability to handle a variety of storage needs. Traditional data maintenance tasks, such as data backup and encryption, are also transferred to cloud storage servers, which are often more professional and efficient. In addition, due to the reliability of cloud storage and the linearity at any time, some data usage patterns different from traditional applications also appear, such as data sharing and network collaboration. These have greatly improved the efficiency of data transfer between people and between applications. Cloud storage systems are used in a variety of applications, the most important of which is the desktop agent. A desktop agent is a cloud storage client that is based on a file system. The desktop agent synchronizes the specific folder in the terminal with the cloud storage - the files stored in the folder are automatically uploaded to the server by the agent; other uploaded files received by the server are also automatically downloaded to the corresponding file through the agent. folder. In this way, files of the same user are automatically synchronized on different terminals. Users can seamlessly use the data in this folder across platforms in a traditional way. The desktop agent can also automatically synchronize shared folders to different users' terminals, thus facilitating convenient data sharing and cooperation. Dropbox is a typical desktop proxy. In addition, Microsoft's OneDrive (formerly known as SkyDrive), Google's Google Drive, Baidu network disk, Jinshan Express, etc. have cloud storage desktop agents. In addition to desktop agents, there are a variety of cloud-based, cross-device end applications. Cloud storage systems bring convenient and efficient data access and sharing. But the data stored in the cloud raises an inevitable concern, that is, the protection of security and privacy. The security of core data is completely dependent on the cloud storage system. Many organizations and individuals are based on this, not to put data, at least critical data, in cloud storage systems. There are two main hidden dangers here: one is that the data in the cloud storage is protected by the user's identity authentication. Once the user's identity is stolen, all users' cloud data will be exposed to the thief. In addition, the security of cloud storage is based on the complete trust of cloud storage service providers. However, this foundation is not solid. On the one hand, the existing computer security technology foundation is weak, and security vulnerabilities of various systems emerge one after another. Malicious attackers can easily attack online services. In recent years, major data leakage accidents have occurred, and there are no shortage of cloud storage providers. For example, in February 2013, Evernote's system was invaded; in November 2013, a large number of Tencent QQ user data was leaked; in May 2014, 8 million cubic meters of user data leaked and so on. On the other hand, suppliers themselves may misuse or abuse data to pose a threat to users. This is evidenced by the exposure of the US Prism Project.

The invention mainly relates to a data processing method, system and application, and has the following aspects Effectively solve the above problems. In particular, it involves the following three aspects of innovation: (1) a novel handwriting input method and system, especially a method for splitting handwritten input characters; (2) an object-based open codec solution, which can be free, Any encoding method that is open to encode or decode any data object; and (3) an object-based data splitting/merging method that splits/separates the metadata and/or encoded data of the data object from the corresponding data content to Guarantee the security of data content. These technical solutions can be implemented separately or in combination, or combined with other technical fields, alone or in combination. The invention has broad application prospects and great application value. The specific plan is as follows:

The invention provides a data object based encoding method, the method comprising:

a) extracting metadata from the data object, and/or parsing the data object and creating or generating corresponding metadata for the data object;

b) selecting or creating an encoding specification for the data object based on at least a portion of the metadata to describe the data object in encoded form;

c) generating or returning an object code for the data object in accordance with the coding convention.

Further, on the basis of the foregoing scheme of the data object-based encoding method, the generating object encoding step in step c) includes: generating a meta-code and/or an instance code for the data object according to a predetermined rule, and by the element The encoding and/or instance encoding generates the object encoding.

Further, based on the foregoing scheme of the data object-based encoding method, wherein the step of compressing and/or encrypting the data object is further included before step a), and after step c), further comprising generating the generated The encryption step of the object encoding.

Further, based on the foregoing scheme of the data object-based encoding method, wherein the meta-coding comprises one of the following encodings, or a combination and/or nesting of two or more types: spatial encoding, context encoding, and type encoding. .

Further, on the basis of the foregoing scheme of the data object-based encoding method, before the step a), the method further includes: a data splitting step of splitting the large data object into small data blocks according to a predetermined rule (or As a data segment, steps a) to c) are performed on each of the split data blocks during or after the data splitting process until the encoding of all the data blocks is completed.

The invention also provides a data object based decoding method, the method comprising:

a) obtaining the object code;

b) disassemble the object code to obtain the meta code and/or the instance code;

c) obtaining corresponding coding metadata and/or coding specifications according to the disassembled meta code;

d) recovering the original data object based on the encoded metadata and/or encoding convention, and the instance encoding.

Further, on the basis of the foregoing solution of the data object-based decoding method, the step of decoding the object in step b) comprises: disassembling the object code into a meta-code and/or an instance according to a predetermined rule at the time of encoding. coding.

Further, on the basis of the above-described scheme of the data object-based decoding method, before the step a) and/or before the step b), an authorization verification step of acquiring a predetermined rule when encoding and/or encoding the object is further included.

Further, on the basis of the above-described scheme of the data object-based decoding method, if compression and/or encryption means are used in the encoding process, corresponding decompression and/or decryption means are needed in the decoding process.

The invention also provides a handwritten input character splitting method, the method comprising:

a) receiving the user's input with the currently activated target row/column as a constraint, and recording at least the input position of each stroke in the current row/column;

b) judging the correlation or correlation between each stroke and other strokes and/or characters by comparing each stroke with all or part of the strokes and/or characters in the current row/column, if one If the stroke is not associated with any character or stroke, a new character is created for it, otherwise the stroke is attributed to one or more of the most relevant or most relevant characters.

Further, based on the above-described scheme based on the handwritten input character splitting method, wherein the step c) is performed in one of the following cases: 1) in the input and writing process of the current stroke, 2) or at the current After the stroke input is completed (ie, after the pen is lifted), 3) or after the current line is entered.

Further, in the above-mentioned scheme based on the handwritten input character splitting method, after the current stroke input is completed, the current stroke is only compared with the strokes and/or characters within the predetermined range one by one.

Further, based on the foregoing solution based on the handwritten input character splitting method, wherein the step c) comprises:

Determining whether the currently input stroke is the first stroke on the space in the row/column or the last stroke in the space in the current input state;

If the currently entered stroke is the first stroke on the space in the row/column and is in the current row/column Other characters (or strokes) that have been entered are not associated, or if the currently entered stroke is the last stroke in the space in the row/column and is not related to other characters (or strokes) already entered in the current row/column Create a new character for the stroke; if the current stroke is neither the first stroke on the space in the row/column nor the last stroke on the space in the row/column, then the current stroke is entered The spacing between all characters passed is compared and the currently entered stroke is attributed to the associated one or more characters (or strokes).

Further, in the above-described scheme based on the handwritten input character splitting method, in the step c), a threshold (MIN_GAP) of a minimum distance between the stroke and the character or the stroke and the stroke is preset, each of The spacing between the stroke and other characters or strokes that have been entered is compared to the threshold to determine the association between the stroke and other characters or strokes.

Further, in the above-mentioned scheme based on the handwriting input character splitting method, in the step b), the method further includes: recording, when receiving each input stroke, the input time and the input position information of each stroke.

Further, in the above-mentioned solution based on the handwriting input character splitting method, the input time includes a pen down time and a pen up time, and the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and the stroke The coordinate position of each point in the handwriting.

The invention also provides an object-based data object splitting method, the method comprising:

a) obtaining metadata of the data object;

b) selecting or creating a corresponding data split/peel protocol for the data object based on at least a portion of the metadata;

c) splitting at least a portion of the data object into pieces of data, and/or stripping out at least a portion of the data object in accordance with the data split/peel protocol.

Further, based on the foregoing solution of the object-based data object splitting method, wherein the data splitting/peeling protocol comprises at least one of the following options or a combination of two or more: 1) data content splitting protocol , recording the method and process of splitting the data content; 2) the metadata stripping protocol, recording the method and process of separating the corresponding metadata from the data object; 3) if generated during the data splitting process The encoding also includes an encoding separation protocol, and records the encoding rules and encoding processes between the corresponding encoding and the encoded object.

Further, based on the solution of the object-based data object splitting method, after step c), further comprising the step d): reassembling the split data segments.

Further, on the basis of the above-described scheme of the object-based data object splitting method, at least a part of the metadata constitutes split metadata.

The invention also provides an object-based data object merging method, the method comprising:

a) obtaining the fragmented data fragments, and the split/peel protocol or the corresponding merge protocol;

b) obtaining split metadata of the data object according to the obtained data segment and/or the split/peel protocol or the merge specification;

c) Combining the data segments based on the data split/peel protocol or the merge specification, and the split metadata, thereby recovering the original data.

Further, on the basis of the foregoing solution of the object-based data object merging method, after completing the splitting process on the data object, the method further includes: a storing step of splitting/stripping each data segment Stored separately in different banks or under different secure channels.

The handwriting input method and system will be described in detail below.

FIG. 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention. The method for processing handwritten input characters provided by the embodiment can be closer to people's natural writing habits than the existing handwriting input system, and at the same time completely and truly preserve the writing style and features of the writer. As shown in FIG. 1A, the method in this embodiment may include:

Step 101A: In the currently activated first target row/column, acquire a stroke input by the user and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column .

The execution subject in this embodiment may be a handwriting input device such as a conventional touch screen, handwriting screen, or other suitable handwriting device, or directly adapted to the handwriting system of the present embodiment. Preferably, the present embodiment may employ a touch screen type handwriting input device, that is, an input device that can directly input information on the screen by handwriting or by means of a dedicated or non-dedicated writing tool.

Specifically, the embodiment can be applied to any writing mode, and the writing mode can be set by the user or the default setting. The writing manners described in this embodiment may include, but are not limited to, the following methods: writing in a row (corresponding to a commonly used horizontal format, left to right, top-down writing habits); writing in columns (corresponding to vertical Row format, top-down, right-to-left writing habits; can also be other user-defined writing formats, for example, can be a right-to-left writing format set for Arabs; or it can be self Top down, writing format from left to right, and so on.

Usually, the user manually writes each character in the order of his or her stroke during the writing process. In this embodiment, each stroke of the user and its input position can be recorded in chronological order. For example, when the user starts writing the word "I", first write the first "丿" (撇), the system automatically records the 撇 and the input position of the 撇 on the panel, for example, the pixel position of the handwriting input screen can be used. As the corresponding input position, other positioning algorithms or position determining methods may be employed as long as the input position of each stroke can be uniquely determined.

When the user performs handwriting input, there is a concept of a target row/column, which can be used as a constraint range for the user's handwriting input, that is, when a row/column is activated, it becomes a target row/column. Row/column input. All strokes entered by the user belong to the target row/column before the target row/column is changed. In this case, the user can be prohibited from handwriting input in an area other than the target row/column, or the user is allowed to input at any position, but when the stroke input by the user exceeds the boundary of the target row/column, it can be used. The following different processing methods: First, in the case of low precision requirements, the part of the stroke beyond the predetermined threshold of the boundary can be discarded; secondly, when the original input needs to be restored with high precision, the boundary beyond the boundary can be completely recorded. Stroke information, such as time and location, to fully restore the user's original input state.

The method provided in this embodiment can be used as a limitation or constraint of input in units of rows (horizontal rows) or columns (vertical rows), that is, the current input can only be limited to a specific row or column, and there is no span. Line or column strokes or text. Based on this row or column constraint, the input can form a stream of characters in the order of input. Compared with the prior art, the method provided by the embodiment is closer to the natural writing habits of the people, so that the writing experience of the user can be more natural and smooth.

When the user inputs, the range of the target row/column may be displayed on the handwriting input screen, for example, highlighting the target row/column, or displaying a line in a text or letter format on the handwriting input screen/ A column or a grain pattern, etc., to indicate the location of the target row/column that the user can currently input.

Preferably, prior to step 101A, the currently activated first target row/column may be selected or created. Selecting or creating the currently activated first target row/column can take many forms, and the present embodiment gives the following two.

Select target row/column mode one: first determine the range of positions for each row/column, and then the user selects the target row/column. The location range of each row/column is determined, which may specifically include:

Obtaining the size information of the handwriting input screen and the information of the row height/column width;

Decoding the handwriting input screen into at least one row/column according to the size information of the handwriting input screen and the information of the row height/column width, and determining a range of positions of each row/column;

Wherein, the row height/column width information is a default value or determined by the user input, and the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen. The position or each column is in the opposite left and right positions in the handwriting input screen.

Through the above steps, the handwriting input screen can be divided into a plurality of rows/columns, and the range of positions of each row/column can be determined. During the subsequent input process of the user, the strokes can be input based on the divided rows/columns.

After determining the range of positions for each row/column, the target row/column can be selected by the user. The target row/column selected by the user may specifically include:

Receiving a target row/column selection message input by the user, where the target row/column selection message includes an identifier of the target row/column to be input by the user;

According to the target row/column selection message, a row/column corresponding to the identifier of the target row/column to be input by the user is used as the currently activated first target row/column.

The identifier of the target row/column to be input by the user may be any coordinate point clicked by the user, and the row/column where the coordinate point is located is the row/column corresponding to the coordinate point; or, the The identifier of the target row/column to be input by the user may be a row/column number, for example, the 10th row or the 10th column, and the row/column corresponding to the row/column number may be used as the first target of the current activation. Row/column.

When external input devices are externally connected, the user can select the target row/column through the input device that is accessed. For example, when an external keyboard is used, the user can select a target row/column through the keyboard; or, when an external mouse is connected, the user can select a different target row/column by moving the mouse; or, when an external stylus is input, it can be input. Before the pen is in contact with the handwriting input screen, the target row/column is selected by the pointing of the input pen.

Select target row/column mode 2: Activate a target row/column based on the characters previously entered by the user. The method may specifically include:

Collecting at least one character obtained by the user;

Using the row/column of the at least one character as the currently activated first target row/column;

Setting a range of locations of the currently activated first target row/column according to a character boundary of the at least one character;

Wherein, the position range refers to a relative top edge position of the first target line in the handwriting input screen and The bottom edge position or the first target is listed in the opposite left and right positions in the handwriting input screen.

Due to differences in writing habits, an appropriate threshold can be set for the width of the first target row/column to meet the needs of a particular user. For example, for horizontal writing, the natural writing line of the writer may be habitually inclined to the right or to the lower right. In this case, the boundary of at least one character that the user has input may be appropriately extended upward or downward by a distance. As the boundary of the first target row/column.

The two methods of selecting the target row/column provided above are simple and fast; the second method can satisfy the user's personalized input and the handwritten text input in the graphic system.

Step 102A, for each stroke, according to an input position of the stroke in the first target row/column, or an input position of the stroke in the first target row/column and the first target row The character specified in the /column, creating a new character for the stroke or determining the character to which the stroke belongs.

This embodiment adopts a text division or division manner different from the prior art, that is, the attribution of the current input stroke is determined based on the correlation between each input stroke and other characters or strokes. Therefore, the method provided in this embodiment can save the user's tedious interaction process by inputting characters, thereby greatly simplifying the input operation.

Among them, the character refers to an independent character object having a two-dimensional shape, including not only standard characters of ideographic characters, such as single Chinese characters, Japanese, Korean, Arabic, Tibetan, Burmese, etc. or parts thereof (for example, radicals, etc.) Or standard words of phonetic characters, such as English letters, German, French, Russian, Spanish, etc.; or computer characters based on traditional standard codes, such as ASCII characters, Unicode characters, or a string or the like; a combination of characters and strings of handwritten characters and standard characters; or any graphic or image input by the user, such as a "heart" pattern, a photo, any graffiti, etc., or Any other written expression.

FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention. FIG. 1C is a second schematic diagram of a character in a method for processing handwritten input characters according to an embodiment of the present invention. Five characters are shown in FIG. 1B, including "stroke characters", that is, handwritten characters input by the user, such as first, third, and fourth characters, and "graphic characters", that is, arbitrary graphic or image information input by the user, Such as the second and fifth characters. In addition, other characters such as "standard characters" (any of the existing standard fonts), "combined characters" (mixed characters of various characters mixed together), and the like can be input in this embodiment. , "combined characters" can also directly include the stylus Stroke - When a handwritten stroke is written directly on a non-"stroke character", a "combined character" is formed. As shown in FIG. 1C, the word "饕餮" is a combination of standard characters and stroke characters.

In this embodiment, it is not necessary to identify the characters input by the user, and it is only necessary to determine which character each stroke belongs to, and the characters are divided. When determining the attribution of the stroke input by the user, the strokes input in the first target row/column can be automatically divided according to the intrinsic convention of the set language (for example, based on the writing or typesetting manner of each language, etc.).

Wherein, determining the character to which the stroke belongs is a process of splitting the input character. The splitting operation of the input characters (ie, the wording operation) can be realized by splitting one side while inputting, that is, with the natural writing of the user, it can be determined which character the stroke has been input belongs to, so that the side input can be realized. The effect of the word on the side.

For the trigger condition of character splitting, one of the following methods may be selected: (1) from the moment the user drops the pen, the input stroke is judged in real time by the dot matrix of the input stroke to determine the attribution thereof. (2) making a judgment on the attribution of each stroke after completing the input of each stroke (ie, raising the pen); (3) after completing the input of one line, or determining that the user has a longer input pause At the same time, all the strokes entered before are judged one by one, and those strokes with the highest correlation or the strongest correlation are attributed to the same character.

Each of the above three methods has its own advantages and disadvantages. In sequential order, their calculations are from large to small. That is, the calculation amount under the trigger condition (1) is the largest, and the latter two calculation amounts are equivalent, but smaller than the first one. In addition, under the trigger condition (1), since the real-time judgment will cause the judgment result to change dynamically, that is, the current stroke should be attributed to the previous character according to the previously input lattice, but as the stroke is input, it will be found thereafter. The stroke should be word-independent or attributed to the next character. In this case, the final assignment of each stroke needs to be updated to avoid assigning the same stroke to a different character. This kind of processing also increases the amount of calculation. Although in most cases, the user does not care about the wording process as a background process, the processing method under the trigger condition (1) can obtain a more real-time interactive experience effect than the latter two methods.

For each stroke, if the stroke is the first stroke of the first target row/column, a new character can be created for the stroke; if the stroke is not the first target row/ The first stroke of the column may create a new character for the stroke according to the input position of the stroke in the first target row/column and other characters in the first target row/column Determining the character to which the stroke belongs.

The method for processing handwritten input characters provided in this embodiment, in the currently activated first target row/column, acquiring a stroke input by the user and corresponding input information, and according to the stroke in the first target row/column An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke The attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input" or "end single text input" commands. Therefore, during the writing process It is not necessary to pause for a period of time or perform some interaction with the system, the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.

Since the present embodiment can make the handwriting input more natural and smooth, it is more convenient for the elderly and children who are unfamiliar with electronic input devices such as computers, mobile phones, tablet computers, laptop computers, notebooks, and iPads to use these devices.

Different from the traditional keyboard/character stream model, the handwritten input character processing method in this embodiment adopts a pen/paper model. The user can directly activate any line in the page for input. The system can process empty lines between handwritten input and handwritten input as empty paragraphs. For the user, there can be only the command to change the input line, and there is no concept of carriage return or line feed.

When the user inputs to the end of a line, it may be necessary to move the target row/column to the next row/column of the first target row/column, so that the user can input in the next row/column, which is the broken line provided by this embodiment. Features. Specifically, the line break function can be implemented in multiple manners. In this embodiment, the following four types are provided:

Break mode one:

Receiving a line break/column command input by the user;

According to the line break/column command, the second target row/column is the currently activated target row/column, and the second target row/column is the next row/column of the first target row/column.

In this mode, the position of the line break can be determined by a preset interaction mode. For example, it may be stipulated in advance that the end of the line is confirmed by continuously clicking a corresponding position or button of the right border of the input box or the screen twice or three times each time the line is naturally written to reach the end of the line. Alternatively, a command button can be set at the end of the first target row/column, and when the user clicks the command button, the next row/column is automatically activated for editing.

Break mode 2:

Determining whether a distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than a first preset threshold;

If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the second target line is/ The column is the currently activated target row/column to enable acquisition of the stroke of the user input in the second target row/column;

The second target row/column is the next row/column of the first target row/column.

Break mode three:

If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the first target line/ The column and the second target row/column are simultaneously the currently activated target row/column;

Acquiring at least one stroke of the user's subsequent input in the first target row/column and/or the second target row/column, and only acquiring the second target when the second target row/column acquisition acquires the first stroke Row/column as the currently active target row/column;

In this mode, in order to realize continuous input, it is necessary to solve the problem of attribution of strokes in adjacent lines. When two or more adjacent lines are activated at the same time, the user's stroke may span multiple rows/columns. In this case, the row/column to which the stroke belongs must be determined by certain rules: it can be the row/column where the starting point is located. It can also be the row/column of the end point, or the row/column with the largest proportion. Of course, this contradiction can also be alleviated by increasing the row/column spacing between adjacent two rows/columns.

Preferably, when the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column, the first target row/column and the second target row/column are both Partial area activation;

A starting position of the active area of the first target row/column is set between an end position of an active area of the second target row/column and an end position of an active area of the first target row/column.

Line break mode four:

The user decides whether or not to break the line by fully controlling the position of the handwriting panel representing the active area within the segment. The handwriting panel itself has the feature of automatically breaking lines within the paragraph. When the user interacts (such as keyboard commands or touch screen gestures, etc.) When moving the handwriting panel to the layout direction or the reverse direction, the system will move some or all of the handwriting panel to the next line or above according to its position in the paragraph and the relationship with the current line. One line. As the position within the segment is different, the content presented in the handwriting panel will change accordingly. When the handwriting panel is moved to the last line of the paragraph, the re-triggering of the handwriting panel's automatic line break actually breaks the paragraph.

FIG. 1D is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention, in which two adjacent rows are simultaneously activated. The position in the box in the figure is the active area. As shown in FIG. 1D, the active area is a logically continuous area within two adjacent rows/columns, and the user can only input in the active area. Since the active areas of two adjacent rows/columns overlap, this avoids the occurrence of cross-row/column strokes. At the same time, the active area can also be switched to the full row/column range (the first target row/column or the second target row/column) according to the user's interaction.

For the case of simultaneously activating two adjacent rows/columns, there is a constraint that there is no corresponding forward or backward detour feature for the first row/column or tail row/column of the paragraph. The details are explained below.

In the same paragraph, if the currently activated target line is not the first line of the segment, the target line and the relevant area of the previous line may be used when the distance between the input position of the stroke in the target line and the start position of the line is less than a certain threshold. Simultaneous activation; if the currently activated target line is not the end of the segment, the target row and the relevant region of the next row can be simultaneously activated when the distance between the input position of the stroke in the target row and the end position of the row is less than a certain threshold.

However, for the first line of the segment and the last line of the segment, if there are other paragraphs before and after this paragraph, when the user inputs in the first line of the paragraph of this paragraph, the first line of the paragraph and the previous line cannot be activated at the same time because of the previous line. It belongs to other paragraphs; when the user enters at the end of the paragraph of this paragraph, the end of the paragraph and the subsequent line cannot be activated at the same time, because the next line belongs to the next paragraph.

In particular, for the end of the paragraph, the user may need to issue a "line extension" command, followed by a blank line that belongs to this paragraph, in order to enable the function of simultaneously activating two adjacent lines.

Among the above four methods of line breaking, the first method and the fourth method are that the user actively breaks the line, and the target row/column is transferred through the interaction with the user, which is more accurate; the second method and the third method are automatic line breaking, and no additional interaction with the user is needed. Operation, as long as the user's writing style fully meets the requirements of rows or columns, the end position of each row/column can be automatically recognized without the user having to interactively confirm the end of each row/column, so that the entire handwriting can be input even The screen is made like ordinary paper Use, greatly improving the user's input experience.

For the processing method of handwritten input characters in this embodiment, there are two important concepts: line break (soft carriage return) and paragraph end (hard carriage return). Line break means that the current paragraph is not over, but since the handwritten character has been entered at the end of the line, the next line needs to be activated; the end of the paragraph means the end of the paragraph, and when the paragraph is judged, it can be inserted after the line. Line, then activate the next line of the blank line as the first line of the next paragraph, so that the user can input on the next line of the blank line; or, when the judgment paragraph ends, you can directly activate the next line/column of the line as the next paragraph The first line is used for input.

In order to distinguish between line breaks and end of paragraphs, different interaction modes can be set, such as clicking a button to break the line, clicking another button to end the paragraph, or automatically breaking the line at the end of the line, and ending the paragraph by manual interaction; or When the end position of a line is reached, the automatic paragraph ends, and the manual interaction can be used to break the line. This embodiment does not limit this.

For example, any one of the above-mentioned line break modes one, two, and three may be used to perform line break. For the end of the paragraph, some interaction with the user is required.

Or, when the user enters on different lines, he can automatically assign different lines to different paragraphs, and create empty paragraphs for empty lines between paragraphs, and for the extension of one paragraph to the next line (ie, line break), Then you need a clear interactive command to determine. In general, the paragraph extension command only makes sense on the last line of the paragraph or the last line inserted. The current edit line and all other lines in the corresponding paragraph of the line will have some sort of visual state to distinguish them from other paragraphs.

On the basis of the technical solutions provided by the above embodiments, it is preferable that the characters input by the user can also be saved.

The saving function in this embodiment may specifically include:

The new character or the attribute that is created by the acquired stroke is saved every preset time;

or,

On the same page, when the currently activated target row/column on the page is switched from one target row/column to another target row/column, the strokes acquired by the acquisition on the one target row/column are saved. New character or attribute of the character;

or,

Save the one page when getting the current page from one page to another Collect new characters or attribute characters created by the acquired strokes.

Specifically, when saving, the stroke input by the user and the corresponding input information may be saved in the first memory; the saved characters are stored in the second memory, and the characters include the composition for each saved character. An index corresponding to the stroke of the character and the stroke; wherein an index corresponding to the stroke points to input information corresponding to the stroke in the first memory. Alternatively, the strokes and their input information and corresponding characters may all be stored in one memory, which is not limited in this embodiment.

For the storage order or sequence of strokes and characters, any suitable storage method may be employed as long as it can effectively distinguish the characters to which each stroke belongs and each different character. Preferably, information such as input strokes and divided characters can be stored in a temporary storage location or space of the system (such as RAM or flash memory of the system) while inputting, and the input of each target row/column is ended. All of the divided character and stroke information in the target row/column is then stored in the specified permanent storage location or space.

On the basis of the technical solutions provided by the foregoing embodiments, it is preferable that the input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and the The input speed of the stroke.

The input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke; the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, And the coordinate position of each point in the stroke of the stroke.

In this embodiment, information such as input time, velocity, and speed of each stroke can be recorded as needed to further refine the input information. The strokes and corresponding input time, velocity and speed can be stored in a separate stroke database in the form of a list.

Since the present embodiment can record and retain the detailed input information of each stroke in accordance with the stroke order at the time of writing while receiving each input stroke, it is possible to completely record and retain all the writing styles associated with each user. And almost all the information that is used to it, such as stroke order style, stroke style, word spacing and other writing features, making for example handwriting identification a breeze.

This embodiment also shows great advantages for missing strokes. For example, when the user enters the word "I", he forgets to input "丶" (dot) in the upper right corner, and finds the missing stroke "丶" after inputting other characters. At this time, the user can be as normal. Writing on paper is like "I" The "丶" is added to the corresponding upper right corner position of the original position of the word. Although the input time of the "丶" is different from the input time of other strokes of the "I" character, it can be judged from the position information that the "丶" belongs to The previously entered part of the "I" word.

When the user draws a custom graphic or character in a graffiti manner during the input process, as with the regular character, the input time and input position of each stroke are also recorded.

Since the present embodiment can completely retain all the input information including the input time, position, velocity, speed, and word spacing of each stroke, it also provides a wider space for application services such as subsequent editing and other processing. .

Based on the technical solution provided by the foregoing embodiment, it is preferable that the input position in the first target row/column according to the stroke in step 102A, or the stroke is in the first target row/column The input position in the first target row/column, the character specified in the first target row/column, the creation of a new character for the stroke, or the character to which the stroke belongs, may specifically include:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character;

If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;

If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.

The specified character in the embodiment may be all the characters that are already in the first target row/column; or the specified character may be the to-be-compared region in the first target row/column. a character in the middle, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold. Comparing the stroke with only a certain range of characters in the surrounding area can effectively reduce the amount of calculation and improve the efficiency of the stroke attribution determination.

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character, There are a variety of implementation methods, which are described separately below.

Judging the relevance mode 1. Determine the relevance of the stroke to the character by judging whether the stroke coincides with the character. Specifically, the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, create a new character for the stroke or determine the location The characters to which the stroke belongs may specifically include:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining whether the stroke is at least one of the characters Overlapping strokes;

If the stroke overlaps with at least one of the characters, determining that the stroke is associated with the character;

If the stroke does not overlap with all the strokes in the character, determining that the stroke is not associated with the character;

In this method, strokes that intersect each other can be used as strokes of the same character, and the strokes are assigned to the same character, which is simple and quick.

Judging the relevance mode 2, the relationship between the stroke and the character is determined by calculating the distance between the stroke and the character boundary. In this manner, the input position in the first target row/column according to the stroke in the step 102A, or the input position of the stroke in the first target row/column and the first A character specified in the target row/column, a new character is created for the stroke, or a character to which the stroke belongs is determined, which may specifically include:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character specified in the first target row/column, and determining the stroke and the location Whether the distance between the boundaries of the characters is less than a third preset threshold;

If the boundary of the stroke and the character is less than a third preset threshold, determining that the stroke is associated with the character;

If the boundary between the stroke and the character is not less than a third preset threshold, determining that the stroke is not associated with the character;

For example, for characters with obvious left and right or top and bottom structures, such as "warm" characters, due to differences in personal writing habits, it may be possible to "left" the side of the left side (three points of water) and the right part of the middle part of the writing process. The "昷" is too large, and at this time, the characters to which the strokes belong can be determined by comparison with a preset third preset threshold. When the distance between the currently input stroke and the adjacent character is less than the third preset threshold, the stroke may be considered to belong to the adjacent character, otherwise a new attribution character may be created for the stroke.

Judging the relevance mode 3. Determine the correlation between the stroke and the character by calculating the distance between the stroke and each stroke in the character. In this manner, the input position according to the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the first target row/column The character specified in the character, the creation of a new character for the stroke or the determination of the character to which the stroke belongs, may specifically include:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value in a spacing between each stroke corresponding to the character, and determining whether the minimum spacing value is less than a fourth preset threshold;

If less than, the stroke is associated with the character;

If not less than, the stroke is not associated with the character;

In determining the relevance mode one, two, and three, the performing the attribution processing on the stroke according to the at least one associated character may include:

If there is one character associated with the stroke, assigning the stroke to one character associated with the stroke;

If there are at least two characters associated with the stroke, at least two characters are combined and the stroke is attributed to the merged character.

In this embodiment, when a stroke can be attributed to the left and right characters at the same time, it indicates that the stroke should be merged with the characters on the left and right sides to form a glyph, for example, the "tree" in the word "side" The positional relationship between the stroke in the middle and the "wood" on the left side and the "inch" on the right side. when However, if the subsequent recognition operation is not required, the preset threshold may not be set as long as the characters can be divided.

In addition, in judging the

relevance modes

2 and 3, the association between the stroke and the character can be divided into strong and weak, and the attribution of the stroke is judged according to the strength of the association.

Specifically, the performing the attribution processing on the stroke according to the at least one associated character may include:

Obtaining the character most strongly associated with the stroke from the associated at least one character;

If the character with the strongest correlation with the stroke is one, the stroke is attributed to the strongest character;

If there are at least two characters with the strongest association with the stroke, at least two characters are merged, and the stroke is attributed to the merged character.

Correspondingly, the obtaining the most strongly associated character from the stroke from the associated at least one character may include:

And according to the distance between the stroke and the boundary of the character, at least one character associated with the stroke is sorted in order from small to large, and the character corresponding to the minimum distance is used as the most relevant to the stroke. Strong character; or,

Sorting at least one character associated with the stroke according to a minimum spacing value corresponding to the character by the stroke, and ordering the first character as the strongest correlation with the stroke according to an order from small to large character of.

When the constraint is input by behavior, the default is that the stroke with the upper and lower positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent left and right characters needs to be judged. Similarly, when the constraint is listed as input, the default is that the stroke with the left and right positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent upper and lower characters needs to be judged.

In the actual application process, when it is necessary to judge the attribution of the stroke, the methods described in the above various manners may be comprehensively used, for example, the method of judging the relevance method 1 is used for some strokes, and some strokes are determined. The method of judging the relevance method 2 is used for judging, and the remaining strokes are judged by the method of judging the correlation method.

For example, if the currently input stroke is the first stroke or the last stroke on the space in the first target row/column, the method of determining the relevance manner may be used to determine whether the stroke is The other characters already entered in the first target row/column are associated, if not associated, a new character is created for the stroke; if the current stroke is neither the space in the first target row/column If the stroke is not the last stroke, the distance between the currently input stroke and all the characters or strokes that have been input may be compared according to the method of determining the correlation method 2 or determining the correlation method 3, and The currently entered stroke is attributed to the associated one or more characters based on the result of the comparison.

The first preset threshold, the second preset threshold, the third preset threshold, and the fourth preset threshold may be determined by the user according to their own writing habits, and may also adopt a system default value.

In addition, the system can also provide visual information to assist in automatic segmentation, such as character-based character segmentation: based on the correlation between the current input stroke and the corresponding text stripe in the current input line, the current input stroke should be determined. character.

In this embodiment, the text can also be used to determine the attribution of the stroke. Specifically, before the collecting in step 101A acquires the stroke input by the user and the corresponding input information, the first target row/column may be divided to divide the first target row/column into multiple Writing a text.

Correspondingly, the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, creating a new character for the stroke or determining the character to which the stroke belongs, including:

Determining, according to the input position of the stroke in the first target row/column, the composition of the stroke;

Determining whether a character already exists in the composition grid;

If present, the stroke is attributed to an existing character in the composition; otherwise, a new character is created in the composition, the stroke being attributed to the new character.

Specifically, if the stroke spans a composition, it is determined whether there is a character in the composition, and if so, the stroke is attributed to the character in the composition, and if not, the stroke is created. a new character, the new character belongs to the composition; if the stroke spans at least two composition grids, determining whether there is a character in the at least two composition grids, if the at least two composition grids If there is no character, a new character is created for the stroke, and the new character belongs to the at least two composition grids. If only one of the at least two composition grids has characters in the text grid, then The stroke is attributed to the composition in which the character exists, if the at least two compositions If there are multiple characters in the grid, the characters in the plurality of composition grids are merged, and the strokes are attributed to the merged characters.

By making a text grid to assist in judging the characters to which the stroke belongs, it is not only simple and convenient, but also better constrains the user's input, making the judgment result more accurate.

The above describes how to judge which character the stroke belongs to, but the automatic division inevitably has a division error, such as a word being recognized as a multi-word, and a multi-word being recognized as a word. However, in this embodiment, it is not necessary to recognize characters, and the input characters are recognized only when they are particularly needed. This is because, on the one hand, each input character of the embodiment is divided and stored on the basis of a glyph object (non-standard, ie, handwritten character), in other words, in this embodiment, or Each input character that is segmented is treated as a non-standard glyph object; on the other hand, if the handwritten content is ultimately only used for human reading (more on the retention of the original input information form), the division error does not need to be corrected. .

However, if a character splitting error occurs at the row/column bypass, for example, at the end of the line, the input "the" word is erroneously split into two characters "white" and "spoon", and Putting them in different rows or columns requires some way to correct this erroneous split. Or, when the user enters the characters entered before, the characters that are incorrectly split are found, and can be corrected in some way.

For the above correction function, the splitting of the error can be modified interactively, and the same effect can be achieved by other feasible methods. This embodiment provides a corrective method, which specifically includes:

Get and display the boundaries of each character saved locally;

Receiving a correction request input by a user, the correction request including a character to be corrected, or a character to be corrected and a stroke to be corrected;

Performing corresponding correction processing on the character to be corrected according to the correction request.

Specifically, the specific content of the correction request may be different according to different scenarios. In this embodiment, the following scenarios are provided:

Scenario 1: Combining two characters into one, that is, the correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;

Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:

Combining the at least two characters to be merged into one character.

Scenario 2: splitting a character into a plurality of characters, that is, the correction request is a split correction request, and the character to be corrected is a character to be split;

Splitting one character to be split into at least two characters.

Scenario 3: changing a stroke attributed to one character to another character, that is, the correction request is a home correction request, the character to be corrected is a character to be vested, and the stroke to be corrected is At least one stroke to be corrected;

At least one stroke to be corrected is attributed to the to-be-vested character.

Through the above correction function, the characters that have been split can be re-splitted by interacting with the user, thereby improving the accuracy of character splitting.

Since each character (possibly a combination of one or more words, words) has been split into separate individuals during the division of characters, it is easy to distinguish between the characters. Further, since the method provided by the embodiment can also record the stroke order (based on time) of each stroke written by the user and the shape feature of the corresponding stroke, it is easy to find out the same or similar stroke order according to the information. Characters with stroke shape characteristics can be treated as the same character if the appropriate threshold conditions are met. This makes matching, searching, and searching for characters a breeze, and even searching for the characters entered by the user.

In this embodiment, the functions of finding and inserting can also be added.

The search function may specifically include the following steps:

Receiving a search command input by the user, where the search command includes a character to be searched by the user;

The characters to be searched are compared with the locally saved characters according to the number of strokes of the character to be searched and the stroke feature, and characters matching the characters to be searched are obtained.

After the content input by the user is divided by the method provided in this embodiment, the split handwritten character characters can be obtained. On this basis, handwritten text search based on pattern matching can be performed. The main thing is to match each character in the search source with the character to be found one by one. Matching characters can be found by matching the number of strokes and the stroke order.

An exemplary flow for performing a single text match based on a stroke in the present embodiment is given below:

Determining whether the number of strokes in the character to be searched is the same as the number of strokes in a locally saved character. If they are different, the matching fails. If they are the same, the next step is performed;

The one-to-one matching between the character to be searched and the stroke in the locally saved character, that is, the matching of the curve, if not, the final matching result is a failure, and if they are consistent, the final matching result is successful.

Of course, any character analysis or other matching method in the prior art can be used to implement the character search function, which is not limited in this embodiment. The function of replacing characters can also be implemented based on the same principle as the search function, and will not be described here.

In this embodiment, the insertion function of the handwritten text input editing may specifically include the following steps:

Receiving an insertion request input by a user, the insertion request including a target row/column to be inserted, a to-be-inserted position in the target row/column to be inserted, and a character to be inserted;

Activating the target row/column to be inserted, and inserting the character to be inserted into the to-be-inserted position;

The characters after the position to be inserted are adjusted accordingly.

If you want to insert new characters in the middle of existing content, you need an explicit command to enter/exit the insert mode instead of automatically inserting it like a traditional character input. In addition, since the inserted characters can be either handwritten characters, standard characters input using a keyboard, or non-standard characters using other input devices, etc., corresponding insertion control or switching instructions, and identification of the inserted content are also required. And editing instructions.

If the user needs to add a character at a position that has become an inactive line, for example, when inserting a character between the 3rd and 4th characters of a line, the user needs to activate the line first, and the system will be in the line. The blank character provides an auxiliary interface that accepts user input. The user activates the auxiliary interface between the 3rd and 4th characters of the line, and optionally inserts an insertion operation at the character interval.

Inserts can be done before and after any character. When it comes to handwriting systems, we can further constrain to insert at blank characters. FIG. 1E is a schematic diagram of a state in which a character is inserted in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1E, after entering the insert editing state, the existing characters after the insertion position can be moved to the next line, and the insertion position is to the end of the current line. It is a space for writing. Insert the line marked with the right arrow and click the right arrow to exit the insertion state. Before the insertion is complete, the user can only enter between the two insertion markers.

The characters before the insertion position and the characters after the insertion position are read-only (but optional) until the end of the insertion. After the insertion is complete, the line breaks according to the inserted characters. You can extend the last line of the inserted line (the last line inserted is inserted at the beginning of the insertion), and the expanded line is the new last inserted line. In theory, inserts can be nested, that is, inserts can be inserted again. Insert rows have different visual states than normal rows to help users clarify the current editing state.

In addition to the above search and insertion functions, other characters can be processed by the user's handwritten input, and the processing may include the following steps:

Acquiring and acquiring at least one character selected by the user;

Receiving a selection processing command input by the user, and performing a processing operation on the at least one character according to the selection processing command;

The selection processing command includes any one or a combination of the following: performing copy processing on the at least one character, performing cut processing on the at least one character, and performing replacement processing on the at least one character And performing a merge process on the at least one character.

FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1F, functions such as inserting, pasting, selecting all, selecting, and merging can be displayed on the handwriting input screen to facilitate the user to perform corresponding operations.

In addition, the embodiment may also insert or add a stroke, a comment, or delete some characters or the like on the input character. The functions of searching, inserting, and copying provided in this embodiment can effectively avoid the disadvantages of the existing handwriting input system being less intuitive and difficult to modify.

On the basis of the technical solutions provided by the foregoing embodiments, it is preferable that the number of the first target rows/columns is plural;

The active areas corresponding to the plurality of the first target rows/columns do not overlap and are not in contact with each other.

In this case, multiple users can input in the active areas corresponding to the plurality of first target rows/columns, respectively, satisfying the function that the large-size handwriting input screen allows multiple people to simultaneously input.

Based on the technical solutions provided by the foregoing embodiments, it is preferable that the embodiment is compatible with the existing keyboard, mouse, and other existing input devices, and the hybrid input is implemented by performing mode switching. The mode switching method in this embodiment may specifically include:

Receiving a mode switching request input by a user, where the mode switching request includes a target mode;

The handwriting mode is switched to the target mode, and in the target mode, at least one standard character input by the user is received.

The target mode may be a keyboard input mode, a mouse input mode, or other existing input modes. For example, a mixed typesetting can be implemented by adding standard code characters or inserting other symbols or information into the input limits of a row or column in combination with an existing keyboard (see handwritten text mixing in the example of the present application).

In particular, other connected input devices, such as a keyboard, can be activated by means of appropriate touch buttons or operations (eg, clicks) to allow the user to freely switch between handwriting input and other conventional input devices such as a keyboard. For the division of the keyboard input content, a division form of a standard code may be used, or a division manner of characters in the present invention may be used.

In addition, during the handwriting input process, the active area can also automatically move with the user's input. For example, the active area is always repositioned with the position of the user's last stroke as the midpoint of the active area. In this way, in most cases, the active area will automatically move as the user writes, so that the location of the active area does not need to be manually set.

In the traditional standard code input state, the system will have a flashing cursor to indicate the current input position. In the handwritten text input state, the system displays the active area to indicate the range that can be currently input. When the user performs input mode switching, the two can be converted to each other according to certain rules. For example, when switching from standard character input to handwriting input, the system sets the position of the active area with the cursor position as the midpoint of the active area; when switching from handwriting input to standard character input, the character position closest to the midpoint of the active area is Is set to the current input position.

On the basis of the technical solutions provided by the foregoing embodiments, it is preferable to increase the concept of controlling characters to solve the problem of typesetting and editing of handwritten text content. Control characters exist in the standard code (such as ASCII code) character set. Similarly, we can introduce the concept of control characters in handwritten text, which makes the output and processing of handwritten text content more convenient and flexible.

Specifically, the control characters may be standard control characters, such as spaces, tabulations, line breaks, and the like; or non-standard control characters, such as white space characters. The standard control characters are similar to the prior art. For the blank characters, the following describes the sixth embodiment as an example.

In addition, this embodiment additionally provides the function of blank characters. Specifically, in this embodiment, The space spacing information between characters can be reserved, for example, the size of the space between the left and right characters for the horizontal format, or the size of the space between the upper and lower characters for the vertical format, etc., and can directly blank The spacing is created as a whitespace character with blank spacing information.

For the characters handwritten by the user, when the writing style is from left to right and top to bottom, the horizontal baseline of the target line where the character is located may be limited to the horizontal baseline of the character, and the character is the most The position of the left part (such as graphics, images, strokes, etc.) is set to the starting position of the character. Each part in the character is based on the baseline and the starting position, and the typesetting direction is recorded in the positive direction. s position. In this way, the same character content can appear in different positions of the text. As long as the corresponding character origin coordinates are correctly calculated according to the line of the character and the position of the character in the line, all the internal components can be correctly drawn. Similarly, for other types of writing styles, the starting position of each character can be set in a similar manner, and the relative internal coordinates of the starting position are used for the character internal part position.

These starting positions are only needed when the characters are drawn. When the divided characters are stored, the starting position is not stored. However, the spacing between the characters associated with them will be separated to form a blank character, which is stored in the character sequence corresponding to the text.

FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1G, in this embodiment, a custom space character is introduced, and the word spacing is saved as a parameter/content. The

numbers

12, 16, and 10 in Fig. 1G are numerical values of each blank character, indicating the length information of each blank character. In the process of analysis and processing (such as identification, bypass, etc.) can be treated differently. Similarly, time-based whitespace characters can be added to the text of the voice input.

In general, the maximum coordinate of the character entered by the user along the layout direction is the width of the character. For the character width, we can store it or not, but recover it by the position information of all internal parts in the character. When formatting text, as long as you get the width information of all characters (including control characters), you can restore all the characters in the starting position of the row/column, providing a basis for further text rendering.

In this embodiment, standard control characters and blank characters are introduced. These control characters have similar models, codes, glyphs, and meanings as the characters handwritten by the user. Therefore, the theory, methods, and tools for processing handwritten input characters can be used directly or indirectly to control characters. Further, the characters handwritten by the user and the control characters can be mixed and processed together, with this base Basic, the splitting of characters is even more significant.

The object processed in this embodiment may be a stroke character, a standard character, a graphic character, a combined character or a control character input by the user, or may be a mixture of a plurality of characters.

FIG. 1H is a flow chart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1H, the text editing in this embodiment may specifically include the following steps:

Step 601A: Determine the open mode: if the existing document is opened, step 602A is performed; if the new document is created, step 603A is performed.

This embodiment is mainly used to provide personalized handwritten character input for related documents, and there are mainly two ways of entering the handwriting input system: a method with document data and a method without document data. The former is to open an existing document, and the latter is to create a new document.

Step 602A, loading document data and performing typesetting according to the typesetting constraint, and executing step 604A.

Specifically, the related data of the characters may be hierarchically loaded. For example, when formatting a character, all that is required is the width of the associated character (higher for column-based layout), so in this step, only the width information of the character can be loaded. Other information, such as drawing stroke information or contour information, can be loaded on demand later, which saves system resources (memory, network traffic, etc.). And step 604A is performed.

Step 603A, initializing the handwritten document, and executing step 604A.

Step 604A: Initialize (empty) the sequence of handwritten text objects representing the character input lines.

The sequence of handwritten text objects representing the character input lines is hereinafter referred to as AL (Active Line), and AL is the core data to be processed in the method provided in this embodiment.

Step 605A, presenting the document content, and performing step 606A.

The presented content includes multiple parts: visual information of the document itself (including visual information of handwritten characters, such as the position and shape of characters), visual information of the document presentation environment (such as background, shading, paper border, etc.), Visual information related to document editing (such as selected area, cursor or active area indicating input focus, auxiliary lines, etc.). It is mentioned in step 602A that the visualized data of the handwritten characters must be loaded when it needs to be presented. For characters that do not need to be rendered, their corresponding visualization data may not be loaded.

Similar to the conventional data processing system, in this embodiment, the character stream is loaded from the storage area to Memory, you need to typeset before displaying. For simple unformatted text, the typesetting here refers to line breaks.

Specifically, the line can be broken at the end of the paragraph mark/newline (hard return); the position of each character is calculated in each row/column, and the total length of the input text content is accumulated. Breaks when the position exceeds the maximum position of the line (soft return). The truncated position is at the last breakable line.

There are a series of judgment rules for the position that can be broken:

Punctuation can be broken after the punctuation (punctuation can not be used as the first character after the soft carriage return);

Blank spaces (blank characters, tabs, etc.) can be broken, and the first character of the next line is the following non-whitespace character (the whitespace character cannot be used as the first character after the soft carriage return);

East Asian characters can be directly broken before and after;

In the middle of an English word, you can't directly break the line (for a simple system, the whole word is directly routed to the next line; for a complex system with the recognition function added, you can also break the line according to the suffix of the word and add a hyphen);

Handwritten characters can be broken directly before and after.

In a practical implementation, whitespace characters can be converted to blank spaces with standard lengths. Continuous blank spaces can be merged directly, so the typesetting algorithm is much simpler. Blank spacing is handled in the same way as whitespace characters.

The document model after typesetting includes information for each display line. The line includes words with position (including characters, East Asian characters, and handwritten characters). Blank characters do not need to appear in this model, and the relevant information is implicit in the position attribute of the word (left border, right border (left border + width)). Therefore, blank characters (including white space, standard white space, tab characters, etc. caused by handwriting pitch) can be discarded after typesetting.

For the document model after typesetting, the spacing information between characters is implicit in the coordinate relationship of the characters. For example, in a line, the left end of a character has a coordinate of 12 and the word width is 2.5; the left end of the next character is 16. It can therefore be calculated that the spacing between the two characters is 16–12–2.5=1.5. The text in each line will change with the user's input. User input and erased strokes may cause the spacing of characters to change or generate new characters. As long as the character coordinates are correct, the spacing will be correctly generated. Only when you need to store the edited content, you need to calculate and generate whitespace characters and insert them into the appropriate locations.

Step 606A, receiving the command, and performing different operations according to the command.

The commands here can be commands entered by the user, or they can be system commands or commands passed by other application systems.

There are various ways to send commands. You can send commands directly through traditional interactive devices, or you can send them by gesture. For example, when you recognize that the user enters a horizontal line through several consecutive characters in the horizontal direction, you can recognize the input gesture. The operation to delete these characters. It can also be automatically performed through some settings, such as automatically starting the handwriting input after creating or opening a document, and automatically ending the handwriting input after selecting the content.

Specifically, if the command is a text encoding typesetting command, step 607A is performed; if the command is to start a handwriting input command, step 608A is performed; if the command is to end a handwriting input command, step 610A is performed; If the command is a system exit command, step 612A is performed.

Step 607A: Typesetting the text content according to the command.

In the process of storing character information, the typesetting constraint and the typesetting direction can also be stored in the information of each character. Thus, when the same character appears in the text of different typesetting modes, the internal relative position of all the characters in the current typesetting mode can be adjusted according to this information, thereby correctly drawing the character.

The following two examples illustrate the mutual conversion of different typesetting methods.

An example would be to use the first horizontal characters for vertical or vice versa. The horizontally typed characters are stepped according to the width (that is, the line length is accumulated from left to right according to the typesetting direction), and the vertically typed characters are stepped according to the height. Therefore, in the specific implementation, it is necessary to distinguish between horizontal characters and vertical characters. For horizontal class characters, the internal coordinate system with the line baseline (alignment line) as the horizontal axis and the leftmost stroke point as the vertical axis may be used. For the vertical class characters, the column axis may be the horizontal axis. The highest stroke point is the internal coordinate system of the vertical axis. In this way, different characters will remain in the original alignment state in the corresponding layout drawing. When the horizontal text is changed to vertical or vertical to horizontal, with the typesetting meta information of this character, the system can automatically perform coordinate conversion. Although the original alignment between characters cannot be preserved, each character can still be rendered normally.

Another example is the change of text layout into ordinary typesetting. In the typography, the character type is marked in the type of the character, and then the internal coordinate system of each character can be the origin of the lower left corner of the corresponding composition (actually any point, such as the center point). Thus, each character is aligned with the corresponding composition. There is no text space/space character in the handwritten text of the text layout (but there is a space character). When we change the typesetting of texts into ordinary typesetting, we can match each word. Recalculate, replace the coordinate system (such as the system with the above baseline and the leftmost intersection as the origin), and insert the corresponding interval character between the characters according to the new coordinate system.

Step 608A, activate the target row/column, and perform step 609A.

In this step, the target row/column can be activated, and the text object in the target row/column is activated (loading stroke information), and the object sequence is assigned to AL.

In this embodiment, the input of the handwritten characters is performed under the constraint of the row/column. Even if the input spans multiple rows/columns, the corresponding characters must eventually be stored in a specific location on a particular row. Therefore, the target row/column of character input can be presented in a visual manner, and the user can also avoid cross-line input through specific settings, such as auxiliary panel, full-screen line editing, and the like.

Step 609A: Perform handwriting input under the constraint of the activated target row/column, and return to step 605A.

In this step, handwriting input can be performed under the constraint of the activated target row/column, and each stroke input is automatically combined with the AL according to a certain rule to form a new sequence of handwritten characters (ie, the AL is updated).

The input process of the handwritten characters is mainly to automatically combine the input pens into different characters according to the spatial constraints in the row/column. For the implementation manner, refer to the foregoing embodiment, and the word spacing effect can be realized by the word spacing constraint or the text constraint. .

Step 610A: Store the content of the AL Chinese character object, and execute step 611A.

In this step, the contents of the AL Chinese character object are stored, and if necessary, the AL related text content can be re-typed.

At the end of the handwritten character input, the character object in the AL is determined (previously changed dynamically by stroke input). Some of these character objects have not changed, some content (strokes) have changed, and some are brand new characters. Both changed and new characters are new characters. The sequence of characters corresponding to the final AL needs to be updated to their corresponding position in the document. If the storage method of encoding and content splitting is used here, the content of the new character needs to be first stored in the encoding library to obtain the corresponding encoding. The new code sequence is then saved to the appropriate location in the document (typically the in-memory document model).

Since this handwritten character method uses a row/column space constraint, in general, the length of the row/column does not change. But at the end of the insert content editing and extension line (soft return) editing At that time, it is necessary to update the current line and the subsequent layout information, that is, to re-type from the current line.

In step 611A, the AL is cleared, and the process returns to step 605A.

After the handwriting input is finished, there is no target row/column of handwriting input, and the corresponding data structure can be cleared.

Step 612A, the end.

The processing method of the handwritten input word provided by the embodiment facilitates the user to edit and process the handwritten character, thereby further improving the user's input experience.

In addition, in addition to editing, formatting, and character splitting, merging, recognizing, inserting, searching, and replacing, in this embodiment, other processing of the document content, such as saving and printing of the document, and Handwritten characters are input to unique processing operations such as, but not limited to, the following examples.

In order to get closer to the writing effect on the paper, it is also possible to refer to the existing conventional text editing tool or the scroll bar in the software. In this embodiment, the corresponding row and column scrolling rulers are set to be up, down, and left. Or expand the input range of the panel to the right, that is, the input range space of the row and column. Also, when the scale is moved, the corresponding target row/column can be displayed and/or activated accordingly.

It is also possible to associate the line height at the time of handwriting input with the size of a specific font size of the standard font, thereby standardizing or adjusting the font size of the handwritten input word.

It is also possible to discard the blank information between the characters after the characters are recognized, and even to selectively discard the partial character spacing information and the position information, thereby saving a certain storage space.

The function of encoding can also be added in this embodiment.

Specifically, the coding function in this embodiment may include:

Receiving an encoding request, and determining a glyph corresponding to the handwritten character in the handwriting input program according to the encoding request;

Query the mapping table in the encoding warehouse to obtain the standard language parameters corresponding to the glyphs.

Wherein, the standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.

This embodiment can implement the function of encoding characters generated during the handwriting input process, which will be described in detail below.

In the present invention, the input text or data object is abstracted into the concept of "character". Character can be Refers to handwritten characters of ideograms, such as single Chinese characters, Japanese, Korean, Arabic, Tibetan, Burmese, etc. or parts thereof (such as radicals, etc.), or handwritten words of phonetic characters, such as English, German, French Western letters or words in Russian, Spanish, etc.; can also be computer characters based on traditional standard codes, such as ASCII characters, Unicode code characters or strings, and even control characters such as spaces, tabs, line breaks Such as special characters, etc.; can also refer to non-standard control characters, such as the spacing or spacing between handwritten characters in this article; can also be mixed with handwritten characters and standard characters and / or synthesized characters or strings; It can be any graphic, image input by the user, such as a "heart" pattern, a photo, any graffiti, etc., or any other written expression. In the input scheme or system of the present invention, all character objects input in the above manner will be recognized as characters in a non-standard glyph manner.

The glyphs referred to in the present invention are similar to the concept of characters in a standard font, except that the present invention generates non-standard glyphs. Since the object of the present invention is not to generate a standard font or font, the resulting glyphs of the system of the present invention are likely to include erroneous splitting of various characters or words or merging between them, and may also include user input. Any graphics or images, etc.

For modern high-level programming languages, it can be divided into two types: compilation generation and interpretation execution. The former is to convert the source code through a series of compilation and conversion, and generate a binary file that encapsulates the instruction sequence of the target machine (which can be a virtual machine). Binary files need to be loaded into the target system for execution. Interpretation execution refers to an interpreter running in the target system, which reads the source code and runs directly through a series of internal processing.

Languages based on interpretation are generally called scripting languages, typically JavaScript, Lua, Tcl, and so on. Many traditional programming languages are compiled languages such as C, C++, Objective-C, Java, C#, go, Swift, and so on. There are also some languages supported, such as Python, Ruby, Lua, Haskell, Scheme, F#, etc.

The core components of the program source code, whether it is the compiler or the interpreter, have very similar front-end constructs, even the same. The so-called front end refers to the conversion of source code into an internal intermediate form. Correspondingly, for the compiler, the backend refers to converting the intermediate form into machine code, and for the interpreter, the intermediate form is executed by the execution engine. In some systems, there is also processing and optimization for the intermediate form, which is called the mid-end. The focus of this article is on the front-end part, so in general, we don't make a distinction between compile and explain. For the front end here is collectively referred to as the compilation front end.

The compilation front end can generally include four processes: lexical scanning, parsing, semantic analysis, and intermediate code generation. The lexical scanner converts the source code into a tag stream; the parser converts the tag stream into an abstract syntax tree; the semantic analysis adds the abstract syntax tree to the semantic tag; the median code generator converts the tagged abstract grammar book into a compiler Intermediate form.

In a programming environment, in addition to the core processor (compiler/interpreter) of the source code, there are other related system support systems/platforms and tools. Such as input, modify the source code of the code editor, debug the code execution process debugger, manage the code version of the source control tools and so on.

The so-called Integrated Development Environment (IDE) is the integration of all these systems and tools to provide an integrated application interface.

For the programming environment of handwritten text system, the handwritten text system brings a new way of text input, which is safe and convenient. However, the input and edit results are still character streams, but not the standard code, but the individual code of the input person.

For handwritten text, we can design a special programming language; we can also use the glyph matching service in the handwritten text system to generate standard source code based source code. For the latter, a large number of existing programming environments and tools can be reused directly. This embodiment is mainly for illustrating such a scheme.

In fact, the solution is quite straightforward—that is, converting personal-based proprietary encoding to standard encoding. That is to say, the handwritten source code is converted into source code that can be recognized by the normal compile front end. Therefore, the traditional compilation front end is preceded by a conversion process to process the handwritten source code, that is, the entire process can generally include five processes: handwritten source code conversion, lexical scanning, syntax analysis, semantic analysis, and intermediate code generation.

This encoding conversion process mainly converts and matches the handwritten source code according to the established rules, and generates the corresponding standard code content, which is separated from the glyphs in the font library. The process is mainly divided into two parts: controll conversion and glyph conversion.

For control character conversion, the control characters in the programming language mainly include spaces, tabs, carriage returns, line feeds, and so on. Since the handwritten text can use the same or similar control characters as the normal text, this conversion is very straightforward. For example, the handwritten interval code is directly converted into a standard blank character. If the handwritten line break uses a standard line feed code directly, it can be retained without conversion.

For glyph conversion, the glyph conversion is mainly to convert the personalized glyph code in the handwritten source code into a pair. Should be coded in standard. The basis of this conversion is the glyphs in the corresponding text font library. Here, the glyph matching service of the handwritten text system is needed. These include digital symbol mapping, keyword mapping, interface identifier mapping, and private identifier generation and mapping.

About digital symbol mapping: The source code for most high-level programming languages exists as text files. The main difference with ordinary text content is the grammatical constraints. This constraint is embodied in strict keyword and grammatical symbol restrictions.

The digital symbol mapping is based on the user-defined glyph digital symbol mapping table, and the glyph search matching is performed in the handwritten source code, and replaced with the corresponding standard code numbers and symbols. The symbols referred to herein refer to punctuation marks used in programming languages, such as addition, subtraction, multiplication and division, greater than, equal to, less than symbols, various brackets, and the like.

It can be seen that this glyph digital symbol mapping table is the key to digital symbol mapping. This table is a personalized setting. Everyone's writing habits, strokes, and glyphs are not the same. It makes sense to find and match the glyphs of the same person. Therefore, each programmer has its own glyph numeric symbol mapping table, which can only map the handwritten source code written by the programmer. In a team software development environment, programmers need to authorize specific users/accounts to share their glyph-like numeric symbol mapping tables, and their handwritten source code can be compiled/runned by others. In fact, this is an extension of the security of handwritten text during software development/running.

Due to the unreliability of the handwriting, the glyph digital symbol mapping table can be a many-to-one mapping. In other words, multiple glyphs can correspond to the same number and symbol.

Due to the long-term validity of the program source code, the glyph number symbol mapping table of a specific user for a specific programming language should in principle be added only to be deleted and modified. Moreover, the contents cannot conflict with each other, such as not allowing the same glyph to correspond to different numbers and symbols.

Unlike keywords and identifiers, numbers and symbol characters in standard codes are not composed of characters in the alphabet. Therefore, when compiling a front-end lexical scan, the symbol characters are often specially processed, and one symbol can directly terminate the previous lexical mark; the identifier often cannot start with a numeric character. Similarly, we also need a special convention for the opponent to write, in order to facilitate processing. For example, it can be agreed that numbers and symbols can only correspond to independent glyphs, and cannot correspond to combinations of multiple glyphs.

Due to the particularity of the symbols, the glyph digital symbol mapping table is generally predefined by the user.

About keyword mapping: Like the numeric symbol mapping, the keyword mapping is also based on the mapping of the glyphs of the mapping table to the standard code. This mapping table is a glyph keyword mapping table. Is a personal A many-to-one table.

Keywords are also crucial for the recognition and parsing of programming languages. Keywords determine the location and number of related syntax elements. Therefore, the content of the glyph keyword mapping table is generally pre-defined by the user, and can also be interactively performed during handwriting source conversion.

Unlike digital symbol mapping, keyword mapping allows one keyword to correspond to a combination of multiple glyphs, that is, different combinations of the same glyphs can correspond to different keywords.

About interface identifier mapping: Similarly, interface identifier mapping also maps glyphs to standard codes. The key here is also a mapping table - glyph identifier mapping table. For traditional high-level programming languages, there are more or less built-in or third-party libraries. We need to use the corresponding identifiers to access system constants, system functions, standard library functions, class libraries, and so on. These identifiers are often composed of standard code characters. The glyph identifier mapping table is a mapping table between the user's handwriting and the corresponding identifier. In addition, some of the symbols in the handwritten code may also become interfaces - used and accessed by others, in which case we also need to provide the corresponding standard code identifier.

In the glyph keyword mapping table, for a particular programming language, the set of target keywords (including system punctuation) mapped to is a well-defined closed, finite set. In the glyph identifier mapping table, the target identifier set is an infinite, open collection. As the number of user access systems/external interfaces increases, and the number of externally provided interfaces increases.

Like the glyph keyword mapping table, the content of the glyph identifier can be pre-defined by the user or interactively during handwritten source conversion.

In fact, we can also put common strings and code snippets into this mapping table and correspond to them with a suitable sequence of glyphs. This will increase programming efficiency and improve program readability.

About private identifier generation and mapping: There are two cases in which private identifiers appear in the source code, one is a definition or a declaration, and the other is a reference. The code conversion for the defined symbol is for the user-defined or declared private symbol (non-interface symbol), which is automatically generated according to the established rules of the system. This standard code identifier does not need to have a specific literal meaning. It only needs to guarantee the uniqueness of the identifier, that is, different glyphs generate different standard code identifiers.

The encoding conversion for reference symbols is actually similar to the conversion based on the mapping table above, except that this mapping table is automatically generated by the system. The content of this mapping table is the correspondence between the glyphs of the above defined symbols and the corresponding generated standard code identifiers.

In our handwritten text scheme, we can allow handwritten text encoding and standard encoding to be in the same Mixed use in one content. In the processing of handwriting programming, we also allow such content. In the source code conversion, the part of the standard code is skipped directly, and no conversion is performed. Here, in order to prevent mutual interference between the standard code generated by the handwritten text and the original standard code, we need to insert a blank character between the standard text and the non-control character handwritten text directly adjacent to each other in the conversion process.

Most programming languages are based primarily on natural language based on phonetic characters, such as English. Therefore, identifiers often correspond to words. One of the benefits of using handwritten programming is that it is not limited by this natural language, as long as it is mapped to the target language through a mapping table. For example, we can use Chinese. In Chinese, there is no concept of words, especially in handwritten Chinese characters, each character can have a certain spacing. If we treat a single character as an identifier based on this spacing, this result is obviously wrong. Therefore, we need to define a large character spacing to ensure that multiple characters can form an identifier.

In the traditional program, it is inevitable to use the input, output and related processing of the standard code string, and the corresponding code will embed the standard code string content more or less. One of the benefits of handwritten text is the ability to generate standard code strings in real time without handwriting recognition. Therefore, embedding a standard code string in the program code of handwritten text is indeed a problem. It can be solved or circumvented by the following methods:

1. Put the string into the glyph interface identifier mapping table, and use the corresponding glyph when programming. Obtain the required string through the standard code conversion process;

2, put the string into the resource file (many systems support this practice, and considering the internationalization problem, this is the recommended practice), the runtime load string through its corresponding ID. This will avoid embedding strings in the source code of the program;

3, consider adding handwritten text runtime support in the program, so that the program can directly support input and output based on some text.

In the glyph digital symbol mapping table, 10 numbers of 0-9 and glyphs corresponding to the decimal point can be directly defined. However, one problem with handwritten numbers is that the glyphs of certain numbers are difficult to distinguish from other symbols or words, resulting in deviations in the results of the text lookup matching service. For example, the number 1 and the parentheses (or), as well as the uppercase I (i) and lowercase l (L) glyphs are highly similar, the case of the number 0 and the letter O are indistinguishable, and the number 7 and the letter T may be the same. . In response to this problem, users need to deliberately distinguish their glyphs from other symbols and letters when entering handwritten numbers. This is usually the way people use it in their daily lives.

One advantage of handwritten text is that it can be constrained by the glyphs of standard coded text, and the user can use any glyph or symbol. So in handwriting programming, we can use any glyph or symbol as a keyword or identifier. But in the process of using, we need to pay attention to the conflict between keywords and identifiers. If the identifier uses the same glyph as a certain keyword, the result of the conversion will often result in a syntax error. By using special glyphs or symbols for keywords, we can circumvent this conflict very well.

FIG. 1I is a flowchart of a handwriting program source code conversion method in a method for processing handwritten input characters according to an embodiment of the present invention. FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I.

As shown in FIG. 1I and FIG. 1J, the entire conversion process has five inputs: a handwritten program source file, a handwritten character library, a glyph numeric symbol mapping table, a glyph keyword mapping table, and a glyph interface identifier mapping table. There are three conversion results: the standard code object file, the source target location mapping table, and the glyph private identifier mapping table. The glyph private identifier mapping table is only needed during the conversion process and can be left unused. However, the source target location mapping table is very important, because the compilation and interpretation execution process after the conversion is completed is performed by inputting the generated standard code object file, and the corresponding system information is also based on the location information in the text file. Given. With this source target location mapping table, we can directly convert this information into the corresponding location within the handwritten source file. This provides the foundation for our entire handwriting programming environment and related aids.

In the detailed conversion process described above, the output is mainly a standard code program text file. However, in actual implementation, the conversion process can be integrated with the existing compilation front end, and the process of writing a file can be skipped, and a standard code character stream is generated in the memory for further processing. On the other hand, the previous conversion process assumes that the glyph interface identifier mapping table is pre-defined. In fact, through deep integration with the compiled front end, the optimized conversion process can generate intermediate files (including complete numeric identifiers and keyword conversions) without the glyph identifier mapping table, and then according to lexical analysis, parsing And the results of semantic analysis intelligently handle handwritten identifiers. For example, a processing rule can be employed: for a handwritten symbol defined by a symbol, its standard code identifier is automatically generated; for an undefined handwritten symbol, an interactive manner is used to query the user for its identifier definition, and automatically according to user input. Generate a glyph interface identifier mapping table.

Further, the deeply integrated compiler is used inside the handwritten text editor, and can also implement functions such as syntax coloring and grammatical intelligence, so as to finally realize integrated development based on handwritten characters. surroundings.

FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention. The handwriting program in Fig. 1K corresponds to the programming language Lua language, which is an embedded scripting language. The corresponding font library code can be as shown in Table 1, Table 2 and Table 3.

Table 1

Table 2

table 3

There are three types of coding in the above handwriting: glyph coding, word spacing coding, and line feed coding. We represent the glyph encoding as W+ (specific glyph encoding) and the word spacing encoding as S+ (word spacing value). For line breaks, for convenience, we don't embed the code in the content, but directly with the new line. Therefore, the code corresponding to the above handwriting program can be expressed as follows:

S06 W01 S22 W02 S07 W03 S06 W04 S11 W05 S06 W06 S09 W07 S12 W08 S09 W09

S05 W10 S38 W11 S13 W12 S11 W13 S13 W14

S46 W15 S39 W16 S23 W17 S24 W18 S33 W19

S114 W20 S40 W21

S51 W22

S113 W23 S39 W24 S25 W25 S25 W26 S11 W27 S08 W28 S12 W29 S12 W30 S09 W31

S62 W32

S17 W33

S31 W34 S30 W35 S27 W36 S12 W37 S05 W38 S03 W39

S30 W40 S09 W41 S16 W42 S16 W43 S16 W44 S13 W45 S18 W46 S13 W47

The code is converted, and the user prepares the glyph digital symbol mapping table as shown in Table 4.

Table 4

The glyph keyword mapping table is shown in Table 5.

table 5

The glyph interface identifier mapping table is shown in Table 6.

Table 6

Here, the system sets a syntax interval threshold of 20. The private identifier auto-generation rule is two underscores (_) followed by a glyph code sequence connected by an underscore.

Finally, according to the previous process, you can get such standard code program code:

As you can see, four private identifiers are generated, and the generated private identifiers are shown in Table 7.

Table 7

Among them, the first identifier is actually a comment content, meaningless. If we use an optimized conversion process, we can omit the conversion directly when it is identified as a comment.

This generated program can be interpreted and executed normally by the traditional Lua interpreter, and its execution semantics are exactly the same as those in the handwritten source code.

Further, based on the foregoing FIG. 1A, the method may further include:

When receiving the storage request, the protocol is stripped according to the preset metadata, the metadata of the saved handwritten text is obtained, and the obtained metadata is stripped from the handwritten text;

The handwritten text is divided into at least two pieces of data according to a preset data content splitting specification.

Further, the method may further include:

Querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and encoding the handwritten text according to the encoding specification, Obtaining an instance code, and acquiring a text code corresponding to the handwritten text according to the meta code and the instance code;

or,

Transmitting the handwritten text and the metadata to the encoding repository, wherein the encoding repository selects or creates an encoding specification according to at least a portion of the metadata, and generates a correspondence corresponding to the metadata according to the encoding specification Encoding according to the encoding protocol, encoding the handwritten text, obtaining an example encoding, and acquiring a text encoding corresponding to the handwritten text according to the meta encoding and the example encoding; and receiving the encoding warehouse The text code returned, the text code is a reference code form or a content code form.

It should be noted that the processing procedure of the data splitting can be referred to the specific introduction of the embodiment of the data splitting method in the instruction manual. In addition, the specific process of the encoding processing can be referred to the specific introduction of the embodiment of the subsequent encoding processing method of the specification. Let me repeat.

FIG. 1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention. As shown in FIG. 1L, the processing device for handwriting input characters in this embodiment may include:

The acquiring module 1001A is configured to collect, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes the stroke in the first target row/column Input position

a attribution module 1002A for each stroke according to the stroke in the first target line / An input position in the column, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the The character to which the stroke belongs.

The handwriting input character processing device in this embodiment may be used to perform the method for processing the handwritten input character shown in FIG. 1A. The specific implementation principle may refer to the foregoing embodiment, and details are not described herein again.

The handwriting input character processing apparatus provided in this embodiment acquires a stroke input by the user and corresponding input information in the currently activated first target row/column, and is in the first target row/column according to the stroke An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke The attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input" or "end single text input" commands. Therefore, during the writing process It is not necessary to pause for a period of time or perform some interaction with the system, the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.

Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:

Wherein, the row height/column width information is a default value or determined by the user input, and the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen. a position or a column of opposite left and right positions in the handwriting input screen;

Alternatively, the acquisition module 1001A is further configured to:

Collecting at least one character obtained by the user;

Wherein, the position range refers to a relative top side position and a bottom side position of the first target line in the handwriting input screen or a relative left side position and a right side position of the first target column in the handwriting input screen.

Receiving a line break/column command input by the user;

Alternatively, the acquisition module 1001A is further configured to:

When the first target row/column and the second target row/column are simultaneously the currently activated target row/column, the first target row/column and the second target row/column are both partial region activated. ;

On the basis of the technical solutions provided by the foregoing embodiments, it is preferred that the home module 1002A is specifically configured to:

The specified character is all characters that are already in the first target row/column;

Or the specified character is a character in the area to be compared in the first target row/column, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold.

Specifically, comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, determining the association between the stroke and the character Sex can include:

If the stroke does not overlap with all the strokes in the character, it is determined that the stroke is not associated with the character.

Alternatively, comparing the input position of the stroke in the first target row/column with position information corresponding to a character specified in the first target row/column, and determining between the stroke and the character Relevance can include:

If the boundary between the stroke and the character is not less than a third preset threshold, it is determined that the stroke is not associated with the character.

Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value in a spacing between each stroke corresponding to the character, and determining whether the minimum spacing value is less than a third preset threshold;

If less than, the stroke is associated with the character.

If not less than, the stroke is not associated with the character.

The performing the affiliation processing on the stroke according to the associated at least one character may include:

Alternatively, the performing the attribution processing on the stroke according to the at least one associated character may include:

The obtaining the most relevant character from the stroke from the associated at least one character includes:

According to the minimum spacing value corresponding to the character of the stroke, in order from small to large, At least one character associated with the stroke is sorted and the first character is used as the character most strongly associated with the stroke.

Before the collecting and acquiring the stroke input by the user and the corresponding input information, dividing the first target row/column to divide the first target row/column into a plurality of composition grids;

Correspondingly, the home module 1002A can be specifically configured to:

Determining whether a character already exists in the composition grid;

or,

When the current page is switched from one page to another, the new character or the attribute created by the acquired stroke is saved on the one page.

Based on the technical solution provided by the foregoing embodiment, it is preferred that the acquisition module 1001A is also used to:

Saving the stroke input by the user and the corresponding input information in the first memory;

The saved characters are stored in the second memory, and for each saved character, the characters include a stroke constituting the character and an index corresponding to the stroke;

The index corresponding to the stroke points to the input information corresponding to the stroke in the first memory.

The input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and an input speed of the stroke.

The input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke;

The input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and a coordinate position of each point in the handwriting of the stroke.

Get and display the boundaries of each character saved locally;

The correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;

Combining the at least two characters to be merged into one character.

Alternatively, the correction request is a split correction request, and the character to be corrected is a character to be split;

Splitting one character to be split into at least two characters.

Or the correction request is a home correction request, the character to be corrected is a character to be vested, and the stroke to be corrected is at least one stroke to be corrected;

The characters after the position to be inserted are adjusted accordingly.

Acquiring and acquiring at least one character selected by the user;

The data splitting and data merging will be described in detail below.

The data splitting of the present invention is a solution that can effectively solve the above problems. 2A is a flowchart of a data splitting method according to an exemplary embodiment. As shown in FIG. 2A, the present invention provides a data splitting method, including:

In step 101B, when receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, and the metadata in the data object corresponding to the data identifier to be stored is obtained.

Step 102B: Strip the acquired metadata from the data object.

Step 103B: Split the data according to the preset data content, and divide the data content into at least two data segments.

Optionally, the method may further include:

In step 104B, the metadata and each data segment are separately stored in different storage bodies or in different secure channels.

In the data splitting method of the embodiment, when receiving the storage request carrying the identifier of the data to be stored, according to the preset metadata stripping rule, obtaining the metadata in the data object corresponding to the data identifier to be stored, and the metadata is The data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and the security of the data storage is more reliably realized.

FIG. 2B-1 is a flowchart of a data splitting method according to another exemplary embodiment. As shown in FIG. 2B-1, the present invention provides a data splitting method, including:

Step 201B: Receive a storage request carrying an identifier of the data to be stored.

The data splitting method may be applied to a device such as a terminal (client device) or a network (server device). When the device receives a storage request carrying a data identifier to be stored, the storage request may be triggered by the terminal application, for example, a mail. The system, the desktop agent and other applications mentioned above take the mail system as an example. When the mail system sends the file data, it receives the storage request carrying the identifier of the data to be stored, and the data splitting device of the mail system first disassembles the file data. Sub-processing, so that the recipient of the mail needs to obtain the file data fragment from each specified storage body to get the complete The file data is triggered by the user. If the user wants to split a file and then store it, the data splitting device receives the storage request carrying the data identifier to be stored, and then splits the file. The identifier of the data to be stored may be the name of the file data, and the identifier information such as the message digest algorithm (MD5 code).

Step 202B: If the metadata specified in the preset metadata stripping protocol includes: attribute information, determine, in the data object corresponding to the data identifier to be stored, the attribute information content that matches the attribute information as metadata.

The process of stripping metadata is to separate the metadata of the data object, especially the key metadata, from the data object from its original location, so that only the data content and/or other metadata information remaining cannot be obtained. The purpose of accessing, identifying, correctly reading, or using raw data objects. Among them, the key metadata is security-related metadata. Once these key metadata are missing, the system will not be able to read, identify, decode or restore the corresponding data objects.

For example, for data in the form of files in a Windows system, the file type is a key metadata. When we remove the type information of the file (in the Windows system, the file extension is removed), the system cannot open the file content normally. Storing file type information and file content data in different cloud storages will cause certain difficulties for malicious attackers or service providers to obtain complete data. Different types of data have different key metadata. For example, for tabular data (a spreadsheet or database table, etc.), its header (field name) is a key metadata. In practical applications, metadata can also cover a wider range. As long as the security of the data is beneficial, any information related to the data content can be separated from the data content itself as metadata. The metadata includes: attribute information; the attribute information is information capable of identifying a unique property of the data object, and is composed of some descriptive information to help find and open the data object. Attributes are not included in the actual content (data content) of the data object, but rather provide information about the data object. It can include a lot of information such as the size of the data object, the type of data, the date the creation was modified, the author, and the rating. Since the attribute information can be set by the person skilled in the art according to the nature of the data object, the content included in the above attribute information is only an example, and is not a limitation on the content of the attribute information.

Alternatively, if the metadata agreed in the preset metadata stripping protocol includes: a data content identifier and a keyword, the data content matching the keyword is determined as metadata from the data content in the data object according to the data content identifier. .

The data content identifier is used to prompt the extraction location of the metadata from the data content portion, and the keyword is used to indicate the data content that needs to be extracted specifically; the data content matched with the keyword may be key information or sensitive information contained in the data. For example, in a bank statement, a number of keywords associated with the account information can be set to extract sensitive information in the account as metadata storage. For example: account number, user ID, user phone, address, etc.

Alternatively, if the metadata agreed in the preset metadata stripping protocol includes: attribute information, a data content identifier, and a keyword, the attribute information content matching the attribute information in the data object is determined as metadata, and according to the data content identifier, From the data content in the data object, the data content matching the keyword is determined as metadata.

The strategy for generating the default metadata stripping protocol can be determined by the developer, or it can allow the user to define the applicable protocol. The system needs to do so to present the metadata to the user as comprehensively as possible, and the user can preset the most based on the information. Appropriate metadata stripping protocol. The preset metadata stripping protocol is built into the data splitting system. As in the previous mail client example, the preset metadata stripping protocol can be built into the mail system application. Of course, the preset metadata stripping protocol may also be stored with the metadata as part of the metadata content, so that when the recipient merges the data, the data object is merged with reference to the preset metadata stripping protocol.

Then, according to the example of the mail client, the attachment file (data object) to be sent is split, and the metadata of the attachment file may be: file name, file type, file size, creation time, and the like. The result of file metadata stripping is stored in the file meta information system. The method of dividing the file content and the segmentation result information, such as the hash value or ID of the file fragment, and the storage location of the file fragment, are also stored in the file meta information system. And associated with the corresponding file metadata. In fact, as mentioned above, all of the content stored in the file meta-information system constitutes an example of this split/peel protocol.

Step 203B: Detach the acquired metadata from the data object.

Stripping, also referred to as splitting, refers to metadata that is selected from the data objects that are associated with the data object's split/peel processing. The system will separate the metadata from the data object based on the default metadata stripping protocol (which can be system default or user-selected or user-defined). The statute records information such as rules, constraints, and methods related to metadata split/peel processing. For example, but not limited to: stripping location information of metadata, stripping method of metadata, encoding scheme, information related to stripping encoding, content splitting rules, and other content splitting Closed data and / or information. Wherein, the metadata may be a complete set or a subset of the metadata of the data object. For details about the type of metadata, please refer to the various situations in step 202B above.

There are various methods for splitting data, such as splitting a data object into multiple segments according to a predetermined rule and saving them separately. However, this method can not achieve more fine-grained encryption means, and can not separate the important information (metadata) closely related to the data object from the data content itself. The invention adopts a new data splitting method to realize the splitting of data objects. This method not only splits the data object into finer granularity (for example, in characters or even in bits), but also can transfer important information (ie, metadata) closely related to the data object and the data content itself. Peel off. Finally, the stripped metadata, data content, and/or the code to be mentioned later can be stored separately in different storage locations or spaces, or under different secure channels, thereby realizing the security of data storage more reliably. .

Step 204B: Split the data according to the preset data content, and divide the data content into at least two data segments.

Content splitting refers to dividing the data content in a data object into several (more than one) segments according to certain rules. The figurative metaphor is like tearing a piece of paper into pieces. However, content splitting is not necessary, and can be determined according to actual needs. Applications that do not require high content confidentiality may not be split. The content splitting method can use RAID disk array technology to divide data into multiple blocks and write multiple disks in parallel to improve the read and write speed and throughput of the disk.

Content splitting can be divided into domain-related content splitting and domain-independent content splitting. Domain-related content splitting is mainly based on the characteristics of specific domain data, the data is split. For example, structural splitting for specific file formats, or splitting key or sensitive information within the data. The latter may have some overlap with the metadata stripping (when the metadata is in the data). For example, the bank's statement can be stripped of the account information as metadata, or the account information can be split as a data segment for split storage.

Further, the preset data content splitting protocol may include at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm. Algorithmic researcher Michael O.Rabin first proposed the Information Decentralized IDA algorithm in 1989 to slice data at the bit level so that it is unrecognizable when the data is transmitted or stored in the array, only with the correct density. The user/device of the key can access it. This information is reassembled when accessed with the correct key. In the field of distributed storage, information-distributed IDA algorithms and related derivative algorithms have been widely used.

Step 205B: Perform separation processing on each data segment according to a preset encoding separation specification to obtain a code corresponding to each data segment.

In this embodiment, optionally, according to the preset coding separation protocol, each data segment is separately encoded to obtain a code corresponding to each data segment, including:

Decoding a protocol according to a preset encoding, querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and respectively, according to the encoding protocol, respectively Encoding each data segment to obtain an instance code corresponding to each data segment;

or,

And transmitting, according to a preset encoding separation protocol, each data segment and the metadata to the encoding warehouse, so that the encoding warehouse selects or creates an encoding specification according to at least a part of the metadata, and generates according to the encoding protocol. a meta-code corresponding to the metadata; and according to the coding protocol, respectively encoding the respective data segments to obtain an instance code; and receiving the meta-code and the instance code returned by the coding warehouse.

It should be noted that the specific process of the encoding process can be referred to the specific description of the embodiment of the subsequent encoding processing method of the specification, and details are not described herein again.

Step 206B: Arrange the respective codes according to the original order of the data segments in the data content to obtain the coded arrangement order information.

As described above, the data splitting method of the present invention covers two different data processing means, one is the stripping of metadata and encoding, and the other is the splitting of data content. The stripping of metadata has been explained in the foregoing. The stripping of the code here refers to splitting the data content into n pieces of data, and then storing or storing the n blocks separately, and obtaining the corresponding code (number) of the number (number). There may be repetitions in which the codes (numbers) are arranged in the order in which they appear. This encoding (numbering) sequence contains the encoding information as well as the encoding ordering information, and the encoding result can be stored in another secure channel. The encoding is different from the previous data fragment, and splitting it out can be called stripping. At the same time, in most cases we only need to split the data content part of the data object, without having to split the metadata part and/or the code part that has been stripped out, but if necessary, it can also be stripped. The metadata portion and/or the encoded portion are further split processed to achieve a finer-grained protection effect. The above-mentioned stripping and splitting can be combined indefinitely, depending on system requirements and processing capabilities.

In most cases, code stripping is based on content splitting, that is, content splitting is to split some or all of the data content according to certain rules, and encode the addressing mode of each split data. . The final encoded result is formed into separate data. In the computer field, reference codes for data are ubiquitous. Such as the key (Address) of the data record in the database; the abbreviated URL (http://dwz.cn/mzot4) for the URL input and reference; the access identifier used in the cloud storage programming interface (API), and so on. These encoding methods can all be used by the encoding mentioned above. If the result of the splitting of the data part is encoded, the encoded result will replace the original corresponding data. However, sometimes the encoding may not be based on content splitting. For example, for data with low security levels, it is not necessary to split the data content. At this point, it is sufficient to give the entire data content a code if necessary, but it may still be necessary to separate the code from the data content. It can be seen that the code stripping of this embodiment is different from the traditional content splitting, and is different from the existing data reference encoding, but a combination of the two. As long as the coding results (including the code itself and its corresponding combination order) are separated from the data content, the security risk of the data can be reduced to some extent. For example: there are 6 bytes of data ACBDAC, split the two bytes of data into the database. AC returns code 1, and BD returns code 2. The result of this data is the sequence of 121, not just 1 and 2. Wherein, the

numbers

1 and 2 represent codes; and the arrangement rules of 1, 2, and 1 are coded arrangement order information.

In practical applications, the above-mentioned metadata, encoding, and data content stripping/split methods are not mutually exclusive, and they can be used in combination. For example, but not limited to, as described above, it is possible to separate only the metadata from the data content; it is also possible to separate only the encoded portion from the data content portion; it is also possible to treat the encoded portion as a special metadata and The other metadata are put together, as long as they are separated from the data content portion; more preferably, the three parts (metadata, encoding parts, data content) are separated according to their respective splitting rules.

In addition, the steps 202B to 206B are the order in which the content splitting, the metadata stripping, and the encoding stripping are not performed, and they may be performed separately or may be performed at the same time or simultaneously. Usually, however, the encoding operation of the present invention needs to be performed during or after the content splitting process. However, when it is not necessary to perform content split processing, the encoding operation may not be performed. Since the metadata stripping can be done before the content is split, the metadata stripping can also be performed after the content splitting and encoding assignment is completed. In the meantime, for example, before and after each splitting step, that is, between steps 202B to 206B, other data processing methods such as data compression, encryption, and the like may be mixed. It is also possible to add a description of compression and encryption to the various protocols mentioned above, but at this time it is best to re-execute after performing compression and/or encryption. The split step for the metadata.

Step 207B: Store the metadata, the code corresponding to each data segment, and the coded sequence information into different storage banks or different secure channels.

On the basis of the foregoing embodiment, further, if the metadata agreed in the preset metadata stripping protocol includes: a data object identifier, the rule is stripped according to the preset metadata, and the element in the data object corresponding to the data identifier to be stored is obtained. The data includes parsing the data object to generate a data object identifier uniquely corresponding to the data object.

Further, when the data object is audio data, step 204B, according to the preset data content splitting specification, dividing the data content into the at least two data segments may include: adopting a time domain analysis method or a frequency domain typing method, Performing a splitting process on the audio data to obtain an audio data object to be encoded; wherein the audio data object to be encoded includes a sound wave segment and/or a silent segment.

Specifically, speech is an earlier and more natural expression than words. However, in the world of computers and the Internet, which are increasingly related to human production and life, voice data and related processing have always been second-class citizens. The reason is mainly caused by the current input, storage and processing methods of voice data and corresponding technical limitations. People now mainly use two methods to process and use voice input through computers and networks: voice calls and voice recognition.

Voice call mainly refers to converting the voice signal output by a person into a digital signal through a computer sound capture device, and then through a computer and a computer network or a communication network (here mainly based on packet-switched voice technology, such as VoLTE, based on circuit-switched voice) The technology has nothing to do with the problems we discussed) processing, transmission and storage, and finally played back through the digital audio playback device. Voice calls can be real-time or non-real-time; they can be one-way or two-way. The main problem with current voice calls is the large amount of data, which is not easy to transfer and store. The current audio sampling rates of sound cards are mainly 11KHz, 22KHz, and 44.1KHz. The sound obtained at 11KHz is called telephone sound quality (the telephone uses 8KHz sampling rate), which basically makes people distinguish the voice of the caller; 22KHz is called broadcast sound quality; 44KHz is CD sound quality. The higher the sampling rate, the better the sound quality of the audio data and the larger the storage. Another sampling parameter is the sampling resolution, which refers to the size of a sound signal (generally the amplitude of the sound wave). The common ones are 8 and 16 and the 8 bits can divide the sound signal into 256 levels. The bit can divide the sound signal into more than 60,000 levels. It can be calculated that the data size of the 8-bit stereo (left and right channel) audio signals sampled at 11KHz in 1 second is 22 KB. This is equivalent to the amount of data in Chinese characters of more than 10,000 words. Currently the most commonly used two-way, real-time voice communication In the application, the user rarely saves the call data recording. The main reason is that audio data occupies a large amount of storage and cannot be retrieved or queried. There are also applications that can preserve the results of one-way calls, and they generally limit the size of the data that is retained. For example, WeChat's "press and talk" function has a limit of 1 minute. Correspondingly, there is no limit to its text WeChat. It is okay to send millions of words. Similarly, Skype has a voice message function, and the message duration is also limited. Can only be kept for 10 minutes. At present, most common voice data are digital audio books, such as storytelling, cross talk, lectures, and audio e-books. They are generally stored in audio files (such as MP3, WMA, MOV, etc.) or accessed in real time through network streaming protocols (such as PTSP, MMS, RTP, RSVP, etc.). People generally know the information about the audio data through metadata other than the audio data (such as ID3V1 and ID3V2 information in MP3); for the inside of the audio data that is first listened to, unless there is auxiliary text positioning information (such as a subtitle file). Otherwise, you can't find and locate them randomly, you can only listen in sequence.

Speech recognition, as we already know, literal data is the first class citizen of current computer systems. Text data is standardized, easy to store, easy to view, find, retrieve, and process. Therefore, speech recognition that converts speech input into text data can make more efficient use of the input data. However, there are two problems here, one is loss of information; the other is the problem of recognition rate. The human natural voice output contains information other than the corresponding text content. At present, when the speech recognition is converted into standard text content, the original speech data is generally not retained, and in fact, this part of the information is lost. These information mainly include voice, intonation, tone, tone, pause, etc., which may contain emotions, emotions, and so on. The recognition rate problem is that speech recognition has not yet become a major obstacle to human computer input. For speech recognition for a specific person and after a certain recognition training, the recognition rate is still quite high, and can reach more than 90%. As a result, Apple's Siri, Amazon's echo, Microsoft's Xiaona, Google's Now and other digital voice assistant applications have grown particularly fast in recent years, and some people have been able to replace traditional search engines with digital voice assistants. However, we also see that language problems and accent problems keep many people away from these applications. Speech training and speech recognition are themselves the relationship between chicken and egg. Due to the lack of data for speech training, the recognition rate of speech recognition is not too high for a specific group of people. Conversely, because of the low recognition rate, this particular group has little enthusiasm to use speech recognition, resulting in the system lacking sufficient sample data for analysis and optimization. In addition, speech recognition for the purpose of text entry also has difficulty in identifying punctuation and text control, which affects the efficiency of input.

In summary, we have seen that the data of the voice call maintains the original voice information, but the amount of data is large, and is not conducive to the automatic analysis and processing of the computer. Although speech recognition can generate text data, which is convenient for computer transmission, storage, analysis and processing, some original speech information is lost in this process; and the accuracy and reliability of current speech recognition are not guaranteed, and there is no effective Ways to get the sound sample data of most people to improve the recognition rate.

This embodiment proposes a compromise method to process the original voice data so that both the original voice data and the text data are saved, which facilitates the transmission, storage and analysis processing of the computer. The key here is that this text data is not a standard text encoding, but a private encoding for a specific person. The voice data corresponding to the code is stored in a specific text code warehouse, and the voice data in the code warehouse is differentiated and coded according to different users. Users can set access rights for different users for their own voice data. As shown in Figure 2B-2, the system is roughly divided into two parts: the code repository and related services surrounding the data. The process of voice input is as follows: 1. The user logs into the code warehouse and selects the voice text input system; 2. The voice text input system registers a series of encoders according to the current user to the code warehouse; 3. The user inputs the system to the voice text. Input continuous speech; 4, voice text input system stores the user's input into the input buffer; 5, the voice text input system divides the voice data in the input buffer according to certain rules to form different data objects; 6, voice text input The system submits the data to the data warehouse through the corresponding encoder, and obtains the corresponding code; 7. The voice text input system stores the obtained code into the text input result, and clears the corresponding input buffer content; 8. Repeat 3 to 7 In the step, the voice text input system continuously obtains the user input and its corresponding code; 9. When the user stops inputting and there is no data in the input buffer, the entire voice input process is completed.

It can be seen that segmenting the voice data in the input buffer is a key step. In fact, this is a mature technology for voice data processing called "endpoint detection" or "voice detection." Common methods of time domain analysis and frequency domain typing. Here is an example of a time domain analysis method. Figure 2B-3 is a time-domain analysis of a piece of audio data, defining an amplitude less than a certain range (here 0.005), and the time is a period of time (here 20ms) is muted. For mutes less than 50ms, we divide directly from the middle, which belongs to one segment before, and then belongs to another segment. For muting greater than or equal to 50ms, we divide from the beginning and the end of the muting. This divides the audio into nine segments: 901ms of silence, 949ms of a sound clip, 421ms of silence, 2558ms of sound clips, 337ms of sound clips, 578ms of sound clips, Silence of 368ms, sound clip of 1209ms, and silence of 679ms. Two encoding types are used here, one is the sound segment encoding, which is represented by the letter V followed by the corresponding number; the other is the silent encoding, which is encoded by the length of the letter S followed by the mute (in milliseconds). The data in the speech text encoding table corresponding to the user of the encoding warehouse is as shown in FIG. 2B-4. In this way, we can get the corresponding text code as follows: S901 V001 S421 V002 V003 V004 S368 V005 S679

In this way, we convert 8 seconds of audio data into 9 special text characters. With four bytes per character (this is actually related to the specific coding scheme, using context-dependent object-based coding, which can achieve an average of four bytes of word length), the entire coding result is 36 words. The section is almost one-thousandth of the original audio data of 176K (22K/s X 8s). Therefore, the encoding result is much more convenient and efficient in the processing of storage, transmission, editing, and other data mixing. Only the user who needs to play the sound content finally needs to obtain the corresponding data from the code repository and restore the audio content.

It is worth mentioning that the method of separating the encoding and the content can easily place the encoding and the data content in different secure channels, and has natural security.

At the same time, the voice data stored in the code warehouse is directly related to a specific person, and naturally can be well used as a training sample for analysis and organization. The existing speech analysis and recognition technology can analyze and identify a lot of useful information, such as pitch, tone, pitch, syllable, etc.; and extract more effective feature parameters, such as MFCC parameters, LPCC parameters, etc. Wait. These can be stored in the code repository to provide further coding services for the corresponding speech coding. Such as content search matching service, content normalization service, content selection service, and the like.

Voice text output, for the obtained voice text content, that is, the encoding result, there are two different output modes, one is graphic output based on text display output, and the other is audio playback based on voice playback. .

Graphic output, graphic output of voice text refers to the presentation of voice text in the way of ordinary text, that is, text layout output. The advantage is that the text processing can be processed and processed using existing word processing methods and tools. In addition, the support of voice text output, can also allow voice text and traditional text, as well as other forms of text (such as graphic text, image text, etc.) appear in the same text document, supporting more colorful applications.

The specific presentation of voice text will vary depending on the user's access rights.

1. For a text output system that supports multiple text types, if the user does not have a text editor Any access rights of the code (including text type information), the user can only see the information of the code itself, which can be the presentation mode as shown in FIG. 2B-5.

2. If the user is able to obtain the type information of the code, but cannot access the specific content of each audio text code. The system can present continuous speech text encoding (including speech data encoding and mute duration encoding, etc.) as a whole, for example: "+ an unauthorised speech text (9 characters, 4 silent characters; mute duration total 2 '369)" When the user expands the contents of the above quotes, more details can be output as shown in Figure 2B-6.

As shown above, we can not only see each phonetic character, but also visually see the duration of the mute. Using this information, the system can also provide relevant search functions, such as a silent search (with or without constraints).

3. Further, if the user has the right to obtain the voice data corresponding to the voice characters, the system can display more relevant information and allow the user to play the voice content, for example, display "+ voice content, duration 8" (5 voices) Character, 4 mute characters; mute duration total 2'369)

"When the user expands the voice text, more details can be obtained, as shown in Figure 2B-7.

Users can click on any voice character to play it. Voice text is graphically output and can be visualized in a variety of formats, such as displaying waveforms, spectrograms, visualization durations, etc., depending on the specific application requirements. In addition, the results of the analysis of the phonetic characters, or the semantic tags added by the user to the characters, can also be presented simultaneously. As shown in FIG. 2B-8, the third and fourth audio characters are also displayed based on the results of the Chinese Pinyin phonetic analysis.

Due to the ability to access the encoded warehouse information for audio characters, the associated system text search can also provide more search control, such as searching based on semantic tags entered by the user.

Among them, the output process of a single phonetic character (including silent characters) is as follows:

1. The user logs in to the code repository.

2. The system decomposes its metacode according to the target character encoding.

3. The system submits a character meta code to the code repository.

4. The encoding warehouse checks the access rights according to the meta code and the current user. If access is disabled, an error message is returned to the system; the system performs a graphical output based on the character encoding; the process ends. If access is allowed, the corresponding encoded metadata is returned to the system; the process continues.

5. The system decomposes the instance code according to the target character encoding.

6. The system parses the instance code according to the encoded metadata. Specifically, if it is a mute character, the instance code is parsed into a mute duration; if it is an audio character, the character code is submitted to the code repository. The encoding repository checks the access rights according to the audio encoding settings and the current user. If access is disabled, an error message is returned; if access is allowed, the corresponding voice data is obtained and returned to the system.

7. The system outputs the characters according to the parsed or obtained data.

8. If the system obtains the user's play request, the waveform data is recovered according to the voice data, and played out.

If multiple consecutive characters are output, the system needs to obtain all corresponding phonetic characters and related data, and graphically output the visualized form according to certain typographic rules. If the user's play request is obtained, the play buffer is established, and the audio data is played back in turn (while taking into account the play of the silent characters).

Voice playback, the voice playback output of voice text is similar to the playback of traditional audio data, and does not need to consider the graphic layout of text. However, the playback of voice text is also based on the user's access rights. The voice text can be played only if the user has obtained the access rights of the voice text corresponding to the data.

In addition to time positioning similar to traditional voice playback, rich search positioning can be performed on voice text, such as searching according to voice duration, mute duration, semantic tags, mixed text in voice text, and the like.

It is worth mentioning that through the mixture of voice text and traditional text, many effects that traditional voice playback cannot achieve can be achieved. For example, embedding subtitles, embedding structured navigation information, embedding photo links, embedding graphics, and more.

Voice text editing, by encoding the text of the audio data, makes it possible to edit the voice data in the manner of traditional text editing. In the state of voice text graphics output, the user can conveniently delete, insert, modify, etc. any character, and can also perform traditional text encoding operations such as searching, replacing, copying and pasting.

Some of these operations require the use of specialized audio services. For example, change the mute duration, divide an audio character into multiples, combine multiple speech characters into one, and so on.

From the above, we can see that the textualization of audio data is safe for people to use computers. Effective voice communication to express and communicate provides more opportunities. However, some people will also have some doubts about this method.

Noise cancellation, audio data recorded in normal environments generally have ambient noise. After it is segmented and encoded, it will be played back. Does the noisy voice character data play with the noiseless mute character, will it sound strange?

This is indeed a problem. The solution to this problem is straightforward, that is, unified denoising of audio data before storage. At present, the technology of automatic denoising is relatively mature, and the noise cancellation for pure speech is easier.

The sound frequency that the human ear can recognize ranges from 20 Hz to 20 kHz. The frequency of the sound emitted by the human body vocal organs is about 80 Hz to 3400 Hz; while the frequency of the human voice is usually 300 Hz to 3000 Hz. For a specific individual, this frequency range is generally more limited. In addition, the volume of conversations of normal people indoors is between 20 and 60 decibels. According to this frequency range, we can automatically remove high frequency and low frequency noise. With low decibel delay, we can perform voice detection and automatically get a silent section. Through the spectrum analysis in the silent section, noise filtering can be performed on the entire audio data. It should be noted here that some of the mute segments will have the same frequency range as the audio data. When performing automatic filtering, we must ensure that the audio of the non-silent segment is not processed into a low-decibel silent segment.

The voice data with the overall noise cancellation and the completely muted mute characters will play together in harmony.

In the actual application environment, it is generally not necessary to wait until the voice data is completely obtained before performing the segmentation and denoising processing. We can build a cache of a few seconds in memory and analyze it. However, the identified noise characteristics can be accumulated and reused and updated in subsequent audio processing.

Real-time voice call, since this method is based on the segmentation of voice data, is this method not applicable to voice applications with high real-time requirements? Indeed, this method is still applicable for voice applications that can allow a delay of a few seconds. If the real-time requirements are high, speech segmentation is not possible. However, for these applications, the method can be used to record the voice, which avoids the problems of large amount of traditional voice recording data and difficulty in editing.

Voice transmission, in traditional voice call applications, voice data can be directly transmitted to the receiver. In this method, the voice text is transmitted to the receiver, and the receiver obtains the real voice data from the code warehouse. Will this process be inefficient?

In fact, the code repository for VoIP-based calling applications should be deployed in the cloud. center. Today's data centers generally provide CDN (Content Delivery Network) services, which automatically select the fastest way to transfer data. So this process can be most efficient, and it all depends on the deployment of the code repository.

On the other hand, due to the separation of the code and the data, the transmission can completely hide some or all of the voice data after the voice data is transmitted. The receiver cannot play in whole or in part even if it receives the voice code. This is not possible in traditional voice call applications.

The amount of actual data, the encoded content of the audio data is indeed much smaller than the original audio data, but for the users who ultimately need to use or play the original voice content, the amount of data has not decreased, but has increased ( Voice text encoding part). So, can we say that it is a defect of this method? It is undeniable that for a specific segment of speech, if the final playback can restore the original input, the amount of data is not reduced (this ignores noise cancellation). However, it must be seen that by centralizing the personalized voice data into the code repository, there is actually a significant amount of redundancy. By processing this redundant information, storage efficiency and transmission efficiency can be greatly improved. Below we specify this.

For a specific individual, the sound that can be emitted in a lifetime is limited. The basic elements/syllables are more limited considering language limitations. The combination of the elements is also very limited. Regardless of the level of the volume, the specific phonemes that can be formed are limited. Based on this, when we store the voice data, we can further reuse it by further segmentation. In the existing audio processing, the voice data is cut into a continuous sound frame. A sound box is generally 10ms to 40ms, and there can be some overlap between the frames. Appropriate frame segmentation facilitates audio analysis and further parameterizes the audio data for ultimate reuse.

Some existing audio fingerprint extraction and matching methods can be used to detect redundant voice data well, to implement content normalization, search matching and other services in the code warehouse. For example, Google's Waveprint method (patent US 8411977 B1).

It can be foreseen that by the method of the present embodiment, it is possible to easily record all the voice data of a person's life to complete some applications that were previously unimaginable.

The falsification of the encoded content, the textual audio data is actually easier to modify, so, who will ensure the security and reliability of the audio data? How to ensure that the audio character sequence is the original character sequence? In fact, this is not a new problem, and the traditional text faces the same problem. We can solve the same problem by using existing solutions such as digital signatures.

Non-speech audio data, here is the emphasis on voice data, then for non-speech audio data, such as music, video and audio track data, etc., is this method also applicable?

First of all, the method of this paper does not change the original data, but it is divided and encoded. The original content is divided into the encoded stream and the corresponding audio data in the encoding warehouse. Final playback will still be able to fully restore and play the original audio. In this sense, there is no problem with using this method.

However, from a textual point of view, the text obtained by this method is personal and relevant to a particular user. This also ensures subsequent speech analysis, identification and other highly personalized services for the user. If music or other sounds that are not related to the individual user are stored in the code repository and associated with the user, it will actually affect the subsequent personalized service. Therefore, it is better to find ways to divide voice data into other audio channels. Use other coding classifications for other audio data, such as instrument-related coding for music. Finally, data that divides different audio characters into multiple channels is mixed together.

A mixture of multiple text types. Since we divide the encoded and encoded content of speech data into text, can we mix it with traditional text and other types of text encoding? Indeed, this is one of the strengths of the program. The natural output of a person is multi-channel. For example, a person can speak while writing or typing on a keyboard. Existing systems can only disperse these results into different data for storage and processing, losing their natural synchronization characteristics. We use appropriate coding methods to textize different data and store them, process them, and correlate them.

With the development of cloud computing and big data technology, computer systems can analyze, summarize and even predict human production and life more systematically and deeply. However, the data that can be analyzed and processed by computer systems at present is mainly data generated inside the digital world. Human output is mainly through the keyboard into the digital world, which is a huge bottleneck. And for most people, the keyboard is not a friendly, easy-to-use device. The method provided in this paper is based on the natural output of human beings, and the output speech data is segmented and encoded. The coding results can be processed using traditional text methods and tools, and the corresponding data of the code is stored in the code repository. The code repository can be placed in cloud storage for easy analysis and utilization. This method will greatly improve the efficiency of digitizing human speech output. And with the accumulation of voice data, the code warehouse has the opportunity to provide smarter, personalized voice data services. Ultimately, humans are seamlessly integrated with the digital world.

Further, the method further comprises: generating a coding order information unique identifier based on the encoded arrangement order information, and/or generating a respective data segment unique identifier based on each data segment, the coding order information unique identifier and/or each data The fragment unique identifier is stored as part of the metadata.

The data object identifier, the encoding order information unique identifier, and the data fragment unique identifier uniquely corresponding to the data object are respectively hash values corresponding to the data object, the encoding ordering information, and the content of each data fragment (eg, MD5, SHA1, etc.) ), or a globally unique identifier (UUID/GUID) generated by the system or any other globally unique encoding. The identifier can be used to perform integrity check on its corresponding content to verify whether the identifier matches its corresponding information, and whether the corresponding information is complete.

In summary, data splitting refers to splitting a complete piece of data into two or more copies, which are then stored in different storage systems. It should be noted that although the split storage operation for the split data in step 104B and step 207B in the above embodiment is included after the split, the purpose of the data splitting of the present invention is not only to store but to Data splitting for data security purposes. For data stored in a cloud provider, users may not trust, but through data splitting, a piece of data can be stored in one or more vendors, and only all data is leaked (including metadata, each Data fragment) can lead to data leakage. This greatly increases the difficulty of illegally merging data. The data splitting of the present invention allows the end user of the data (i.e., the user entitled to own the data) to directly intervene and control. The data splitting method is built on the operating system (including the cloud operating system), specifically in the application system for splitting purposes, or in the splitting service of other application systems. The storage system is built on the storage physical device, the infrastructure under the operating system. The data splitting method of the present invention will eventually use a data storage system. 2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention, showing the location of the application field of the present invention in the computer system hierarchy.

The splitting and merging of data can be done at the terminal or by the server or service provider. In this way, whether the attacker or the data service provider itself, the data obtained from a cloud storage server is not complete and is not enough to pose a threat to the privacy and confidentiality of the user. An attacker needs to obtain the identity of the same user in different cloud storage services in order to get different pieces of data that make up the complete data. This difficulty is often much greater than cracking a single system. In addition, you need to use it correctly. The merged specification can restore the fragment data to the original complete data. This gives the user's data an extra layer of protection. Of course, the hacker can attack the user's terminal system to obtain complete data before or after the user's spin-off. This risk has always existed and has nothing to do with whether or not to use cloud storage. In general, terminal devices, especially mobile terminals, have less exposed services and are not stable online. The risk of direct attacks is generally smaller than that of online servers. In addition, an application system with data splitting and merging can split and merge data in real time at runtime, and does not necessarily need to store the pre-split or merged data in the end system. In this case, even if the terminal system is attacked, the split storage data is still safe; when the terminal system fails, the maintenance personnel and the personnel of the enterprise IT department cannot obtain the data protected in this way. Take the mail system with data splitting function as an example: when no data is used, there may not be any fragment of data on the terminal side. When a document is sent to someone, the document exists on the terminal side only after the recipient downloads the document. Further, a hypothetical use of the enhanced mail client based on the data splitting and merging method of the present invention, where the mail server can be a conventional mail server, when the attachment needs to be added to the mail, the content of the attached file is split. In multiple parts, several of them are stored in the cloud storage specified by the user, and several others are saved in the mail as ordinary attachments. Then the user selects the sender and sends the email, and the mail cloud application system can register the metadata and the split information (the default metadata stripping protocol, etc.) in the original attachment file to the file meta-information database (an online service system, Both the sender and the recipient must have an account), and the corresponding data access link can be automatically set for the sender according to the settings of the client. Corresponding to the recipient, there is no fragment of the data on the terminal side before it downloads the attachment. The actual storage of data is distributed among the cloud storage, the mail server, and the corresponding metadata in the file meta-information. Of course, this data also exists in the sender's terminal (if the sender is not using a distributed file system and the file has not been deleted). When the recipient uses the same enhanced email client, when the attachment is opened, the system can automatically locate the corresponding item in the file meta-information according to the content stored in the email as a normal attachment, and then locate the cloud. Part of the content in the store, and restore according to the corresponding split method, and finally restore the original raw data on the recipient's client. Of course, the premise of this process is automatically completed, the account information required by the recipient's mail client is pre-set. There are at least three accounts involved here: the mail system, the cloud storage system, and the file meta-information system.

Corresponding to the data splitting of the present invention, FIG. 2D is a flowchart of a data merging method according to an exemplary embodiment. As shown in FIG. 2D, the present invention provides a data merging method, including:

Step 401B: Receive a data object acquisition request carrying the identification information.

The identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object.

Step 402B: Acquire storage content corresponding to the positioning information, and obtain data information in the other storage content according to the obtained positioning information in the stored content until all data information of the data object is obtained.

Step 403B: Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.

The data merging method of the embodiment obtains the data object acquisition request carrying the identification information, obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and acquires other storage content according to the positioning information in the storage content. The data information is obtained until all the data information constituting the data object is acquired. According to the preset merge specification, the obtained data information is combined and processed to obtain a complete data object. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and it is difficult to obtain a complete and correct data object even by obtaining some user data through illegal means, thereby realizing the security of data storage more reliably.

FIG. 2E is a flowchart of a data merging method according to another exemplary embodiment. As shown in FIG. 2E, the present invention provides a data merging method, including:

Step 501B: Receive a data object acquisition request carrying the identification information.

The identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object. The type of data information is one or more of the following combinations: metadata, data fragments, encoding, and encoding order.

Step 502B: Acquire storage content corresponding to the location information, and obtain data information in the other storage content according to the location information in the obtained storage content, until all data information of the data object is obtained.

Step 503B: Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.

Specifically, one or more pieces of data information are obtained according to the positioning information (the data information may be a piece of data that is split, or may be part or all of the metadata, or may be part or all of the encoding and encoding order), according to a specific rule. That is, the preset merge protocol gradually acquires corresponding data information according to one or more data information, and combines the data information together (ie, metadata, data pieces) The merge, encoding, encoding order, etc. are combined to recover the original data object. The specific merger is as follows:

A. When the type of the data information is a combination of data segment, encoding, and encoding order, the encoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained; and each of the decoded data is decoded according to the encoding order. The data segments are arranged to obtain data objects arranged in the original order of the respective data segments.

B. When the type of data information is a combination of metadata and data fragments:

B1. If the metadata agreed in the preset merge specification includes: attribute information, integrity verification is performed on the data objects merged by each data segment according to the attribute information, to confirm that the attribute of the data object matches the attribute information in the metadata; or,

B2. If the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, the data matching the keyword is merged into the data segment corresponding to the data content identifier, and then each data segment is merged to form a data object. ;or,

B3. If the metadata agreed in the preset merge specification includes: attribute information, data content identifier, and keyword, the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each data segment is determined according to the attribute information. The merged data object performs integrity verification to confirm that the attributes of the merged data object match the attribute information in the metadata.

Step 504B: If the metadata includes a unique identifier of the data object, perform integrity verification on the merged data object according to the unique identifier.

The data merging process is actually the reverse process of the data splitting process and works according to the preset merge statute. In actual operation, the preset merge specification (hereinafter referred to as the merge specification) may be combined with the preset split specification (including: preset metadata stripping protocol, preset data content splitting specification, preset encoding separation specification, etc. For the split/peel protocol, it is the same content. Similar to the split specification, a merge specification is data information prepared for data recovery, or it can be called a split merge specification, because it is necessary to ensure that the split data can be recovered back. Therefore, the split statute often includes or implies a merger reduction.

For example, in the mail client, after the client obtains the email attachment, the client can locate the storage content in the file meta-information system library, the mail system, the cloud storage, and the like according to the attachment name (ie, the unique identifier of the data object). Data information, data information has split algorithm, each data segment, positioning information and related file metadata items, etc., the mail system can be obtained according to The obtained data information locates and downloads the data segment, and obtains the inverse algorithm according to the splitting algorithm to merge the data segment and the metadata. If there is a code, the data segment can be restored according to the code to obtain the original user data object content; if the metadata includes the data The unique identifier of the object, which can also verify the file size, recovery file name, file type, creation time, etc. based on the file metadata. The information of the split protocol in the example of the mail client can be a merge specification. Among them, the specific merge specification, that is, the inversion process, can be derived through the data split description document.

It can be seen that when merging data, when only the data segments are acquired, the original data cannot be recovered, at least the split/peel protocol established in the data splitting process needs to be obtained, and the merged protocol of the data is obtained through reverse parsing or Get the default merge specification directly. Typically, the system retains the appropriate split/peel protocol after data splitting and stores the relevant location information (such as its storage location) in the split data segment, or any storage space that is designated for access. in. Of course, it is also possible to directly generate a merge specification corresponding to the split/peel protocol in the process of data splitting and store it in each of the split data segments or other specified locations. At this point, in the merge process, you only need to directly obtain the merge specification. Subsequently, the system will find or extract the corresponding split metadata according to the obtained split/peel protocol or merge specification, and splicing and combining each data segment based on information such as data split/peel protocol or merge specification and metadata. Together, thus recovering the original data.

Further, the decoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained, including:

Disassembling the data information according to a merge algorithm in a preset merge protocol, obtaining a meta code, or the meta code and an instance code;

Obtaining a data object corresponding to the data information according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.

It should be noted that the specific process for decoding may be referred to the embodiment part of the processing method for subsequent decoding of the specification, and details are not described herein again.

The following is a specific example to illustrate the process of splitting and merging the entire data object. It should be noted that the specific data, algorithms and the like in this example are merely exemplary and are not intended to limit the present invention. Split target: The information of the data object is divided into three parts: a metadata block, a data block (ie, a data segment), and an index block (ie, an encoding). Any information dispersion algorithm can be used, For example, the IDA algorithm divides the contents of the source file after lossless compression into four bytes (32 bits). It should be noted that compression is not necessary. The divided results are sorted and combined and deduplicated, that is, the duplicates are eliminated and saved as data block files that are not duplicated. The divided data block (data segment) is assigned to the index (encoding) of the data block file, and is saved as an index file (arrangement order information of encoding and encoding) in the original order. The file name of the data block file and the index file may be a hash value (MD5, SHA-1, etc.) of the corresponding file content or a system-generated globally unique identifier (GUID) or any other globally unique code. The file name, size, date, and other information of the source file, as well as the file name of the data block file and the index file, can be stored in the metabase. As long as these three parts (metadata blocks, data blocks, ie data fragments, index blocks, ie coding and coding order information) are stored in multiple cloud storage systems, respectively, it can play a predetermined security protection role. This deployment solution is flexible, you can put the data block file and index file into one file-based cloud storage, and put the metadata into another cloud database; you can also store these three data in three different In cloud storage; to increase availability, you can also provide separate redundant backups for each piece of data. In addition, in the multi-person data sharing and collaborative usage mode, the solution for sharing data is more flexible, and the sharing of three data can be a combination of multiple communication and sharing methods: email, cloud sharing, instant messaging, FTP, etc. . After obtaining three pieces of data or access authorization of the storage system corresponding to the data, the system can restore the target file through the data merge process: for example, according to the coding order of the index file and the arrangement order information of the code, the data block (data segment) The four-byte content corresponding to the file index position is spliced; the spliced result is decompressed (if previously compressed) to obtain the target file. In this general purpose split storage system, a desktop agent can also be established. However, this desktop is built on the desktop agent of the basic cloud storage, which automates the above-mentioned splitting and merging process, and brings convenience to users. For example, the split-store desktop agent of the user client runs in the background of the system, such as GoogleDrive and Microsoft's One Drive. Google Drive has a directory C:\GDrive that automatically syncs with Google's cloud storage, and One Drive has a directory C:\MDrive that automatically syncs with Microsoft's cloud storage. The sync directory corresponding to the split storage desktop agent is C:\DDrive. When the user saves the file to C:\DDrive, the desktop proxy service detects the change of the file system, automatically splits the file, saves the data block (data fragment) file to C:\GDrive, and indexes the file (encoding and The encoded ordering information is saved to C:\MDrive and the metadata is saved to the proprietary database cloud service. Google and the Microsoft Desktop Agent service will automatically sync the block file and index file to Google and Microsoft's cloud storage respectively. Go to the user's other terminal directory. If the corresponding terminal runs the split storage desktop agent, it will detect the changes of C:\GDrive and C:\MDrive directory, automatically obtain the metadata, merge it with the data block file and data index file into the original file and save it. It is in the C:\DDrive directory, which enables synchronization of split/merge storage.

2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment. As shown in FIG. 2F, the present invention provides a data splitting apparatus, including: an extracting and stripping module 61B, for receiving and carrying When the storage request of the data identifier is to be stored, the metadata is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. The segmentation module 62B is configured to split the data content into at least two data segments according to the preset data content splitting protocol. The storage module 63B is configured to store the metadata and the individual data segments in different storage bodies or in different secure channels.

The data splitting apparatus of the embodiment obtains the metadata in the data object corresponding to the data identifier to be stored, and obtains the metadata from the data element corresponding to the data identifier to be stored, by receiving the storage request carrying the identifier of the data to be stored. The data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and the security of the data storage is more reliably realized.

On the basis of the above-mentioned embodiments, FIG. 2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment. As shown in FIG. 2G, the stripping module 61B is obtained, including: a receiving submodule 611B. And for receiving a storage request carrying the identifier of the data to be stored. The determining sub-module 612B is configured to: when the receiving sub-module 611B receives the storage request carrying the data identifier to be stored, the metadata agreed in the preset metadata stripping protocol includes: attribute information; and the data object corresponding to the data identifier to be stored The attribute information content matching the attribute information is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: a data content identifier and a keyword, and corresponding to the data identifier to be stored according to the data content identifier Among the data contents in the data object, the data matching the keyword is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: attribute information, data content identifier, and keyword, The attribute information content matching the attribute information in the data object corresponding to the to-be-stored data identifier is determined as metadata, and according to the data content identifier, the data content matching the keyword is determined as metadata from the data content in the data object. The stripping sub-module 613B is configured to determine the metadata determined by the sub-module 612B from the number According to the object peeling.

Further, the obtaining the stripping module 61B includes: a parsing sub-module 614B, configured to parse the data object to generate a unique correspondence with the data object when the metadata agreed in the preset metadata stripping protocol includes: the data object identifier Data object ID.

Further, the apparatus further includes: an encoding module 64B, configured to perform encoding processing on each data segment according to a preset encoding separation protocol, to obtain a code corresponding to each data segment. The arranging module 65B is configured to arrange the respective codes according to the original order of the data segments in the data content to obtain the coded ordering information. The storage module 63B is specifically configured to store metadata, encoding corresponding to each data segment, and encoding sequence information into different storage bodies or different secure channels.

Further, the apparatus further includes: an identifier generating module 66B, configured to generate a coding order information unique identifier based on the encoded arrangement order information, and/or generate a respective data segment unique identifier based on each data segment; a storage module 63B, It is also used to store the coding sequence information unique identifier and/or the individual data segment unique identifier as part of the metadata.

The preset data content splitting protocol includes at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm.

The implementation method and principle of the above data splitting device are similar to the data splitting method, and are not described here.

2H is a schematic structural diagram of a data merging device according to an exemplary embodiment. As shown in FIG. 2H, the present invention provides a data merging device, including:

The receiving module 81B is configured to receive a data object acquisition request that carries the identification information, where the identification information includes positioning information, and the positioning information is used to locate a storage address of the partial data information in the data object.

The obtaining module 82B is configured to obtain the storage content corresponding to the positioning information, and obtain the data information in the other storage content according to the obtained positioning information in the stored content until all the data information of the data object is obtained.

The processing module 83B is configured to combine the acquired data information according to the preset merge protocol in the acquired data information to obtain a data object.

The data merging device of the embodiment obtains the data object acquisition request carrying the identification information, and obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and then The data information in the other stored content is obtained according to the positioning information in the stored content until all the data information constituting the data object is acquired. According to the preset merge specification, the obtained data information is combined and processed to obtain a complete data object. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and it is difficult to obtain a complete and correct data object even by obtaining some user data through illegal means, thereby realizing the security of data storage more reliably.

On the basis of the foregoing embodiment, FIG. 2I is a schematic structural diagram of a data merging device according to another exemplary embodiment. As shown in FIG. 2I, the type of data information is one or more of the following combinations. Method: metadata, data fragment, encoding, encoding order.

A. When the type of the data information is a combination of the data segment, the encoding, and the encoding sequence, the processing module 83B includes: a decoding sub-module 831B, configured to perform a decoding operation on the encoding according to the combining algorithm in the preset merge protocol, to obtain a code corresponding Data fragment. The arranging sub-module 832B is configured to arrange the decoded data segments according to the encoding order to obtain data objects arranged in the original order of the respective data segments.

B. When the type of the data information is a combination of the metadata and the data segment, the processing module 83B is specifically configured to: when the metadata agreed in the preset merge specification includes: attribute information, the data object merged with each data segment according to the attribute information Perform an integrity check to confirm that the properties of the data object match the attribute information in the metadata. Or specifically, the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, and the data matched with the keyword is merged into the data segment corresponding to the data content identifier, and then the respective data segments are merged to form Data object. Or specifically, the metadata agreed in the preset merge specification includes: attribute information, a data content identifier, and a keyword, and the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each of the data content is The merged data object of the data fragment is integrity verified to confirm that the attributes of the merged data object match the attribute information in the metadata.

Further, the apparatus further includes: an integrity verification module 84B, configured to include a unique identifier of the data object in the metadata, and perform integrity verification on the merged data object according to the unique identifier.

The implementation method and principle of the foregoing data merging device are similar to the data merging method, and are not described herein again.

A soft/hardware implementation method in accordance with the present invention will be presented in conjunction with various embodiments of the above split and merge method and apparatus, in a specific example.

For split-based applications, splitting is primarily about considering how the system distributes data across multiple stores in the system architecture. Such systems typically use metadata, coding, and domain-related data content splitting. Therefore, it is possible to naturally disassemble the application domain, that is, to use a domain-related split method. The data split/stripe, merge process is often built into the system's data access layer, associated with domain-related business logic. Whether it is domain-related data splitting or domain-independent data splitting, its data splitting/stripping methods can be varied. Therefore, we introduce the concept of "data split description language (which can be used as part of the split/merge protocol)" to configure the data splitting process. In this way, the system or user can split/stripe the data at runtime using a dynamic data split/peel method. The description of the data split/peel method itself (which can be part of the split specification) can be stored in a particular store as part of the stripped out metadata. Different data can have different split/peel methods. Finally, the merging of data will vary from data to data, and the merging process must be based on an understanding of the split/peel method description. The data split/peel/merge engine is a system component that parses and executes the data split/peel description information to complete the data split/peel/merge. At the heart of the data split description language and data split/peel/merge model is the data processor model. A data processor is a software/hardware component that processes data. The splitter is used to implement the split function, and the corresponding merged data is called the combiner. They are also data processors. In addition, compressors, decompressors, encryptors, decryptors, savers, extractors, etc. are also data processors. The core of the data processor is the processing, in addition to several input ports (including data input port and parameter input port) and several outputs. The data input port corresponds to the data input, the output port corresponds to the data output, and the parameter input port corresponds to the parameter information that needs to be used in the data processing process. For example, the compressor has an input port (and an additional password parameter input port when there is a compressed password), a data output; the splitter has one data input, multiple data outputs; the combiner has multiple data inputs , a data output; saver has a data input, multiple parameter input (corresponding storage location, access access information, etc.), no output (the process is to submit the input to the storage); the extractor has no input, a data output There is also a very special kind of data processor - generator, no data input (sometimes with parameter input), one or more data output, and its data output often participates in the entire data processing process as a parameter of data processing. The distributor is a data input, multiple data outputs, and each output data is the same as the input data. The output of one processor must be connected to the input of another processor (either data input or parameter input). In addition, we can see that almost every data processing The device has a corresponding reverse processor. Otherwise, we can't complete the data merge process through the data split description. The only exception is the data generator. The data generation process is generally irreversible. The reverse processing in the system is the generated data. Can be obtained directly or indirectly from storage and other processors). In general, the data input of a data processor is the data output of its corresponding reverse processor, and the data output is the data input of its reverse processor; the parameter input remains unchanged. The splitter corresponds to the combiner, the encryptor corresponds to the decryptor, the compressor corresponds to the decompressor, the saver corresponds to the extractor, the distributor corresponds to the distributor (the process of the distributor inversion has a data input port selection), and so on. The whole process of data splitting/stripping/merging is actually implemented by a network of data processors, and its essence can be characterized by the Petri net model. The processing is transition, the input port is the library, and the output to the next input port is a directed arc. The directed arc from the data processor input port to the processor is hidden. Included inside the processor - when all data ports have data (tokens), the process is automatically activated and the data flows down.

Among them, the aforementioned data split description language is mainly used to describe the assembly flow diagram of the data processor. A document described in a data split description language is called a data split description document. Data Split Description The data flow diagram described in the document is essentially a data processor. Therefore, another data flow graph can be used as a data processor in one data flow graph. The data split description document actually defines one or more data flow graphs. For documents that are directly used for data split descriptions, you need to specify the final ingress flow graph. Each data flow graph includes multiple data processors and their connection relationships. The connection relationship is described in the data output port of the data processor. The data flow graph has a specified starting data processor. Data split description documents can be rendered and edited graphically. Furthermore, the data splitting and merging engine splits and merges the data according to the description of the data split description document. The corresponding data splitting process is as shown in FIG. 2J: step 1001B, acquiring metadata of the data object to be separated; step 1002B, creating a separate archive document according to the metadata; step 1003B, reading the data to separate the archive document; and step 1004B separating the data The storage document is instantiated into a data flow graph (instantiating the data processor and establishing a connection between them); step 1005B, passing the data to be separated to the starting data processor of the data flow graph; step 1006B, destroying the flow graph after execution Data flow graph.

We can see that the main process of data splitting is actually performed by the data processor in the data flow graph. The data splitting and merging engine is mainly responsible for loading the data split description document and instantiating it as executable. The data flow graph finally passes the data to the flow graph for data processing. Number According to the processor as an active object, that is, the instantiated processor object has its own thread/process, which constantly checks its own executable conditions. Once it finds that all input ports have data, it executes automatically and passes the result to other Data processor. After completing these operations, it will destroy itself. The flowchart is as shown in FIG. 2K. Step 1101B, determining whether data is transmitted to the input port; if step 1102B is performed, if step 1103B is not performed; step 1102B, receiving input data; step 1103B, determining whether all data ports have Data; if an empty input port (usually a parameter port) is found, that is, an input port without any data source, the user is allowed to enter the corresponding information through the interactive interface. If there is an execution step 1104B, if not returning to the execution of step 1101B; step 1104B, executing a data processing procedure; step 1105B, passing the processing result to the output corresponding data processor.

The corresponding process of data merging is as shown in FIG. 2L: step 1201B, locating the corresponding document according to the input information to separate the stored document; step 1202B, reading the data to separate the stored document; step 1203B, instantiating the data separated storage document into the corresponding reverse data stream Figure 1204B. After the flow graph is executed, the data flow graph is destroyed.

When restoring the split data, the input information may be a reference code of the data split document, or may be a part of the data content after the split. For the latter, a hash function (also known as a hash function) is a method of creating a small digital "fingerprint" from the data content. The same digital fingerprint obtained by the hash function is always the same, and it is considered not It will conflict with other digital fingerprints.) The obtained hash value can also be used as a reference code for the document. With this encoding, a corresponding data split document can be obtained. The data splitting document describes the data splitting process, and the corresponding reverse process needs to be obtained when data is merged. This inversion process is actually started from the actual data processor, and the inversion is performed according to the output port traversing the relevant data processor. The process of reversing the data processor varies by type, but in general, the type is changed to the inverse process type, the data input port becomes the output port, and the output port becomes the data input port. The input parameter port is unchanged.

For example, the data split description language definition is shown in Figure 2M; the data split description language visualization flow chart is shown in Figure 2N; the data split description document sample is shown in Table 1:

Table 1, data split description document sample

The specific splitting process is as follows: the data to be split is first DES encrypted, the encryption key is from the system configuration storage; the encrypted data is split into block data and encoded data by 4-byte split coding; the encoded data is stored in In Amazon S3 cloud storage, the corresponding SHA1 hash value is stored in the metadata database as the key value for addressing the corresponding metadata; the block data is stored in a local file, and the file name is a GUID generated by the system, and the GUID is also used as Key values are stored in the metadata database. The metadata database related records are shown in Table 2; the split items and metadata mapping tables are shown in Table 3;

Table 2, metadata table:

Table 3, split items, metadata mapping table:

When any of these two key values are obtained, there is a chance to obtain the corresponding data split description document, thereby recovering the data.

It is not difficult to find from the above description for the three concepts of the present invention: (1) handwriting input system and method; (2) object-based data encoding scheme; and (3) object-based data splitting scheme, The respective technical effects can be obtained by implementing each of the above technical solutions separately. Preferred Yes, these concepts can be combined together, or one or more of them can be combined with other applications, and at this time, the value and beneficial effects of these inventive concepts can be more exerted or embodied. FIG. 2O illustrates the correlation between various concepts under the above three concepts, and some specific application examples that can be extended with these concepts and concepts. These specific applications are merely exemplary, and there are more variations in practical applications, so the present invention has a very broad application prospect.

After decades of development, information technology has now entered a network era that is highly integrated with communication technologies. The traditional standardized coded data processing system lays a solid foundation for modern computer technology, but it does not meet the various needs of networked personal computing - personalization, security, efficiency and so on. In order to adapt to the development of the times and to make up for these deficiencies, the present invention not only provides a novel handwriting input method and system, but also combines the object-based open codec scheme of the present invention, and object-based data splitting/stripping/merging. The data processing method and system of the method, based on the traditional data processing system, constructs an open, secure and efficient data processing system in the true sense of the future and based on the network environment.

In addition, in the present invention, regarding the codec processing method mentioned below, first, the basic background content is first introduced, and the generation and development of the computer are inseparable from the coding technique. There are various coding techniques available. As a computer-based coding technology, it is widely used in the transmission, storage and processing of data, and its importance is self-evident. On the other hand, the rise of cloud computing, big data, and the Internet of things are poised to bring new opportunities and challenges to coding technology.

Specifically, the generation and development of computers are inseparable from coding techniques. There are various coding techniques available. In essence, the encoding methods can be divided into two categories: content encoding and reference encoding.

Among them, the content encoding is a method of digitizing or converting the content of the encoding object. Base64 encoding, various data compression encoding (including lossless compression, lossy compression, etc.), image encoding (JPEG, SVG, etc.), video and audio encoding (PCM, MP3, MP4, etc.) are all in the category of content encoding. The digitized content of the data itself is directly included in the results of the content encoding and can be analyzed and processed by the computer. There is also a type of structured coding technique for describing the structural information of data. It mainly encodes structured data/document content. For example, HTML, MathML, SVG, etc. are specific structured description languages, and the corresponding coding specification is meta-language XML. Similar coding specifications are JSON, Protocol Buffer, etc.

Unlike content encoding, the result of a reference encoding process is not the data content itself, but a reference to the content or a description of the addressing path of the access object. Huffman coding is a pair of source symbols (The content itself) establishes an optimized reference encoding method. URL, IP address, RFID, barcode, QR code, ISBN, zip code, etc. are all reference codes. It is worth mentioning that the text encoding (especially the standard encoding) is essentially a reference encoding, which is the encoding corresponding to the specific text position in the text encoding scheme. As the text body, the sound, shape, meaning and other data are only reflected in the coding specification.

With the standardization of some reference encodings (rather than encoding methods), a computer program can directly process the encoding without encoding the corresponding content (or the corresponding content has been built into the computer program). For example, standardized coding systems such as ASCII and Unicode. Such encoding and encoding combinations themselves already constitute a higher level of data content. Standardized text encoding is such a typical example. Many of today's text-based coding conventions (such as JSON, CSV, XML, etc.) are based on this.

About objects and models, objects, and Taiwanese translations are terminology in Object Oriented, which represents a specific thing in the objective world problem space and in the solution space of the software system. fundamental element.

About OMG, a non-profit standardization organization in the computer field, successfully defined a set of languages and standards for object modeling. OMG divides the model into four levels of abstraction: meta-model layer (M3), meta-model layer (M2), model layer (M1), and runtime data object (M0). The meta-model layer contains the elements needed to define the modeling language; the meta-model layer defines the structure and syntax of a modeling language, which can be specifically mapped to UML (Unified Modeling Language) or object-based programming languages such as Java, C#, etc.; the model layer defines a specific system model, specifically the class or object model we often say; the runtime contains the state of a model object at runtime, etc. The object or instance we are talking about.

3 is a schematic diagram of a meta model in the prior art. As shown in FIG. 3, a Meta-Object Facility (MOF) is a standardized specification for establishing a metamodel (M2) defined by the OMG. MOF includes a metamodeling language (M3 model) and methods for creating, manipulating models, and metamodels.

The object model has multiple levels, static models that represent structure and functionality, and dynamic models that describe runtime behavior. The main focus of this paper is on static models related to coding, including data and interfaces.

For reference encodings and object identifiers, the object's identifier (ID) is actually a reference encoding. In the context of the object identifier used, the identifier must be unique, paired with the object. should. In this way, the system can locate the corresponding object by identifier addressing.

Most of the time, object reference encoding and object identifiers are a concept because their usage goals are consistent. However, sometimes the reference code may not be used as an object identifier. The reference code is only guaranteed to be correctly addressed to the target, and does not necessarily guarantee a one-to-one correspondence with the object. Sometimes there is a many-to-one situation (one object, multiple encodings). For example, a host can have multiple IP addresses; the same website can have multiple URLs.

In addition, in the field of computer science, reflection refers to a class of applications that are self-describing and self-controlling. That is to say, such applications use a mechanism to achieve self-representation and examination of their own behavior, and can adjust or modify the state of the behavior described by the application according to the state and result of their behavior. Relevant semantics.

Reflection technology has been supported by modern software development platforms, tools, and programming languages. For example, you can use reflection to get metadata directly from running objects in Java and .Net platforms at runtime.

In addition, in the present invention, the method of encoding and decoding is an object-based encoding system, and FIG. 4 is a schematic diagram of the architecture of the encoding system of the present invention. As shown in FIG. 4, the encoding system is mainly divided into three parts: a client. End, encoding server, data storage. Among them, the encoding server and the data storage end together constitute an encoding warehouse.

As shown in FIG. 4, the client can obtain a corresponding data object by sending an encoding to the encoding warehouse; and sending the new data object to the encoding warehouse, the corresponding encoding can be obtained. Inside the encoding repository, the encoding server provides services to the client. An encoding repository can include one or more data stores in which real data is stored. The encoding server can send data queries to the data storage terminal to obtain, update, and insert related data.

The code repository provides a centralized encoding service that allows different clients to share data objects and encode meta-objects by reference encoding. Further, a variety of different systems can register new coded meta-objects with the code repository to meet a variety of different coding requirements. This centralized coding service makes data integration and exchange of various systems easier. In general, the code repository has a built-in data access control system that provides different access rights for different data objects and coded meta objects. In particular, the encoded meta-objects and data objects can be stored on different data storage ends, and or set with different data access rights. In an object-based coding system, the encoded meta information is stored in an encoding repository, and the data object itself may exist in the encoding stream (content encoding) or the storage system of the encoding repository. In the system (reference code), the reference code of the data object exists in the encoded stream. The data objects in the code stream and the code repository can be placed in different secure channels. The separation of this information has natural security on the one hand and better coding efficiency on the other hand.

In a specific implementation, the data storage end can be implemented by using different storage systems such as file storage, relational database, NoSQL database, and cloud storage.

Specifically, the present invention proposes a new object-based coding and decoding scheme and system, and is also an open solution. In contrast to standard coding schemes, object-based open coding schemes can be completely personal and non-standard. This non-standard refers to a standard that is different from the traditional ones that are developed and reused by the organization or organization, but the essence is based on the de facto standard (coding protocol) of the coding warehouse. This solution not only provides more flexible and diverse data services, but also provides more reliable security for data.

The coding scheme of the present invention can encode data of any type and any length, can have any coding format and arbitrary coding word length, and the coding rules can be not fixed, that is, the coding rules can be randomly changed as needed. This makes it possible to create fully personalized coding. In other words, the coding scheme of the present invention is an encoding scheme that can encode an arbitrary object and is independent of the length of the object data, the encoding rule, and the length of the encoded word. This greatly breaks through the inherent form and limitations of existing standard coding. This coding scheme can be arbitrarily expanded. The same code can also be reused in different encoding processes without affecting each other, thus greatly improving the utilization of the code.

The concept of the coding scheme of the present invention consists in creating an encoding protocol for the data object based on the metadata of the data object and generating the encoding according to the encoding specification. In other words, the present invention can acquire the features or structures of the data objects in an encoded manner and generate corresponding codes for the data objects in accordance with the features and/or structures of the encoded objects.

Furthermore, based on the data of the existing standard text encoding scheme, in the process of data transmission, any party involved in the transmission, as well as the receiving and storing parties have the opportunity to obtain all the information in the data. This is not conducive to the confidentiality of data, but also makes the data transmission amount large, increasing the network bandwidth and the burden of CPU processing, especially for large-scale data transmission, and thus reducing the data transmission efficiency.

Another feature of the present invention is that only the data objects that need to be transmitted are stored in the code repository, and the corresponding data access rights are set to obtain the corresponding reference code. When transmitting, only need to pass The reference code of the data object can be exported, and only the receiver that has the data access right can get the complete data. This can greatly reduce the amount of data transferred, while increasing the security and reliability of the data.

In addition, unlike the encryption process of data in the prior art, generally, the encryption process of data does not require any metadata participation, and only the encryption data is needed to convert the original data into content that cannot be normally recognized or displayed. can. Although the invention can also achieve the effect of encryption, on the one hand, the invention achieves data protection in a completely different way. Specifically, the data content is protected by means of metadata of the data object in a coded isolation manner. On the other hand, in general, the encrypted ciphertext data size is often the same as or larger than the original plaintext, but the present invention only needs to transmit a very small amount of information such as a corresponding reference code. Moreover, due to the concept of the present invention, in addition to security, more useful functions and operational space are provided for data processing. For example, but not limited to, it can reduce the transmission of data and reduce the network load; the flexibility of coding also provides greater convenience for subsequent data processing and the like.

Although the secret key and the encrypted data need to be stored or transmitted separately after encryption, on the one hand, the encryption needs to convert the original data into a code or data completely different from the original data by a predetermined rule or algorithm, so that it cannot be easily The ground is identified by a third party. However, the present invention can completely preserve the original form of the data content, and can also realize the security and confidentiality of the data without any modification to the content, which is not possible by the conventional encryption system.

In addition, in the encryption process, usually only one secret key is needed, and the open system of the present invention can assign different encodings to each data segment in the encoding process, and can also set different access rights for different users. This allows for more granular security.

As mentioned earlier, due to the similarity between object reference coding and standard text coding, we can extend the basic coding form based on object coding from the standard text coding form. Thus, the standard character becomes a special object (the object number of the built-in encoding metadata); the object reference encoding becomes a special character - non-standard characters. Different from the prior art, the present invention can be used to directly accept the digitized result of human natural output, divide it into different data objects according to certain rules, and place it in an encoding warehouse to form non-standard characters (in this paper, non-standard) The character is based on the object reference encoding of the encoding repository, but focuses on emphasizing that the data object is a piece of data obtained by splitting the human digital output result. You can not care about the content of each character or the relevance of the characters before and after, so you can use the same characters as the existing standard text-based system. Store and process data for the base unit. This also provides a great opportunity to expand the flexibility of subsequent editing, encoding and storage operations.

Preferably, the present invention can establish a proprietary font for the writer by assigning a custom unique code or code to all or a fragment of the digitized result of the natural output of each human individual. In this case, since any information input by the user is not required as a reference datum, the user can input or add his own font at any time, thereby eliminating the need to input the reference font in advance as disclosed in Chinese Patent No. CN103136769A. The trouble with information.

The invention can also place the object reference coding in different coding spaces, such as the user coding space divided by the user, different users can use the same reference code to correspond to different data objects in the coding warehouse; and the coding according to the date Space; coding space divided by geographic location; coding space divided by department; coding space divided according to online session; The coding space divided by the session has a very high security feature - the reference code of the data exists in the coding space corresponding to the session. When the session ends, the corresponding coding space will disappear, and all the codes in the space will not be decoded correctly. . With this feature, the effect of "reading and burning" can be achieved. Preferably, introducing the coding space and adopting variable length coding can greatly reduce the storage consumption of the reference code and improve the efficiency of transmission, processing and storage.

Due to the rapid development of modern storage technology and the expansion of storage means, large-capacity and mass storage are possible. Especially in the context of strong support of cloud storage, the digital content of all human natural output has been retained locally. may.

Someone once calculated that assuming that someone writes for 60 years every day, all of their handwritten information storage capacity is only 250GB. This is a slap in the face of the existing mass storage technology and cloud storage technology. This makes complete retention of original works (such as novels, arrangements, prints, etc.) possible.

In addition, when the handwriting input system before this article is combined with the object-based coding scheme concept, a new data processing system as follows can be established. The new data processing system introduces the concept of an encoding repository. The application can not only query and use the encoding meta-objects already in the encoding repository, but also register and use new encoding meta-objects. The new system breaks through the limitations of existing systems from four different levels.

First level, built-in security

In new data processing systems, text encoding is non-standardized. Text encoding and corresponding solution The code information is stored in the application system and the code repository, respectively. The code repository can support different levels of code isolation for users, applications, and content. Therefore, we can authorize the access and use of text content through the access control management of the code repository. In other words, the new data processing system has built-in security.

This security is multi-layered. We can set different access rights for different users, different applications, different text content, and even different encodings. This is completely impossible in traditional data processing systems based on standardized text encoding.

In addition, not only simple text content, but also the application system and data that are encoded by the new data processing system will have corresponding security.

The second level, comprehensive coding ability

In existing data processing systems, various general purpose, proprietary text formats have been created to describe various general purpose, proprietary data structures. For example, XML, JSON, CSV, RTF, and so on. However, these formats use the same coding standards for marking and definition, which makes content text and markup text have many limitations, and storage and parsing are also less efficient. For example, in XML, characters such as ">", "<", "&" have special meanings and cannot be used in text content. We have to use the escape sequences ">", "<", "&" instead, or put the text in the "<![CDATA[" and"]]>" or quotation mark protection.

In the new data processing system, open coding allows us to completely break through these limitations. We can use some encoding types for the markup, and use another type for the text content. The corresponding text parser can distinguish which text is the mark and which is the content according to the encoded metadata.

At the same time, due to the arbitrariness of the new system coding, anything that can be serially encoded can be stored and encoded by the system, such as music melody, dance action, game data, video subtitles and even computer instructions. The stored results are divided into two parts, one is the data object in the encoding warehouse, which can be multimedia data, or proprietary data, and the other part is the encoded code sequence. The reference encoding of such data objects is not unique to the system. Traditional data processing systems based on standardized encoding can also encode arbitrary data. But far from being based on object coding systems, it is simple, efficient, and natural.

The third level, simple and efficient

The object coding in the object-based coding system may include a meta-encoding and an instance coding part, for For a certain system, the number of metacodes is very limited. For example, two bytes of 16 bits can encode more than 60,000 yuan codes, which can actually correspond to more than 60,000 object types, which is for most applications. All are enough. For a specific object, due to the arbitrariness of the object encoding, we can directly use a number to represent its instance code, for example, 4 bytes 32 bits can encode more than 4 billion object individuals, plus we can Putting the reference code in a different encoding space, 32 bits is sufficient for most systems. That is, 6 bytes can represent the reference encoding of objects in most applications. In addition, if variable-length encoding is used, we can often express an object reference encoding with fewer word counts by setting default meta-encoding, using client-side encoding, and so on. In contrast, in order to prevent data block conflicts in cloud storage, it is much simpler and more effective to use a dozen or even dozens of bytes to reference and encode a data block.

In addition, in the new data processing system, we can store the data object corresponding to the object reference encoding in the encoding warehouse, which can greatly improve the storage efficiency of the data object, thereby improving the data transmission and processing efficiency. For example, the HTML of the webpage is re-encoded using the object encoding technique, and the elements and attributes of the standard HTML various tags are encoded, and the relevant meta-information is put into the encoding repository, and the size of the obtained webpage document is greatly reduced, which can be Network transmission of web pages saves traffic.

The fourth level, personalized text encoding

In contrast to standard text encoding schemes, the encoding scheme used by object-based data processing systems can be personalized and non-standard. This is mainly achieved by the isolation of the context coding space. Different users and unused applications have their own context coding space. Further access to personalized coding is achieved by accessing a personalized contextual coding space. Each object reference code has a one-to-one correspondence with the data objects in the encoding repository. When text is input, the input data object content is stored in the encoding repository, and the location of the content in the encoding repository is converted into a corresponding object reference encoding. When the text is output, the system finds the corresponding data object content in the encoding warehouse according to the object encoding, and outputs the content to a specific device.

Due to the openness of the object-based coding system, we can divide and encode the digitized results of human output in any way, and can also express any content that we want to express, and only need to associate the content with the code. That is, the data processing system can dynamically add data object types and their encodings.

Therefore, under this system, people can input in the most natural way, this kind of input It is also not limited to the handwriting input in the foregoing, and may be any data stream, such as but not limited to: voice, image, multimedia stream, Braille, sign language, lip language, semaphore, or even meaning or meaningless. Burst and so on. The system automatically stores the input to the encoding repository as it is entered and encodes the location of the content in the encoding repository. The output process is based on the object reference code, the input content is taken from the code repository, and it is played back naturally.

Still take the previous handwriting input system as an example. Specifically, corresponding to a handwritten text input scene, the writer writes under a natural writing constraint (such as row constraint or column constraint), and the system writes the content according to natural participle (such as Chinese character segmentation). Or the division of words (such as the word segmentation of words in the phonetic language) rules, the shape of the word or word that is split is stored in the code warehouse, and its corresponding reference code is generated. These encodings are stored in a textual content--ie, a collection of textual encodings in a specific typographical order.

It can be seen that the above handwritten text input process is between the text recognition handwriting input and the non-recognition handwriting input. Similar to the text recognition system, this process requires the division of words and words. But the difference is that you don't need to analyze the standard code corresponding to the input, but "input is what you get." This method does not have the problem of recognition rate, always 100%. This is the same as the non-identifying system. But the difference is that the process divides the input content and encodes them separately. This allows us to perform some word processing on the coding results in the new system, such as editing, copying, pasting, transferring, searching, retrieving, etc., just like ordinary text.

Similarly, data processing systems based on open coding can also be used in optical recognition based input systems. Especially in the recognition of handwriting input, it is not important whether the handwriting is scribbled or not. The optical recognition system based on open coding only needs to divide and input the input image to divide the image and store it in the code warehouse, and generate corresponding Image object reference encoding. It is worth mentioning that due to the personalized characteristics of the code, the corresponding data objects in the code repository formed by the system can be used as a good sample. The results of analytical training can in turn increase the conventional text recognition rate for that particular individual.

Similarly, the data processing system is also applicable to a voice input system. The input sound signal does not need to be identified, and only needs to be simply processed and divided, and can be stored in the code warehouse and encoded accordingly.

The data processing system can also be applied to other text input methods, such as Braille, lip language, sign language, and semaphore input. In addition, new text can be created based on this new data processing system. Input method. For example, on a small-sized screen touch screen device, specific gestures can be designed as branches, word breakers, and end markers, and then input in full-screen handwriting or voice. The input content is divided according to the word segmentation, and is stored in the code warehouse, and the corresponding text code is obtained. As another example, a 3D glove-based sign language input method can be designed. The motion information of the 3D glove is stored as a text content in the code repository, and the code corresponds to the character, and a certain time interval is used as a separation of the actions. The output of the sign language is to play back the 3D glove motion information in the code warehouse through the 3D model.

In summary, the new data processing system has the following advantages:

The first aspect, simple and natural

The new data processing system does not require the generation of specific standard encodings, so the simplest and most natural input method can be designed for the average user to directly encode the result into a personalized encoding.

Since there is no restriction on the coding standard, the user can input any content he wants to express, including graphics, symbols, sounds, videos and other multimedia data. Unlike traditional text recognition systems, the text output in the new data processing system does not need to be recognized, which ensures uninterrupted and efficient input. A smooth and natural user input experience is guaranteed.

Second aspect, security

The new data processing system is a non-standardized object-based reference encoding. People can't understand the content from the text coding sequence, and they need to get the specific content information of the code from the code repository. The access control of the code repository ensures the security of the data content. At the same time, due to the separation of the reference code and the data object, the readability/visibility of the non-standard text after obtaining the code sequence is completely dependent on the security settings of the corresponding code store. Therefore, the code repository is essentially a full-featured cryptographic server. Further, the code sequence and the data in the code repository can be placed in different secure channels, which greatly increases the difficulty for the data thefter to completely obtain all the data. Furthermore, unlike the context-independentness of traditional standardized text encoding, non-standard text based on object encoding can be context-sensitive text. Through the isolation of context space, the same encoding can vary from person to person, from application to application, from document to document, from time to time, from location to location, and so on. The application system, and even the individual user, can register a new context specification with the code repository, thereby introducing a new coding space to further isolate the text code. Compared with traditional data processing systems, the new system has natural security and privacy.

Software developers can store encoded non-standard text information for users, or they can Standard text is further processed, such as retrieval, analysis, and so on. But they can't understand the real non-standard text content. Similarly, the code repository provider can also analyze, process, and even identify the content in the code repository, but because it does not have the final order of object reference encoding, non-standard text content is also unknown. Only those users who have access to the corresponding application system and the code repository can get complete text content information. Therefore, for a network application that is authorized to access, the user must have both permissions—application rights and code repository permissions—in order to obtain full non-standard text information.

Due to the openness of object-based coding, we can also directly re-encode the data content that needs to be protected (including traditional standardized text encoding). The authorized access service of the encoding warehouse can specifically control these special encodings to achieve specific conditions. , the encryption of a specific text encoding. The specific conditions here may be rules based on context (time, place, environment, user, application, etc.) to achieve complex, flexible text encoding security.

Based on the context-aware security, the encoding repository can also provide users or systems for identity authentication and digital copyright protection.

The third aspect, open

From object reference coding to non-standard text content, from encoding services to non-standard text services, object-based coded data processing systems are a fully open system. Any data object can be placed in the code repository and its reference code can be recorded in non-standard text. Software developers can register new context object specifications, new encoding spaces, new encoding meta objects, new data objects, or add new encoding services to the system, including new non-standard text services (including new non- Standard text input and output, non-standard text editing and other systems).

At the same time, we can use it to build models of any specific domain because of the more efficient and secure common text data (including non-standard text and standardized text) solutions brought by the new data processing system. That is to say, different application systems can use the object coded data processing system to encode their domain model and deploy the code in the code repository. In this way, the application system and the corresponding data object content not only have the advantages of the new data processing system - efficient, secure, etc., but also make full use of various text services to process its data.

The fourth aspect, flexible

In the non-recognition handwriting application system, people can input arbitrary text and graphic content; the sound recording software can record people's voice information; the video recording software can also record people's motion information (package) Including sign language). Unlike these full content recording systems, the new data processing system divides, splits, and encodes the same content. In this process, the system can directly filter out useless information, and only retain important information that people pay attention to, such as filtering out noise in the audio, scanning noise points in the text, and so on. Moreover, through the content normalization service, the duplicate content does not need to be repeatedly stored, which greatly reduces the storage space and improves the transmission speed. More importantly, we can use the existing word processing infrastructure and tools to process and process the text-encoded content formed in the new data processing system, such as searching, indexing, editing, and so on.

In addition, flexibility is also reflected in coding deployment and access control. The flexibility of coding deployment means that for the same encoding type, we can selectively configure it into different encoding spaces, thus having different security levels and visibility. The flexibility of access control means that the user or the administrator of the application system can configure the access to the object code very flexibly through the access control settings of the code repository: on the one hand, the access control can be configured to different coding levels, which can be coding. Space, or encoding metadata, or even specific data objects; on the other hand, access control for encoding can be based on different conditions, such as time, location, user, application, state of the domain model, and so on.

The fifth aspect, efficient

In a networked environment, the data object encoding and the split storage of content in the new data processing system ensure efficient storage and transmission. The content of the data object needs to be transferred from the encoding repository to the consumer only when it is really needed.

In non-standard word processing systems, the unidentified data object content formed in the new data processing system can be a good personalized identification training sample. The trained text recognition system can more effectively identify personalized non-standard text into corresponding standard codes.

In a non-standard text data processing system, the format information of the text can be stored in the code repository. Text format characters use non-standard encoding, text data can use standard characters arbitrarily without escaping, which will bring efficient text data transmission and processing.

Further, the new data processing system mainly has the following aspects:

The first aspect is conducive to the popularity and depth of personal computing.

The new data processing system makes it possible to access traditional text input methods that are close to nature, solving many people's problems of "computer input is difficult". A safe, natural data processing system is more acceptable to ordinary people. Such computer text input is no longer a matter related to the individual's cultural background and familiarity with the degree of the keyboard, which is conducive to the popularity and depth of personal computing.

The second aspect is conducive to the popularity and depth of cloud computing.

In recent years, more and more Internet applications and services have been converted to cloud computing, an on-demand consumption, dynamic allocation computing model. However, for cloud-based systems, especially public clouds, security is a challenge that cannot be ignored. In the new data processing system, data object encoding and content splitting can greatly improve the security level of the system. By deploying the code repository within the enterprise's firewall, companies can confidently use a variety of public cloud-based applications and services, as well as allow their employees to use their private mobile devices at will. All enterprise data information stored in the public cloud is meaningless "garbled" for people outside the firewall. Similarly, families or individuals only need to protect the security of their home or personal code warehouse. The information stored in the public cloud is safe and reliable. Here, the code repository acts as a codebook. This high level of security features accelerates the adoption and adoption of public cloud services by businesses and individuals.

The third aspect is conducive to the development and popularization of the Internet of Things.

The internet of things combines intellisense technology, recognition technology, and pervasive computing technology, and is called the third wave of information industry development after computers and the Internet. The Internet of Things is an extension of the Internet. On the one hand, the Internet of Things has an urgent need for object addressing coding/identification at the three levels of the sensing layer, the network layer, and the application layer. The number of nodes is large, the variety is large, and the processing capability is limited. A huge challenge has not yet formed a common standard. A simple and flexible object coding mechanism can well meet these needs.

On the other hand, a large number of sensors in the sensing layer need to store the perceptual data records, and the object encoding technology can effectively provide relevant encoding storage support.

The fourth aspect is conducive to cultural protection and inheritance

There are now more than 7,000 common languages in the world, and dialects are countless. Unicode covers only a few hundred of them. Under the existing computer data processing system, many language words are difficult to input into the computer system. In the new data processing system, there is almost no limit to the use of language and text (for handwritten text, the typesetting method is the only restriction, which needs to be specified in advance). People can directly store any non-standard text content into a computer system or communicate with others through a computer. It broke the unreasonable constraint of “standardization first, then use” of the original computer text.

The keyboard input of the existing computer text has caused many people to "write the pen and forget the word". The new data processing system maintains the original writing tradition of humans.

The fifth aspect is conducive to environmental protection

The new data processing system makes the direct input and use of text on electronic devices more natural, convenient and secure. Conducive to the formation of a paperless environment, and ultimately save the use of paper.

The encoding processing method and the decoding processing method provided by the following embodiments of the present invention can be implemented based on the above encoding system. The technical solution of the present invention will be further described in detail below through the accompanying drawings and specific embodiments.

FIG. 5C is a flowchart of Embodiment 1 of an encoding processing method provided by the present invention. As shown in FIG. 5A, an execution body of the method in this embodiment is an encoding system, and the method includes:

Step 101C: Acquire a data object to be encoded and its metadata according to the received encoding processing request.

In this embodiment, the metadata of the acquired object is mainly the encoded metadata of the acquired object. The encoded metadata can be a subset or a complete set of metadata. For example, but not limited to, the type of object, the corresponding data structure, constraints on storage and transmission, control, and the like. The metadata of the object is the basis of the system and must be extracted from the data in some way. The object's metadata can be automatically obtained using modern software platforms such as reflection mechanisms in Java, .Net, etc.

In addition, in the present embodiment, the data object (also referred to herein as an object) is the basic object of data processing in the present invention, that is, the target object to be encoded by the present invention. It can be in any form of data, either as a single word, symbol, part of it, or as an audio, video, multimedia stream or fragment thereof, or as an encoding itself or a document. It includes at least the metadata portion (or metadata) of the data object, and usually includes the content data portion of the data object, which is the remainder of the data object, or data, after stripping the metadata. The content of the object, or the data content, or the content data. The content data can be related or unrelated to the metadata portion.

Metadata is data about data objects, and is a description of the characteristics, attributes, intrinsic logical relationships, and/or structures of data objects. Metadata can appear inside, outside the data, along with the data, or with the data. Metadata may include such things as the type of object, creation and or modification dates, historical version information, data structures, interfaces, storage constraints, transmission constraints, encoding constraints, encoding context constraints, and the like. Specific metadata examples may include, but are not limited to, information on the following: description of the assembly; identification (name, version, culture, public key); type of the export; other assemblies from the assembly; Security permissions; description of the type; name, visibility, base class and implementation interface; members (methods, fields, properties, things Pieces, nested types); attributes; other descriptive elements that modify types and members; header and/or table structure information for tables; palettes in drawing files, and more.

Metadata is different for different data objects. For example, for the metadata portion of the data object we call it the metadata of the data object; for the metadata portion of the encoding object mentioned later we can call it the encoding metadata. The ability to acquire or add metadata corresponding to a data object at runtime is the basis for the system to encode data objects.

Step 102C: Acquire an object code of the data object according to the encoding warehouse and the data object and metadata thereof.

In this embodiment, the data object to be encoded and its metadata are obtained according to the received encoding processing request, and the object encoding of the data object is obtained according to the encoding warehouse and the data object and its metadata, because the data object can be obtained according to the data object. Metadata and encoding repositories to encode data objects, thus enabling flexible and diverse encoding.

Further, for example, FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C. As shown in FIG. 5D, a specific implementation manner of step 102C is as follows:

Step 102C1: Select or create an encoding protocol according to the encoding repository and at least a portion of the metadata, and generate a meta encoding corresponding to the metadata according to the encoding specification.

In this embodiment, based on the predetermined extraction rule, metadata related to the subsequent encoding process may be further selected from the metadata, and then the corresponding encoding specification may be created or generated based on the selected metadata.

In addition, based on the metadata extracted from the object, an encoding specification is selected or created, and the encoding specification is saved. The encoding protocol will be utilized to generate the corresponding encoding. You can also set the default or default encoding protocol for the system to perform the corresponding encoding and decoding. In this case, you only need to select without creating a new encoding protocol. Some or all of the coding conventions can be selected or created by the user in an interactive manner. It is worth mentioning that the encoding protocol generated during the encoding process can be automatically destroyed after the encoding process is completed (after the encoding factory), and can also be saved.

The process of adding or creating a coding specification can be done while the object is being modeled; it can also be done while the specific application is running. It can be done automatically by certain rules or by interaction.

The coding protocol mainly includes the coding mode of the object, and the coding constraints of the internal structure of the object.

Step 102C2, compiling data content of the data object according to the coding protocol And obtaining an instance code, and acquiring an object code corresponding to the data object according to the meta code and the instance code.

Wherein, the object coding is a reference coding form or a content coding form.

Further, as can be seen from FIG. 3, the encoding system mainly includes an encoding warehouse and a client, and the encoding processing flow can have two implementation manners, and the specific details are as follows;

The first way to achieve:

Step 1a: The client acquires the data object to be encoded and its metadata according to the received encoding processing request.

Step 2a: The client sends the data object to be encoded and its metadata to the code repository.

Step 3a: The encoding repository selects or creates an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.

In this embodiment, the object coding protocol (which may be referred to as an encoding protocol) refers to the specification and constraints on how the data object is coded and decoded. It can include encoding of data objects (content encoding, reference encoding, or a mixture of both), encoding constraints of object metadata (such as schemes for related data serialization, word length, endianness, data alignment, etc.), etc. . The object encoding protocol can also be used as part of the metadata of the data object.

Object encoding conventions can be added manually (through the modeler) or automatically (via the tool) when the object is modeled, or interactively (by the user) or automatically (via system policy) at runtime.

Encoding metadata refers to metadata associated with a data object codec. The encoded metadata can be part or all of the metadata. The encoding metadata of the data object is the basis for the system to encode and decode the data object.

Step 4a: The code repository encodes the data content of the data object according to the coding protocol, obtains an instance code, and acquires an object code corresponding to the data object according to the meta code and the instance code.

In this embodiment, the data object and its metadata are stored in an encoding repository. In addition, the corresponding object code generated by the code repository is actually the reference code of the data object in the code repository.

Step 5a: The client receives the object code returned by the encoding warehouse.

The second implementation is:

Step 1b: The client obtains the data object to be encoded according to the received encoding processing request and Metadata.

Step 2b: The client queries the encoding warehouse to select or create an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.

In this embodiment, the client proposes an encoding process request to the encoding server in the encoding repository to obtain a meta-encoding corresponding to the encoding meta-object (actually a reference encoding of the encoding meta-object in the encoding repository).

Optionally, the meta-encoding may include one or a combination and/or nesting of: type coding, spatial coding, and context coding.

Step 3b: The client encodes the data content of the data object according to the coding protocol, obtains an instance code, and obtains an object code corresponding to the data object according to the meta code and the instance code.

In this embodiment, in the above step 3b, for two different forms of object coding - content encoding and reference encoding, the generation of the example encoding is also correspondingly divided into two types: for the example encoding of the content encoding form, the encoding client According to the coding convention, the content of the data object is directly serialized into an instance code. For the example encoding of the reference encoding form, the encoding client sends an encoding request to the encoding server; the encoding server obtains the corresponding data object and the encoding specification and related information according to the request, and stores the data object in the encoding warehouse according to the encoding specification and related information; Generate the corresponding instance code and return it to the client.

Correspondingly, the decoding process of the object encoding is the inverse of the encoding process. Generally, the encoding server obtains the object code to be decoded according to the decoding processing request of the encoding client. The data object in the encoding repository is located according to the encoding and returned to the client.

In particular, the object encoding obtained for reading multiple steps. The encoding client parses the object encoding into a meta-code and an instance code according to a preset rule. A metacoded decoding request is sent to the encoding server. Obtaining the corresponding coding element object, decoding the instance code according to the coding protocol and related information in the coding element object, and combining the coding element object to obtain the corresponding data object.

For two different forms of object coding - content encoding and reference encoding, the decoding process of the above example encoding is also divided into two types: for the content encoding form, the encoding client can directly decode the instance code into corresponding data according to the encoding protocol. Object content. For the reference encoding form, the encoding client issues an instance encoding and decoding request to the encoding server; the encoding server obtains the corresponding instance encoding and encoding protocol and related information according to the request, and locates the data object in the encoding warehouse, and Return to the client.

In addition, in the decoding process based on the object encoding, the system first acquires the encoded metadata; and then obtains the corresponding content encoding according to the metadata. Specifically, the encoding metadata may include encoding type information for locating, loading, or transmitting the encoded content, and constraint information for the target encoding space to which the encoding belongs. The encoding metadata is encoded to obtain a meta-encoding. In fact, the encoded content of the meta-encoding in the encoding repository is mainly the encoding meta-object. Meta-encoding is generally an integral part of encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism.

In this embodiment, it is worth mentioning that, as an encoding system, we can also directly regard the encoding metadata as a data object, that is, a data object that encodes metadata as content, which may be referred to as encoding. Meta objects can also have their own metacode. Therefore, the encoded metadata as a data object may also have its corresponding metadata encoding, called meta-encoding.

Preferably, FIG. 6 is a relationship between data objects, metadata, encoding protocols, and encoding meta objects. As shown in FIG. 6, the encoding meta object is also a data object (for a normal data object, it is an M1 abstraction. The level of the object), the model of its metadata (the abstraction level is M2) is called the encoding metamodel. The encoded metadata of the encoded meta-object is part of the encoding metamodel.

The coding element model is the cornerstone of the object coding system. Generally speaking, the coding element model is relatively stable at runtime and does not change dynamically, but can be extended. That is to say, the encoding metadata of the encoding meta object is built into the system. Therefore, the system can directly store, transfer, and encode and decode these encoded meta-objects.

An object coding system can correspond to a unique core coding metamodel (which can have an extension mechanism). Specifically, FIG. 7 is a schematic diagram of the core coding element model.

In addition, the meta-encoding, as the object encoding of the encoding meta-object, does it also have its own meta-encoding? This is actually related to the specific design of the coding metamodel and the codec method. If there is only one encoding meta-object in the encoding metamodel, the meta-encoding is all of the encoding meta-object. If there are multiple encoding meta objects in the metamodel and they can be encoded into the same metacode at the same time, then this case does not require metacoded metacode. Otherwise, metacoded metacodes are needed to distinguish them. Sometimes, there is a certain hierarchical relationship between the encoded meta-objects. In this case, multi-level decoding may be required to obtain the encoded meta-object of the final data object.

In general, variable length coding is more direct and flexible for the expression of this meta-object hierarchy. And easy to handle: the previous code word is the meta code of the next code word, and the latter code word is the meta code of the next code word, so that multiple levels can be nested.

Specifically, FIG. 8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object coding removes the meta-encoding part), and a conceptual model of the data object and the coding meta-object. As shown in FIG. 8, the following layers are shown. relationship:

1. The encoding meta object can also be used as a data object.

2. Meta-encoding itself can also be used as an object encoding

3. Data objects and encoding meta objects are related to each other

4. Object coding includes meta coding and instance coding

5. The object encoding is associated with the corresponding data object, which implies the same correspondence between the meta-encoding and the encoding meta-object (mainly implicit in relation 1 and relationship 2 above).

In addition, an example of a plurality of encoding meta-objects is included in the meta-encoding, and FIG. 9 is an exemplary diagram of the meta-encoding in the present embodiment. As shown in Figure 9, the object encoding is a 128-bit fixed-length encoding. There are only two encoding meta-objects in the encoding meta-model: the owner of the object, and the object type. They can be related or unrelated, depending on the definition in the encoding metamodel. Correlation or irrelevant corresponding coding logic is different.

As another example, FIG. 10 is an exemplary diagram of a similar layer-by-layer correlation of coded meta-objects (variable-length coding of 16-bit word length).

Further, FIG. 11 is a schematic diagram of a meta model corresponding to the encoding. As shown in FIG. 11, there are two kinds of encoding meta objects: user and encoding type. The encoding type can have one owner (01) or no owner (00). Therefore, both of the above encoding forms are legal. Only the type encoding as the meta-encoded object encoding corresponds to the data object without the owner. The other one represents a data object with the owner.

In the present embodiment, the meta-encoding is generated based on the metadata and the encoding protocol, and an instance encoding is generated based on the data content. These specific steps can be implemented using a coding factory. A coding factory is another important component of a system that can be dynamically created by an encoding repository or across components or across systems. The coding factory can provide direct codec services for related objects.

The code repository can provide two important services: registration and access to encoded metadata; encoding and decoding of object reference encoding.

The encoding repository can also use external storage services to store encoded metadata as well as object data. Wait.

The final object encoding is generated from the meta-code and the instance code based on predetermined rules. The meta-encoding and the instance coding may be combined into an object coding in an arbitrary manner, such as splicing or by some kind of operation, etc., as long as the two can be reversely disassembled and restored at the time of decoding. The process of generating the object encoding can be placed on the client side or automatically by the encoding factory, depending on the actual design. Moreover, it is also possible to include in the final object code a code representing a combination or splicing manner of the meta code and the instance code. If necessary, the code representing the combination or splicing mode can be stored separately from the object code under different secure channels, and the respective access rights are set separately. Only after authorization and verification can the object code and the corresponding representative element code be obtained. The combination with the example coding or the coding of the splicing method, so that the meta coding and the example coding can be correctly disassembled in the decoding process.

In this embodiment, the content data may also be the application object itself, or may be positioning and index information of the application object. In the latter case, the data access component of the application system can obtain the corresponding application data through some means or algorithm according to the content data, thereby obtaining the final application object.

Additionally, preferably, the content of the data object can be stored in a third party storage system that interfaces with the encoding repository, in which case the encoding repository needs to store information about accessing data objects in the third party storage system.

In this embodiment, the process of encoding a data object is referred to as object-based encoding. Data serialization, referred to as serialization, is the process of encoding content into data. The metadata of the data object and the content data ultimately need to be serialized, or stored in the result based on the object encoding (content encoding method), or stored in a storage other than the result (reference encoding method). In addition, during the encoding and decoding process, the content of the data object and the content of the metadata need to be serialized before being transmitted in the system.

In fact, the serialization of data objects, that is, the content encoding itself, can also be built entirely on object-based coding methods. The key is that the encoded metadata is stored in the encoding warehouse by the method to obtain the corresponding encoded meta-object reference encoding, that is, the meta-encoding. With the participation of the encoded metadata corresponding to the metacode, the serialization of the subsequent data objects can be smoothly performed. Therefore, it can be said that object-based reference coding is the basis of this method. On this basis, the encoded meta-object can be reference coded to obtain the meta-encoding. On the basis of meta-encoding, we can both reference the data object and serialize the data object, that is, content encoding. In the implementation of the reference code In the process, better, you need to get the content encoding of the data object (use this method for itself), transfer the content encoding to the encoding warehouse for storage, and then get the reference encoding.

In the present embodiment, object encoding refers to encoding of an arbitrary object. The objects here can be either entity objects such as data, content information, images, voices, etc. (generally they can be reference coded), or they can be value objects (for example, dates, which can be encoded by examples), or High-level objects that include internal object structures, such as array objects, table objects, tree/document objects, and more. Object encoding is one of the outputs of this system for encoding arbitrary objects, and is also one of the inputs for object decoding.

For example, FIG. 12 is a schematic diagram of a conceptual model of the object encoding. As shown in FIG. 12, the object encoding may include two parts, one is a meta-encoding, and the other is an example encoding. Meta-encoding is the encoding of an encoded meta-object. Meta-encoding is generally an integral part of object encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism. Content encoding is the encoding of data content under the corresponding encoding constraints.

FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5C, as shown in FIG. 13, the method in this embodiment further includes:

Step 201C: Set access rights to data in the encoding warehouse.

In this embodiment, the data may be metadata, data objects, and the like. Optionally, the metadata includes one or a combination of the following:

Type of data object, creation time of data object, modification time of data object, historical version information of data object, data structure of data object, interface of data object, storage constraint of data object, transmission constraint of data object, data object Encoding constraints (including constraints on the encoding space).

Further, the method may further include:

Step 202C: Send the object code to the target client.

FIG. 14 is a flowchart of Embodiment 3 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5D, as shown in FIG. 14, a specific implementation manner of step 102C2 is:

Step 301C: Acquire a context object.

Step 302C: Acquire a corresponding coding space according to the context object and the coded protocol.

Step 303C: Encode the data content in the data object in the coding space to obtain an instance code.

Step 304C: Acquire an object code corresponding to the data object according to the meta code and the instance code.

In this embodiment, the encoding repository (also referred to herein as an encoding repository) may be a repository that stores encoded metadata, encoded meta-objects, and object data, which may also provide related services. Similar to the font library based on the standardized encoding system, the glyph corresponding to the character encoding in the handwriting input system of the present invention can also be stored in the encoding warehouse. 15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment, as shown in FIG. 15, by accessing the glyph information in the encoding warehouse, the application using the new data processing system can render Any text font.

However, unlike traditional fonts, not only glyph information is stored in the encoding repository. The new data processing system uses a solution based on object open coding. You can encode graphics, voice, or other multimedia data, as well as encode different domain data. These encoded metadata are also stored in the encoding repository. The application system can not only query and use various encodings in the encoding warehouse, but also register new encoding types with the encoding warehouse and submit encoded data to them.

16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system, as shown in FIG. 16, which illustrates the relationship between some of the core concepts in the encoding metamodel. The definition of these specific concepts is then given.

For the encoding space, it refers to the logical space that isolates the object encoding. Objects corresponding to different instance codes of the same object type in different coding spaces are different. The coding space is directly related to one or several coding objects (only one of the above-mentioned coding metamodels), and the (several) coding object is called the space and the direct context of the coding object in the space. This encoding space is called the encoding space of this (several) object.

The coding space of the coding object in the coding space is called a subspace. The encoding space is called the parent space of its child space. The encoding space without a parent space is called the root space. The root space is generally the encoding space of the encoding repository.

In the computer world, we encode with binary bits. Given enough digits, we can use as many encodings as possible, including metacode. But in the implementation process, more bits mean the cost of performance and storage. In addition, flat meta-coding is also not conducive to management. This is also One of the reasons for programming languages (such as C++, Java, etc.) and XML technology to use namespaces. Similarly, we also introduce the concept of coding space to manage coding more effectively. In fact, the coding space is a means of hierarchically classifying and isolating the encoded metadata. The coding space is hierarchical, that is, the coding space can also have subspaces. The same code belonging to different coding spaces can correspond to different objects. The same element code can be completely different in different spaces. In fact, different coding spaces have different levels of security isolation for encoding.

We can divide the coding space in different ways. However, in the process of using and processing the code, some basic objects are inevitably involved. For example, Figure 17 is a schematic diagram of a base object that can be applied to a basic coding space.

For the purposes of the present invention, any code is present in the code repository, with the exception of standard codes. In fact, different encoding warehouses correspond to different encoding spaces. The encoding space corresponding to an encoding warehouse is the root space of all encodings of this encoding warehouse.

Similarly, in the same code repository, each code has its own owner. Then the coding of different users belongs to different user coding spaces. With the complexity of user models in the coding warehouse, the division of user space can be more complicated. For example, there may be a group space shared by multiple users.

The same kind of data objects are often used by different applications. For a specific user of a code repository, different applications can share the same code; they can also use separate codes for these applications. For the former, the same text content can be processed and used by different applications without conversion. For the latter, independent coding increases the security of the data—the code that leaks from a malicious application or a compromised application only affects the data corresponding to that application. Of course, the advantage of the former corresponds to the disadvantage of the latter, and vice versa. Interoperability and security have always been two sides of a coin. But here, we can see that the introduction of the concept of space gives us the flexibility of choice.

Further, the encoding is to be serialized into a specific data store. This data store can be a file, a database field, or a string that is transmitted over the network. Separating the encoding for this data content itself maximizes the security of the encoding. In fact, this content space based on data content isolation is a password book that establishes a content-to-code correspondence.

Finally, you can divide the coding into different areas for management. This can be called tube. Space. Names/identifiers can often be used to distinguish between different management spaces, so they are also called named encoding spaces.

In the context of encoding formation and use, the above two encoding spaces (named encoding space, context encoding space) may be implicitly present. We call this the context space.

In an encoding repository, the permutation of different kinds of context objects determines the final context space. For example, different user and application permutations combine to correspond to different context spaces. But in general, the code in the non-standard text content is uniquely corresponding to the content, and the content itself implies the corresponding application and user (except, of course, multi-application, multi-user content). Therefore, it is not necessary to divide the application subspace or the user subspace in the content space. In all context spaces, there is a special space, which is a context-independent coding space, which we call public coding. In fact, standardized coding is public coding. The encoding in the root space is not a common encoding, but an encoding related to the encoding warehouse. The encoding space is the root space corresponding to the encoding warehouse.

For an encoding system, anything will eventually be embodied as a code. The last code corresponding to the coding space is a meta-code, which we can call spatial coding. The encoding space is actually a special encoding meta-object - its corresponding object instance is still an encoding meta-object. For context-independent spatial coding, there is no coding space for this encoding. However, for context-dependent spatial coding, the coding can correspond to different coding spaces depending on the context object. Therefore, for context-independent coding spaces, such as named encoding space, we can directly use spatial encoding, and the corresponding instance encoding is subspace encoding or other metacoding. For the context coding space, we can directly use the encoding of the context object as the corresponding spatial encoding. For example, the code corresponding to the coded warehouse space is the coded warehouse code. The content space corresponds to the instance code. The application space corresponds to the application code. User space corresponds to the user code.

For example, Figure 18 is a schematic diagram of the coding structure of a 128 fixed length coding scheme. In addition, the arrangement and combination of the above codes are not unique. For example, the example code can be placed at any position in the object code as long as it is clearly defined in advance.

In actual use, context space coding is implicit in the context in which the encoding is used and does not need to appear in the final object encoding. For example, the currently used encoding repository implies the encoding warehouse encoding; the currently used encoding application implies the corresponding application encoding; the current encoding of the document content implies the instance encoding and the encoding owner's user encoding ( Assume a single-user document). However, when the same text content appears simultaneously from multiple spaces of the same kind When encoding, context space encoding must appear in the text to set different encoding contexts to isolate different spaces. For example, the text in a document includes the encoding of multiple encoding repositories. In this case, the corresponding encoding warehouse code must appear in the content of the document to distinguish different encoding warehouse spaces. Of course, an encoding repository that supports encoding repository encoding must provide information to access the encoding repository for the library encoding. Similarly, multi-user text content must use user encoding; application encoding must be used in content that can be read and written by multiple applications and that uses application space isolation. Content space is an exception, because content encoding is the encoding of the content of the document itself, one-to-one correspondence with the content of the document. It is not possible to encode multiple content in any content, so the content encoding does not need to be displayed in the encoding. In terms of implementation, the content encoding can be a hash value of the document content, or a hash value of the application encoding and time stamp. Therefore, content encoding is either calculated in real time or stored as content metadata.

As mentioned above, in general, encoding does not need to include spatial encoding, but it is necessary to indicate which spatial encoding is used, which can be specified by using spatial bits in the encoding. This space bit actually corresponds to the coding context specification in the coding protocol.

In addition, for example, FIG. 19 is a schematic diagram of four binary bits being four spatial bits. As shown in FIG. 19, the coded storage bit may also be called a reserved bit. An illustrative example may be, for example, when the reserved bit is 0, the encoding is from the current encoding repository. Otherwise, additional information is required to define the encoding or specify the encoding source, such as the client encoding that will be mentioned later. When the content bit is 0, the encoding is independent of the content; when it is 1, the encoding exists for the specific content. When the application bit is 0, the code is independent of the application; when the bit is 1, it is the application-specific code. When the user bit is 0, the code is a public code; when it is 1, it is the code owned by the current document user. vice versa. Any other coding scheme can be used as long as it can effectively distinguish different spaces.

It is worth mentioning that the type encoding is the same as the normal encoding, and there is also a coding space. Moreover, the space of type coding and instance coding can be different. For example, using public coding for user space can serve as a security isolation for the user space. In this example, the encoding type of the encoding is user space, and the instance encoding is public space. Since the instance code must belong to a certain encoding type, the same type of instance encodes the same spatial bits. In the specific decoding process, the metadata of the encoding type in the encoding warehouse can be accessed according to the type encoding. Therefore, the type encoding must contain the corresponding space to ensure that the decoder can get the correct encoding type information from the encoding repository. The type information in the encoding repository can contain the spatial bits corresponding to the instance encoding, so the spatial bits do not need to be Appears in the example code.

Context space is the main means to securely isolate the code. The main body that manages and sets the application with the generated encoding target space should be the individual corresponding to the context object (such as the user) and the administrator (such as system administrator and application administrator). The management space is a hierarchical management that facilitates coding and is registered and used by the application.

The code word length is the minimum number of bits required to encode a character in a text encoding system. For example, the encoded word length of UTF-8 is 8 binary bits, or one byte. The encoded word length of UTF-16 is two bytes. In the encoding of a coded word length, not all codes are of this length. But its length must be an integer multiple of the code word length. For an encoding system with a multibyte word length, it is also necessary to consider the endian problem in a coded word length. This problem does not exist in single-byte word lengths. All data is arranged in bytes from low to high.

In addition, for fixed length coding and variable length coding, in an coding system, all coding lengths are equal to their coding word lengths, and such an encoding system is called a fixed length coding system. On the contrary, it is called a variable length coding system.

In the object coding system, the coding word length and the associated coding method are closely related to the coding and decoding process, and are independent of the coding element model. That is to say, the object coding system corresponding to the same coding element model can select different coding word lengths and corresponding different coding methods. It is even possible to support multiple word lengths or combinations of encoding methods at the same time. Of course, it is necessary to design an effective mechanism to distinguish them.

It should be pointed out that the coding length and encoding method of the system are not directly related to the serialization word length and method specified in the specific object coding protocol. However, if the serialization result is part of the object encoding, the compatibility of the object encoding word length and the method needs to be considered.

Similar to Unicode, the object encoding system can be a system that is independent of the encoding word length. That is to say, based on the same code repository, there can be different word length coding schemes. In the short word length coding scheme, a code word length often cannot put down a complete code (as mentioned above, including spatial coding, type coding, and instance coding). In this case, we can use variable word length coding, that is, one code can include multiple words. For example, the metacode portion and the instance code portion are split into a plurality of consecutive code words. Even so, sometimes a word length encoding does not cover all encoding instances. We can use the variable length encoding technique in Unicode - using the flag bits to define the encoding word length. For example, for a code with a word length of one byte, Figure 20 is a An example diagram of a coding scheme, as shown in FIG. 20, enables the encoder to automatically obtain the corresponding codeword length through the previous or first two bytes. The scheme can represent a coding range of 0 to 265-1.

FIG. 21 is an exemplary diagram of the encoding scheme of UTF-8. Compared with the encoding scheme of UTF-8 (as shown in FIG. 21), it is found that the encoding results of the two encoding schemes do not conflict with each other and may appear in the same document. When the first bit of the first byte of the code is 0, the byte corresponds to the ASCII code portion of UTF-8; when the first two bits of the first byte of the code are 10, the corresponding code is the object. Encoding; when the first two bits of the first byte of the encoding are 11, the corresponding encoding is Unicode encoding. In this way, hybrid encoding of object encoding and Unicode can be achieved.

Similarly, another variable length coding scheme with one byte word length and multiple byte word lengths can be designed.

In addition, for the encoding type, the encoding type is the object type to which the relevant encoding convention is added.

In addition, for an encoding context, the encoding context is an abstraction of the context object. It is actually the selection criteria for the selection of context objects at runtime. The above encoding metamodel uses the encoding type plus the object role name. In the same encoding context (generally a specific application), the same type of role name must be unique.

For example, in a web blog application, there are authors and readers. They are all user objects, but they are different roles. The encoding context of the data object in the blog content should be the author user. In this way, when any reader opens the content, there is no problem that the decoding error occurs because the currently logged in user is not the author. Of course, the premise of correct decoding is to correctly set the encoding context object. For the blog example, when opening each specific blog content, the corresponding author user object is set as the encoding context object.

In addition, for the encoding path, the encoding context path is referred to as the encoding path, and corresponds to a series of encoding context conventions, which is a constraint on the encoding space to which the instance code of the corresponding data object belongs. The definition of the coding space indicates that the coding space is a hierarchy associated with the encoded object with the associated encoding - the subspace can also have subspaces. The encoding path is the encoding space path that is positioned to determine the encoding object. For example, the image encoding path in a personalized journal might look like this:

Coding warehouse space | user 001 space | application personalized diary space

The image corresponding to the image object encoding can be found in the final application space.

The encoding path exemplified above is a runtime specific path. Encoding path in the encoding metamodel Is the encoding path of a higher level of abstraction, corresponding to:

Root space|author space|application space

At runtime, this encoding path is instantiated to the above encoded path instance by selecting the corresponding context object.

The so-called context object is a concrete object corresponding to the context specification. The object must conform to the constraints of the context specification and must be accessible in the corresponding encoding and encoding process. For example, there is an "author" context constraint whose corresponding type is "user". When the context constraint is set, the current application cannot be set to the corresponding context object. It must be set with an object of the "user" type. In general, after obtaining the author information corresponding to the document, it can be set to the context object corresponding to the "author" context constraint. If the author object is inaccessible to the current user, the context object cannot be instantiated, which means that the encoding context constraint is not satisfied, and the subsequent related instance encoding cannot be decoded. This is also a concrete manifestation of context-based coding security in this method.

In fact, in the implementation of the system, the encoding path instance is directly related to the encoding space of the corresponding data object instance code in the encoding warehouse. Optionally, the storage location of the corresponding data object in the encoding warehouse may also be restricted by the encoding space. The specific implementation of the encoding path for the encoding warehouse can have multiple choices depending on the storage scheme. Here is a concrete implementation example. In a code repository implemented with relational database technology, a simple implementation is to use simple context name splicing to form table names for context-sensitive data objects. In the example above, the table name of this picture table can be:

User_001_Application_005_Picture Table

The instance code of the corresponding data object can directly use the keys of the table.

Another implementation of the coding space is to uniformly store the data objects, and only distinguish the coding space for coding. Here is a concrete implementation example. In an encoding repository implemented with relational database technology, the system maintains a table of encoding spaces as follows:

编码空间IDEncoding space ID	父空间IDParent space ID	上下文对象引用编码Context object reference encoding
编码空间IDEncoding space ID	父空间IDParent space ID	上下文对象引用编码Context object reference encoding	00	NullNull	NullNull
…...	…...	…...	00	NullNull	NullNull
…...	…...	…...	88	00	(用户001的引用编码)(reference code for user 001)
…...	…...	…...	88	00	(用户001的引用编码)(reference code for user 001)
…...	…...	…...	100100	88	(应用005的引用编码)(reference code for application 005)

...

The code space ID field is the table primary key; the parent space ID is a foreign key of the table, and is used to represent the nested relationship of the code space.

There are two tables for each data object placed in the data warehouse. One is the data table itself of the data object, such as a picture table:

图片ID Picture ID		字段1Field 1	…...
图片ID Picture ID		字段1Field 1	…...	…...	…...	…...

The picture ID field is the primary key of the table. The data for all images is placed in the table. The other is the corresponding picture encoding table:

编码空间IDEncoding space ID	编码coding	图片IDPicture ID
编码空间IDEncoding space ID	编码coding	图片IDPicture ID	…...	…...	…...
100100	001001	…...	…...	…...	…...
100100	001001	…...	100100	002002	…...
…...	…...	…...	100100	002002	…...

The code space ID field is a foreign key of the system code space table, and the picture ID field is a foreign key of the picture table. The Encoding Space ID field plus the Encoding field is the primary key of the table.

Additionally, for an encoded directory entry, the encoded directory entry is a specific encoded meta-object encoded by the context-dependent object. There is one and only one encoding directory in each encoding space, and the encoding directory is a list of encoding directory entries. Each encoding directory entry has a unique number in the encoding directory, which is the metacode. In the above encoding metamodel, the encoding directory entry is specifically the encoding type plus the encoding path. The encoding path can be a relative path, that is, the current space of the encoding directory item, or an absolute path-based root space; or both can be supported at the same time, and only a mechanism for distinguishing the two needs to be established.

That is to say, in the context-dependent object encoding system, the meta-encoding (encoding corresponding to the encoding directory entry) and the instance encoding in the object encoding may not be in one encoding space.

The encoding directory entry can unify the spatial encoding and type encoding mentioned above. If a meta-encoding, the encoding type in the corresponding object data (actually the encoding directory entry) is still an encoding directory entry, then the meta-encoding corresponds to An encoding space; the instance encoding after the meta-encoding is actually a meta-encoding. In this way, the meta-encoding can represent both the spatial encoding and the encoding of the encoded directory entry, depending on whether the corresponding encoding type is an encoding directory entry type. Therefore, with the support of this design, the meta-encoding of an object encoding can be one or more meta-encoded groups. The last meta-code corresponds to a common encoding meta-object, and the previous meta-encoding corresponds to the encoding space. In addition, we can hide the concept of the aforementioned space bits into the code repository by encoding directory entries instead of directly exposing them to the code. The encoding path is more flexible and secure than the encoding bits, and different context object combinations can be set.

In addition, for the encoding directory item instantiation, the instantiation of the encoding directory entry is mainly the process of instantiating the encoding path (a series of context conventions) into the target encoding space when the context-related object encoding system is running. Thus, with the different context objects in the encoding and decoding process, the same meta-encoding (encoding corresponding to the encoding directory entry) will correspond to different target encoding spaces, and the object instance encoding will be encoded into different encoding spaces. (Of course, only the reference encoding form will correspond to the encoding space). For an encoded directory entry whose encoding path is empty, there is no instantiation process, and the corresponding target encoding space is the space where the directory entry is located.

Encoding directory entry instantiation is the key to context-dependent object coding system implementation context.

In addition, for the coding factory, the coding factory is the object codec of the runtime corresponding to the coded directory entry instantiation. It includes the corresponding encoding directory entry, the current encoding space (the space where the encoding directory is located), and the target encoding space (the space where the object instance data is located, which is actually instantiated by the encoding path through the corresponding context-related object). The encoding factory contains all the information that encodes and decodes the data object in addition to the data content of the object. The encoding factory provides a codec service for data objects corresponding to the encoded directory entry (actually a specific type of specific target space).

The encoding space can be used as a special encoding factory, and the encoding type of the corresponding encoding directory item is the encoding directory item type itself. That is to say, the encoding space provides a codec service for encoding directory entries, that is, encoding meta objects.

The final output of the coding factory should be the object code, which includes the meta code and the instance code. However, the process of combining or splicing the meta-code with the instance code can be placed on the client side or in the code repository, depending on the actual design. Moreover, it is also possible to include in the final object code a code representing a combination or splicing manner of the meta code and the instance code. If necessary, the code representing the combination or splicing mode can be stored separately from the object code under different secure channels, and the respective access rights are set separately. Only after authorization and verification can the object code and the corresponding representative element code be obtained. The combination with the example code or the coding of the splicing method, so that the meta code and the instance code are correctly disassembled.

In addition, for system coding of context-sensitive object coding systems, due to context-sensitive objects The multi-level coding combination feature of the coding system is relatively straightforward using the variable length coding method. Both the directory entry code and the instance code can be one word long.

In addition, for the context object to set the encoding, this system encoding is used to set the current (encoding, decoding time) context object, this setting will work on the data object in the encoding directory that uses the relevant context.

Possible forms of the encoding:

[The system code mark] [Code Context Code] [Object Code]

In the above core diagram of the coding metamodel, the coding context object needs to be modified into an encoding object to support the system coding, that is, the coding of the context object is the basis of the above coding form.

Another possible form is:

[The system code mark] [Code Context Identifier] [Object Code]

The encoding context identifier can be a combination of a context type name and a context role name.

For final encoding, the final encoding is used to tell the decoder the end of an object encoding parsing. Final encoding is not required. In most cases, the object encoding is always terminated in the instance encoding, and will be parsed if there is no instance encoding. Therefore, the system can be set to use the end of the instance code as the final identifier for encoding resolution. It is implied that the encoding space cannot be loop nested and must be a strict tree structure. Can be a word length mark.

For root space encoding, after setting the default factory with spatial encoding, it is sometimes necessary to use encodings outside the default factory. At this point, we can use root space encoding to convert the current encoding to another space. Root-space coding is the starting point for all complete encodings, and all other encodings as well as meta-encodings can be decoded from the root space. A text content can only correspond to a single root space. In the case where the default factory is not set, the default factory is the root space. The root space encoding can be a special token of a word length, which can be followed by the object's full object encoding from the root encoding to the instance encoding.

For the default metacode setting encoding, the default metacoding setting encoding is actually a setting for the encoding space or the encoding factory. Root space coding can break this setting. In addition to the object encoding at the beginning of the root space encoding, it is decoded by the encoding factory.

Since this encoding must end in meta-encoding, it must be terminated using final encoding.

Possible forms of the encoding:

[The system code mark] [multiple directory code] [terminator code]

Context-sensitive object coding can improve the coding expression while shortening the length of coding. It is very suitable for large data, rich data types in cloud storage, encoding storage and transmission of massive data objects with complex relationships, and also suitable for Internet of Things. Identify lightweight, diverse needs.

For object encoding and text, standard literal encoding is actually a reference encoding of a character object. So we can think of the object coding sequence as a special text content. For some of the operational concepts and processing tools of traditional text, we can learn and reuse them, combined with the characteristics of object coding. Such as text search, retrieval, editing, replacement, and so on.

At the same time, object encoding and text encoding can also be mixed, as long as the text encoding is a special object encoding.

When object encoding and text encoding are mixed together, there are three ways to do the corresponding encoding and decoding methods:

1. Assign a special metacode to the text encoding

2. Use a specific text encoding to escape to the object encoding with the specified escape character when object encoding is required.

3. Extend the specific text encoding and extend it to an extended text encoding that expresses the object encoding.

For structured object coding, as mentioned above, the object coding sequence can be regarded as a special text content. On the basis of standard text, there are already a large number of coding standards and formats of structured documents, such as comma-separated Text table format CSV, structured document standard SGML/XML based on markup language, JSON format for packing data structures using JavaScript syntax, and so on. On the one hand, we can directly use the relevant formats and standards to mix object-encoded characters as content.

On the other hand, we can also regard the structured document composed of the object coding sequence as a special object, which is encoded by the object encoding method. The encoding result is the serialization of the corresponding encoding of all the sub-objects that make up the object. The encoding and decoding process of this structured object can be used as a part of the encoded metadata as a common data object, placed in the encoding warehouse, and the content is encoded and decoded according to the encoded metadata. This codec process synthesizes and parses the object code sequence that is the serialized content of the structured object, and further encodes and decodes the object code. This process can be a recursive, nested process. In addition, for structured objects The codec can also be defined in other forms as follows:

Encoding of an array of objects

Generally refers to the encoding of the same set of objects that are encoded by the meta. In the variable length coding method, the array system coding is defined, and the redundant element coding can be removed. The array system can be defined as follows:

Array encoding: = array system encoding + array length n + array object encoding of the first element (including meta-encoding + instance encoding) + n-1 instance coding of the remaining elements

Under this definition, the array system encoding can be thought of as the meta-encoding of array objects. The meta information of an array object is implicit in the entire array encoding, including array length, array type, and so on.

Object two-dimensional table encoding

Generally speaking, each column element encodes the same two-dimensional array code. Similarly, the table system code is defined and the redundant element code can be removed.

Table code: = table system code + number of data lines n + object code of the first line element (including meta code + instance code) + n-1 case code of the remaining lines

Under this definition, the array system encoding can be thought of as the meta-encoding of array objects. The meta information of the array object is implicit in the entire array encoding, and can include array length, array type, and so on.

Object tree encoding

The tree structure is very common and can represent complex object combinations, such as document trees, abstract syntax trees, and so on. A special type of tag encoding can be defined. The tag encoding is actually the tag at the beginning of the tree node, and the tag end tag is specified in the tag object's metadata. When the decoder parses the end tag, the data object between the tag and the end tag is combined to form a tree node object. Tree node objects can be nested and combined.

In addition to all the tree structure information can be placed in the coding metadata corresponding to the root node code, it can also be hierarchically placed into the coded metadata corresponding to the tree node.

Element code

The meta-encoding is the encoding of the metadata associated with the encoded metadata. It is also part of the meta code.

Tag encoding

For object coding, there is also a case where there is no instance coding. That is to say, the object encoding is only the metacoded part. This type of encoding is called token encoding and only corresponds to the encoding metadata. Its main role is to provide semantic tags to the decoder. It is used extensively in structured coded streams.

Further, a specific implementation manner of step 304C is:

The meta-encoding and instance encoding are used to generate the object encoding using predetermined rules.

In the present embodiment, the manner in which the object encoding is constituted by the meta-encoding and the example encoding can be various. The object encoding can be constructed by directly combining or splicing the meta code with the instance code. For example, FIG. 22 is a schematic diagram of object encoding composed of meta-encoding and example encoding.

In addition, for example, the object encoding can also be obtained by some kind of operation between meta-encoding and instance encoding or other feasible hybrids, as follows:

Object encoding = instance encoding X 101 + meta encoding

In this way, we can strip the object code into meta-code and instance code through the corresponding operations:

Therefore, any manner in which object encoding is obtained by meta-encoding and example encoding can be applied to the present invention as long as the meta-encoding and the example encoding can be regained in a reversible manner.

Both meta-coded and instance-encoded are used internally by the object encoding system, and are typically generated automatically within the system and are invisible to applications built on top of the system. Depending on the relevance of the metadata portion to the data content portion, the instance code may be related or unrelated to the meta-code.

Type coding is a typical meta code. Type coding can be used to obtain type information of an object instance, as well as a coding convention for related types.

Preferably, the method may further include:

A code representing the predetermined rule is added to the object code.

or,

The code representing the predetermined rule and the object code are respectively stored under different secure channels, and different access rights are respectively set for the encoding of the predetermined rule and the object encoding.

In the present embodiment, regarding context-dependent coding, as mentioned above, object-based coding already has type-based coding isolation. However, for a certain type of data object, there are two major drawbacks to the unified coding space: First, the coding is not secure enough. By directly modifying the code or Using random encoding, you may have direct access to other users of the same type of data object. Second, the coding is not efficient enough. In order to ensure that the encoding of the same type of data objects does not conflict with each other, the storage space occupied by the object encoding itself increases as the number of data objects increases. Eventually it leads to a reduction in coding efficiency.

Context-sensitive coding is the concept of introducing a context-sensitive coding space that solves both of these problems.

The so-called coding space is an abstract concept that isolates the encoding of data objects. The encoding of a certain type of data object in a certain encoding space is unique. But it may correspond to different encodings in different coding spaces. At the same time, the same type, the same code, may correspond to different data objects in different coding spaces.

A context object refers to a data object related to the encoding usage environment, such as a user, an application system, a time, a place, a domain, and the like. The encoding of some data objects is closely related to these usage environments. For example, a user-private data object is closely related to the user, so the corresponding encoding should also be relevant to the user.

The context-dependent coding space refers to the coding space that belongs to the context object. By using the information of the context object in the meta information of the data object, we can specify the encoding space of the corresponding data object. In this way, we can directly encode the data object with the encoding in the encoding space. In the process of encoding use and parsing, the same object encoding can correspond to different encoding spaces with different context objects. This further improves the effectiveness of the coding.

In addition, by providing certain security access mechanisms for some key context objects, the security of the corresponding coding space can be guaranteed, thereby ensuring the security of coding in the space.

More importantly, in this embodiment, the key to object-based coding is the meta-information of the data object. The serialization (content encoding), transmission, and storage of data objects are all controlled by their meta information. The type of data object is an important meta information. A variety of data objects have different data types, and these types have a certain relationship. For example, complex types are composed of simple types, and multiple data objects of one or more types can be formed according to certain conventions. Some kind of special structure, and so on. All of these types together form a type system. The object-based coding system is built on top of a complete type system. That is to say, in the corresponding coding system, all data objects have their object type. And this type of system is extensible, users can define their own custom types based on existing types, as well as type definition and extension mechanisms. Type system mainly gives The corresponding coding system offers three benefits:

First, type checking

With the object type, we have a basis for verification of the data legitimacy of the corresponding object. This is extremely important for the reliability of data encoding and transmission.

Second, type derivation

With object types, we can derive their local types or related types. Therefore, this local type or related type can be omitted during the encoding process. This greatly improves the coding efficiency.

Third, code isolation

With object types, we can reuse encodings for different types (specifically, reference encodings). This also improves the validity and security of the code.

In addition, in the present embodiment, we introduce OTF-8 encoding. First, regarding the character encoding in OTF-8 encoding, the target encoding here is a text encoding. However, unlike traditional text encoding, the encoding and decoding process requires the participation of the encoding warehouse. Therefore, the encoding result and the decoding source can support non-standard characters. Data for non-standard characters exists in the encoding repository.

This text code is based on UTF-8, which we call OTF-8. OTF-8 is in one byte and there is no problem with endianness. It is backward compatible with UTF-8. That is to say, the content of any UTF-8 can be directly decoded in OTF-8 encoding, and the decoding result is exactly the same as the UTF-8 decoding result.

Second, with regard to the digital representation of OTF-8 encoding, OTF-8 can encode numbers from 0 to 128 in addition to traditional UTF-8 characters. Variable length coding is used here: for one to two for 0 to 31; two bytes for 32 to 255; three bytes for 28 to 216-1; and so on. Specifically, the byte starting with 100 represents 0 to 31, and the next five binary bits correspond to specific numbers. For example, 0x80 (binary representation of bytes is 10000000) means 0, 0x81 (10000001) means 1, 0x82 (10000010) means 2... and so on, until 0x9F (10011111) corresponds to 31. For numbers greater than or equal to 32, we use the first byte starting with 101 to indicate the number of bytes afterwards, followed by the big endian number encoding of the corresponding number of bytes (high bit first, low bit first, high bit 0). 0xA0 (10100000) indicates that there is 1 byte followed by a number; 0xA1 (10100001) indicates a number of two bytes followed; 0xA2 (10100010) indicates the last three bytes... and so on, until 0xAF (10101111) indicates that there are 16 bytes, 128 The number of digits. For example, 0xA0 0x20 (10100000 00100000) represents the number 32; 0xA0 0xFF (10100000 11111111) represents the number 255; 0xA1 0x01 0x00 (0x10100001 00000001 00000000) represents the number 256; 0xA2 0x01 0x00 0x00 (10100010 00000001 00000000 00000000) represents the number 65536. The corresponding coding details are shown in Figure 23.

Finally, with regard to OTF-8 encoded object reference encoding, the numbers appearing in OTF-8, if there is no special markup, or a special context, are used by default to reference code the objects in the encoding repository.

Below, a brief description of the encoding space, encoding directory items and metacode:

This code is mainly done by numerical numbering and is a hierarchical number. This layering is mainly reflected in the layering of the coding space in the coding warehouse.

In order to access various encodings in the encoding space, there is one and only one encoding directory in the OTF-8 encoding space. Each encoding directory entry includes an encoding type and an encoding path. The encoding path can be a sequence of contexts from the current encoding space to other encoding spaces. For example, when the encoding path is “current user”, the corresponding encoding space is the subspace of the current user in the current space. When the encoding path is empty (does not contain any context), the encoding space to which the corresponding encoding belongs is the encoding space where the encoding directory entry is located. The encoding path can also be a string, that is, a name, and the corresponding encoding space is the named subspace of the current space. When the encoding type of the encoding directory entry is an encoding directory entry, the data object corresponding to the encoding is the target space, and the encoding is called spatial encoding.

The number corresponding to the encoding directory entry is the directory entry encoding.

The directory entry code and the spatial code are all meta-coded, which does not correspond to a specific data object instance, but a metadata object corresponding to the object. Specifically, the corresponding directory entry and the encoding space are corresponding. After the meta-encoding, an instance code is needed to form the complete object encoding.

The default encoding starts with the root encoding space of the current encoding repository. For example, the encoding directory of the root space of the encoding repository is shown in Table 1 below:

Table I

编号Numbering	类型Types of	编码路径Encoding path
编号Numbering	类型Types of	编码路径Encoding path	0000	编码目录项Encoding directory entry
0101	类型供应器Type provider		0000	编码目录项Encoding directory entry
0101	类型供应器Type provider		0202	存储驱动Storage driver
0303	编码类型Coding type		0202	存储驱动Storage driver

0404	编码上下文Coding context
0404	编码上下文Coding context		0505	用户user
0606	应用application		0505	用户user
0606	应用application		0707	文档Document
0808	编码空间Coding space	用户user	0707	文档Document
0808	编码空间Coding space	用户user	0909	编码空间 Coding space	应用application
1010	编码空间 Coding space		0909	编码空间 Coding space	应用application	文档Document
1010	编码空间 Coding space		1111	手写文字Handwritten text		文档Document
1212	手写文字Handwritten text	用户user	1111	手写文字Handwritten text

Then we can use the two-level number 05|256 to represent the user numbered 256. With the aforementioned OTF-8 digital encoding scheme, the reference encoding of this user object can be represented in four bytes:

10000101 10100001 00000001 00000000

Here, the protocol code "10000101" is the metacode of the user object code; the latter "10100001 00000001 00000000" is the instance code of the object code.

Assume that the encoding directory of the current user's encoding space is as shown in Table 2 below:

Table II

编号Numbering	类型Types of	编码路径Encoding path
编号Numbering	类型Types of	编码路径Encoding path	0000	编码规约Coding protocol
0101	应用application		0000	编码规约Coding protocol
0101	应用application		0202	文档Document
0303	编码空间Coding space	应用application	0202	文档Document
0303	编码空间Coding space	应用application	0404	编码空间Coding space	文档Document
0505	手写文字Handwritten text		0404	编码空间Coding space	文档Document

Then we can use the three-level number 08|05|256 to represent the current user's 256th handwritten text. The reference code of this handwritten text object can be represented by five bytes:

10001000 10000101 10100001 00000001 00000000

Here, the protocol code "10001000" of the root space corresponds to the user coding space, that is, the spatial coding. The subsequent "10000101" corresponds to the protocol code of the user space number 55. Therefore, spatial coding and protocol coding together constitute the meta-encoding of the handwritten text object. "10001000 10000101"; the latter "10100001 00000001 00000000" is the instance code of the object encoding.

We noticed that the encoding directory entry with the root space encoding directory number 11 is the same as the encoding directory entry with the encoding directory number 05 in the current user space. But their corresponding data objects are from different coding spaces, one is the root space and the other is the current user space. In fact, the data object pointed to by the encoding directory entry numbered 12 in the root space encoding directory is the handwritten text in the current user space. Therefore, the data object corresponding to the above encoding can also be represented by the secondary number 12|256, and the specific form is as follows:

10001100 10100001 00000001 00000000

This saves one byte and only takes four bytes.

In addition, regarding the encoding context and its setting, comparing the two encodings of the above handwritten text object, in addition to the different metacoding, there is a difference: the former may correspond to different encoding types depending on the current user, and the latter corresponds to The encoding type is always handwritten text. This is because the encoding directories of different user encoding spaces are not necessarily the same.

In fact, the code space corresponding to the coded directory entry numbered 08 in the root space code directory is not a certain code space, but a code space of the user determined according to the current context "user" object. The corresponding coding space is different depending on the current user.

A context is a role that appears in the system during the use of the code. It actually corresponds to a specific object and is called a context object. The context object can be determined before using the encoding, such as the user login can determine the current "user" context. The context object can also be dynamically switched during the encoding process. For example, in a multi-person chat application, the current user needs to switch back and forth in the chat record document. We use a specific byte 0xBD (10111101) "starting code sequence to specify the current context object. This code sequence is called context setting code, and its specific syntax is as follows:

0xBD<context encoding or context name><context object encoding>

If the context of the root space is as shown in Table 3 below:

Table 3

编号Numbering	类型Types of	名字first name
编号Numbering	类型Types of	名字first name	0000	编码仓库Coding warehouse
0101	编码元对象Encoding meta object	缺省元对象Default meta object	0000	编码仓库Coding warehouse

0202	用户user	当前用户Current user
0202	用户user	当前用户Current user	0303	应用application	当前应用Current application
0404	文档Document	当前文档Current document	0303	应用application	当前应用Current application
0404	文档Document	当前文档Current document	0505	…...	…...

Then, the following code:

0xBD 0x84 0x82 0x85 0xA1 0x00 0x01

That is, the user object (05|256) numbered 256 is set as the current user (04|02). This 7-byte setting will have an effect on the user-related encoding before it is set again.

Further, for the encoding terminator, the "current user" is just an encoding context, and depending on the application, a variety of different encoding contexts may occur. A common system context is the "default meta object." As mentioned earlier, the default meta object of the system is the root space of the current encoding repository. This root space is our "default meta object", which we can change by the above "context setting encoding".

In traditional text encoding, there is a concept of a code point, and one code point corresponds to one character. OTF-8 has a similar concept, except that the OTF-8 encoding point corresponds to a Unicode code point, an OTF-8 number as well, or a complete setting, as described in the context. So, how do you represent meta objects in coding? Direct use of meta-encoding will mistake the subsequent encoding for instance encoding. Here we use a specific byte called the "encoding terminator" to tell the decoder the end of the code point. This byte is 0xB8 (10111000). The following encoding is to set the meta object corresponding to the 12th encoding directory entry in the root space encoding directory as the default meta object:

10111101 10000100 10000001 10001100 10111000

After this setting, the original secondary number 12|256 becomes the primary number 256. Previous code:

10001100 10100001 00000001 00000000

It becomes two object encodings, the first is the current user's private handwritten character numbered 12, and the second is the current user's private handwritten character numbered 256.

It can be seen that the encoding terminator is mainly used for the encoding corresponding to the meta object.

Further, for the root space prefix, after the system default meta object is changed, some methods need to be used to encode some objects from the root space. In OTF-8, we use a special byte to represent the root "10111001". Space, called the root space prefix. In this way, the following code is The current default meta object is irrelevant:

10111001 10001100 10100001 00000001 00000000

It also corresponds to the secondary number 12|256 starting from the root space.

For all object reference encodings in OTF-8, the encoding without the root space prefix is decoded by the current default meta object.

Further, for system client coding. We have already seen that by setting the default meta-object, the encoding length can be shortened and the encoding efficiency can be improved. However, sometimes, within a document, multiple kinds of codes may appear, which belong to different coding spaces. The system default meta-object can only improve coding efficiency for one of the codes. OTF-8 provides 8 system client encodings to bind arbitrary encoding objects (including encoding meta objects), which are all one byte, respectively:

1011000010110000
1011000010110000	1011000110110001
1011001010110010	1011000110110001
1011001010110010	1011001110110011
1011010010110100	1011001110110011
1011010010110100	1011010110110101
1011011010110110	1011010110110101
1011011010110110	1011011110110111

We still use the same specific byte "10111101" to start the encoding sequence to specify the data object corresponding to the client encoding. This code sequence is called client code set code, and its specific syntax is as follows:

10111101<Client Encoding><Data Object Encoding>

For example, the following setting code sets the client code "10110000" to the user object corresponding to the secondary code 05|256.

10111101 10110000 10000101 10100001 00000001 00000000

Once the extended client code is defined, we can use it instead of the encoding of the data object it corresponds to. Then, the following code:

10111101 10000100 10000010 10110000

The semantics corresponding to the previous 7 bytes of context setting encoding are exactly the same. Here we replace the original four-byte object encoding with a one-byte client-side encoding.

Further, for the OTF-8 encoded object representation, as mentioned earlier, the numbers appearing in OTF-8 are by default used to represent the reference encoding of objects in the encoding repository. So how do you directly represent numbers in OTF-8? Further, how do you directly encode the object itself instead of its reference/number?

The answer is automatic type derivation and direct object coding with type coding.

Regarding type derivation, in the OTF-8 content decoding process, type derivation can be performed using the classic "integration algorithm". All OTF-8 content has a type, the default type is OTF-8 string type, which is the root/generic object array. When decoding, there is a system's decoding type stack. The top of the stack is the specific type to be parsed. After the data object corresponding to the current type is parsed, the top of the stack is replaced with the type of the next element of the current type structure. If the current structure is complete, the top of the stack is unstacked and the top of the stack is the next element of the parent structure.

For example, there are the following structures:

When parsing this type, the first number encountered will be parsed into an integer instead of the object reference encoding. And at this time, if the parsed content is not an OTF-8 number, it is actually a data type error. The type information here also provides us with the basis for type checking.

When parsing the second element of the type, the system will automatically receive the contents of the integer or string according to the type. Since the encoding format of the numbers and strings in OTF-8 is completely different, the parser can automatically judge according to the encoding format. The actual type of data object there.

When parsing the third element, since byte is a subset of int, there will be some overlap between the two types of encoding. Therefore, the type inference of the parser will have certain difficulties. OTF-8 provides the system context "current parsing type" to allow refinement of the type of data object that follows. At this point, you can use

To specify that the next data object is of type byte. Or use

0xBD<"current parsing type" context reference encoding><"int" type reference encoding>

To specify that the next data object is of type int.

When setting this "current parsing type" context, we can't use incompatible types. For example, in this example, int32 is a type that is compatible with int, so it can be used. However, the string type is not compatible with both byte and int, and setting it to "current parsing type" will result in a type error.

Regarding direct object encoding, in addition to performing direct object encoding by setting "current parsing type" as described above, OTF-8 also allows direct reference to the corresponding data content after the encoding of the encoding type or the reference encoding of the encoding directory item. coding.

For parameterized types, you need to apply a code list to the type corresponding to the type parameter immediately after the type.

Therefore, the basic types of all data objects that need to be represented in OTF-8 must be stored in the code repository. In the root space encoding directory mentioned above, the encoding directory entry numbered 03 is the encoding type. The corresponding information is shown in Table 4 below:

Table 4

编号Numbering	编码类型Coding type
编号Numbering	编码类型Coding type	0000	类型Types of
0101	无符号整数Unsigned integer	0000	类型Types of
0101	无符号整数Unsigned integer	0202	有符号整数Signed integer
0303	浮点数Floating point number	0202	有符号整数Signed integer
0303	浮点数Floating point number	0404	GUIDGUID
0505	布尔量Boolean	0404	GUIDGUID
0505	布尔量Boolean	0606	UTF-8字符UTF-8 character
0707	UTF-8字符串UTF-8 string	0606	UTF-8字符UTF-8 character
0707	UTF-8字符串UTF-8 string	0808	对象引用Object reference
0909	可空对象 Nullable object	0808	对象引用Object reference
0909	可空对象 Nullable object	1010	数组 Array
1111	元组Tuple	1010	数组 Array
1111	元组Tuple	1212	字典dictionary

Then, the representation of the various types of data objects is as follows:

1, the representation of the number

2, the representation of unsigned integer

For unsigned integers, place the data directly after the unsigned integer type encoding. For example, the following code represents the number 256:

0x83 0x81 0xA1 0x00 0x01

3, the representation of signed integers

For signed integers, we need to use unsigned integers, which need to use ZigZag encoding.

ZigZag actually uses an even number to represent a positive integer and an odd number to represent a negative integer. As shown in the following table:

有符号整数Signed integer	编码结果(无符号整数)Encoding result (unsigned integer)
有符号整数Signed integer	编码结果(无符号整数)Encoding result (unsigned integer)	00	00
-1-1	11	00	00
-1-1	11	11	22
-2-2	33	11	22
-2-2	33	21474836472147483647	42949672944294967294
-2147483648-2147483648	42949672954294967295	21474836472147483647	42949672944294967294

ZigZag encoding can decode unsigned integers into corresponding signed integers by the following algorithm: (n>>1)^(-(n&1))

The following code represents the signed 128:

0x83 0x82 0xA1 0x00 0x01

4, the representation of floating point numbers

For the representation of floating point numbers, OTF-8 directly uses the IEEE 754 standard. Supports common single-precision 32-bit (four-byte) floating point and double-precision 64-bit (eight-byte) floating point. They are represented by the four-byte and eight-byte numbers of OTF-8, respectively. The numerical part is encoded with big endian. The specific numerical form is:

0x83 0x83 0xA3 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

as well as

Half-precision floating point and four-precision floating point can also be supported if needed.

GUID representation

Similarly, the GUID can be represented directly by a 16-byte number, which has the following form:

5, the representation of Boolean

OTF-8 directly defines two special bytes to represent booleans.

Byte 0xBB (10111011) represents logically true; byte 0xBC (10111100) represents logically false.

Character and string representation

OTF-8 can directly represent UTF-8 characters and strings. In order to separate consecutive multiple strings, the OTF-8 convention string can end with "0x0" (if not ending with "0x0", the OTF-8 string ends with the last consecutive OTF-8 character); only A string consisting of a "0x0" character is an empty string.

6, the representation of complex objects

Complex objects are composed of simple objects by some sort of rule. In OTF-8, two special system objects need to be marked, one is the object start tag, which is represented by byte 0xFE (11111110); the other is the object end tag, which is represented by byte 0xFF (11111111). The content of the data object is encoded between the start and end tags.

Further, in this embodiment, regarding the OTF-8 encoding and type system, we can see that the encoding type is crucial for the object representation of the OTF-8. In fact, OTF-8 is built on a scalable, complete type system. OTF-8 has some basic types built in: integer, Unicode character, boolean, float, Unicode string, OTF-8 string (actually an array of objects). At the same time, OTF-8 also supports parameterized types. Some built-in parameterized types include: coded reference type, nullable type, tuple type, array type, and dictionary type. OTF-8 allows users to customize structures, interfaces, and services; it also allows users to inherit and extend on top of existing types. In addition, users are allowed to introduce external encoding methods to augment existing types.

OTF-8 defines an encoding type definition language. Users can define new types through them. This definition language is independent of any existing programming language. However, it is possible to establish a mapping relationship with elements in an existing programming language, thereby realizing automatic conversion between languages, such as generating a type declaration of a specific programming language from a type description in an encoding repository; source code or constructing a result from a specific programming language ( Extract the description of the encoding type definition in the executable file). In this type definition language, we are inside The type is a concise description, and the corresponding table five is as follows:

Table 5

实际类型Actual type	简略类型Abbreviated type
实际类型Actual type	简略类型Abbreviated type	OpenCode.ObjectOpenCode.Object	**
OpenCode.IntegerOpenCode.Integer	intInt	OpenCode.ObjectOpenCode.Object	**
OpenCode.IntegerOpenCode.Integer	intInt	OpenCode.CharOpenCode.Char	charChar
OpenCode.StringOpenCode.String	stringString	OpenCode.CharOpenCode.Char	charChar
OpenCode.StringOpenCode.String	stringString	OpenCode.BooleanOpenCode.Boolean	boolBool
OpenCode.FloatOpenCode.Float	floatFlo	OpenCode.BooleanOpenCode.Boolean	boolBool
OpenCode.FloatOpenCode.Float	floatFlo	OpenCode.Object[]OpenCode.Object[]	STRINGSTRING

In this embodiment, regarding the type identifier, the type of OTF-8 has a unique type identifier. To ensure the uniqueness of the type identifier, a specific naming convention is generally adopted, such as specifying a separator, a namespace, a naming rule, and the like.

Regarding the root type, the data objects that OTF-8 can express have a common root type. Thus UTF-8's standard string corresponds to the OTF-8 object string. This root type is of type "OpenCode.Object". The encoding type and encoding space can be obtained by any OpenCode.Object.

In the type definition syntax of OTF-8, an asterisk (*) is used to represent a type, which actually means any type.

Regarding empty types, an empty type is a type that does not correspond to any data object. For example, the aforementioned context settings, extension code settings, etc., correspond to an empty type. In the type definition syntax of OTF-8, the symbol "()" represents an empty type; in this syntax, if the return type of the method or function is empty, it can be omitted. For example, the following function:

Start()

A function that indicates that the input type is empty and the return type is empty. Its corresponding type is

()->()

Simple types and complex types

Simple or complex here is in terms of coding expression. In OTF-8, simple types include: encoded reference types, integer types, boolean types, floating point types, Unicode character types, Unicode string types, and their extended types. Among them, except for the Unicode string type corresponding to multiple objects, other types correspond to a single object. Simple types can be directly encoded in OTF-8.

Regarding type aliases, a type alias is a new type that defines an existing type as a different type representation. The corresponding encoding type definition syntax is as follows:

Such as:

MyTypes.YesOrNo:type OpenCode.Boolean

With regard to constraint types, existing simple types (mainly including numeric types, character types, and string types) can be qualified by type constraints to obtain a new constrained numeric value and string type. The corresponding encoding type definition syntax is as follows:

<new type identifier>: type<numeric type, character type or string type>{constraint}

For numeric types, the constraint is the range of values, such as:

OpenCode.Byte:type OpenCode.Integer{[0,255]}

Represents an integer type from 0 to 255.

For character types, the constraint is a range of characters in Unicode.

For string types, the constraint is the length limit of the string, and the regular expression matching pattern, such as:

Postal code: type OpenCode.String{[0-9]{6}}

A string type representing 6 digits.

Regarding parameterized types, OTF-8 also supports parameterized types, also known as generic types or generic types. A parameterized type means that the child elements that make up the type are parameters, not the determined types. The final type is determined after the parameters are specified. For example, a generic array type that specifies its parameters For shaping, the corresponding type becomes an integer array type; if its argument is specified as a string, the corresponding type becomes an array of strings. The definition of all complex types in OTF-8 can be a parameter type, and the parameterized type can also be used directly during the definition process. The syntax of the parameter definition in the parameter type is surrounded by angle brackets "<", ">" after the type keyword (class, enum, type, etc.), and multiple parameters are separated by ",". In the type definition, the parameter type can be directly used to specify the parameter. The syntax is the parameter type identifier followed by the parameter list surrounded by angle brackets "<", ">", and separated by ",".

The definition of a type alias can directly determine all or part of the parameters of the parameterized type.

For example, for a parameterized dictionary type, there are two type parameters, one is a key type and the other is a value type. We can define a dictionary of strings to strings as follows:

String dictionary: type dictionary <string, string>

You can also define a parameterized dictionary whose key type is an integer, as follows:

Integer key dictionary: type<T> dictionary <int,T>

Here T is a type parameter that corresponds to the value type of the dictionary.

When encoding a data object of a parameterized type, it is necessary to give a reference or type code of the type corresponding to the parameter before encoding the data object itself. Type quotes and data objects are distinguished by a special separator. The system separator object for OTF-8 is byte 0xBA (10111010). This separator is used to separate different syntax elements in a structure. For example, an example of directly encoding a parameterized dictionary type data object is as follows:

Since the bytes 0xFE, 0x00, and 0xFF are not characters that can be displayed normally, they are highlighted here to show the difference.

Regarding the merge type, the merge type refers to a type in which multiple types of encodings exist simultaneously. The definition of a merge type has the following syntax:

<new type identifier>: type<existing type identifier 1>{constraint 1}|<existing type identifier 2>{constraint 2}|...

Such as:

OpenCode.SmartFloat:type OpenCode.Float64|OpenCode.String {[+-]? [0-9]*(\.[0-9]+)? |-? [1-9]\.? [0-9]+([eE][-+]?[0-9]+)? }

A double-precision floating point that would otherwise require 9 bytes can be represented with fewer bytes when appropriate. For example, "1" has only one byte, ".24356" has only 6 bytes, and "6e23" has only 4 bytes.

When defining a merge type, a recursive definition is allowed, ie the defined target type can be used directly in the type definition body. For example, a tree type is defined as follows:

Tree: type<T>(T, tree[])|T

The encoding of a corresponding string tree data object is as follows:

It can be seen that this is a tree structure of the administrative division of China. The newline characters and whitespace/tabs are added for the convenience of reading, and these control symbols do not exist in the actual encoded content. However, based on the previously defined tree type, the OTF-8 parser is capable of encoding, decoding, and verifying the corresponding data object.

Regarding an empty object, unlike an empty type, an empty object is an object rather than a type. An empty object has its own special type (instead of an empty type without any instances), which we remember as Null. But this type has only one instance, which is this empty object. And this special type is not used directly.

An empty object indicates that the corresponding data object does not exist. We use an encoding terminator (0xB8) to represent an empty object.

Regarding the nullable type, the nullable type is actually a type formed by combining any type and Null. The nullable type corresponds to a data type that can have no data. The type syntax can be described as follows:

Nullable type: type<T>T|Null

The OTF-8 encoded type system has built-in direct support for nullable objects, which can be used in a simplified real-time in the type definition syntax - directly after the corresponding type, "Well"? Types of. As follows:

String?

Represents a nullable string. This type of empty object and empty string are two completely different objects. The former said that it does not exist. The latter indicates that the content is an empty string.

Regarding array types, array types are also a parameterized type that can be used to sequentially discharge multiple data objects of any type. The OTF-8 encoded type system also provides built-in support for array types, as well as a concise expression - after placing a parenthesis after a particular type, the type is converted to the corresponding array type.

The number in square brackets can be used to limit the number of elements in the array.

For example, the following type is an array of integers, and the number of array elements is not limited:

Int[]

The following type is an array of strings with only 5 strings:

String[5]

When the OTF-8 decoding system parses the corresponding data object, if there are not five elements, a type check error will occur.

The following type is a boolean array, where the number of elements can only be 5, 6 or 7

Bool[5..7]

In addition, OTF-8 also supports the definition of multidimensional arrays. Such as:

String[3][4..5]

This is a two-dimensional array of 3 rows, 4 columns or 5 columns. For a specific two-dimensional array object, it can only be an array of 3X4 or 3X5, there can be no rows of 4 columns, and some rows of 5 columns.

Regarding the tuple type, the tuple type is also a parameterized type, and its parameters can be any number of any number. The corresponding data is arranged in the order of the corresponding type of data objects. Only a tuple type of one data type is equivalent to this data type. No data type tuple type is an empty type.

OTF-8 has built-in support for tuple types. The list of type parameters is surrounded by parentheses "(" and ")", which can be separated by commas to represent a tuple.

For example, (int,string)[]? Is a nullable array type of a tuple of integers and strings.

When the tuple object is serialized/encoded, it also needs to be surrounded by the start (0xFE) and end (0xFF) flags.

Regarding the dictionary type, the dictionary type is also a parameterized type with two parameters: the key type, Value type. The essence is an array of corresponding tuple types. There is only one more constraint: the key parts of the array element object must be unique and not repeatable. OTF-8 has built-in dictionary type support. The key and value types are separated by a colon (":"), and surrounded by square brackets ("[", "]") can represent the corresponding dictionary type. Such as:

[string:int]

A dictionary type that represents a string to a numeric map. A single element of a dictionary is not surrounded by a start or end tag.

Regarding classes, like object-oriented classes, classes in OTF-8 include members and methods. The syntax of the class definition is as follows:

When the corresponding object encoding is performed, the contents of the member data object are encoded in the order in which the members appear. In addition, when a member is the default, the system-defined special tag can be used to inform the system. This default value is marked as a special byte 0xBE (10111110).

When defining a class member, you can use a system keyword context. The member data content with the keyword tag is stored in the corresponding encoding space; the member data content without the tag is stored in the unified storage.

For example, the following contact categories:

Then a corresponding data object will be encoded as follows:

This data object is finally saved to the code repository. This contact will often exist in the address book of different users, so the main information of the contact will be cited as shared storage by different users.

Use; but "nickname" generally varies from person to person, so the "context" here means that the field is stored in the target context space. The specific contact context-independent storage of a possible data storage server is as follows:

联系人IDContact ID	姓名Name	邮件地址Email address	联系电话contact number
联系人IDContact ID	姓名Name	邮件地址Email address	联系电话contact number	…...	…...	…...	…...
46234784623478	张三Zhang San	zhangsan12345@sina.comZhangsan12345@sina.com	1323456789013234567890	…...	…...	…...	…...
46234784623478	张三Zhang San	zhangsan12345@sina.comZhangsan12345@sina.com	1323456789013234567890

The context-dependent storage of this type is as follows:

编码空间IDEncoding space ID	联系人编号Contact number	联系人IDContact ID	昵称nickname
编码空间IDEncoding space ID	联系人编号Contact number	联系人IDContact ID	昵称nickname
(用户1的编码空间ID)(User 1's code space ID)	005005	46234784623478	老张Lao Zhang
(用户1的编码空间ID)(User 1's code space ID)	005005	46234784623478	老张Lao Zhang
(用户1的编码空间ID)(User 1's code space ID)	007007	46234784623478	小三儿Little three children
(用户1的编码空间ID)(User 1's code space ID)	007007	46234784623478	小三儿Little three children

In this way, different users can share the same contact, but the number and nickname of the same contact are separated by the coding space. This can increase the utilization of storage space, and the context-independent part of the data object does not need to be stored multiple times.

Unlike object methods in object-oriented programming languages, the methods in OTF-8 are just grammatical definitions. The methods in the definition can be directly applied in the OTF-8 encoded document. The definition of the method determines the type of the method. Both the client and the server need to verify the correctness of the method application syntax based on the type information. The specific implementation of the final method is performed by the remote service.

Regarding interfaces, interfaces have only methods. An interface is an abstract type that defines the interaction between objects and objects between objects. The interface will eventually be implemented by the class.

E.g:

Inheritance and implementation

Like classes in an object-oriented programming language, a class can be a subclass of another class, and an interface can be a subinterface of another interface. Classes can also implement interfaces. The OTF-8 interface supports single inheritance; the class also supports only single inheritance, that is, it can only be derived from one class at most, but multiple interfaces can be implemented at the same time.

The encoding of the subclass members is started from the root object, and all ancestor classes, parent classes, and their own members are sequentially encoded according to the inheritance chain. The method number of the subclass is also sequentially performed according to the inheritance chain, the methods of all the ancestor classes and the parent class, the methods in the implemented interface, and the methods defined by itself.

Regarding the encoding reference type, the encoding reference type is a parameterized type, and its argument can only be one class. The content of the data object is the number corresponding to the type storage in the corresponding coding space of the object. Edit The code reference type is the most important type in OTF-8. With this type, we can reference the data in the encoding repository by the encoding number. Encoding directory entries in the encoding repository are also presented as metacodes by encoding the referenced form. In the type syntax definition of OTF-8, we use the class identifier followed by a "#" to indicate the corresponding encoding reference type. E.g:

Contact#

The reference type corresponding to the "contact" class is an instance of the corresponding coded warehouse application code.

enumerate

There are two types of enumerations in OTF-8, one is symbol enumeration and the other is object enumeration.

Symbol enumeration, like the enumeration type in a normal programming language, is a list of digitized symbols. Its definition is a set of named integers. Its grammatical form is as follows:

<new type identifier>: enum{<name 1[=number1]>, <name 2[=number1]>,...}

Unlike the enumerated types in ordinary programming languages, the object enumeration type of OTF-8 is a parameterized type whose definition is a set of named objects. Its grammatical form is as follows:

<new type identifier>: enum<<enumerated type>>{<object 1[=number1]>, <object 2[=number1]>,...}

Such as:

Week: enum<string>{"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"}

When the object does not have a corresponding number, the first object is encoded as 0 and its object is postponed.

You can also explicitly specify the number corresponding to the name, such as:

Poker.Figure:enum<string|int>{3=3,4=4,5=5,6=6,7=7,8=8,9=9,10=10, “Jake”=11,” Queen"=12, "King"=13, "A"=14, "2"=15, "Black Joker"=16, "Red Joker"=17}

In fact, OTF-8's type definition language supports all types of object descriptions, mainly for object descriptions in object enumeration type definitions and default value descriptions in class definitions.

Service

A service is different from an object method. A service is not affiliated with an object, but a collection of functions. Usually corresponds to a network service on a node in the network.

For example, a digital weather forecasting network service can be defined as follows:

Regarding external types, in addition to the built-in support for the above types, OTF-8 can also support external types through type providers, so as to accommodate any existing encoding format.

Existing encoding formats are nothing more than two types of encoding: text encoding and binary encoding. The text encoding corresponds to the string type. It can be expressed directly in OTF-8. For binary encoding, there is a specific tag byte 0xBF (10111111) in OTF-8 for representing a binary byte stream. This is followed by an OTF-8 integer representing the size of the byte stream, and then the content is the specific binary byte stream.

Based on support for text and binary encoded content, the OTF-8 encoding system supports specific encoding syntax and semantics by providing different encoding drivers for encoding types.

Specifically, in the present embodiment, in combination with the above description, the following two specific examples are used for explanation:

The first example is about XML coding.

XML is a text-based markup language. There are two ways to support it in OTF-8.

One is to directly embed the contents of the XML document into the OTF-8 document, which is actually a string object corresponding to an OTF-8. But with the XML type provider (embedded), we can get and access the object's Document Object Model (DOM).

Another way is to extend the XML type system directly into OTF-8. XML is a meta-language that can define the syntax structure of a specific XML document in languages such as DTD, XML Schema, and RelaxNG. For example, the standard network vector graphics format SVG is defined by the DTD. Through the DTD type provider, we can read and parse the SVG DTD definition to generate a corresponding series of element types and attribute types. There are certain relationships and constraints between these types. These types can be grammatically checked and type derived based on this. The DTD type provider (map type) generates a corresponding space in the code repository according to the DTD definition of the SVG, and directly encodes the formed type object therein. Therefore, for the corresponding SVG type of data object. An SVG document can be encoded directly from the SVG type (corresponding element type and attribute type) in the encoding repository. This encoding is much more efficient than the traditional XML text approach. And you can maximize the reuse of existing XML technology heritage.

E.g:

As a content of an SVG file, the rendering result is as shown in Figure 24.

Through the DTD type provider, we get a series of SVG elements and attribute types. As shown in Figure 24, it is easy to see that the large amount of redundancy in XML is mainly the element name of the syntax mark, the attribute name, and some system characters that distinguish the node name from the node value, such as ">", " <", "/", "=", etc. Since in OTF-8, we can directly encode the information items in the XML corresponding information set (XML Infoset) using open coding without the limitation of standard coding, which can greatly reduce redundancy.

We can put some XML information item attributes into the code repository and use the corresponding code directly. We get the content of the coded warehouse type as follows:

The associated encoding repository data for type xml.infoset.element is as follows:

The associated encoding repository data for type xml.infoset.attribute is as follows:

With OTF-8 encoding, the original SVG document can be represented as follows:

Its document object model is exactly the same as before, but the latter's data content is only 380 bytes, saving more than 60% of the data volume than the former's 980 bytes.

Observe the above OTF-8 document and compare the previous example of the string tree in the administrative division of China. We will find that there are many types of labels in this document, such as green element labels, green and blue attribute labels. This is because the type expression in DTD is more limited, and the attribute types are mostly string type, so type derivation is difficult to derive the correct type. Therefore type labels are essential. In fact, type providers based on XML Schema or RelaxNG will generate more types, and eventually the corresponding XML OTF-8 documents will be more compact and efficient.

The second example is about Buffer Protocol encoding.

Google's Buffer Protocol is also an object serialization format with Schema. Its type definition language can be directly defined as the type of the corresponding type. With the Buffer Protocol type provider, we can match the binary data object encoded by the Buffer Protocol to the OTF-8. A type of data object. Specifically, in OTF-8 we define a system code 0xBF (10111111) as the starting tag for embedded binary data blocks. Following this tag byte is an integer representing the number of bytes of the binary data block (encoded in open coding), followed by the corresponding binary byte stream.

In fact, depending on the type derivation, it is sufficient for the binary data type to directly correspond to the data block length plus the data block. We introduce this binary block mark here mainly to ensure the reliability of code parsing. Because any code point of OTF-8 (including system code) may appear in the binary stream, we need to avoid parsing the embedded binary stream without any data element information (including type information). This binary markup system code does exactly that.

It can be seen that in OTF-8, the "type provider" is the key to achieving the existing coding standard or custom coding mode.

In fact, OTF-8 defines the corresponding types and rules for the combination of these types for all code points, which together form the OTF-8 type system. There are two types of "type providers", one is mapping type, which means that the specific type in the external type definition corresponds to the type system of OTF-8, so that we can reconstruct the external type in OTF-8 mode. Coding. On the basis of retaining the original coding Schema definition, the benefits of the coding warehouse are increased, such as a more secure metadata authorization access model, centralized metadata sharing, a more streamlined coding form, and the like. The "DTD type provider" in the previous SVG instance is this type of mapping.

Another type of "type provider" is an embedded type, which means that the data of the entire external coding mode is directly embedded into the code of the OTF-8, corresponding to a data type. The original code and decoder directly encode and decode the corresponding content to form a corresponding OTF-8 object. Specifically, for the text-based data serialization method, the embedded is a UTF-8 string (if the original encoding is not UTF-8, a corresponding encoding conversion is needed); for the binary data serialization method, the embedded As mentioned earlier, the block length guided by the 0xBF binary mark is added to the specific binary block content. The XML type provider mentioned above is an embedded text encoding, and the Buffer Protocol type provider is an embedded binary encoding.

In summary, OTF-8 is a specific coding system based on object-based context-dependent coding methods. Based on the built-in perfect type system, it can not only reference and encode the data objects in the encoded data warehouse, but also directly and efficiently encode the objects (encoding metadata, including type information, placed in the encoding). warehouse).

Referring to Figure 25, code points other than UTF-8 for OTF-8 are listed here. In addition, according to this coding definition, there are still many codes to be defined for system expansion. They are all listed in Figure 26. For example, we can define double-byte 0xA00x00 as an application function/method. By implementing it, it is possible to provide support for remote procedure calls (RPC) on the basis of OTF-8, which will be much more effective than existing methods such as XML-RPC and SOAP.

Similarly, in this embodiment, an extension scheme of Unicode such as OTF-16 and OTF-32 can be further introduced. Expanded from UTF-16 and UTF-32 respectively. Compared with OTF-8, the concept and composition of the coding warehouse, object-based context coding method, and type system are identical. The main difference is that the specific definition of open coding (mainly including digital coding and system coding) will vary depending on the encoding method corresponding to Unicode, and will not be described here.

Further, the method may further include:

The data content corresponding to the encoded content is normalized by reference coding.

In this embodiment, the processing system based on the encoding warehouse of the present invention, in addition to the most basic codec service, can also encode the data by using the encoding metadata of the encoding warehouse and related various services ( Byte stream) provides a variety of analysis and processing services. This includes two different levels of service: one is a code analysis processing service that does not rely on specific encoded data. This service is mainly for the statistical analysis of specific users, specific types of codes, and the analysis results are stored for further use - such as text retrieval services. We call this service level a text encoding service layer. This kind of service only processes the code itself, and does not need corresponding text content information, so the security of the user's text content and personal privacy is completely guaranteed, which is difficult to standardize. Another level is to provide a variety of related services on top of the text encoding and its corresponding data to facilitate the application's use of the new data processing system. Call it the text content service layer. The results of the first level of analysis can be directly used directly by the second level.

For traditional data processing systems, text encoding is not only used for text processing, but also widely used for the expression and delivery of general data. Some common structured text and proprietary domain text processing techniques are also emerging, such as SGML/XML (and HTML, SVG, MathML, and above). Etc.) Series technology, programming language processing techniques, domain-specific modeling languages, etc. The new data processing system is completely built on the traditional data processing system. In addition to the new concept of personalized word processing, it can also introduce open coded text based on the code warehouse into the existing text data processing technology. Only a little modification on the basis of the existing technology can form a new text data processing technology that is safer and more efficient. Therefore, the word processing system in the new data processing system actually includes two aspects, one is a new word processing system, and the other is a new text data processing system. Of course, these two aspects can also be combined, such as processing based on handwritten programming languages.

Optionally, some other services or applications may also be provided, including but not limited to the following service options: data content normalization service.

Specifically, data content normalization refers to merging identical or similar data content in an encoding repository, allowing them to use the same encoding. For example, the same word written by the same person at different times, although the final glyphs are not necessarily identical, can be grouped according to a certain feature.

Normalization can be done automatically according to certain rules. For example, the normalization of the sound can only retain the same sound of the highest sampling frequency, and the sound of the lower sampling frequency can be generated therefrom. Normalization can also be done semi-automatically by manual intervention, ie content normalization services find the same or similar content items in the code repository and then output them to the specified user (eg content item owner) by the user according to their criteria Specifies the content item that was last retained.

The normalization service can be performed in real time. In this case, whenever the encoding repository receives the input, the content normalization service will look for the same/similar items in the encoding repository. If the same or similar content items exist, they will be encoded directly, if necessary. (According to certain rules), you also need to replace the original content item with the new content. The normalization service can also be offline, not in real time. At this time, after the content normalization service finds the content that can be normalized in the encoding warehouse, the correspondence between the original instance encoding and the normalized encoding is established. According to this correspondence, the normalization service converts the input string into a string returned using the normalization.

The normalization service needs to be done using a specific content matching algorithm. For matching handwritten content, a pattern matching or image matching algorithm is required. Matching voice content requires the use of a sound matching algorithm, and so on.

Although content normalization is an optional service, the code repository that implements content normalization can Minimize code redundancy to maximize the use of existing text infrastructure and related tools.

In addition, further, some other services or applications may be provided, including but not limited to the following service options:

First, the code management service

The content in the code repository can be of various types, which will bring great flexibility and openness to the system - different input and output methods can be mixed; the same type of input method can be mixed with different concrete implementations; Different kinds of encodings can be used in a specific input/output scheme; new encoding schemes can be dynamically added; and so on. In this case, some management of the encoding is required.

Encoding management is mainly the access and maintenance of encoding metadata. This includes management of the coding space, coding type, coding protocol, and the like.

Due to the personalization of the new data processing system and the arbitrariness of encoding, it is necessary to introduce a mechanism for encoding type registration and query. In this way, the application system can dynamically increase the encoding type. It is also possible to query and use existing coding types, and related metadata, such as the specific details of the corresponding coding specification.

Second, content selection service

Different environments have different requirements for the output of text content. For example, high-precision text printing equipment requires high-precision glyph information; low-bandwidth network equipment has to find a balance between glyph quality and data size; systems with high security requirements want text content to hide stroke information; movie dubbing and video chat need to be different Quality audio output; and more. These all require a content selection service.

Content selection is actually conditional output. The output can be directly the data object in the encoding repository. Corresponding to the same code, there may be multiple data objects in the code repository (the normal service can reserve multiple data objects for the same code). The content selection service needs to select the most suitable data object for output. The output data object can also be dynamically generated. For example, text image output can be dynamically rendered by text graphics data; low sample rate audio can be degraded by high sample rate audio; and so on.

Third, the content cache service

The specific implementation of the code repository may be a storage and related service within an application, and may be a service shared by the system, or a service in a public cloud or a private cloud.

When the code repository is shared in the network environment, the content needs to be downloaded locally via the network. Have At the time, due to limitations in network transmission reliability, bandwidth, etc., it is necessary to provide a local cache of the encoding repository. The local cache can cache some or all of the data objects of the shared code repository in the network on the client or intermediate nodes to support fast and reliable output. Similarly, in the case where the code repository access is unreliable or even offline, the input can also be directly cached locally, resulting in a temporary encoding. When the content cache is synchronized with the encoding repository, the temporary encoding is updated to the official encoding, and the corresponding encoded content is updated accordingly.

Fourth, the code conversion service

Based on the new data processing system, the computer system is capable of decomposing various inputs into data objects in the code repository and encoded content. The computer system can then restore this output to what the human (at least the importer himself) can understand based on the encoding repository.

However, due to the non-standard nature of the text encoding of this system, the encoded text content cannot be understood by anyone or machine in the environment without an encoding warehouse. The coding conversion is mainly to provide a service for converting personalized text encoding into standard text encoding. The result of the conversion is the traditional standard text, which can be used in traditional application environments that are out of the code repository.

Specifically, converting the handwritten object encoding into standard text encoding is to perform handwriting recognition on the corresponding text content; converting the speech-based object encoding into standard text encoding is to perform speech recognition on the corresponding text content. The result of this identification can also be used to implement a content normalization service.

Once the correspondence between object coding and standard text coding is established, the system can realize the conversion from standard coding to object-based coding to a certain extent.

Furthermore, different object encodings can also be converted to each other. It can be a conversion between different text output methods of the same person. For example, the result text of the handwritten input is voice output. It can also be a code conversion between different users. For example, the secretary's handwritten draft is directly converted into the manager's handwriting. There are two ways to implement conversion between object encodings. One is to convert the standard text code as an intermediate code. Convert an object code to a standard text encoding and then convert the standard text encoding to another object encoding. Another method of converting between object encodings is to directly establish a mapping relationship between the two encodings.

In addition, some object encodings are based on standard text encoding, such as sensitive word encoding for encryption purposes, common word encoding for compression purposes, and so on. These codes are themselves used to convert to standard text encoding.

It is worth mentioning that the relationship between different codes is not a one-to-one mapping relationship. For example, very In multi-language, the phenomenon of polyphony is very common, so there is often a one-to-many relationship between the code based on speech input and the standard text code.

V. Access Control Service

For a security-critical environment, access to the code repository needs to be protected by a system-level access control system. Of course, this access control is optional. In some single-user systems, there is no need to set up the content access control service separately.

In a multi-user environment, the access control system confirms the user identity of the system and, for that identity, allows or prohibits the use of services provided by the code repository in accordance with rules set by the code repository. For example, a user with an encoded warehouse text entry account can store their input data objects into an encoding repository. Only the user, and other users authorized by the user, have permission to obtain the data object of the user in the encoding repository.

The coding in the code repository is in the process of use and has an associated context model. Such as document models, user models, application models, and so on. Therefore, we can set permissions for different encodings based on these models, and this permission can be set at different levels, which can be encoding space level, meta encoding level, or even instance encoding level. Unlike traditional resource access controls (such as files, computers, etc.) and website access control, this level of code-level permissions enables more granular access control.

It should be emphasized here that the access control system does not protect the encoded content itself (object encoding set), and only protects the data objects in the corresponding encoding warehouse. Thus, an authorized user can restore the original input in conjunction with the data objects in the encoding repository. Users who are not authorized can not correctly output the same encoded content, only the unordered content or "garbled".

Six, text service

Based on the encoding services provided by the encoding repository, the object encoding based text system can also include some service subsystems to provide advanced text services.

Seven, text search and replace

As with traditional text search, object encoding can be searched in the new data processing system (text encoding layer), especially for normalized text content. In addition, since the new data processing system code and content are one-to-one correspondence, the text lookup can also be a content-based lookup. Taking handwritten input text as an example, you can search according to part of the text (such as the radicals). (Text content layer); can be fuzzy search based on content; can be searched according to the number of strokes, and so on.

In addition, due to the openness of the new data processing system, any type of data can be encoded by the encoding warehouse, and the new text search service can also perform search and replace according to the type of object encoding and the domain characteristics of the relevant type.

Eight, text conversion

A text conversion service is a service that converts open code into standard code. The service is based on the coding transformation of the code repository. However, unlike the coding conversion of the coding warehouse, the text conversion needs to be based on grammatical semantic analysis to select the optimal result among multiple candidate target codes. It is actually a more integrated, higher level recognition system.

Nine, text matching

Because the new data processing system can support highly personalized text input, the application can formulate matching rules based on personalized input to map the input to a specific output. For example, an Internet browser can map different characters or icons input by handwriting to different websites; a handwriting programming system can map specific inputs into corresponding keywords and the like.

X. Text data service

The security and efficiency of the new data processing system also applies to structured text technology. Text data technology based on open coding transformation will bring performance and efficiency comparable to existing binary data - metadata can be completely stored in the encoding warehouse, and object codes that do not conflict with each other can ensure the minimization of the encoding word length. The application has every reason to unify the textual content, structured, semi-structured data by the object coding system. The literal data service provides services for converting back and forth between open coded strings and application-specific models.

In addition, unlike traditional text input, the text input in the new data processing system does not need to generate a standard encoding, but rather input first, and generate encoding later. Therefore, the text input system can input in the most natural and efficient manner. The input result needs to be divided into minimum units in a natural and reasonable manner, such as characters of characters or words, segments of speech, and the like. These contents are then sent to the encoder or encoding system via the encoding system to obtain the corresponding encoding.

We can see that the input subsystem includes at least two functions, namely the reception of the input and the segmentation of the content unit.

It is worth mentioning that due to the privacy and openness of personalized coding, different input methods are still It can be mixed, as long as they are mixed into the same text using different encoding types or different encoding spaces. For example, insert text of a voice input into the text input by handwriting.

The input to the new data processing system allows for the diversity of input content such as graphics, images, video, sound, and the like. It also allows the multi-dimensionality of the input content, such as reading the pronunciation of the written content at the same time during the handwriting process. The content selection service of the code repository can output the appropriate form for multi-dimensional content selection. Multidimensional content also provides more information to help the system to segment content and identify content.

For an output system, the output subsystem is the original information that restores the text encoding to the input. Unlike traditional output systems, the output of the new system is completely dependent on an open code repository. The form and content of the output depends on the form and content of the input. It is not possible to output content that has not been input.

For the editing system, it is often necessary to make appropriate modification adjustments while inputting. As with traditional editing systems, editing systems based on personalized object encoding also provide basic addition, deletion, and modification functions. But the difference is that the new editing system can also provide functions such as modifying the input content and managing the segmentation of the content unit.

It should be noted that the new data processing system does not and cannot replace the existing data processing system. Instead, with the right design, we can also make the most of the infrastructure and tools of the existing system and organically combine the two systems. This use and integration includes at least the following aspects:

First aspect, standard control

Among the existing word processing systems and tools, some are just general-purpose data tools, and do not do any special processing for specific encodings, such as compression, encryption, storage, and so on. In the new data processing system, we can use them directly.

However, some word processing systems and tools require special handling for some characters. The most common are control characters such as line breaks, spaces, tabs, and so on. For example, the text line counter is to calculate the number of newline characters in the text; the text version management system or the text comparison and merge tool is also based on the index system of English words, and is also performed in units of behavior; the word count and the English word segmentation are also Standard control characters and punctuation are segmented as words.

Therefore, as long as methods are provided to input such standard control symbols and punctuation in a new text input system, more conventional word processing systems and tools can be used in new data processing systems.

The second aspect, hybrid coding

In addition, if the compatibility of traditional standard text encoding is taken into account in the text encoding of the new data processing system, we can easily mix traditional text with new text. Existing text can be used directly or effectively, and existing and new text input editing systems can be mixed. A simple hybrid coding scheme is to directly expand on the existing standard text encoding scheme, and the object encoding is distinguished from the standard encoding in some way. In this way, the object-encoded characters, even other speech or multimedia streams, can appear in the text at the same time as the standard characters.

With hybrid coding, existing text data technologies can be efficiently modified. In traditional text data technology, data characters and format characters are derived from standard text encoding, which results in the inability to directly use format characters in data characters, but rather through character escaping, which is inconvenient and inefficient. For example, in CVS tabular text data, a comma is used as a separator to separate text data. Therefore, if the text data contains a comma, you must protect the data in quotation marks. If quotes appear in the data text, you have to specialize the quotes. Hybrid coding is a good solution to this problem - since object encoding can be distinguished from standard text encoding, we can use it as a format character. In this way, standardized characters can be used arbitrarily in the text data without any limitation; the corresponding parsing program can also directly process the corresponding data without performing any character escaping processing. Furthermore, the schema of the data and the details of the format data can be placed in the code repository, which greatly reduces data redundancy and improves the efficiency of transmission and processing.

The third aspect, keyword mapping

A direct benefit of hybrid coding is that we can apply new data processing systems to traditional structured text and grammatical text. Keywords and special symbols still use the original standard text encoding, and identifiers or data content are encoded using objects. This means handwritten programming or voice programming is possible.

In this hybrid coding system, we can use the new text input system to complete all text input. It only needs to define the corresponding object coded text content for the system's keywords and special symbols. For other characters, it can also be encoded into standard characters by escaping. During the text input process or the processing of the text data, the system can automatically convert the corresponding standard text code according to the result of the content matching, and then process it by the traditional word processing tool, and the returned result is mapped back to the object code to be visualized. The form is presented to the user. A typical example is the handwriting programming system. We only need to provide this object encoding and standard encoding mapping system on the front end. You can achieve the desired results by using a series of toolchains such as traditional compilers, connectors, and debuggers.

Similarly, we can also map standard encodings to object encodings. In this way, a conventional text input system can be used to input a preset standard text encoding sequence, and the system automatically matches the corresponding object encoding. This has important implications for the editing and modification of object coding. For example, for an XML editor that supports object encoding, we can edit and modify the XML document in the traditional way, and store it as object encoding when the document is serialized.

FIG. 27 is a flowchart of Embodiment 4 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5C, as shown in FIG. 27, the method further includes:

Step 401C: When there are multiple object encodings of the same type and belonging to the same owner, encoding the plurality of objects of the same type and belonging to the same owner, or encoding the plurality of objects of the same type and belonging to the same owner The metacode in the map is mapped to the specified system code.

The system coding includes the following: a default meta code setting code, a root space code, and a client code set code.

In the present embodiment, system coding refers to an code capable of changing the coding and decoding behavior of the system. The corresponding data object is directly related to the components of the system codec. In general, system coding is built into the codec system and allows for certain extension mechanisms. The finalization code, the default metacode setting encoding, the root space encoding, and the client encoding setting encoding which will be mentioned later are all system encodings.

For example, following the above example, if there are a large number of data objects of the same type belonging to the same owner, then their corresponding object encodings are three encoding points (user encoding + type encoding + instance encoding), of which the first two The code points are all the same, which is a kind of redundancy.

We can introduce a system code to reduce this redundancy to a certain extent, such as using client-side encoding to set the encoding. The so-called client-side coding is a reference code that indicates a data object that has been decoded for some purpose. This encoding directly corresponds to the data object without the need for an additional decoding process. In general, client-side encoding is shorter than the original encoding of its corresponding data object. The coding and decoding process of this code does not involve the participation of the code repository. From the coding form, the client code is directly different from other common codes. The client code can correspond to a data object or to an encoded meta object.

The client encoding setting code is a system code that sets the client encoding. Its general form for:

Client encoding setting encoding + client encoding + object encoding / meta encoding

It is to map the specified object encoding/metacoding to the specified client encoding. Thus, any occurrence of the client-side encoding can then represent the corresponding object encoding/meta-encoding.

In this example, the purpose of this client-side encoding setting code is to define the meta-encoding of the two code points as a code of one word length. Then the meta-encoding of this word length can be used instead of the previous two-encoding point element encoding. The corresponding coding element model update is shown in FIG.

According to this coding element model, the system adds two new coding combinations, as shown in Figure 29: The target element coding in the figure corresponds to the replacement type coding.

In this way, the code storage of the above case can reduce one-third of the content.

When necessary, system codes with different functions can also be designed in different object coding systems.

Further, the method may further include:

The object encoding is encrypted.

or,

Compressing or encrypting the data object to be encoded.

FIG. 30 is a flowchart of Embodiment 5 of an encoding processing method according to the present invention. On the basis of the embodiment shown in FIG. 5C, as shown in FIG. 30, if the data object to be encoded is handwritten text, Then the method further includes:

Step 501C: Receive a code conversion request, and query a mapping table in the coding warehouse according to the code conversion request, and obtain a standard language parameter corresponding to the handwritten character by using a glyph matching manner.

Step 502C: Perform encoding conversion processing on the object code corresponding to the handwritten character according to the standard language parameter corresponding to the handwritten character and the object code corresponding to the handwritten character, to obtain a standard text corresponding to the handwritten character.

The standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.

In this embodiment, for example, FIG. 31 is a handwriting input program, and the corresponding programming language is Lua. Language, this is an embedded scripting language. The corresponding font library is encoded as follows:

There are three types of codes in the handwriting program shown in Fig. 31: font coding, word spacing coding, and line feed coding. We represent the glyph encoding as W+ (specific glyph encoding) and the word spacing encoding as S+ (word spacing value). For line breaks, for convenience, we don't embed the code in the content, but directly with the new line. Therefore, the code corresponding to the above handwriting program can be expressed as follows:

To convert the code, the user prepares the glyph number symbol mapping table as follows:

The glyph keyword mapping table is as follows:

The glyph interface identifier mapping table is as follows:

As you can see, there are four private identifiers generated:

FIG. 32 is a flowchart of Embodiment 1 of a decoding processing method according to the present invention. As shown in FIG. 32, the method includes:

Step 601C: Receive a decoding processing request, and acquire an object code to be decoded according to the decoding processing request.

Step 602C: Decompose the object code to obtain a meta code, or the element code and the instance code.

Step 603C: Query an encoding warehouse, and obtain corresponding metadata and a coding specification according to the meta code.

Step 604C: Acquire a data object corresponding to the object encoding according to the metadata and the encoding protocol, or the metadata, the encoding protocol, and the instance encoding.

In this embodiment, the object code contains or implicitly contains the meta code of the associated coded meta object. It is through this meta-encoding that the encoding repository obtains the corresponding encoding metadata and returns or creates an encoding meta-object for it. If authorization information or other control information has been set for access to the object code during or after the encoding process, these access control rights must first be authorized for verification before decoding.

In addition, after the object encoding is obtained, it needs to be disassembled to obtain the meta-encoding and/or instance encoding therein. After the meta-encoding is obtained, corresponding encoding metadata and/or encoding conventions are obtained in accordance with the obtained meta-encoding. The original data object is restored according to the encoding metadata and/or the encoding specification and the instance encoding.

The decoding of the data object is performed according to the content of the coding protocol. It can include direct content decoding, or decoding by reference to the encoding repository, or both.

The system is an open system, and the existing content codec technology can be used by the encoded meta-object (as long as there is a corresponding description in the coding protocol), and can also be used for the transmission and storage of the encoding warehouse.

FIG. 33 is a flowchart of Embodiment 2 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 33, a specific implementation manner of one of the steps 602C is:

Step 701C: Acquire a predetermined rule corresponding to the object code.

Step 702C: Decompose the object code according to the predetermined rule to obtain the meta code, or the element code and the instance code.

Further, the method further includes:

Performing access authority authentication on the predetermined rule;

Then the specific implementation manner of step 702C is:

After the authentication of the predetermined rule access authority is successful, the object code is disassembled according to the predetermined rule to obtain the meta code, or the meta code and the instance code.

FIG. 34 is a flowchart of Embodiment 3 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 34, the method further includes:

Step 801C: Perform access authority authentication on the meta code.

Then a specific implementation manner of step 603C is:

Step 802C: After the access authority of the predetermined rule is successfully authenticated, the object code is disassembled according to the predetermined rule to obtain the meta code, or the meta code and the instance code.

FIG. 35 is a flowchart of Embodiment 4 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 35, a specific implementation manner of the step 604C is:

Step 901C: Acquire a context object.

Step 902C: Acquire a corresponding coding space according to the context object and the coding protocol.

Step 903C: Decode the instance code from the coding space to obtain corresponding data content.

Step 904C: Acquire a data object corresponding to the object encoding according to the metadata and the data content.

Based on the description of the above embodiments, the specific application of the handwriting input system based on the encoding process will be schematically described below by taking the handwriting input system of the present invention as an example.

For example, taking the handwriting input based on line and spacing word segmentation as an example, the user inputs the current line as shown in FIG. Then, the input system forms four characters according to the spacing word segmentation algorithm and stores it in the encoding repository (assuming there are 64 characters 0x1–0x40 in the encoding repository):

Among them, 0x41, 0x42, 0x43, 0x44 are hexadecimal notation, which means 65, 66, 67, 68 in decimal. The object encoding can be directly the location of the data object in the encoding repository, or it can be a hash of the location. The specific content of each code item is graphic data, which can be a common format, such as SVG, or a proprietary format.

Correspondingly, the input system also generates corresponding text data, as follows:

0x41 0x20 0x42 0x20 0x43 0x20 0x44

Where 0x20 is a space character in the standard ASCII code (assuming the system uses standard spaces to separate characters). The above text is seen in the traditional text viewing environment:

A B C D

This is because 0x41, 0x42, 0x43, and 0x44 respectively correspond to the four characters A, B, C, and D in the ASCII code. When the conventional text is output, the corresponding character contours are extracted from the corresponding standard code-based fonts by these codes.

In the new data processing system, the text output will be taken out of the code repository and the corresponding graphics will be drawn to the output display in order. The result of the drawing is shown in Figure 36.

In addition, for type encoding, as mentioned earlier, in the new data processing system, multiple types of encoding will exist at the same time. We can uniformly encode different types of characters/terms. But unified The problem with the code is that the decoding system needs to obtain the corresponding encoding type information for each encoding to the encoding warehouse in order to correctly decode and output the encoding. This greatly affects system performance.

Another solution is to encode the type and store the encoding type information in the encoding repository. Thus, text encoding based on object encoding will include two parts: encoding type encoding (meta encoding) and specific encoding (instance encoding) under that type. This may increase the size of the encoding result, but it can greatly improve the flexibility and openness of the codec.

Based on the previous example, the encoding repository needs to add type encoding information (encoding meta information):

At the same time, all code items need to be coded according to the corresponding type and placed in different locations in the code repository. For example, for database-based implementations, encodings of different encoding types can be placed in different tables, and the object factory can find corresponding ones according to the type coding (meta-encoding) according to the system convention (for example, using the type ID as the corresponding encoded table name). table.

The contents of the "com.sample.handwriting.word" table in this example are as follows.

Correspondingly, the text data generated by the input system will become the following code:

0x01 0x41 0x02 0x01 0x42 0x02 0x01 0x43 0x02 0x01 0x44

Among them, 0x02 corresponds to a space. This is a control character and does not require specific text content. There is no corresponding table in the encoding repository.

We can use dynamic coding for the encoding type to achieve efficient, secure and open new data processing systems. A variety of input methods and encoding methods can be mixed in the same application system. Unauthorized systems or individuals cannot obtain any information from the encoded results. New input methods, encoding types, and applications can be dynamically added to new data processing systems.

In addition, for encoding data, for a system that can encode arbitrary data objects, sometimes it is often not enough to only provide the encoding of the text content itself. We also need to encode some other related information, that is, Encoding the data. Different from the encoding of the object data, the data content may not be stored in the text encoding warehouse, but directly encoded in the object encoding, that is, the content encoding mentioned above.

A typical example is the spacing of text. In traditional ASCII encoding systems, a space is a control character. In the corresponding text output, the width of a space is fixed. The distance between characters separated by spaces is determined by the number of spaces between them. This spacing can only be an integer multiple of the width of the space. However, in naturally written text, the spacing between characters or words is arbitrary (of course, all within the scope of the paper). In the previous example, a closer look reveals that the handwritten input graphic and the corresponding output are not consistent, mainly because the spacing between characters is not consistent. The encoding in the example uses the same encoding for the spacing between characters. To ensure the WYSIWYG effect, the length of the character spacing can also be encoded into the character object encoding result. We can put this length information into the code repository and then encode the location of the content item into the text. Obviously, it is much more straightforward and efficient to binary code the text and place it directly into the text. Figure 37 visualizes the length of the character spacing. As shown in Figure 37, the length is in logical units and can be adapted to different devices and output of different font sizes. We update the encoding type information as follows:

Among them, the encoding length of the space is changed from 0 to 1, which means that there is a byte length encoding after the space encoding. The encoded data type is null to indicate that decoding of the length encoding does not require access to the encoding repository. The encoding program can directly convert the interval length between characters into bytes and store them in the encoding result. The corresponding text encoding is as follows:

0x01 0x41 0x02 0x0C 0x01 0x42 0x02 0x10 0x01 0x43 0x02 0x01 0x0A 0x44

In this way, the text output subsystem can completely restore the original input based on this encoding.

It is worth mentioning that the interval in the example is the length spacing between handwritten characters. However, for other input methods, there are other kinds of spacing, such as the time interval between sound units in the speech input. We can provide different encoding types to support different kinds of spacing encoding.

In this example, we saw the effect of directly encoding the data on the object. Here we are coding integers. In fact, in computer systems, binary representation/encoding of various data is the basis for data storage and processing, and these technologies are very mature. For example, the IEEE 754 standard is a standard for binary encoding of floating point numbers. We can use all of these techniques to directly encode arbitrary data directly into the object's encoded results.

Therefore, in the coding scheme of the new data processing system, the data content of our data object can be stored not only in the encoding warehouse, but also in some way directly into the object encoding. Therefore, the text encoding of the new data processing system may actually be a mixture of reference encoding and content encoding. We can distinguish them by coding type. Furthermore, it is also possible to determine whether the encoding conforms to the type constraint by type type security check of the encoding type, and to determine the encoding by type derivation. The specific type.

In addition, for hybrid coding, the new data processing system allows us to create object-based encoded text content from beginning to end with new encoding. But in many cases, people want to be able to directly use existing text resources to make changes directly on existing standard code-based text. Sometimes, I also want to be able to modify and edit the text by mixing keyboards and new input methods. This requires that the new text encoding scheme be compatible with existing standard encodings so that the text of the two systems can be mixed in the same document.

There are many ways to implement mixed coding. A simple and straightforward solution is to put each standard code sequence into the code repository as object data content, defining a new object code for the content. Another solution is to place a type code before each standard text encoding in the text content. This type of code tells the decoder that the code is a standard text code. One of the main problems with these two schemes is that the existing standard encoded text content needs to be converted to become the target encoding, and the encoding result is completely incompatible with the original standard encoding. It is difficult to use existing text infrastructure and tools to process and analyze.

A better solution is to base the new text encoding directly on the existing standard coding. Here is a specific UTF-16 based text encoding scheme:

1. All UTF-16 standard codes are encoded using the original coding standards, such as BOM and Surrogate Pair.

2. The meta-encoding of all object encodings is based on UTF-16's private extension encoding (from U+E000 to U+F8FF)

3. The example code word length after type encoding (here a word is 2 bytes) is subject to the information in the code repository.

4. After the type encoding, the code example code word height is 1 (ie from 0x8000 to 0xFFFF), so as to avoid conflicts with other control characters.

For this encoding scheme, the decoding process is shown in FIG.

In addition, a specific example is given here. As shown in Figure 39, this is a mixed-coded content display.

In the corresponding text encoding, five standard Unicode characters U+0049(I), U+0020 (space), U+0061(a), U+006D(m) and U+002E(.) are used. Others are non-standard codes. Correspondingly, we have the coding information as follows:

The code type "com.sample.handwriting.word" in the code repository is:

The type "com.sample.photo" is encoded as:

The code corresponding to the Chinese text content is:

U+0049 U+0020 U+0061 U+006D U+0020 U+E0001 0x8000 U+002E U+0020 U+E0000 0x8041 U+0020 U+E0000 0x8042 U+0020 U+E0000 0x8043 U+0020 U+E0000 0x8044

This code will appear in the traditional UTF-16 data processing system as:

I am 耀. 聁聂聃聄

Since the two types of codes U+E0000 and U+E0001 are private characters and are not supported by the standard UTF-16 font, their output will vary depending on the implementation. Here, the output is blank (the blank before the five Chinese characters above). Some systems appear as boxes or black blocks.

We can see that based on this encoding scheme, our traditional UTF-16 text can be used directly without any conversion in the new data processing system. The coding results of the new data processing system can also be handled with infrastructure and tools that support UTF-16. For example, in the traditional text editor, replace "I am" in the example with "I am". Through the output of the new data processing system, the corresponding changes can be directly reflected, as shown in Figure 40.

In other words, the original UTF-16 processing power and tools can be inherited and retained in the new system. At the same time, the new encoding results can be stored intact in any storage system that supports UTF-16.

Similarly, we can also extend other standard encoding systems such as UTF-8 and UTF-32 to support new data processing systems.

In addition, with regard to conversion coding, in the new object coding system, in addition to putting the contents of the data object into the code repository, we can also put the code itself as the data content into the code repository. This type of encoding that converts other encodings is called transcoding. The specific content stored in the encoding repository is the text. A simple application is the conversion of standard codes. As shown below, we define a conversion encoding:

编码(内容ID)Encoding (content ID)	内容content	其他属性Other attributes
编码(内容ID)Encoding (content ID)	内容content	其他属性Other attributes	……......	……......	……......
0x410x41	0x54(T)0x54(T)	……......	……......	……......	……......
0x410x41	0x54(T)0x54(T)	……......	0x420x42	0x68(h)0x68(h)	……......
0x430x43	0x69(i)0x69(i)	……......	0x420x42	0x68(h)0x68(h)	……......
0x430x43	0x69(i)0x69(i)	……......	0x440x44	0x73(s)0x73(s)	……......
0x450x45	0x20(空格)0x20 (space)	……......	0x440x44	0x73(s)0x73(s)	……......
0x450x45	0x20(空格)0x20 (space)	……......	0x460x46	0x61(a)0x61(a)	……......
0x470x47	0x53(S)0x53(S)	……......	0x460x46	0x61(a)0x61(a)	……......

0x480x48	0x45(E)0x45(E)	……......
0x480x48	0x45(E)0x45(E)	……......	0x490x49	0x43(C)0x43(C)	……......
0x500x50	0x52(R)0x52(R)	……......	0x490x49	0x43(C)0x43(C)	……......
0x500x50	0x52(R)0x52(R)	……......	0x510x51	0x21(！)0x21(!)
……......	……......	……......	0x510x51	0x21(！)0x21(!)

Thus, our original ASCII string "This is a SECRET!" will be encoded as "0x41 0x42 0x43 0x44 0x45 0x43 0x44 0x45 0x46 0x45 0x47 0x48 0x49 0x50 0x48 0x41 0x51" under the new data processing system. For those who do not have access to the corresponding coded warehouse, if they get the text encoding, they cannot be output in the new data processing system. This code is output as "ABCDECDEFEGHIJHAK" in the conventional ASCII code system. In this way, users who are not authorized by the code repository will not be able to obtain real content. This actually implements an encryption function. This encryption is not the same as traditional encryption. Traditional encryption is the overall encryption of the entire text data. This contention-based content protection relies on authorized access to the encoding repository for fine-grained content protection. For example, only encode or convert characters or words that need to be protected, or grant different access rights to different encodings.

For example, based on the aforementioned UTF-16 hybrid encoding, we can re-encode only part of the text, and other content is encoded in UTF-16. Here we use the new type encoding:

Corresponding to the following code warehouse:

The original UTF-16 string "This is a SECRET!" is coded as "U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 in the new data processing system. U+0061 U+0020 U+E002 0x8000 U+0021”. In the new data processing system, for the type "com.sample.secrete", special display output can be performed by different users. For example, for an authorized user, the content corresponding to U+E0002 0x8000 can be obtained normally, and the result is displayed as:

This is a SECRETE!

For an unauthorized user, the content corresponding to the content corresponding to U+E0002 0x8000 cannot be obtained. The result is displayed as:

This is a !

The code is output in the UTF-16 text environment as:

This is a Yao!

Here we can see that this flexibility is difficult to achieve with traditional encryption. In addition, traditional encryption methods and transcoding can also be used simultaneously: the entire text encoding is encrypted, or the text content is encrypted. In this way, the content security of the system can reach a higher level. After obtaining the ciphertext, the user needs a key to get the plaintext, but the plaintext is incomprehensible. You need to obtain the corresponding content by obtaining the identity verification of the encoding repository. If the content itself is encrypted, you need to The content decryption can finally get the corresponding information.

At the same time, it should be pointed out that the practice of turning multiple characters into one code here actually achieves the effect of compressing the text.

In addition to the standard code, encryption and compression can be achieved by transcoding. Other arbitrary encodings can also implement encoding and grouping and conversion using conversion encoding.

Here's a concrete example: As mentioned earlier, the new data processing system coded results can be mixed with the characters entered by traditional keyboards. Suppose this time, we are using the handwriting input method. What results will be obtained if handwriting input directly on the content of traditional characters? If this interaction is allowed, the intuitive result is that the stylus stroke falls above the result of the character output. As shown in Figure 41.

Here, we can use transform coding to mix different types of codes together to form a Coding. The encoding types used are as follows:

The content items of the encoding type "com.sample.handwriting.word" are as follows:

The related content items of the encoding type "com.sample.handwriting.mixedword" are as follows:

Here, the code U+E003 0x8000 corresponds to this mixed content of UTF-16 encoding and handwritten character object encoding. When the encoding repository obtains this content, it will detect that there is still an encoding in the encoding repository in the encoded content, and it will refer to all objects directly or indirectly referenced. The data content is taken out and sent to the client. This minimizes the number of times the service is accessed and also the problem of detecting circular references (the same code is directly or indirectly referenced by itself). The corresponding text output system will decompose this code into two parts. The first part is a handwritten code, which may include an interval code before. This interval code is the spatial separation of the handwritten content from the previous location. After the handwritten coding is the second part, which is any mixture of UTF-16 coding and interval coding. Rendering the two parts in turn gives the correct results.

In this embodiment, the personalized text encoding makes the text dependent on its encoding warehouse for correct output and human understanding. This has a natural safety advantage. We can deploy text encodings and text encoding repositories in two different systems. In this way, only the users who have access to the two systems at the same time can get the final text information. This is the concept of split storage as described earlier. For example, for a traditional web microblogging system, a webmaster or system database administrator can easily see any microblog content stored in its system, whether it is public or private. However, if the Weibo content uses handwritten text content based on object encoding, and the corresponding encoding warehouse is provided by another Internet service provider, then the administrator who does not have access to the encoding warehouse can see Weibo. Text code, he/she still can't get the text content. At the same time, although the administrators of the coding warehouse service providers can obtain the glyphs corresponding to each text code, they do not have the text encoding of the entire Weibo, so the Weibo content is unknown to them. Similarly, for a hacker who makes a man-in-the-middle attack on this handwritten microblogging system, they must simultaneously crack the two systems of the microblog and the code repository to completely intercept the microblog information of the system. This approach greatly increases the cost of attack.

In addition to non-standard text encoding, we can also use the conversion encoding mentioned above to standardize the standard code through the encoding warehouse to standardize to achieve content protection.

In addition to the security of the encoding of the object-based data processing system, the new system can also pass other mechanisms (such as but not limited to: encoding space, access control, encryption encoding, content verification encoding, etc.) ) Provide more detailed protection for text content.

In addition, as mentioned earlier, the coded access space can completely isolate the code of different security levels. For example, for an encoding repository deployed inside the enterprise, any direct request for privately encoded content is rejected. Similarly, an encoding repository deployed in a public cloud will also reject textual requests for enterprise encoding and private encoding.

We can specify the corresponding encoding space by specifying the range of type encoding. For example, at some In a data processing system based on open coding, we define 0–99 as public coding, 100–199 as enterprise coding, and 200–255 as private coding. Thus, type encoding above 99 is not directly supported by the public encoding repository. For intra-enterprise cloud-based code repositories, type encodings greater than 199 are unsupported encodings, 100–199 type encodings are supported for direct storage, and type encodings of 0–99 are indirectly supported. The type of encoding. This indirect support can be implemented as a content caching service for public cloud encoding repositories.

From this we can know that for the same person, there can only be one public code warehouse, which exists in the public cloud. Specifically, it exists in an Internet service. However, there may be multiple private code repositories and enterprise code repositories, which exist in different network environments and computer systems. For these different encoding repositories, it is necessary to generate different encoding warehouse identities. The corresponding text file or text data needs to store the identifier of the corresponding encoding warehouse to ensure correct encoding, decoding, input and output.

Different non-public code repositories will lead to the emergence of information silos. Therefore, under certain conditions, the closed code repository is also allowed to submit content to the open code repository to facilitate content sharing.

Sometimes, the three-level code access space does not meet the actual needs. For example, some application systems also want to establish a department-level sharing mechanism. In this case, the application system can define a finer subspace within the enterprise coding space. The management of the subspace is done by the application system.

Here is a concrete example:

A personal handwritten diary application that uses a local private code repository. The body content of the diary is stored in the cloud storage of the Internet. The code warehouse is stored in the U disk that the user carries with him. In this way, even if a hacker obtains the journal content in the cloud storage, and there is no corresponding U disk, they cannot obtain the information inside. In the same application system, when the user publishes the journal content as a blog, the system needs to convert the corresponding text content from the private encoding space to the public personal encoding space. This process is actually taking the corresponding encoded content from the U disk encoding warehouse. Stored in the public code repository and get the process corresponding to the public code.

In addition, the protection of the encoded content in the encoding warehouse is mainly done by the access control service of the encoding warehouse. Access control is primarily for encoding metadata as well as specific data objects. Unlike ordinary access control, object-coded access control enables fine-grained control over access to text content. The previous example has been combined with access control and conversion coding to achieve partial text. Encryption.

Further, in the case of encrypting the code, in the example of encrypting a part of the text content, the coded warehouse of the converted code stores the code of the sensitive text content. Then, in fact, the system administrator of the encoding repository or the hacker who invades the encoding repository can actually obtain all the information of the sensitive text from the text encoding warehouse according to the encoded content. Moreover, the plaintext obtained from the code repository will be transmitted directly through the network, and there are also security risks. Another option is to use encryption encoding. Encryption coding is a special type of coding. The text content corresponding to the encryption code is a key. Encrypted encoding is followed by the length of the encrypted content, after which the encoding of this length is the ciphertext after being encrypted by this key. In the text output, if the key corresponding to the encryption code can be obtained normally, the ciphertext can be correctly restored to the original code by the decryption process, and the ciphertext can be correctly output. Therefore, the access control of the encryption code can implement dynamic access control for the encrypted code. Traditional encryption and decryption techniques can be used here. Here, as an example, we define a simple encryption scheme: the key is a pseudo-random number (which can be automatically generated when setting encryption), and the encryption and decryption functions are identical, that is, each instance code is XORed with the key.

Using this scheme in the previous example, update the encoding type information as follows:

The "com.sample.scrambling" code repository is as follows:

The original UTF-16 string "This is a SECRET!" is compiled in the new data processing system. The code is "U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0061 U+0020 U+E004 0x8000 0x0006 U+FFAC U+FFBA U+FFBC U+FFCD U+FFAC U+FFCA U+0021”. Here U+E004 0x8000 0x0006 is actually the encryption code. When the decoder reads U+E004, it will find that it is an encryption encoding type. There are two parameters, 0x8000 is the specific code, and the corresponding code in the code repository is its decoding key. 0x0006 is the data length of the encryption encoding, here is 6 words (here a word 2 bytes). The decryption program will attempt to read the contents of 0x8000 from the encoding repository. If available, the key can be used to decrypt the six 16-digit numbers. Get the corresponding code: U+0053 U+0045 U+0043 U+0052 U+0045 U+0054.

Otherwise, the next 6 words are encrypted text, which cannot be displayed correctly. The decoding program will skip 6 words directly, and the display output is as follows:

This is a [Encrypt 12 bytes here]!

This encoding method makes it easy to authorize text in real time. For example, we encrypt the text and send it out by email. After that, for some reason, we don't want the recipient to see the content of the message. At this point, we only need to set the corresponding encryption code to be forbidden by the recipient. In this way, the already sent mail becomes unreadable. We can use this mechanism to implement the function of mail revocation. In addition, it is worth mentioning that because the encrypted text encoding has changed, the search engine is invalid.

For content verification coding, similar to encryption coding, we can also place verification information that encodes some or all of the text into the code repository and form a code. This code is called content verification code. With content verification coding, we can monitor whether text content has been tampered with.

For example, a leader gives a clear indication of an item in an email, and he can set the text as "tamper-proof." At this time, the system can perform a hash algorithm on the text to form a 128-bit number, which has a one-to-one correspondence with the text. The system stores this 128-bit number in the code repository, forms a content verification code (including the length of the text), and places the code before the text. After the mail has undergone some forwarding, the decoding program may compare the hash code obtained by the content verification code with the hash value of the corresponding text to determine whether the text is the original information of the original author. If the verification is correct, the verification result can be visualized in some form, so that the final reader knows that the information has not been tampered with.

For multi-user coding schemes, multiple copies are stored in the text encoding repository in a multi-user environment. The text content of the user. At this time, you only need to use the user ID to distinguish the text content of different users. If necessary, you can also distinguish the encoding type information from different users. In this way, different users may have different types of encoding of the same encoding, thereby further increasing the security of the system.

For encoding the home space, sometimes users need to share the code. We distinguish by different coding spaces. As mentioned earlier, personal codes vary from person to person, and the shared code is the same for everyone. In an enterprise code repository, if the corporate logo is placed in it, the corresponding code is a typical shared code. The existing standard codes are typical public shared codes. In addition, some control codes, such as interval coding of handwritten text, and system codes, such as codes representing user IDs, may employ shared coding. In this way, some system tools, such as retrieval systems, can use these codes more efficiently. In fact, Unicode also has the concept of encoding the home space, most of which is shared code, but also reserved a private area, which is actually the personal code we are talking about here.

As mentioned earlier, in the object encoding data processing system, we can encode the encoding type, which includes two parts: type encoding (meta encoding) and specific instance encoding in the type. Applying the encoding home space to these two parts actually produces three specific encoding methods: full shared encoding, shared type personal encoding, and full personal encoding. Full shared encoding The entire encoding is actually shared by all users of the encoding repository and is not associated with any user. The encoding and corresponding content is generally managed by the code repository administrator. The shared type of personal code is actually still a personal code, and its coding varies from person to person. But its type encoding is shared. That is to say, different users use such an encoding, and the corresponding type encoding portions are the same, but the remaining portions vary from person to person. One advantage of using this encoding is that the word processing tool can obtain the type information of the text encoding without any personal information, and then can process the text encoding based on this information. Complete personal coding means that both parts of the code are personalized and different from person to person. This code is therefore the most secure, but at the same time has the lowest operability. The word processing tool must obtain the encoding type information based on the user information of the encoding owner in order to obtain all the encoded information. Here we see that for the same coding type, these three different specific types of codes may exist in one code repository.

For the same user, his personal code and the shared code available will appear in his text content. At this time, it is necessary to distinguish by coding space. Here are some examples:

In the previous example, we added a standard smiley icon at the end of the sentence, as shown in Figure 42. This smiley icon also comes from the code repository, and the corresponding code is the expression code shared by all users. At the same time, the space encoding here uses the shared encoding. Handwritten coding uses type shared personal coding. The share type information is as follows (here the shared type code is assumed to be 0x01-0x7F):

In the above table, the encoding type 0x01 and the type 0x02 are identical except for the attribution space. In fact, types 0x01 and 0x03 are shared encodings, while type 0x02 is a personal encoding. But all three types are shared. In the same code repository, the personal type information will have one more user ID than the shared type information.

The following is the content item of type 0x02:

The following is the content item of type 0x03:

Therefore, the code corresponding to the text is:

0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05

In addition, for encoding the user, from the above example, we noticed that each personal coded content item has information of the user ID. For a multi-user code repository, the personally encoded data objects vary from person to person. The encoding of different users can be distinguished by the user ID of the data object. However, in the text encoding that can exist independently of the text encoding warehouse, how to place the corresponding user ID information? There are two situations here.

For single-user text encoding, one case is that the personal encoding in the text encoding comes from the same user (shared encoding does not require a user ID to access). There can be different implementations. One way is to use the context object mentioned above to set the system encoding. The other way is to explicitly define the user type as the context object type in the encoding metamodel. In this case, we only need The user ID information is encoded into a shared code and placed at the top of the text-encoded content.

The shared type information of the above example adds the user ID code, which is updated as follows:

Correspondingly, the two-byte user ID is directly used as the encoding parameter of type 0x01. The final encoding for the above example is:

0x01 0x0C3F 0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05

In this way, the character encoding read program can know which user the personal code belongs to after reading the first three bytes 0x01 0x0C3F.

Sometimes, this user code can also be omitted, which is actually the implicit encoding context. For example, in a personal handwriting application system, each user's textually encoded content is the user's personal code. In this system, the user ID of the text encoding warehouse uniquely corresponds to the system account. This ID can be stored elsewhere than the text encoding.

For multi-user text encoding, another case is multi-user hybrid encoding, which means that in the same document, multiple text encoding warehouse user codes may appear. We can still use the above scheme, except that different user codes can appear multiple times in the text. After each user code The personal code is the personal code of the user. In addition, we can also use the user ID as a property of the text in a structured document (for example, XML-based documents: XHTML, SVG, etc.).

Of course, there is also a most direct context-independent coding scheme that directly uses the user ID as part of the encoding.

For multi-user and multi-application coding schemes, in a multi-user system, as a warehouse of data object data content, an encoding warehouse is often shared by multiple application systems. The developer of the application system has the opportunity to obtain the object code that the user stores in his system. If the same user uses the same encoding method for different applications, then if the hacker or malicious application developer analyzes the object encoding of a certain user in an application, the correspondence between the user code and the content can be established. This correspondence can be directly used in other applications. Therefore, code isolation between different applications will greatly enhance the security of the system. The so-called encoding isolation is that the data content of the same data object corresponds to different object encodings of different applications. To achieve code isolation and sharing between applications, application-related coding spaces can be used here. Different applications can use different encoding spaces or use the same encoding space when applying for a certain encoding.

Further examples of the application of handwriting input systems that can incorporate the coding scheme of the present invention are listed below:

1. Handwriting systems in specific fields, such as handwritten diaries, handwritten books, handwritten sudoku, handwritten crosswords, etc.;

2. Based on a handwritten command line input system;

3. A handwritten formula editor;

4. A handwriting based programming system.

In addition, in order to further describe various implementations of the coding scheme, for example, for DSL personalized documents, due to the openness of the new data processing system coding, we can also encode the user's interaction in a specific domain. In this way, the user's interaction data can be stored, processed and transmitted in text form. One of the benefits of doing this is that we can mix and match this interaction with other text from the user for storage and processing. At the same time, we can also process it with existing word processing tools. In addition, we can also personalize the user data using the various encoding schemes we mentioned above to achieve the security of interactive data.

Specifically, an example of online Go is taken as an example for specific description, as shown in FIG. 43.

We can define four shared encoding types: one is the user encoding type, the user's encoding The warehouse user ID is encoded in it. One is the opening code, which is a specific domain (application) code, followed by a black and white user ID. One is the drop code, followed by the position of the drop. As shown above, we can use two bytes, such as 0x00 0x00 is the position of the upper left corner, 0x09 0x09 is the sky element position. The last one is the delay code, which records the number of seconds since the last drop. Here, we use an 8-bit word length scheme compatible with ASCII encoding. Therefore, all of our non-ASCII encodings here use the first byte of 1. The type information (encoding metadata) is as follows:

Here, these six encodings are content encodings, so there are no data objects in the encoding repository. The example game text is as follows (in this example, the code except ASCII code is expressed in hexadecimal):

0x81 0x85 0x83

0x80 0x85 0x83 0x85 0x82 0x8F 0x83

0x83 0x82 0x84 0x86

0x80 0x83 0x83 0x8A 0x82 0x83 0x83

0x80 0x86 0x83 0x87 Hello, everybody!

0x80 0x85 0x83 0x88 0x82 0x8F 0x90

0x80 0x83 0x83 0x8F 0x82 0x83 0x8F

0x85 0x86

0x80 0x85 0x83 0x83 0x82 0x90 0x8A

0x80 0x83 0x83 0x8F 0x82 0x8D 0x82

...

The object code sequence will be stored in the website store of the Go app. Game data and chat data can be mixed together due to the new data processing system. Through this content, the application can visualize it in the user's chat history (here, the user name of the user ID is 0x05 is "Xiaoming", the user whose user ID is 0x03 is "Xiaoliang", and the user whose user ID is 0x06) Named "Xiaoqiang"):

System: Xiao Ming is black, Xiao Liang is white. The game begins.

(5 seconds after the start) Xiao Ming: Luo Zi P4

(7 seconds after the start) System: Xiaoqiang joined the auditorium

(15 seconds after the start) Xiao Liang: Luozi D4

(22 seconds after the start) Xiaoqiang: Hello, everybody!

(23 seconds after the start) Xiao Ming: Luo Zi P17

(38 seconds after the start) Xiao Liang: Luozi D16

(38 seconds after the start) System: Xiaoqiang left

(41 seconds after the start) Xiao Ming: Luo Zi Q11

(56 seconds after the start) Xiao Liang: Luozi N3

...

The game process can also be visualized in a graphical way.

According to this transcript, the Go application can play back the entire game process. If the privacy of the player is protected, only the game process authorized by both players can be played back normally. For traditional applications, implementing this functionality requires a lot of work in the application system: The user authorizes the system, maintains user authorization information, and so on. There is no privacy protection from the game data itself that is out of the authorization system. Therefore, application data leakage for any reason will lead to leakage of user privacy. In the new data processing system, the key data is placed in the protection of the coding warehouse context coding space, which can greatly enhance the security of the application and data, and can also reduce the complexity of the application system.

Returning to the Go app example, we only need to replace the drop type with a context-dependent type:

Corresponding to the user space of the code repository Xiao Ming (actually the document space in the user space), there is Xiao Ming's coded data for the game:

编码coding	XX	YY
编码coding	XX	YY	11	P P	44
22	PP	1717	11	P P	44
22	PP	1717	33	Q Q	1111

The small bright sub-coded data is:

编码coding	XX	YY
编码coding	XX	YY	11	D D	44
22	DD	1717	11	D D	44
22	DD	1717	33	N N	1111

Thus, the corresponding text encoding is:

In this way, as long as Xiao Ming and Xiao Liang properly authorize the respective coding spaces in the code repository, they can control the system or other people's access to the game.

As mentioned earlier, the code repository can be thought of as a font library for new data processing systems. However, this font library does not necessarily contain standard glyph information, but also any other type of information; the location of information storage is not specific, but arbitrary. This font library can of course also store the standardized coded glyph information, which is the content of the traditional font library. Taking a vector outline font as an example, the vector outline information of each word (or letter) can be stored in a specific storage of the code repository according to the position of its standard code (such as Unicode code). Other information needed for text output, such as Hinting, Kerning, etc., can also be stored in the code repository.

Code repositories can be deployed on the network, and networked fonts make it easier to maintain, upgrade, add new fonts, and more. A traditional font file can be used as a local cache for the content corresponding to the encoding repository. At the same time, the content selection service of the encoding warehouse can also select different quality glyph contents according to different output devices.

The text display client only needs to render the standard encoded text, according to the font information, obtain the corresponding rendering information or rendering result from the encoding warehouse, and then can correctly render the traditional text.

In computer systems, people not only use text data to record their own words or actions, but also use them to portray models and data in different fields. In general, we use formatted text to record models and data. The advantage of formatting text is that it facilitates automatic analysis and processing by the computer. XML is a typical formatted text that can express any model in the world through a tree structure. XML-compliant text due to the human-readable, extensible, and flexible nature of XML The format is widely used and widely exists. For example, HTML (4.0 or above), SVG, RDF, etc. used in Internet web pages are all based on XML format. In fact, the XML standard is one of the cornerstones of the Internet.

However, XML has a fatal weakness, which is too redundant, which makes the file storage, transmission, and processing too expensive. It is for this reason that the World Wide Web Consortium (W3C) has developed the EXI (Efficient XML Interchange) standard. This is a binary XML standard.

Similarly, representing an XML file in a new data processing system can also avoid its Achilles heel. But unlike the full binarization of EXI, the XML file in the new data processing system is still in text format, except that the corresponding encoding becomes the object encoding. As can be seen from the SVG example in OTF-8, we reduced the redundant information on the XML syntax by object encoding. Combined with the metadata in the code repository, the converted result is completely equivalent to the information before the conversion. People can easily view and edit text content through the previously mentioned "hybrid coded universal display and editing" text service. We can make greater use of the encoding warehouse, and use the values of XML elements and attributes as the data parameters of the corresponding encoding, and directly use the object encoding to encode. This further compresses the storage space and reduces the possibility of errors. Of course, we can also store the XML content or fragments directly in the encoding repository and use the encoding in the XML file, but this is just the use of XML by the encoding repository, not the XML encoding itself.

Using object-encoded XML files, we only need to make a few changes in the XML parser to get the relevant information from the encoding repository. Based on this, all existing XML technologies, such as SAX, DOM, XPath, XSLT, XSLT-FO, etc., can be used directly. For application developers, all changes occur in the storage and parsing layers of the XML file. If the API remains the same, the application using XML does not need any changes, and can directly enjoy smaller file sizes and more. Fast transfer speed.

In fact, in the existing XML specification, the same set of character sets is used to express both grammatical and textual content. Therefore, in the process of generating XML files, we have a number of restrictions, such as: some system characters ("<", ">", "&", etc.) can not be used directly, must be escaped through the entity; non-parsed data has to Encapsulation via "<[!CDATA[" and "]]>"; and so on. The use of object coding makes these restrictions completely unnecessary, because we do not need to determine whether it is a tag or content by the encoding itself, but by encoding the corresponding encoding repository information. So we can simplify the complexity of XML and the corresponding parsing process.

Similarly, we can use the same method to encode any existing text format (such as CSV, RTF, CSS, JSON, and even programming languages):

1. Place the corresponding content of the grammar tag/keyword in the encoding repository, and use the corresponding object encoding in the file;

2. Remove any character restrictions in the data/text content.

As mentioned above, the object encoding can easily eliminate the conflict between the formatting code of the original standard encoding and the text content. Similarly, the open coding of such encoding and content splitting and the openness of the encoding type makes it possible to mix many different arbitrary text formats together. This possibility is also taken into account in some existing text format specifications. For example, XHTML can embed JavaScript, or embed Base64 encoded binary data; RTF can embed OLE objects and so on. However, on the one hand, these formats are limited by standard text encoding. Different formats of data require certain encoding conversion or character escaping. On the other hand, the existing format mixing is also limited, mainly in one format. (Other formats are just embedded data). However, with object coding, we can easily mix in any format. For example, embedded in tabular data in a node of an XML document (actually a tree document); or conversely, a tree-shaped document in one unit of the table; or two different forms of document data placed side by side. Of course, this mixture of multiple formats is also subject to certain rules:

1. Each format must have an explicit format to start with the format end encoding.

2. The beginning and end of different formats cannot be intertwined. That is, if one format starts inside another, it must end inside it.

In addition, object encoding allows us to embed binary data directly into the encoded results. In fact, it is the way to encode the content of the data content of the data object. It is only necessary to describe the corresponding binary encoding method in the corresponding encoding metadata. The composition of such object coding can be in the form of:

Metacode + binary content encoded data length + specific binary content encoded data

In fact, the implementation of mixed format encoding is very natural for object encoded data processing systems. In the open object coding system, different coding types originally require different encoders and decoders, and in an object coding document, they are dynamically loaded and coded as needed. The encoder encodes the object into a stream of bytes, and the decoder decodes the stream of bytes into objects. The different formats are to divide the codec and decoder into different groups. Therefore, the encoding of a certain format is actually the corresponding memory. The model is encoded into a byte stream, and the decoding process for this format is to decode the byte stream into a memory model, a higher level object. Therefore, the format codec is actually a more macro object codec that can be managed in the same way in new data processing systems.

Essentially, the object encoding system encodes object strings in a byte stream. The object string, that is, the object in the object array can be as simple as a single character, or can be as complex as the abstract syntax tree corresponding to the program code, or the tree structure corresponding to XML.

In addition, for handwriting-based programming systems, the compiler and the objects of interest to the interpreter are primarily symbols in the programming system. Whether the symbol corresponds to a word or a graphic does not affect the compilation and interpretation. In this process, symbol matching is extremely important. Therefore, in the handwritten data processing system, we can reuse the existing programming language infrastructure by simply matching the text content and matching the matching content with the same encoding. There are two main types of pattern matching: keyword matching and identifier matching. The result of keyword matching is the system keyword (generally standard encoding for traditional programming languages); the result of identifier matching is the same custom encoding or extended encoding.

In addition, for programming languages, most programming languages currently use text files. Similarly, the program source code object can be encoded using the above method. Object encoding of program source code can bring the following benefits:

1. Reduce the file size. This is especially important for source code that needs to be transmitted over the network, such as JavaScript.

2. Can be programmed using non-standard encoding. This makes possible such as handwriting programming and voice programming.

3. You can use the security features of open coding to place the code in the source code in the relevant context space of the author or copyright owner, which can only be used by authorized users.

4. In the process of parsing the source code of the keyword being open-coded, the lexical scanning and analysis of the keyword becomes a direct code recognition, which is more efficient.

As with object encoding for most text files, object encoding of program source code is primarily at the tool level and is completely transparent to the end user.

In addition, open coding itself brings new possibilities to programming languages. We can build computer software in a completely new way: data can exist in the code repository, which can be directly referenced in the program; the program can also exist in the code repository, which can be referenced by coding; Can be mixed with the program in some form.

In addition, for machine instruction encoding, the encoding repository is actually a natural password library. Data encoded by the code repository is highly secure. Therefore, we can not only encode the text through the encoding warehouse, but also use it to encode the binary data. A typical application is to encode context-sensitive objects for machine instructions. Thus, the binary of the same application is completely different for different users. Users cannot execute executables of other users. This is actually a solution for digital rights protection for applications. In addition, this solution can also prevent the destruction of executable files by viruses or malicious programs.

The specific implementation of this scheme is mainly accomplished by modifying the implementation of the program execution engine or virtual machine. Take the Java virtual machine as an example, as long as the standard Java virtual machine instruction code is re-encoded by different users according to a method (such as a random algorithm), and placed in the code repository, and the appropriate protection rights are set; The Java bytecode is encoded according to the encoded instruction code; during the execution of the Java virtual machine, the current bytecode is dynamically restored to the standard instruction code according to the current user information. In this way, only the corresponding user can correctly execute the corresponding Java bytecode.

For binary format encoding, similar to the executable file, we can also put some or all of the key information in other binary data files in the encoding area, thus playing the role of copyright protection - only authorized users can get the key Information and use the corresponding binary data.

Taking video files as an example, many video file formats are actually container formats, which can accommodate video and audio streams in different encoding formats. The industry generally uses a four-byte encoding format designation called "FourCC." The video player will decode and play the video and audio streams using the correct decoder based on this FourCC. There are currently hundreds of registered FourCCs. We can replace the FourCC in the video file with the object encoding, and the real stream encoding identifier is stored in the corresponding encoding repository storage. In this way, by controlling the corresponding access rights of the code repository, we can control the playback of video files or video streams.

In addition, with regard to data compression, it is also possible to implement a data compression function by using an encoding warehouse: the repeated portions of the data are placed in the encoding area and the corresponding open coding is used.

In addition, for the network digital store, we have already seen that the security mechanism built into the object encoding warehouse makes digital rights management, identity authentication, etc. easy to implement on the basis of the encoding warehouse. We can use it for the construction of a network digital store.

The network digital store system is mainly an application system that provides digital content transaction services to network users. System. Applications such as app stores, e-libraries, etc. fall into this category. There are two main types of users here: providers of digital content and consumers of digital content. The network digital store system can be directly built on the code warehouse. All users are users of the code warehouse. By linking the corresponding digital content with the context code associated with the user, the security built into the code repository can be used.

Specifically, consumer consumption of digital content is mainly two modes: rental mode and purchase mode.

Lease mode means that digital content or digital assets are owned by the provider, and consumers only obtain temporary access or use rights through some means (generally paid). The digital content being leased is generally time-sensitive, and content that has expired is inaccessible to the consumer. By placing the provider-related contextual code in the digital content, access control for the lease mode can be implemented—access authorization based on each user's lease period.

The purchase model refers to the right of the consumer to obtain digital content in a certain way (such as paid purchase). So here is mainly the issue of digital copyright protection - preventing the creation of illegal copies. A specific implementation of the use of the code repository is to place a special context code in the user's personal space in the digital content purchased by the user. This encoding can only be accessed by this user, and the user cannot change the encoding access rules. In this way, other users will not be able to use the digital copy of the same content.

As can be seen from the above description, the core part of the data processing system based on object encoding is the encoding warehouse (or encoding library). Various encoded metadata can be stored therein; the real content of the text can also be stored therein. Through the various services provided by the code repository, the new text input system can convert various text content, or other content (such as user interaction content, specific domain content, application content, etc.) into text code, which is stored and processed by the application system. . In the process of generating the text encoding, part or all of the text content is stored in the encoding repository. Similarly, through the services of the code repository, the new text output system can convert the string sent by the application into text content that can be rendered or played, or an object model that the application can use.

Of course, the encoding repository is not the only storage or storage space. A generalized code repository can be a combination of multiple banks, or even a cloud storage service provider under different secure channels in cloud storage.

Metadata In the new system, whether it is coding layer processing or text data processing, encoding, decoding systems or functions are the cornerstones of them. As the core of the new coding system, the code repository provides at least two items. Basic service. One is to receive the content to be encoded, ensure that the content is properly stored in the encoding repository, and return the corresponding encoding. Called the encoding service. The encoding system uses this service to get the correct text encoding. Another service is to return the corresponding content item according to the encoding, which is called decoding service. The decoding system needs this function to obtain content that can be correctly output by the output system. Of course, for a single-user system, the encoding/decoding function or service can also be set directly on the client side without having to be set at the encoding repository.

44 is a schematic structural diagram of a first embodiment of an encoding processing system according to the present invention. As shown in FIG. 44, the encoding processing system includes: a receiving unit 11C, a metadata extracting unit 12C, a metacode generating unit 13C, and an encoding protocol. a selection or creation unit 14C, an example encoding generation unit 15C, and an object encoding generation unit 16C; specifically, the receiving unit 11C is configured to receive an encoding processing request, and acquire a data object to be encoded according to the encoding processing request according to the encoding processing request; metadata extraction a unit 12C metadata extracting unit, configured to acquire metadata according to the data object to be encoded, and a meta code generating unit 13C, configured to query an encoding warehouse according to the metadata, and obtain a meta code corresponding to the metadata; The encoding specification selection or creation unit 14C is configured to select or create a corresponding encoding specification according to the meta encoding; the content encoding generating unit 15C is configured to encode the data content of the data object according to the encoding specification, and obtain an instance encoding. The object encoding generating unit 16C is configured to acquire the data pair according to the meta encoding and the instance encoding Like the corresponding object encoding.

In this embodiment, the coding processing system can perform the technical solutions of the method embodiments shown in FIG. 5C and FIG. 5D, and the implementation principles and effects thereof are similar, and details are not described herein again.

In addition, the encoding processing system may further include: a data compression unit, configured to perform data compression on the data before data transmission and storage, may describe or embody a corresponding compression process in the coding protocol; and an encryption unit, Used to encrypt data objects or encodings that need to be encrypted.

45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention. As shown in FIG. 44, the apparatus includes: a receiving unit 21C, a disassembling unit 22C, an obtaining unit 23C, and a restoring unit 24C; The unit 21C is configured to receive a decoding processing request, and obtain an object encoding to be decoded according to the decoding processing request; the disassembling unit 22C is configured to disassemble the object encoding, obtain a meta encoding, or the meta encoding and an instance. Encoding unit 23C is configured to query an encoding warehouse, and obtain corresponding metadata and encoding specifications according to the meta-encoding; the recovering unit 24C is configured to use the metadata, the encoding protocol, and the encoding according to the metadata and encoding protocol. , get with The object encodes a corresponding data object.

In this embodiment, the decoding processing system can perform the technical solution of the method embodiment shown in FIG. 32, and the implementation principle and effect are similar, and details are not described herein again.

Further, corresponding to the encoding processing system, the decoding processing system may also include a corresponding data decrypting unit, a data decompressing unit, and the like.

In this embodiment, for example, a word processing system mainly based on an object coding system is taken as an example for detailed description. FIG. 46 is a schematic diagram of a structure of a word processing system mainly based on an object coding system, as shown in FIG. 46, a new system. It is roughly divided into two parts: the code warehouse and the corresponding processing system.

The code repository (code base) code store can consist of two parts: coded data, and related services around the data.

Specifically, it can be seen from the open coding coding model that the model can be easily implemented using an object-based approach. Due to the persistence of encoding, we can use object databases or store objects in various databases through object-relational mapping techniques.

For an encoding service, the encoding service is actually a process in which the encoding repository receives the object data, stores it in the library, and returns the corresponding encoding. As can be seen from the previous coding model, this code is divided into two parts: meta-encoding and instance coding. For the more common short word length coding, we generally provide two corresponding sub-services.

For the registered encoding meta-object sub-service, after obtaining the registered naming coding space, the client can register the encoding type with it. The coding type includes the target coding space corresponding to the coding, which is actually specified by the meta coding space corresponding to the type data. After receiving the registration request, the encoding warehouse verifies the security and legality of the request according to the settings of the system and the user. After the verification is passed, the corresponding code is returned to the client.

The named encoding space is not the only target space for encoding type registration, and the client can also register directly with the root space of the encoding repository. Similar to the registered named encoding space type, the encoding repository will place the encoding type in a specific encoding space according to the system and the user's settings, and return the corresponding encoding space path and type encoding to the client.

For the object encoding sub-service, when the client makes an encoding request to the encoding warehouse, it must provide the corresponding meta-encoding and type encoding. The encoding repository stores the object in the data store corresponding to the encoding type and returns the object to the client at the stored location.

For the decoding service, contrary to the encoding service, the decoding service is an encoding warehouse receiving code, and the corresponding data object is returned to the client.

Specifically, the code repository provides two sets of decoding sub-services. In the implementation of the short word length of the decoding service, we give a simple constraint: the meta code and the instance code are respectively represented by separate code points, and the instance code can only appear after the meta code. In this way, the decoding service can be done through two sub-services.

For the decoding meta-encoding sub-service, when the client proposes a decoding request to the encoding repository for a specific encoding space (if not specified, the root space), the encoding repository first performs a security check to see if the current context object satisfies the system. Security settings. On the basis of satisfying the security setting, the encoded metadata encoded in the specified encoding space is returned to the client. This encoding metadata includes corresponding type information of the type encoding and a target encoding space of the corresponding encoding instance. If the corresponding type is an encoded metadata type, its corresponding encoding space is a subspace of the current space.

For the decoding of the encoding object sub-service, similarly, after the client obtains the encoding metadata, the decoding request for the specific encoding space, the specific encoding type, and the specific encoding may be proposed to the encoding warehouse. On the basis of satisfying the security settings, the code repository will return the object data of the corresponding location to the client.

For content caching services, the content caching service can be implemented by object encoding the encoding repository. Specifically, the object encoding of another code storage warehouse is established in an encoding warehouse. Of course, the content of the so-called encoding warehouse object is mainly a reference encoding of the target warehouse, such as a URL, a connection string, and the like. Then, each target repository actually corresponds to a coding space. In this way, in the encoding and decoding process, by setting the cache encoding repository, the content caching service can store the target encoding and the corresponding content in the encoding space corresponding to the target encoding warehouse in the cache encoding warehouse by proxy caching. .

For an environment-aware authorized access system, the security of the new system is mainly based on the coded warehouse authorized access service. Other services for the code repository are provided on the basis of an authorized access service.

Different from the general authorized access system, the granularity of the authorized access of the encoding warehouse can be very fine, and it can be a specific encoding. And the use of encoding exists in a specific context, such as the author of the code, the reader, the application using the code, the document, and so on. Therefore, based on this context model and its phase The extended model can define various rules to facilitate access settings for various encoding services within the encoding repository.

The implementation of the environment (context)-aware authorized access system does not have any technical difficulties, and the traditional rule-based system-based technology can meet the demand.

In addition to the system default settings, the access authorization rule base is mainly set by the system administrator and the code author himself to set his own code access.

The authorization rules are set based on the coding model and the coding context model, such as coding type, coding space, coding context, time, location (GPS), code author, code reader, etc., in addition, use the code warehouse The application system can also provide an extension model of the encoding context to the encoding repository, and the encoding access rules can be built on all of these models.

Applications that are combined with the object-based context-dependent coding scheme of the present invention may also include, but are not limited to, handwritten login, secure authentication model, text service, text codec serialization service, and the like.

In addition, unlike the encoding and decoding services of the encoding mentioned above, the character codec serialization service converts objects in the application system from each other. The text encoding and decoding serialization service is based on the codec service of the encoding warehouse. The serialization service of the text codec is actually the content encoding service of the data object. In addition, the main difference between text encoding and decoding and encoding warehouse codec is that the corresponding model of codec data is different: the text codec corresponds to the application model, and the code warehouse codec corresponds to the storage model. Of course, in some cases, the two models are identical.

For text input and output services, we mentioned earlier that the new data processing system mainly has two aspects of coding ability, one is the coding ability of personalized text, and the other is the re-encoding ability of traditional text data. The text input and output services we mentioned here are mainly for the former. The input and output of the latter is mainly through the "general display editing service" mentioned later.

Common personalized texts are mainly handwritten text and voice text. Of course, it can also be any other form of text that can be stored and transmitted by means of a computer system, such as sign language, gestures, semaphores, lips, and the like.

Here, the description of handwritten characters is mainly used to show the difference between personalized text and traditional computer text.

There are many kinds of personalized handwritten texts, which can be directly input depending on the input method. The graphic/stroke information to the computer system is called online handwriting; it can also be a traditional scanned image of the result written on paper, called offline handwriting. According to the details of the strokes, there are hard pen handwriting, soft pen handwriting and so on.

This kind of personalized handwritten text has one of the most essential differences from the existing handwriting input, that is, the personalized text is personalized, which varies from person to person and does not need to be recognized as a standard code. Therefore, the input and output process of personalized text is mainly a natural writing process. In this process, the computer needs to adapt to the individual's writing habits as much as possible, and to retain the writing results to the utmost extent. This is the opposite of the traditional human keyboard input method for computer adaptation.

The output of personalized handwritten text is mainly the display output of the computer screen, of course, there are subsequent printouts. The input is primarily the direct writing of a finger or pen device on a computer touch screen. There are two natural writing constraints to ensure that we are entering text, not graphics:

1. Based on row or column overall layout constraints. That is to say, when the user makes an input, the target row (or column, which will be collectively referred to as a row) must be activated in some way before the input can be made in the row. In this way, the text input system can effectively determine the overall order of the text.

2. Interval-based inline typesetting constraints. In the same line, the text input system must be able to recognize the most basic text units to ensure efficient text storage, encoding, and reuse. In the phonetic data processing system, the distance between words is often significantly larger than the letter and the distance between the words. Therefore, we can use words as the most basic text unit of the corresponding data processing system, and divide the words in the line by analyzing the spacing. At the same time, we also encode the length of the spacing to ensure the correct playback of the text content. In this case, even if the result of the gap analysis is not completely correct (mainly because the process is not exactly the same as the human recognition process, lacking letter recognition and semantic analysis), the output can be exactly the same as the input. The text input system can also provide tools to correct the pitch analysis results, taking into account the error conditions of the pitch analysis. In the ideographic data processing system, the single character size is equivalent, the word spacing is similar, and both are relatively small. In this case, the text input system can add an auxiliary grid to assist the input system in segmenting the characters. For example, for Chinese characters, when text is input, we can provide auxiliary lines in the form of text to help the user correctly input the characters into the corresponding grid. In the character interval analysis, the grid can be used to classify the characters. . We call it the text layout constraint. In fact, text typographic rules are highly cultural and often vary from language to language. In the new system, different input and output systems can be provided for different language cultures.

For general purpose display and editing of hybrid coding, one of the main benefits of a standard coded data processing system is its readability, which means that people can understand the corresponding text content. This readability is based on the fact that coding standards are generally supported by various hardware and software systems. The most widely supported coding standard is ASCII encoding.

In the new data processing system, we are fully compatible with existing coding standards. Support for UTF encoding by OTF encoding as mentioned earlier. In addition to display support for UTF standard text, we also offer a common text display and editing service to provide direct display and editing of open coded text. The display and editing mentioned here is neither a complete text display editor nor a binary display and encoding, but a general service between the two. The service has the following characteristics:

1. Can correctly display and edit UTF standard text;

2. For non-UTF encoding, it is possible to display and edit the encoding type ID (including the ID of the spatial type) and the number corresponding to the encoding;

3. For some commonly used public open coding, such as XML, JSON, HTML, SVG, etc., directly display and edit the original text content.

This text's universal display and editing service can support traditional text input and output methods: monochrome text terminal (you can use reverse display to distinguish the encoding and corresponding content display) and keyboard (you can distinguish the encoding editing state from the encoding content editing state). Come). It is primarily intended to give developers and system maintainers the convenience to view and modify text data in a traditional way.

The universal display and editing services of text are important guarantees for the new system to maintain human readability.

For the matching (service) of the encoded warehouse content, taking the personalized handwritten content as an example, the normalization of the encoded warehouse content is shape matching.

At present, the matching technology of graphics and images is relatively mature. For the glyphs, there are various algorithms to match. There are methods based on stroke curve fitting, contour based methods, feature analysis based matching methods, machine learning based methods, and the like. I will not repeat them here. In addition, since the present invention can record the time and position information of each stroke input, the present invention can also utilize the input time and position information of the stroke to achieve matching of the input content.

For the normalization of the encoded repository content, the normalization of the encoded repository content is based on the matching of the encoded repository content to ensure that the same or similar content corresponds to a unique encoding. Taking personalized handwritten content as an example, the most ideal one is that the same user always has the same handwriting for the same content. The same encoding of the encoding repository.

The normalization of the content of the coded warehouse can be automatically performed according to the set threshold or interactively with the user. For example, in the case of personalized handwriting, when the user's writing content is submitted to the code repository, the code repository finds all the similarly shaped glyphs and allows the user to confirm whether they are normalized and the normalized glyphs.

For the search and matching of object encoding, the traditional string pattern matching algorithm can be directly used in the search and matching of object encoding. However, there are two things to note:

1. The binary alignment cannot simply be used to determine whether the encoding in the source string and the encoding in the target string are the same, but to ensure that the encoding space, encoding type, and instance encoding of the source encoding and the target encoding are identical.

2. The encoding of the interval between the source string and the target string (ie, the space between characters) can be directly ignored.

Therefore, existing string matching algorithms, such as the classic KMP algorithm, can be used in new data processing systems with minor modifications. It is worth mentioning that the search for the object encoding does not need to encode the corresponding text content, and only needs to encode the corresponding encoding metadata, mainly including the encoding type information and the information of the encoding space.

The retrieval of the object encoding is similar to the matching matching of the object encoding, and the retrieval of the object encoding can be completely based on the existing retrieval method. It is also necessary to modify existing methods in response to the above characteristics.

For the input search of personalized text, in the new data processing system, all the encoded content can be stored in the encoding warehouse, so the search for the user input content can be optimized based on the encoding warehouse content normalization service. The search process is as follows:

1. Enter the text content to be found (source text) through the text input system;

2. The code warehouse performs a normal match on the source text;

3. If the source text contains a new encoding (unmatched encoding), then directly return the search failed;

4. If the source text contains a text encoding that does not appear in the target text, the direct return search fails;

5. Find the code string corresponding to the text to be checked in the target code.

For the recognition of personalized text, the recognition of personalized text is a subset of traditional text recognition. The results of the identification can be stored in the code repository. It is worth noting that the identification of the same code There may be more than one. For example, the capital letter I may correspond to the number 1, or the lowercase letter l. This is also encountered in the process of traditional text recognition. Here only the traditional text recognition process needs to be slightly modified, combined with the single word or word recognition information in the code warehouse to perform the whole sentence and the whole text recognition.

For multi-level output systems, we do not have any restrictions on the text content of the encoding in the encoding library of the object encoding. Therefore, these two situations may occur:

1. The corresponding text content of the encoding is vectorized/parameterized information, which can have different outputs according to different conditions/parameters;

2. The same code may correspond to multiple pieces of text content.

In either case, a content selection mechanism must be used in the decoding service of the encoding repository. For the first case, the encoding repository dynamically generates the corresponding encoded content based on the information of the decoding request. For the second case, the encoding repository will select the most appropriate text content based on system settings and decoding requests.

For visual touch editing of personalized text, under the new data processing system, visual hybrid editing and formatting of personalized text and traditional text becomes possible. Traditional visual text editing is designed with the keyboard as the main editing device. There are two core concepts:

1. Enter the focus, which is the position where the current text is inserted or overwritten. For a text stream, it is a one-dimensional position coordinate. But for a visual editing area, it corresponds to a two-dimensional coordinate (row and column). A flashing cursor is typically used to visualize its position. It can be changed by the arrow keys. Systems that support point devices (such as mice) can also use point devices to directly locate the focus.

2. Select the text (that is, the text to be manipulated). For a text stream, it is a one-to-one dimensional position coordinate. In general, input focus and selected text cannot exist at the same time. The input focus can be understood as a selected text of zero length. The selected text is typically visualized by highlighting or highlighting. Through the keyboard, the start and end of the text selection are defined mainly by the combination of the direction keys and the specific function keys. The use of point devices, such as the mouse, is mainly to select text by "press and hold, drag, release".

Traditional WYSIWYG visual text editing is based on the way you apply commands to selected text. But this kind of user interface is not natural for the increasingly popular touch devices. In addition, handwriting input is also incompatible with existing visual editing methods. In contrast, touch devices are very natural input devices for handwritten text. Therefore, based on the existing text visualization editing, we introduce input mode to ensure the switching of different input methods, and In touch input mode, the "input focus" is extended to a range of areas, which can improve visual text editing under the touch device. The following is the concept of the input mode and input area introduced by the present invention.

1. Input method. Based on the original keyboard input method, we also allow handwriting input. When making input, we must be in one of these two ways. Users can switch between the two modes freely. When in keyboard input mode, the user can directly type text content using a keyboard (virtual keyboard or numeric keypad) and use a traditional visual editing interface. When in the touch input mode, the user can input in a specific area using a touch device (stylus or finger). And use a touch-friendly visual editing interface.

2. Input area (ie input panel), only valid in handwriting input mode. Corresponds to the input focus in the keyboard input mode. Different from the input focus in the traditional editing system, the input area corresponds to not a one-dimensional position coordinate, but a two-dimensional area of the edited display. In the handwriting input mode, the user can directly write text in the input area. The written text is presented directly in a WYSIWYG manner and participates in typographic editing. The line information corresponding to the current text layout layout exists in the input area, so that the text information written in the area can directly correspond to the position after the text typesetting. If there are no other restrictions, the most direct and natural input area is the line, or the display area where the column is located. The user can change the current input area by touch click outside the input area; or change the position of the input area directly by moving commands.

For typesetting, different language cultures and different words have different typesetting rules. For example, Arabic characters are horizontally arranged from top to bottom and from right to left, while traditional Chinese is vertically from right to left and top to bottom. Personalized text must also follow the corresponding typographic rules.

However, no matter which typesetting rule is used, the segmentation is performed on the basis of the accumulation of the character length. Similar to standardized text, personalized text based on open coding also has length information; but unlike standardized text encoding, there is no fixed-length special space character in open-coded personalized text, instead it can have different lengths (spaces) The space character of the length as the encoding parameter).

In addition, punctuation often participates in typesetting. However, in handwritten text, punctuation does not necessarily need to be recognized. Therefore, personalized punctuation marks are often combined with other characters and treated as ordinary characters.

Two typical typographical algorithms are given below, and other typographical rules algorithms can be modified by them. Come.

For input, in the handwriting input mode, you can write directly in the input area. The results entered do not need to be identified, but are instead translated into personalized text based on open coding. In this process, you need to identify the text and the spacing of the text. The typesetting rules also have a restrictive effect on this identification process.

For the deployment scheme of the object coding system, the computer data processing system based on open coding splits the object coding and the content of the data object. Like traditional data processing systems, text encoding can exist in different storage—memory, file, database, network, or cloud. Therefore, the specific storage scheme for text encoding is completely determined by the requirements and architecture of the application system, and is independent of the storage scheme of the corresponding encoding warehouse. What we want to discuss here is not the storage scheme of text encoding, but the deployment scheme corresponding to the text encoding warehouse. On the other hand, using different storage systems to store text encoding and encoding repositories can effectively improve the security of the system - as mentioned above, in this case, the attacker can only crack the two systems at the same time. Finally get the text information.

In addition, the system architecture of the traditional application system is a stand-alone application or a network application, whether it is a single-user or multi-user model, whether it is based on a browser or a rich client, etc., and is independent of the coding warehouse deployment scheme. Of course, in the new data processing system, the same application system uses different coding warehouse deployment schemes, which will have different security levels and performance indicators.

Figure 47 is a schematic diagram of the architecture of the in-app deployment. As shown in Figure 47, in-app deployment means that each application system has its own specific code repository. In such a deployment scenario, text content in an application can only be identified and displayed by the system. In other applications, it is "garbled" that cannot be explained.

The text content in this deployment scenario is at a higher level of security—at least between different applications. Can be used for personal applications with high security. "Personal Diary" is such a typical application system, and the diary content can only be opened by the authorized application. The downside of in-app deployment is the other side of its security: data is hard to share.

FIG. 48 is a schematic diagram of the architecture of the terminal deployment. As shown in FIG. 48, unlike the in-application deployment, the terminal deployment of the coding warehouse is shared as a system service of the terminal system, and can be used by multiple applications at the same time. This deployment scheme also has a high level of security, because the text content that leaves the terminal cannot be used.

Figure 49 is a schematic diagram of the architecture of the mobile external device deployment. As shown in Figure 49, the terminal deployment of the coded warehouse is suitable for sharing personal applications with little demand. However, with the popularity of mobile terminals and tablet devices, more and more individuals have multiple computer devices, which leads to the need for personal information to be shared among multiple devices. Deploying the code repository on an accessible mobile device can directly address this need. The mobile device can be a smart mobile terminal running an encoded warehouse service, a mobile storage device storing a coded repository, or a specialized coded warehouse device.

For web deployments, language text is primarily used to communicate with others. Therefore, the main deployment of the code repository is network deployment. For Internet-wide networks, it is cloud deployment. As shown in Figure 50, all applications share the same code repository. In this way, all people using the application can use and exchange text information under the access control of the same code repository.

For a local area network or a corporate intranet, the network deployment of the coded warehouse is either a private cloud deployment or an internal server deployment, as shown in Figure 51. In this way, the encoding warehouse is isolated from the outside by the firewall, and the corresponding encoded content can only be used internally by the organization.

Figure 52 is a schematic diagram of the architecture of a peer-to-peer deployment. A special case of network deployment is peer-to-peer deployment. As shown in FIG. 52, the code repository is temporarily or permanently shared with other users on the basis of in-application deployment or terminal deployment. A typical application is a personal instant messaging application: during a call, both parties to the call share the code repository, so the two parties can communicate normally. If the party closes the share of the code repository when the call ends, the other party cannot see the other party's call history. In real life, we sometimes need this kind of security effect.

The coding repository deployment scheme used by an application is not absolute and static. Application systems can mix different scenarios at the same time. Figure 53 is a schematic diagram of a hybrid deployment architecture. As shown in Figure 53, the same application can use three different code repositories. In this way, the application can be used in three different environments, and only the corresponding code repository needs to be switched.

Combined with the above description, the present embodiment is specifically combined with practical applications to implement enhancements and modifications to the conventional information system, and support for the text system based on the object encoding.

As shown in FIG. 54, the text in the traditional information system generally uses the text service provided by the operating system directly for input and display output. Since the object encoding in the new data processing system can be fully compatible with traditional text encoding, as shown in Figure 54, we can add support for the new data processing system by changing the text service of the operating system. In this way, traditional information systems can directly support the input and output of non-standard text (such as personalized handwritten text) without modification.

Specifically, for the transformation of the object-based backend storage, in the existing software application system, the loading and storage of the sustainable data object is completed by the data access module/component. When storing, the data access component stores the data corresponding to the application object directly in the application storage; when loading, the data access component acquires the corresponding data by accessing the application storage, and loads and instantiates the data into an application object.

The system application level of the object coding system of the present invention can be implemented as follows, and the specific implementation method is not limited thereto. For example, the code repository can be set on the user side, on the third party server, or anywhere in the cloud storage.

See Figure 55: The object encoding system systematically numbers the data that needs to be loaded and stored to obtain the corresponding object encoding. Thus, the application storage stores mainly the encoded object code and the object code sequence. Real application data needs to be obtained through the object encoding system. The link between the application system and the application data introduces the indirection layer of "encoding". In this way, although it introduces additional running and storage overhead, it also brings many benefits such as security, flexibility and efficiency. This is very beneficial in some applications.

As shown in FIG. 55, the application of the object encoding system based on the present invention stores the encoded/encoded sequence used in the application storage. When storing, the data access component converts the data corresponding to the application object into the encoded content according to the specific application logic; by the object encoding system, the data object is converted into the corresponding encoding and returned to the data access component, and the content of the data object itself is stored in In the object encoding system; the data access component stores the resulting encoding/encoding sequence into the application storage. At load time, the data access component obtains the required code by accessing the application store, and then restores it to a data object through the object encoding system; finally, the data access component of the application system converts the data object into an application object.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that It is still possible to repair the technical solutions described in the foregoing embodiments. Modifications, or equivalents to some or all of the technical features, and the modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

A method for processing handwritten input characters, comprising:

And acquiring, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column;

For each stroke, according to the input position of the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the first target row/column The specified character, creating a new character for the stroke or determining the character to which the stroke belongs.
The method according to claim 1, wherein an input position in said first target row/column or an input position of said stroke in said first target row/column according to said stroke and said a character specified in the first target row/column, creating a new character for the stroke or determining a character to which the stroke belongs, including:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character;

If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;

If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
The method according to claim 2, wherein said designated character is all characters existing in said first target row/column;

Or the specified character is a character in the area to be compared in the first target row/column, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold.
The method according to claim 2 or 3, wherein the position information of the stroke in the first target row/column corresponding to the character specified in the first target row/column is performed. In contrast, the correlation between the stroke and the character is determined, including:

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining whether the stroke is at least one of the characters Stroke overlap; if the stroke overlaps with at least one of the characters, determining that the stroke is associated with the character; if the stroke does not overlap with all strokes of the character, Then determining that the stroke is not associated with the character;

or,

Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character specified in the first target row/column, and determining the stroke and the location Whether the distance between the boundaries of the characters is less than a third preset threshold; if the boundary of the stroke and the character is less than a third preset threshold, determining that the stroke is associated with the character; if the stroke is If the boundary of the character is not less than a third preset threshold, determining that the stroke is not associated with the character;

or,

Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value among the spacings between the strokes corresponding to the character, and determining whether the minimum spacing value is less than a third preset threshold; if less than, the stroke is associated with the character; Not less than, the stroke is not associated with the character.
The method according to any one of claims 1 to 4, further comprising:

When receiving the storage request, the protocol is stripped according to the preset metadata, the metadata of the saved handwritten text is obtained, and the obtained metadata is stripped from the handwritten text;

The handwritten text is divided into at least two pieces of data according to a preset data content splitting specification.
The method of claim 5, further comprising:

Querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and encoding the handwritten text according to the encoding specification, Obtaining an instance code, and acquiring a text code corresponding to the handwritten text according to the meta code and the instance code;

or,

Transmitting the handwritten text and the metadata to the encoding repository, wherein the encoding repository selects or creates an encoding specification according to at least a portion of the metadata, and generates a correspondence corresponding to the metadata according to the encoding specification Encoding according to the encoding protocol, encoding the handwritten text, obtaining an example encoding, and acquiring a text encoding corresponding to the handwritten text according to the meta encoding and the example encoding; and receiving the encoding warehouse The returned text encoding, the text encoding Is a reference to the coding form or content coding form.
A data splitting method, comprising:

When receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. ;

The data content is divided into at least two data segments according to a preset data content splitting specification.
The method according to claim 7, wherein after the dividing the data content into at least two data segments, the method further comprises:

Separating the data fragments according to the preset encoding, and respectively encoding each data segment to obtain a code corresponding to each data segment;

The respective codes are arranged according to the original order of the respective data segments in the data content to obtain coded arrangement order information.
The method of claim 8 further comprising:

Generating a coding order information unique identifier based on the encoded ranking order information, and/or generating a respective data segment unique identifier based on each of the data segments, the encoding order information unique identifier and/or each of the data The fragment unique identifier is stored as part of the metadata.
The method according to claim 8 or 9, wherein the encoding is performed according to a preset encoding, and each data segment is separately encoded to obtain a code corresponding to each data segment, including:

Decoding a protocol according to a preset encoding, querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and respectively, according to the encoding protocol, respectively Encoding each data segment to obtain an instance code corresponding to each data segment;

or,

And transmitting, according to a preset encoding separation protocol, each data segment and the metadata to the encoding warehouse, so that the encoding warehouse selects or creates an encoding specification according to at least a part of the metadata, and generates according to the encoding protocol. a meta-code corresponding to the metadata; and according to the coding protocol, respectively encoding the respective data segments to obtain an instance code; and receiving the code The meta-code and instance code returned by the code repository.
A data merging method, comprising:

Receiving a data object acquisition request carrying the identification information; wherein the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object;

Acquiring the storage content corresponding to the positioning information, and acquiring data information in the other storage content according to the obtained positioning information in the storage content, until all data information of the data object is obtained;

And obtaining, according to the preset merge rule in the obtained data information, the acquired data information, to obtain the data object.
The method according to claim 11, wherein when the type of the data information is a combination of a data segment, an encoding, and an encoding sequence, the acquiring according to the preset merge protocol in the acquired data information Each data information is merged to obtain the data object, including:

According to the merging algorithm in the preset merging convention, the encoding operation is performed to obtain the data segment corresponding to the encoding; the decoded data segments are arranged according to the encoding order, and the data objects arranged in the original order of the respective data segments are obtained.
The method according to claim 12, wherein the decoding operation of the encoding according to the merging algorithm in the preset merging protocol to obtain the data segment corresponding to the encoding comprises:

Disassembling the data information according to a merge algorithm in a preset merge protocol, obtaining a meta code, or the meta code and an instance code;

Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;

Obtaining a data object corresponding to the data information according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
An encoding processing method, comprising:

Acquiring the data object to be encoded and its metadata according to the received encoding processing request;

Obtaining an object encoding of the data object according to the encoding repository and the data object and its metadata.
The method of claim 14 wherein said encoding according to a code repository The data object and its metadata obtain an object code of the data object, including:

Selecting or creating an encoding specification according to at least a portion of the encoding repository and the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification;

Encoding the data content of the data object according to the coding protocol, acquiring an instance code, and acquiring an object code corresponding to the data object according to the element code and the instance code;

The object code is a reference coded form or a content coded form.
The method according to claim 15, wherein the encoding the data content of the data object according to the encoding specification to obtain an instance code comprises:

Performing serialization processing on the data content of the data object according to the coding protocol to obtain a serialization result; wherein the instance code is the serialization result;

or,

Performing serialization processing on the data object content according to the encoding specification, obtaining a serialization result, and saving the serialization result in the encoding warehouse to obtain an object number in the encoding warehouse; The instance code is the object number.
The method according to any one of claims 14 to 16, further comprising:

Set access rights to the data in the encoding repository.
The method according to claim 15 or 16, wherein the encoding the data content of the data object according to the encoding protocol to obtain an instance code comprises:

Get the context object;

Obtaining a corresponding coding space according to the context object and the coded protocol;

In the coding space, the data content in the data object is encoded to obtain an instance code.
The method according to any one of claims 14 to 18, wherein the meta-coding comprises a combination and/or nesting of one or more of the following: type coding, spatial coding and context coding.
A decoding processing method, comprising:

Receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request;

Decomposing the object code to obtain a meta code, or the element code and the instance code;

Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;

Obtaining a data object corresponding to the object encoding according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
The method according to claim 20, wherein the acquiring the data object corresponding to the object encoding according to the metadata and the encoding protocol, or the metadata, the encoding protocol, and the instance encoding, comprises:

Get the context object;

Obtaining a corresponding coding space according to the context object and the coding protocol;

Decoding the example code from the coding space to obtain corresponding data content;

Obtaining a data object corresponding to the object encoding according to the metadata and the data content.