WO2016023471A1 - Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing - Google Patents

Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing Download PDF

Info

Publication number
WO2016023471A1
WO2016023471A1 PCT/CN2015/086672 CN2015086672W WO2016023471A1 WO 2016023471 A1 WO2016023471 A1 WO 2016023471A1 CN 2015086672 W CN2015086672 W CN 2015086672W WO 2016023471 A1 WO2016023471 A1 WO 2016023471A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
encoding
stroke
character
metadata
Prior art date
Application number
PCT/CN2015/086672
Other languages
French (fr)
Chinese (zh)
Inventor
张锐
Original Assignee
张锐
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 张锐 filed Critical 张锐
Priority to CN201580042761.6A priority Critical patent/CN106575166B/en
Priority to CN202310088220.3A priority patent/CN116185209A/en
Publication of WO2016023471A1 publication Critical patent/WO2016023471A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]

Definitions

  • the present invention relates to data processing technologies, and in particular, to a method for processing handwritten input characters, data splitting and merging, and encoding and decoding.
  • text encoding is the most basic encoding for human input, viewing and editing, modification; for computer analysis and processing.
  • ASCII text encoding standards to today's Unicode
  • standardized text encoding is a basis for transferring information between people and machines and various systems.
  • the existing standardized text encoding is far from enough.
  • standard text encoding and its corresponding text input methods have gradually become the bottleneck of human natural output into the digital world.
  • Standard text-based coding allows humans to participate in the process of data creation, viewing, debugging, and modification, facilitating integration and exchange between different systems, improving the speed of system development, and reducing the cost of system troubleshooting.
  • the text format is redundant for the expression of symbolized data and binary data.
  • the complexity of the structure to be expressed by the system is improved, the complexity of the mark and syntax based on text coding is greatly improved. Data redundancy will also increase.
  • due to the limited number of codes in a specific text encoding standard the conflict between the data content and the grammar mark in the encoding is also inevitable, and text escaping also brings certain data redundancy.
  • binary data is its natural form of data representation. People-defined text format data will also be processed into binary data through conversion to reduce redundancy and improve processing and transmission efficiency. There are also some general binary-based encoding methods, such as the International Standards Organization and the International Telecommunications Union coding standards ANS.1, Google's BufferProtocol, Apache's Thrift and Avro, as well as BSON, Message Pack and so on. However, contrary to the text-based coding method, binary data has the disadvantages of relatively closed, unfavorable exchange, and unfavorable human participation.
  • encoding For encoding, whether it is text encoding or binary encoding, there are two purposes, one is to describe the data object itself, which is also called serialization, which is referred to as the content encoding of the data object.
  • serialization which is referred to as the content encoding of the data object.
  • the aforementioned coding standards and methods are mainly used for content coding.
  • Text-based reference encoding has URN, URL, object identifier (OID) in ANS.1, etc.; binary-based reference encoding has keys in the database, UUID/GUID, IP address, MAC address, MD5, SHA-1, etc.
  • OID object identifier
  • a first aspect of the present invention provides a method for processing handwritten input characters, including:
  • the technical effect of the first aspect of the present invention is to provide a method for processing handwritten input characters, which can realize the effect of inputting a word while inputting, and the user does not need to explicitly or implicitly "start a single text input” or "end".
  • the command of a single text input distinguishes different characters. Therefore, it is not necessary to pause for a period of time or perform some interaction with the system during the writing process, and the writing process is smooth and efficient; and, in the method
  • the character to which the stroke belongs is determined directly by the input position of the stroke, and the identification of the standard character is not required, so that the personalized information and the writing style and characteristics of the user's handwriting input can be retained.
  • a second aspect of the present invention provides a data splitting method, including:
  • the protocol is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object.
  • the data content is divided into at least two data segments according to a preset data content splitting specification.
  • the technical effect of the second aspect of the present invention is to provide a data splitting method, which separates the metadata in the user's original data from the data content, and divides the data content into a plurality of data segments, thereby increasing illegal acquisition.
  • the difficulty of the user's original data makes the security of data storage more reliable.
  • a third aspect of the present invention provides a data merging method comprising:
  • the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object;
  • the technical effect of the third aspect of the present invention is to provide a data merging method, which is obtained by stepwise positioning according to the positioning information included in the identification information in the data object acquisition request.
  • the data information stored in each storage body is split, so that each data information is combined according to a preset merge rule to obtain a user original data object, thereby ensuring that data dispersed in each storage body can be efficiently and safely.
  • the acquisition ensures the reliability of the user successfully merging the scattered data into the original data.
  • a fourth aspect of the present invention provides a coding processing method, including:
  • the technical effect of the fourth aspect of the present invention is: obtaining a data object to be encoded and its metadata according to the received encoding processing request, and acquiring an object encoding of the data object according to the encoding warehouse and the data object and the metadata thereof, Since the data object can be encoded according to the metadata of the data object and the encoding warehouse, a flexible and diverse encoding method is realized.
  • a fifth aspect of the present invention provides a decoding processing method, including:
  • the technical effect of the fifth aspect of the present invention is: receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request, disassembling the object encoding, obtaining a meta encoding, or the meta encoding and the instance encoding.
  • Querying the code repository, obtaining corresponding metadata and coding specifications according to the meta code, and acquiring data objects corresponding to the object code according to the metadata and the coding protocol, or the metadata, the coding protocol, and the instance code The metadata and the encoding warehouse realize the encoding of the data object. Therefore, not only the flexible coding method is realized, but also the space is saved to a certain extent.
  • the meta-coding of the disassembly and the coding warehouse Effectively improve the efficiency of decoding.
  • FIG. 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention
  • FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention
  • 1C is a schematic diagram 2 of a character in a method for processing handwritten input characters according to an embodiment of the present invention
  • FIG. 1 is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention
  • FIG. 1 is a schematic diagram of a state in which a character is inserted in a method for processing handwritten input characters according to an embodiment of the present invention
  • FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention
  • FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention
  • FIG. 1H is a flowchart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention
  • 1I is a flowchart of a handwriting program source code conversion method in an embodiment of a method for processing handwritten input characters provided by the present invention
  • FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I;
  • FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention
  • 1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention.
  • FIG. 2A is a flowchart of a data splitting method according to an exemplary embodiment
  • 2B-1 is a flowchart of a data splitting method according to another exemplary embodiment
  • 2B-2 is a structural diagram of a system in which a data object of the data splitting method is audio data according to the present invention
  • 2B-3 is a time domain analysis diagram of data objects of the data splitting method according to the present invention.
  • 2B-4 is a diagram of a speech text coding table in which a data object of the data splitting method is audio data according to the present invention
  • 2B-5 is a schematic diagram showing a voice text of a data object of the data splitting method according to the present invention.
  • 2B-6 is another schematic diagram showing the voice text of the data object in the data splitting method according to the present invention.
  • 2B-7 is still another schematic diagram of a voice text of a data object of the data splitting method according to the present invention.
  • 2B-8 is still another schematic diagram of a voice text of a data object in which the data object is a data splitting method according to the present invention
  • 2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention.
  • 2D is a flowchart of a data merging method according to an exemplary embodiment
  • 2E is a flowchart of a data merging method according to another exemplary embodiment
  • 2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment
  • 2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment
  • 2H is a schematic structural diagram of a data combining apparatus according to an exemplary embodiment
  • 2I is a schematic structural diagram of a data merging device according to another exemplary embodiment
  • 2J is an exemplary data splitting flowchart
  • 2K is another exemplary data splitting flowchart
  • 2L is an exemplary data merge flowchart
  • 2M is a schematic diagram of an exemplary data split description language definition
  • 2N is a flow chart of an exemplary data split description language visualization
  • Figure 2O is a diagram showing the relationship between concepts in the three concepts of the present invention.
  • FIG. 3 is a schematic diagram of a meta model in the prior art
  • FIG. 4 is a schematic structural diagram of an encoding system of the present invention.
  • FIG. 5C is a flowchart of Embodiment 1 of a coding processing method according to the present invention.
  • FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C;
  • Figure 7 is a schematic diagram of the core coding metamodel
  • 8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object reference coding removes the meta-coded part), and data objects and coding meta-objects;
  • FIG. 9 is a diagram showing an example of meta-encoding in the embodiment.
  • Figure 10 is a diagram showing an example of a layer-by-layer correlation of a coded meta-object (variable-length coding of 16-bit word length);
  • FIG. 11 is a schematic diagram of a meta model corresponding to a code
  • Figure 12 is a schematic diagram of a conceptual model of the object encoding
  • FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention.
  • FIG. 14 is a flowchart of Embodiment 3 of a coding processing method according to the present invention.
  • FIG. 15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment;
  • 16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system
  • 17 is a schematic diagram of a basic object that can be applied to a basic coding space
  • 18 is a schematic diagram showing the coding structure of a 128 fixed length coding scheme
  • Figure 19 is a schematic diagram of four binary bits being four spatial bits
  • Figure 20 is a diagram showing an example of a coding scheme
  • 21 is a diagram showing an example of a coding scheme of UTF-8;
  • Figure 22 is a schematic diagram of object coding consisting of element coding and example coding
  • Figure 23 is a detailed view of the encoding
  • Figure 24 is a rendering result diagram
  • 25 is a schematic diagram of code points other than UTF-8 of OTF-8;
  • Figure 26 is a schematic diagram of the coding to be defined
  • FIG. 27 is a flowchart of Embodiment 4 of a coding processing method according to the present invention.
  • Figure 29 is a schematic diagram of coding combination
  • FIG. 30 is a flowchart of Embodiment 5 of a coding processing method according to the present invention.
  • Figure 31 is a handwriting input program
  • FIG. 33 is a flowchart of Embodiment 2 of a decoding processing method according to the present invention.
  • FIG. 34 is a flowchart of Embodiment 3 of a decoding processing method according to the present invention.
  • FIG. 35 is a flowchart of Embodiment 4 of a decoding processing method according to the present invention.
  • Figure 36 is the content of the handwritten input
  • Figure 37 is a schematic view showing the length of the character pitch
  • Figure 38 is a schematic diagram of a decoding process
  • Figure 39 is a diagram showing an example of a mixed encoded content display
  • Figure 40 is a schematic diagram of the contents of the output
  • Figure 41 is a schematic view showing the strobe stroke falling on the result of the character output
  • Figure 42 is a schematic diagram of adding a standard smiley face icon
  • Figure 43 is a schematic view of an online Go
  • 44 is a schematic structural diagram of a first embodiment of an encoding processing system according to the present invention.
  • FIG. 45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention.
  • 46 is a schematic structural diagram of a word processing system mainly based on an object coding system
  • 47 is a schematic diagram of an architecture of an in-application deployment
  • 49 is a schematic structural diagram of a mobile external device deployment
  • Figure 50 is a schematic diagram of an architecture in which an application shares the same code repository
  • Figure 51 is a diagram showing an example of a network deployment of a code repository being a private cloud deployment or an internal server deployment;
  • Figure 52 is a schematic diagram of the architecture of a point-to-point deployment
  • Figure 53 is a schematic diagram of a hybrid deployment architecture
  • Figure 54 is an architectural diagram of an extended operating system to allow legacy applications to support object encoding
  • Figure 55 is a diagram showing the interaction of an object encoding system and an application system based on the present invention.
  • cloud storage systems and related applications have emerged.
  • the so-called cloud storage system refers to storing user data in a server in the cloud.
  • users can use different terminal devices to access data in the cloud storage at any time, eliminating the migration of data between different terminal systems.
  • users don’t need to By temporarily updating storage devices, cloud storage services provide sufficient scalability to handle a variety of storage needs.
  • Traditional data maintenance tasks, such as data backup and encryption are also transferred to cloud storage servers, which are often more professional and efficient.
  • some data usage patterns different from traditional applications also appear, such as data sharing and network collaboration.
  • a desktop agent is a cloud storage client that is based on a file system.
  • the desktop agent synchronizes the specific folder in the terminal with the cloud storage - the files stored in the folder are automatically uploaded to the server by the agent; other uploaded files received by the server are also automatically downloaded to the corresponding file through the agent. folder.
  • files of the same user are automatically synchronized on different terminals. Users can seamlessly use the data in this folder across platforms in a traditional way.
  • the desktop agent can also automatically synchronize shared folders to different users' terminals, thus facilitating convenient data sharing and cooperation.
  • Dropbox is a typical desktop proxy.
  • Cloud storage systems bring convenient and efficient data access and sharing. But the data stored in the cloud raises an inevitable concern, that is, the protection of security and privacy. The security of core data is completely dependent on the cloud storage system. Many organizations and individuals are based on this, not to put data, at least critical data, in cloud storage systems. There are two main hidden dangers here: one is that the data in the cloud storage is protected by the user's identity authentication. Once the user's identity is stolen, all users' cloud data will be exposed to the thief.
  • the invention mainly relates to a data processing method, system and application, and has the following aspects Effectively solve the above problems.
  • it involves the following three aspects of innovation: (1) a novel handwriting input method and system, especially a method for splitting handwritten input characters; (2) an object-based open codec solution, which can be free, Any encoding method that is open to encode or decode any data object; and (3) an object-based data splitting/merging method that splits/separates the metadata and/or encoded data of the data object from the corresponding data content to Guarantee the security of data content.
  • These technical solutions can be implemented separately or in combination, or combined with other technical fields, alone or in combination.
  • the invention has broad application prospects and great application value.
  • the specific plan is as follows:
  • the invention provides a data object based encoding method, the method comprising:
  • the generating object encoding step in step c) includes: generating a meta-code and/or an instance code for the data object according to a predetermined rule, and by the element The encoding and/or instance encoding generates the object encoding.
  • step of compressing and/or encrypting the data object is further included before step a), and after step c), further comprising generating the generated The encryption step of the object encoding.
  • meta-coding comprises one of the following encodings, or a combination and/or nesting of two or more types: spatial encoding, context encoding, and type encoding. .
  • the method further includes: a data splitting step of splitting the large data object into small data blocks according to a predetermined rule (or As a data segment, steps a) to c) are performed on each of the split data blocks during or after the data splitting process until the encoding of all the data blocks is completed.
  • the invention also provides a data object based decoding method, the method comprising:
  • the step of decoding the object in step b) comprises: disassembling the object code into a meta-code and/or an instance according to a predetermined rule at the time of encoding. coding.
  • an authorization verification step of acquiring a predetermined rule when encoding and/or encoding the object is further included.
  • the invention also provides a handwritten input character splitting method, the method comprising:
  • step c) is performed in one of the following cases: 1) in the input and writing process of the current stroke, 2) or at the current After the stroke input is completed (ie, after the pen is lifted), 3) or after the current line is entered.
  • the current stroke is only compared with the strokes and/or characters within the predetermined range one by one.
  • step c) comprises:
  • the currently entered stroke is the first stroke on the space in the row/column and is in the current row/column
  • Other characters (or strokes) that have been entered are not associated, or if the currently entered stroke is the last stroke in the space in the row/column and is not related to other characters (or strokes) already entered in the current row/column Create a new character for the stroke; if the current stroke is neither the first stroke on the space in the row/column nor the last stroke on the space in the row/column, then the current stroke is entered
  • the spacing between all characters passed is compared and the currently entered stroke is attributed to the associated one or more characters (or strokes).
  • a threshold (MIN_GAP) of a minimum distance between the stroke and the character or the stroke and the stroke is preset, each of The spacing between the stroke and other characters or strokes that have been entered is compared to the threshold to determine the association between the stroke and other characters or strokes.
  • the method further includes: recording, when receiving each input stroke, the input time and the input position information of each stroke.
  • the input time includes a pen down time and a pen up time
  • the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and the stroke The coordinate position of each point in the handwriting.
  • the invention also provides an object-based data object splitting method, the method comprising:
  • the data splitting/peeling protocol comprises at least one of the following options or a combination of two or more: 1) data content splitting protocol , recording the method and process of splitting the data content; 2) the metadata stripping protocol, recording the method and process of separating the corresponding metadata from the data object; 3) if generated during the data splitting process
  • the encoding also includes an encoding separation protocol, and records the encoding rules and encoding processes between the corresponding encoding and the encoded object.
  • step c) further comprising the step d): reassembling the split data segments.
  • At least a part of the metadata constitutes split metadata.
  • the invention also provides an object-based data object merging method, the method comprising:
  • the method further includes: a storing step of splitting/stripping each data segment Stored separately in different banks or under different secure channels.
  • FIG. 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention.
  • the method for processing handwritten input characters provided by the embodiment can be closer to people's natural writing habits than the existing handwriting input system, and at the same time completely and truly preserve the writing style and features of the writer.
  • the method in this embodiment may include:
  • Step 101A In the currently activated first target row/column, acquire a stroke input by the user and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column .
  • the execution subject in this embodiment may be a handwriting input device such as a conventional touch screen, handwriting screen, or other suitable handwriting device, or directly adapted to the handwriting system of the present embodiment.
  • the present embodiment may employ a touch screen type handwriting input device, that is, an input device that can directly input information on the screen by handwriting or by means of a dedicated or non-dedicated writing tool.
  • the embodiment can be applied to any writing mode, and the writing mode can be set by the user or the default setting.
  • the writing manners described in this embodiment may include, but are not limited to, the following methods: writing in a row (corresponding to a commonly used horizontal format, left to right, top-down writing habits); writing in columns (corresponding to vertical Row format, top-down, right-to-left writing habits; can also be other user-defined writing formats, for example, can be a right-to-left writing format set for Arabs; or it can be self Top down, writing format from left to right, and so on.
  • each stroke of the user and its input position can be recorded in chronological order.
  • the system automatically records the ⁇ and the input position of the ⁇ on the panel, for example, the pixel position of the handwriting input screen can be used.
  • the corresponding input position other positioning algorithms or position determining methods may be employed as long as the input position of each stroke can be uniquely determined.
  • a target row/column which can be used as a constraint range for the user's handwriting input, that is, when a row/column is activated, it becomes a target row/column.
  • the user can be prohibited from handwriting input in an area other than the target row/column, or the user is allowed to input at any position, but when the stroke input by the user exceeds the boundary of the target row/column, it can be used.
  • the method provided in this embodiment can be used as a limitation or constraint of input in units of rows (horizontal rows) or columns (vertical rows), that is, the current input can only be limited to a specific row or column, and there is no span. Line or column strokes or text. Based on this row or column constraint, the input can form a stream of characters in the order of input.
  • the method provided by the embodiment is closer to the natural writing habits of the people, so that the writing experience of the user can be more natural and smooth.
  • the range of the target row/column may be displayed on the handwriting input screen, for example, highlighting the target row/column, or displaying a line in a text or letter format on the handwriting input screen/ A column or a grain pattern, etc., to indicate the location of the target row/column that the user can currently input.
  • the currently activated first target row/column may be selected or created. Selecting or creating the currently activated first target row/column can take many forms, and the present embodiment gives the following two.
  • the location range of each row/column is determined, which may specifically include:
  • the row height/column width information is a default value or determined by the user input
  • the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen.
  • the position or each column is in the opposite left and right positions in the handwriting input screen.
  • the handwriting input screen can be divided into a plurality of rows/columns, and the range of positions of each row/column can be determined.
  • the strokes can be input based on the divided rows/columns.
  • the target row/column can be selected by the user.
  • the target row/column selected by the user may specifically include:
  • Target row/column selection message receives a target row/column selection message input by the user, where the target row/column selection message includes an identifier of the target row/column to be input by the user;
  • a row/column corresponding to the identifier of the target row/column to be input by the user is used as the currently activated first target row/column.
  • the identifier of the target row/column to be input by the user may be any coordinate point clicked by the user, and the row/column where the coordinate point is located is the row/column corresponding to the coordinate point; or, the The identifier of the target row/column to be input by the user may be a row/column number, for example, the 10th row or the 10th column, and the row/column corresponding to the row/column number may be used as the first target of the current activation. Row/column.
  • the user can select the target row/column through the input device that is accessed. For example, when an external keyboard is used, the user can select a target row/column through the keyboard; or, when an external mouse is connected, the user can select a different target row/column by moving the mouse; or, when an external stylus is input, it can be input. Before the pen is in contact with the handwriting input screen, the target row/column is selected by the pointing of the input pen.
  • Select target row/column mode 2 Activate a target row/column based on the characters previously entered by the user.
  • the method may specifically include:
  • the position range refers to a relative top edge position of the first target line in the handwriting input screen and The bottom edge position or the first target is listed in the opposite left and right positions in the handwriting input screen.
  • an appropriate threshold can be set for the width of the first target row/column to meet the needs of a particular user.
  • the natural writing line of the writer may be habitually inclined to the right or to the lower right.
  • the boundary of at least one character that the user has input may be appropriately extended upward or downward by a distance.
  • the two methods of selecting the target row/column provided above are simple and fast; the second method can satisfy the user's personalized input and the handwritten text input in the graphic system.
  • Step 102A for each stroke, according to an input position of the stroke in the first target row/column, or an input position of the stroke in the first target row/column and the first target row
  • the character specified in the /column creating a new character for the stroke or determining the character to which the stroke belongs.
  • This embodiment adopts a text division or division manner different from the prior art, that is, the attribution of the current input stroke is determined based on the correlation between each input stroke and other characters or strokes. Therefore, the method provided in this embodiment can save the user's tedious interaction process by inputting characters, thereby greatly simplifying the input operation.
  • the character refers to an independent character object having a two-dimensional shape, including not only standard characters of ideographic characters, such as single Chinese characters, Japanese, Korean, Arabic, Vietnamese, Burmese, etc. or parts thereof (for example, radicals, etc.) Or standard words of phonetic characters, such as English letters, German, French, Russian, Spanish, etc.; or computer characters based on traditional standard codes, such as ASCII characters, Unicode characters, or a string or the like; a combination of characters and strings of handwritten characters and standard characters; or any graphic or image input by the user, such as a "heart" pattern, a photo, any graffiti, etc., or Any other written expression.
  • standard characters of ideographic characters such as single Chinese characters, Japanese, Korean, Arabic, Vietnamese, Burmese, etc. or parts thereof (for example, radicals, etc.)
  • standard words of phonetic characters such as English letters, German, French, Russian, Spanish, etc.
  • computer characters based on traditional standard codes such as ASCII characters, Unicode characters, or a
  • FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention.
  • FIG. 1C is a second schematic diagram of a character in a method for processing handwritten input characters according to an embodiment of the present invention.
  • Five characters are shown in FIG. 1B, including "stroke characters”, that is, handwritten characters input by the user, such as first, third, and fourth characters, and "graphic characters", that is, arbitrary graphic or image information input by the user, Such as the second and fifth characters.
  • other characters such as “standard characters” (any of the existing standard fonts), “combined characters” (mixed characters of various characters mixed together), and the like can be input in this embodiment.
  • “combined characters” can also directly include the stylus Stroke - When a handwritten stroke is written directly on a non-"stroke character", a “combined character” is formed. As shown in FIG. 1C, the word “ ⁇ ” is a combination of standard characters and stroke characters.
  • the strokes input in the first target row/column can be automatically divided according to the intrinsic convention of the set language (for example, based on the writing or typesetting manner of each language, etc.).
  • determining the character to which the stroke belongs is a process of splitting the input character.
  • the splitting operation of the input characters ie, the wording operation
  • the splitting operation of the input characters can be realized by splitting one side while inputting, that is, with the natural writing of the user, it can be determined which character the stroke has been input belongs to, so that the side input can be realized.
  • the effect of the word on the side is a process of splitting the input character.
  • one of the following methods may be selected: (1) from the moment the user drops the pen, the input stroke is judged in real time by the dot matrix of the input stroke to determine the attribution thereof. (2) making a judgment on the attribution of each stroke after completing the input of each stroke (ie, raising the pen); (3) after completing the input of one line, or determining that the user has a longer input pause At the same time, all the strokes entered before are judged one by one, and those strokes with the highest correlation or the strongest correlation are attributed to the same character.
  • a new character can be created for the stroke; if the stroke is not the first target row/
  • the first stroke of the column may create a new character for the stroke according to the input position of the stroke in the first target row/column and other characters in the first target row/column Determining the character to which the stroke belongs.
  • the method for processing handwritten input characters provided in this embodiment, in the currently activated first target row/column, acquiring a stroke input by the user and corresponding input information, and according to the stroke in the first target row/column An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke
  • the attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input” or "end single text input” commands.
  • the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.
  • the present embodiment can make the handwriting input more natural and smooth, it is more convenient for the elderly and children who are unfamiliar with electronic input devices such as computers, mobile phones, tablet computers, laptop computers, notebooks, and iPads to use these devices.
  • the handwritten input character processing method in this embodiment adopts a pen/paper model.
  • the user can directly activate any line in the page for input.
  • the system can process empty lines between handwritten input and handwritten input as empty paragraphs. For the user, there can be only the command to change the input line, and there is no concept of carriage return or line feed.
  • the line break function can be implemented in multiple manners. In this embodiment, the following four types are provided:
  • the second target row/column is the currently activated target row/column, and the second target row/column is the next row/column of the first target row/column.
  • the position of the line break can be determined by a preset interaction mode. For example, it may be stipulated in advance that the end of the line is confirmed by continuously clicking a corresponding position or button of the right border of the input box or the screen twice or three times each time the line is naturally written to reach the end of the line.
  • a command button can be set at the end of the first target row/column, and when the user clicks the command button, the next row/column is automatically activated for editing.
  • the second target line is/ The column is the currently activated target row/column to enable acquisition of the stroke of the user input in the second target row/column;
  • the second target row/column is the next row/column of the first target row/column.
  • the first target line/ The column and the second target row/column are simultaneously the currently activated target row/column;
  • the second target row/column is the next row/column of the first target row/column.
  • the user's stroke may span multiple rows/columns.
  • the row/column to which the stroke belongs must be determined by certain rules: it can be the row/column where the starting point is located. It can also be the row/column of the end point, or the row/column with the largest proportion.
  • this contradiction can also be alleviated by increasing the row/column spacing between adjacent two rows/columns.
  • the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column, the first target row/column and the second target row/column are both Partial area activation;
  • a starting position of the active area of the first target row/column is set between an end position of an active area of the second target row/column and an end position of an active area of the first target row/column.
  • the user decides whether or not to break the line by fully controlling the position of the handwriting panel representing the active area within the segment.
  • the handwriting panel itself has the feature of automatically breaking lines within the paragraph.
  • the system will move some or all of the handwriting panel to the next line or above according to its position in the paragraph and the relationship with the current line.
  • One line As the position within the segment is different, the content presented in the handwriting panel will change accordingly.
  • the handwriting panel is moved to the last line of the paragraph, the re-triggering of the handwriting panel's automatic line break actually breaks the paragraph.
  • FIG. 1D is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention, in which two adjacent rows are simultaneously activated.
  • the position in the box in the figure is the active area.
  • the active area is a logically continuous area within two adjacent rows/columns, and the user can only input in the active area. Since the active areas of two adjacent rows/columns overlap, this avoids the occurrence of cross-row/column strokes.
  • the active area can also be switched to the full row/column range (the first target row/column or the second target row/column) according to the user's interaction.
  • the target line and the relevant area of the previous line may be used when the distance between the input position of the stroke in the target line and the start position of the line is less than a certain threshold. Simultaneous activation; if the currently activated target line is not the end of the segment, the target row and the relevant region of the next row can be simultaneously activated when the distance between the input position of the stroke in the target row and the end position of the row is less than a certain threshold.
  • the user may need to issue a "line extension" command, followed by a blank line that belongs to this paragraph, in order to enable the function of simultaneously activating two adjacent lines.
  • the first method and the fourth method are that the user actively breaks the line, and the target row/column is transferred through the interaction with the user, which is more accurate; the second method and the third method are automatic line breaking, and no additional interaction with the user is needed. Operation, as long as the user's writing style fully meets the requirements of rows or columns, the end position of each row/column can be automatically recognized without the user having to interactively confirm the end of each row/column, so that the entire handwriting can be input even The screen is made like ordinary paper Use, greatly improving the user's input experience.
  • Line break means that the current paragraph is not over, but since the handwritten character has been entered at the end of the line, the next line needs to be activated; the end of the paragraph means the end of the paragraph, and when the paragraph is judged, it can be inserted after the line.
  • Line then activate the next line of the blank line as the first line of the next paragraph, so that the user can input on the next line of the blank line; or, when the judgment paragraph ends, you can directly activate the next line/column of the line as the next paragraph
  • the first line is used for input.
  • any one of the above-mentioned line break modes one, two, and three may be used to perform line break.
  • some interaction with the user is required.
  • paragraph extension command only makes sense on the last line of the paragraph or the last line inserted.
  • the current edit line and all other lines in the corresponding paragraph of the line will have some sort of visual state to distinguish them from other paragraphs.
  • the new character or the attribute that is created by the acquired stroke is saved every preset time;
  • the stroke input by the user and the corresponding input information may be saved in the first memory; the saved characters are stored in the second memory, and the characters include the composition for each saved character.
  • the strokes and their input information and corresponding characters may all be stored in one memory, which is not limited in this embodiment.
  • any suitable storage method may be employed as long as it can effectively distinguish the characters to which each stroke belongs and each different character.
  • information such as input strokes and divided characters can be stored in a temporary storage location or space of the system (such as RAM or flash memory of the system) while inputting, and the input of each target row/column is ended. All of the divided character and stroke information in the target row/column is then stored in the specified permanent storage location or space.
  • the input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and the The input speed of the stroke.
  • the input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke;
  • the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, And the coordinate position of each point in the stroke of the stroke.
  • information such as input time, velocity, and speed of each stroke can be recorded as needed to further refine the input information.
  • the strokes and corresponding input time, velocity and speed can be stored in a separate stroke database in the form of a list.
  • the present embodiment can record and retain the detailed input information of each stroke in accordance with the stroke order at the time of writing while receiving each input stroke, it is possible to completely record and retain all the writing styles associated with each user. And almost all the information that is used to it, such as stroke order style, stroke style, word spacing and other writing features, making for example handwriting identification a breeze.
  • This embodiment also shows great advantages for missing strokes. For example, when the user enters the word “I”, he forgets to input " ⁇ " (dot) in the upper right corner, and finds the missing stroke " ⁇ ” after inputting other characters. At this time, the user can be as normal. Writing on paper is like “I” The “ ⁇ ” is added to the corresponding upper right corner position of the original position of the word. Although the input time of the " ⁇ ” is different from the input time of other strokes of the "I” character, it can be judged from the position information that the " ⁇ " belongs to The previously entered part of the "I” word.
  • the present embodiment can completely retain all the input information including the input time, position, velocity, speed, and word spacing of each stroke, it also provides a wider space for application services such as subsequent editing and other processing. .
  • the input position in the first target row/column according to the stroke in step 102A, or the stroke is in the first target row/column
  • the input position in the first target row/column, the character specified in the first target row/column, the creation of a new character for the stroke, or the character to which the stroke belongs may specifically include:
  • the stroke is associated with at least one character
  • the stroke is attributed according to the associated at least one character.
  • the specified character in the embodiment may be all the characters that are already in the first target row/column; or the specified character may be the to-be-compared region in the first target row/column. a character in the middle, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold. Comparing the stroke with only a certain range of characters in the surrounding area can effectively reduce the amount of calculation and improve the efficiency of the stroke attribution determination.
  • Judging the relevance mode Determine the relevance of the stroke to the character by judging whether the stroke coincides with the character. Specifically, the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, create a new character for the stroke or determine the location
  • the characters to which the stroke belongs may specifically include:
  • the stroke is associated with at least one character
  • the stroke is attributed according to the associated at least one character.
  • strokes that intersect each other can be used as strokes of the same character, and the strokes are assigned to the same character, which is simple and quick.
  • the relationship between the stroke and the character is determined by calculating the distance between the stroke and the character boundary.
  • the input position in the first target row/column according to the stroke in the step 102A, or the input position of the stroke in the first target row/column and the first A character specified in the target row/column a new character is created for the stroke, or a character to which the stroke belongs is determined, which may specifically include:
  • the stroke is associated with at least one character
  • the stroke is attributed according to the associated at least one character.
  • the characters to which the strokes belong can be determined by comparison with a preset third preset threshold.
  • the stroke may be considered to belong to the adjacent character, otherwise a new attribution character may be created for the stroke.
  • Judging the relevance mode 3 Determine the correlation between the stroke and the character by calculating the distance between the stroke and each stroke in the character.
  • the input position according to the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the first target row/column may specifically include:
  • the stroke is associated with at least one character
  • the stroke is attributed according to the associated at least one character.
  • the performing the attribution processing on the stroke according to the at least one associated character may include:
  • At least two characters associated with the stroke If there are at least two characters associated with the stroke, at least two characters are combined and the stroke is attributed to the merged character.
  • a stroke when a stroke can be attributed to the left and right characters at the same time, it indicates that the stroke should be merged with the characters on the left and right sides to form a glyph, for example, the "tree” in the word “side” The positional relationship between the stroke in the middle and the "wood” on the left side and the "inch” on the right side.
  • the preset threshold may not be set as long as the characters can be divided.
  • the association between the stroke and the character can be divided into strong and weak, and the attribution of the stroke is judged according to the strength of the association.
  • the performing the attribution processing on the stroke according to the at least one associated character may include:
  • At least two characters with the strongest association with the stroke at least two characters are merged, and the stroke is attributed to the merged character.
  • the obtaining the most strongly associated character from the stroke from the associated at least one character may include:
  • At least one character associated with the stroke is sorted in order from small to large, and the character corresponding to the minimum distance is used as the most relevant to the stroke. Strong character; or,
  • the default is that the stroke with the upper and lower positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent left and right characters needs to be judged.
  • the default is that the stroke with the left and right positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent upper and lower characters needs to be judged.
  • the methods described in the above various manners may be comprehensively used, for example, the method of judging the relevance method 1 is used for some strokes, and some strokes are determined. The method of judging the relevance method 2 is used for judging, and the remaining strokes are judged by the method of judging the correlation method.
  • the method of determining the relevance manner may be used to determine whether the stroke is The other characters already entered in the first target row/column are associated, if not associated, a new character is created for the stroke; if the current stroke is neither the space in the first target row/column If the stroke is not the last stroke, the distance between the currently input stroke and all the characters or strokes that have been input may be compared according to the method of determining the correlation method 2 or determining the correlation method 3, and The currently entered stroke is attributed to the associated one or more characters based on the result of the comparison.
  • the first preset threshold, the second preset threshold, the third preset threshold, and the fourth preset threshold may be determined by the user according to their own writing habits, and may also adopt a system default value.
  • system can also provide visual information to assist in automatic segmentation, such as character-based character segmentation: based on the correlation between the current input stroke and the corresponding text stripe in the current input line, the current input stroke should be determined. character.
  • the text can also be used to determine the attribution of the stroke. Specifically, before the collecting in step 101A acquires the stroke input by the user and the corresponding input information, the first target row/column may be divided to divide the first target row/column into multiple Writing a text.
  • the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, creating a new character for the stroke or determining the character to which the stroke belongs, including:
  • the stroke is attributed to an existing character in the composition; otherwise, a new character is created in the composition, the stroke being attributed to the new character.
  • the stroke spans a composition
  • a new character the new character belongs to the composition; if the stroke spans at least two composition grids, determining whether there is a character in the at least two composition grids, if the at least two composition grids If there is no character, a new character is created for the stroke, and the new character belongs to the at least two composition grids.
  • the stroke is attributed to the composition in which the character exists, if the at least two compositions If there are multiple characters in the grid, the characters in the plurality of composition grids are merged, and the strokes are attributed to the merged characters.
  • each input character of the embodiment is divided and stored on the basis of a glyph object (non-standard, ie, handwritten character), in other words, in this embodiment, or
  • a glyph object non-standard, ie, handwritten character
  • Each input character that is segmented is treated as a non-standard glyph object; on the other hand, if the handwritten content is ultimately only used for human reading (more on the retention of the original input information form), the division error does not need to be corrected. .
  • This embodiment provides a corrective method, which specifically includes:
  • the correction request including a character to be corrected, or a character to be corrected and a stroke to be corrected;
  • the specific content of the correction request may be different according to different scenarios.
  • the following scenarios are provided:
  • Scenario 1 Combining two characters into one, that is, the correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • Scenario 2 splitting a character into a plurality of characters, that is, the correction request is a split correction request, and the character to be corrected is a character to be split;
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • Scenario 3 changing a stroke attributed to one character to another character, that is, the correction request is a home correction request, the character to be corrected is a character to be vested, and the stroke to be corrected is At least one stroke to be corrected;
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • At least one stroke to be corrected is attributed to the to-be-vested character.
  • the characters that have been split can be re-splitted by interacting with the user, thereby improving the accuracy of character splitting.
  • each character possibly a combination of one or more words, words
  • the method provided by the embodiment can also record the stroke order (based on time) of each stroke written by the user and the shape feature of the corresponding stroke, it is easy to find out the same or similar stroke order according to the information. Characters with stroke shape characteristics can be treated as the same character if the appropriate threshold conditions are met. This makes matching, searching, and searching for characters a breeze, and even searching for the characters entered by the user.
  • the functions of finding and inserting can also be added.
  • the search function may specifically include the following steps:
  • the characters to be searched are compared with the locally saved characters according to the number of strokes of the character to be searched and the stroke feature, and characters matching the characters to be searched are obtained.
  • the split handwritten character characters can be obtained.
  • handwritten text search based on pattern matching can be performed. The main thing is to match each character in the search source with the character to be found one by one. Matching characters can be found by matching the number of strokes and the stroke order.
  • the one-to-one matching between the character to be searched and the stroke in the locally saved character that is, the matching of the curve, if not, the final matching result is a failure, and if they are consistent, the final matching result is successful.
  • any character analysis or other matching method in the prior art can be used to implement the character search function, which is not limited in this embodiment.
  • the function of replacing characters can also be implemented based on the same principle as the search function, and will not be described here.
  • the insertion function of the handwritten text input editing may specifically include the following steps:
  • the insertion request including a target row/column to be inserted, a to-be-inserted position in the target row/column to be inserted, and a character to be inserted;
  • the user needs to add a character at a position that has become an inactive line, for example, when inserting a character between the 3rd and 4th characters of a line, the user needs to activate the line first, and the system will be in the line.
  • the blank character provides an auxiliary interface that accepts user input. The user activates the auxiliary interface between the 3rd and 4th characters of the line, and optionally inserts an insertion operation at the character interval.
  • FIG. 1E is a schematic diagram of a state in which a character is inserted in an embodiment of a method for processing handwritten input characters according to the present invention.
  • the existing characters after the insertion position can be moved to the next line, and the insertion position is to the end of the current line. It is a space for writing. Insert the line marked with the right arrow and click the right arrow to exit the insertion state. Before the insertion is complete, the user can only enter between the two insertion markers.
  • inserts can be nested, that is, inserts can be inserted again. Insert rows have different visual states than normal rows to help users clarify the current editing state.
  • the selection processing command includes any one or a combination of the following: performing copy processing on the at least one character, performing cut processing on the at least one character, and performing replacement processing on the at least one character And performing a merge process on the at least one character.
  • FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1F, functions such as inserting, pasting, selecting all, selecting, and merging can be displayed on the handwriting input screen to facilitate the user to perform corresponding operations.
  • the embodiment may also insert or add a stroke, a comment, or delete some characters or the like on the input character.
  • the functions of searching, inserting, and copying provided in this embodiment can effectively avoid the disadvantages of the existing handwriting input system being less intuitive and difficult to modify.
  • the number of the first target rows/columns is plural;
  • the active areas corresponding to the plurality of the first target rows/columns do not overlap and are not in contact with each other.
  • multiple users can input in the active areas corresponding to the plurality of first target rows/columns, respectively, satisfying the function that the large-size handwriting input screen allows multiple people to simultaneously input.
  • the embodiment is compatible with the existing keyboard, mouse, and other existing input devices, and the hybrid input is implemented by performing mode switching.
  • the mode switching method in this embodiment may specifically include:
  • the handwriting mode is switched to the target mode, and in the target mode, at least one standard character input by the user is received.
  • the target mode may be a keyboard input mode, a mouse input mode, or other existing input modes.
  • a mixed typesetting can be implemented by adding standard code characters or inserting other symbols or information into the input limits of a row or column in combination with an existing keyboard (see handwritten text mixing in the example of the present application).
  • keyboards can be activated by means of appropriate touch buttons or operations (eg, clicks) to allow the user to freely switch between handwriting input and other conventional input devices such as a keyboard.
  • touch buttons or operations eg, clicks
  • a division form of a standard code may be used, or a division manner of characters in the present invention may be used.
  • the active area can also automatically move with the user's input. For example, the active area is always repositioned with the position of the user's last stroke as the midpoint of the active area. In this way, in most cases, the active area will automatically move as the user writes, so that the location of the active area does not need to be manually set.
  • the system will have a flashing cursor to indicate the current input position.
  • the system displays the active area to indicate the range that can be currently input.
  • the two can be converted to each other according to certain rules. For example, when switching from standard character input to handwriting input, the system sets the position of the active area with the cursor position as the midpoint of the active area; when switching from handwriting input to standard character input, the character position closest to the midpoint of the active area is Is set to the current input position.
  • Control characters exist in the standard code (such as ASCII code) character set.
  • ASCII code ASCII code
  • control characters may be standard control characters, such as spaces, tabulations, line breaks, and the like; or non-standard control characters, such as white space characters.
  • standard control characters are similar to the prior art.
  • this embodiment additionally provides the function of blank characters.
  • the space spacing information between characters can be reserved, for example, the size of the space between the left and right characters for the horizontal format, or the size of the space between the upper and lower characters for the vertical format, etc., and can directly blank The spacing is created as a whitespace character with blank spacing information.
  • the horizontal baseline of the target line where the character is located may be limited to the horizontal baseline of the character, and the character is the most The position of the left part (such as graphics, images, strokes, etc.) is set to the starting position of the character.
  • Each part in the character is based on the baseline and the starting position, and the typesetting direction is recorded in the positive direction. s position. In this way, the same character content can appear in different positions of the text.
  • the corresponding character origin coordinates are correctly calculated according to the line of the character and the position of the character in the line, all the internal components can be correctly drawn.
  • the starting position of each character can be set in a similar manner, and the relative internal coordinates of the starting position are used for the character internal part position.
  • FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention.
  • a custom space character is introduced, and the word spacing is saved as a parameter/content.
  • the numbers 12, 16, and 10 in Fig. 1G are numerical values of each blank character, indicating the length information of each blank character. In the process of analysis and processing (such as identification, bypass, etc.) can be treated differently. Similarly, time-based whitespace characters can be added to the text of the voice input.
  • the maximum coordinate of the character entered by the user along the layout direction is the width of the character.
  • the character width we can store it or not, but recover it by the position information of all internal parts in the character.
  • formatting text as long as you get the width information of all characters (including control characters), you can restore all the characters in the starting position of the row/column, providing a basis for further text rendering.
  • control characters and blank characters are introduced. These control characters have similar models, codes, glyphs, and meanings as the characters handwritten by the user. Therefore, the theory, methods, and tools for processing handwritten input characters can be used directly or indirectly to control characters. Further, the characters handwritten by the user and the control characters can be mixed and processed together, with this base Basic, the splitting of characters is even more significant.
  • the object processed in this embodiment may be a stroke character, a standard character, a graphic character, a combined character or a control character input by the user, or may be a mixture of a plurality of characters.
  • FIG. 1H is a flow chart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1H, the text editing in this embodiment may specifically include the following steps:
  • Step 601A Determine the open mode: if the existing document is opened, step 602A is performed; if the new document is created, step 603A is performed.
  • This embodiment is mainly used to provide personalized handwritten character input for related documents, and there are mainly two ways of entering the handwriting input system: a method with document data and a method without document data.
  • the former is to open an existing document, and the latter is to create a new document.
  • Step 602A loading document data and performing typesetting according to the typesetting constraint, and executing step 604A.
  • the related data of the characters may be hierarchically loaded. For example, when formatting a character, all that is required is the width of the associated character (higher for column-based layout), so in this step, only the width information of the character can be loaded. Other information, such as drawing stroke information or contour information, can be loaded on demand later, which saves system resources (memory, network traffic, etc.). And step 604A is performed.
  • Step 603A initializing the handwritten document, and executing step 604A.
  • Step 604A Initialize (empty) the sequence of handwritten text objects representing the character input lines.
  • AL Active Line
  • Step 605A presenting the document content, and performing step 606A.
  • the presented content includes multiple parts: visual information of the document itself (including visual information of handwritten characters, such as the position and shape of characters), visual information of the document presentation environment (such as background, shading, paper border, etc.), Visual information related to document editing (such as selected area, cursor or active area indicating input focus, auxiliary lines, etc.). It is mentioned in step 602A that the visualized data of the handwritten characters must be loaded when it needs to be presented. For characters that do not need to be rendered, their corresponding visualization data may not be loaded.
  • the character stream is loaded from the storage area to Memory, you need to typeset before displaying.
  • the typesetting here refers to line breaks.
  • the line can be broken at the end of the paragraph mark/newline (hard return); the position of each character is calculated in each row/column, and the total length of the input text content is accumulated. Breaks when the position exceeds the maximum position of the line (soft return). The truncated position is at the last breakable line.
  • Punctuation can be broken after the punctuation (punctuation can not be used as the first character after the soft carriage return);
  • Blank spaces can be broken, and the first character of the next line is the following non-whitespace character (the whitespace character cannot be used as the first character after the soft carriage return);
  • East Asian characters can be directly broken before and after;
  • Handwritten characters can be broken directly before and after.
  • whitespace characters can be converted to blank spaces with standard lengths. Continuous blank spaces can be merged directly, so the typesetting algorithm is much simpler. Blank spacing is handled in the same way as whitespace characters.
  • the document model after typesetting includes information for each display line.
  • the line includes words with position (including characters, East Asian characters, and handwritten characters). Blank characters do not need to appear in this model, and the relevant information is implicit in the position attribute of the word (left border, right border (left border + width)). Therefore, blank characters (including white space, standard white space, tab characters, etc. caused by handwriting pitch) can be discarded after typesetting.
  • the text in each line will change with the user's input. User input and erased strokes may cause the spacing of characters to change or generate new characters. As long as the character coordinates are correct, the spacing will be correctly generated. Only when you need to store the edited content, you need to calculate and generate whitespace characters and insert them into the appropriate locations.
  • Step 606A receiving the command, and performing different operations according to the command.
  • the commands here can be commands entered by the user, or they can be system commands or commands passed by other application systems.
  • step 607A if the command is a text encoding typesetting command, step 607A is performed; if the command is to start a handwriting input command, step 608A is performed; if the command is to end a handwriting input command, step 610A is performed; If the command is a system exit command, step 612A is performed.
  • Step 607A Typesetting the text content according to the command.
  • the typesetting constraint and the typesetting direction can also be stored in the information of each character.
  • the internal relative position of all the characters in the current typesetting mode can be adjusted according to this information, thereby correctly drawing the character.
  • the horizontally typed characters are stepped according to the width (that is, the line length is accumulated from left to right according to the typesetting direction), and the vertically typed characters are stepped according to the height. Therefore, in the specific implementation, it is necessary to distinguish between horizontal characters and vertical characters.
  • the internal coordinate system with the line baseline (alignment line) as the horizontal axis and the leftmost stroke point as the vertical axis may be used.
  • the column axis may be the horizontal axis.
  • the highest stroke point is the internal coordinate system of the vertical axis. In this way, different characters will remain in the original alignment state in the corresponding layout drawing.
  • the system can automatically perform coordinate conversion. Although the original alignment between characters cannot be preserved, each character can still be rendered normally.
  • Another example is the change of text layout into ordinary typesetting.
  • the character type is marked in the type of the character, and then the internal coordinate system of each character can be the origin of the lower left corner of the corresponding composition (actually any point, such as the center point).
  • each character is aligned with the corresponding composition.
  • There is no text space/space character in the handwritten text of the text layout but there is a space character.
  • we change the typesetting of texts into ordinary typesetting we can match each word. Recalculate, replace the coordinate system (such as the system with the above baseline and the leftmost intersection as the origin), and insert the corresponding interval character between the characters according to the new coordinate system.
  • Step 608A activate the target row/column, and perform step 609A.
  • the target row/column can be activated, and the text object in the target row/column is activated (loading stroke information), and the object sequence is assigned to AL.
  • the input of the handwritten characters is performed under the constraint of the row/column. Even if the input spans multiple rows/columns, the corresponding characters must eventually be stored in a specific location on a particular row. Therefore, the target row/column of character input can be presented in a visual manner, and the user can also avoid cross-line input through specific settings, such as auxiliary panel, full-screen line editing, and the like.
  • Step 609A Perform handwriting input under the constraint of the activated target row/column, and return to step 605A.
  • handwriting input can be performed under the constraint of the activated target row/column, and each stroke input is automatically combined with the AL according to a certain rule to form a new sequence of handwritten characters (ie, the AL is updated).
  • the input process of the handwritten characters is mainly to automatically combine the input pens into different characters according to the spatial constraints in the row/column.
  • the word spacing effect can be realized by the word spacing constraint or the text constraint. .
  • Step 610A Store the content of the AL Chinese character object, and execute step 611A.
  • the contents of the AL Chinese character object are stored, and if necessary, the AL related text content can be re-typed.
  • the character object in the AL is determined (previously changed dynamically by stroke input). Some of these character objects have not changed, some content (strokes) have changed, and some are brand new characters. Both changed and new characters are new characters.
  • the sequence of characters corresponding to the final AL needs to be updated to their corresponding position in the document. If the storage method of encoding and content splitting is used here, the content of the new character needs to be first stored in the encoding library to obtain the corresponding encoding. The new code sequence is then saved to the appropriate location in the document (typically the in-memory document model).
  • step 611A the AL is cleared, and the process returns to step 605A.
  • Step 612A the end.
  • the processing method of the handwritten input word provided by the embodiment facilitates the user to edit and process the handwritten character, thereby further improving the user's input experience.
  • the corresponding row and column scrolling rulers are set to be up, down, and left. Or expand the input range of the panel to the right, that is, the input range space of the row and column. Also, when the scale is moved, the corresponding target row/column can be displayed and/or activated accordingly.
  • the function of encoding can also be added in this embodiment.
  • the coding function in this embodiment may include:
  • mapping table in the encoding warehouse to obtain the standard language parameters corresponding to the glyphs.
  • the standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.
  • This embodiment can implement the function of encoding characters generated during the handwriting input process, which will be described in detail below.
  • Character can be Refers to handwritten characters of ideograms, such as single Chinese characters, Japanese, Korean, Arabic, Vietnamese, Burmese, etc. or parts thereof (such as radicals, etc.), or handwritten words of phonetic characters, such as English, German, French Western letters or words in Russian, Spanish, etc.; can also be computer characters based on traditional standard codes, such as ASCII characters, Unicode code characters or strings, and even control characters such as spaces, tabs, line breaks Such as special characters, etc.; can also refer to non-standard control characters, such as the spacing or spacing between handwritten characters in this article; can also be mixed with handwritten characters and standard characters and / or synthesized characters or strings; It can be any graphic, image input by the user, such as a "heart” pattern, a photo, any graffiti, etc., or any other written expression.
  • all character objects input in the above manner will be recognized as characters in a
  • the glyphs referred to in the present invention are similar to the concept of characters in a standard font, except that the present invention generates non-standard glyphs. Since the object of the present invention is not to generate a standard font or font, the resulting glyphs of the system of the present invention are likely to include erroneous splitting of various characters or words or merging between them, and may also include user input. Any graphics or images, etc.
  • compilation generation and interpretation execution For modern high-level programming languages, it can be divided into two types: compilation generation and interpretation execution.
  • the former is to convert the source code through a series of compilation and conversion, and generate a binary file that encapsulates the instruction sequence of the target machine (which can be a virtual machine).
  • Binary files need to be loaded into the target system for execution.
  • Interpretation execution refers to an interpreter running in the target system, which reads the source code and runs directly through a series of internal processing.
  • scripting languages typically JavaScript, Lua, Tcl, and so on.
  • Many traditional programming languages are compiled languages such as C, C++, Objective-C, Java, C#, go, Swift, and so on.
  • the core components of the program source code whether it is the compiler or the interpreter, have very similar front-end constructs, even the same.
  • the so-called front end refers to the conversion of source code into an internal intermediate form.
  • the backend refers to converting the intermediate form into machine code, and for the interpreter, the intermediate form is executed by the execution engine.
  • the mid-end there is also processing and optimization for the intermediate form, which is called the mid-end.
  • the focus of this article is on the front-end part, so in general, we don't make a distinction between compile and explain.
  • For the front end here is collectively referred to as the compilation front end.
  • the compilation front end can generally include four processes: lexical scanning, parsing, semantic analysis, and intermediate code generation.
  • the lexical scanner converts the source code into a tag stream; the parser converts the tag stream into an abstract syntax tree; the semantic analysis adds the abstract syntax tree to the semantic tag; the median code generator converts the tagged abstract grammar book into a compiler Intermediate form.
  • IDE Integrated Development Environment
  • the handwritten text system brings a new way of text input, which is safe and convenient.
  • the input and edit results are still character streams, but not the standard code, but the individual code of the input person.
  • the solution is quite straightforward—that is, converting personal-based proprietary encoding to standard encoding. That is to say, the handwritten source code is converted into source code that can be recognized by the normal compile front end. Therefore, the traditional compilation front end is preceded by a conversion process to process the handwritten source code, that is, the entire process can generally include five processes: handwritten source code conversion, lexical scanning, syntax analysis, semantic analysis, and intermediate code generation.
  • This encoding conversion process mainly converts and matches the handwritten source code according to the established rules, and generates the corresponding standard code content, which is separated from the glyphs in the font library.
  • the process is mainly divided into two parts: controll conversion and glyph conversion.
  • control characters in the programming language mainly include spaces, tabs, carriage returns, line feeds, and so on. Since the handwritten text can use the same or similar control characters as the normal text, this conversion is very straightforward. For example, the handwritten interval code is directly converted into a standard blank character. If the handwritten line break uses a standard line feed code directly, it can be retained without conversion.
  • the glyph conversion is mainly to convert the personalized glyph code in the handwritten source code into a pair. Should be coded in standard.
  • the basis of this conversion is the glyphs in the corresponding text font library.
  • the glyph matching service of the handwritten text system is needed. These include digital symbol mapping, keyword mapping, interface identifier mapping, and private identifier generation and mapping.
  • the digital symbol mapping is based on the user-defined glyph digital symbol mapping table, and the glyph search matching is performed in the handwritten source code, and replaced with the corresponding standard code numbers and symbols.
  • the symbols referred to herein refer to punctuation marks used in programming languages, such as addition, subtraction, multiplication and division, greater than, equal to, less than symbols, various brackets, and the like.
  • this glyph digital symbol mapping table is the key to digital symbol mapping.
  • This table is a personalized setting.
  • everyone's writing habits, strokes, and glyphs are not the same. It makes sense to find and match the glyphs of the same person. Therefore, each programmer has its own glyph numeric symbol mapping table, which can only map the handwritten source code written by the programmer.
  • programmers need to authorize specific users/accounts to share their glyph-like numeric symbol mapping tables, and their handwritten source code can be compiled/runned by others. In fact, this is an extension of the security of handwritten text during software development/running.
  • the glyph digital symbol mapping table can be a many-to-one mapping. In other words, multiple glyphs can correspond to the same number and symbol.
  • the glyph number symbol mapping table of a specific user for a specific programming language should in principle be added only to be deleted and modified. Moreover, the contents cannot conflict with each other, such as not allowing the same glyph to correspond to different numbers and symbols.
  • numbers and symbol characters in standard codes are not composed of characters in the alphabet. Therefore, when compiling a front-end lexical scan, the symbol characters are often specially processed, and one symbol can directly terminate the previous lexical mark; the identifier often cannot start with a numeric character. Similarly, we also need a special convention for the opponent to write, in order to facilitate processing. For example, it can be agreed that numbers and symbols can only correspond to independent glyphs, and cannot correspond to combinations of multiple glyphs.
  • the glyph digital symbol mapping table is generally predefined by the user.
  • the keyword mapping is also based on the mapping of the glyphs of the mapping table to the standard code.
  • This mapping table is a glyph keyword mapping table. Is a personal A many-to-one table.
  • Keywords are also crucial for the recognition and parsing of programming languages. Keywords determine the location and number of related syntax elements. Therefore, the content of the glyph keyword mapping table is generally pre-defined by the user, and can also be interactively performed during handwriting source conversion.
  • keyword mapping allows one keyword to correspond to a combination of multiple glyphs, that is, different combinations of the same glyphs can correspond to different keywords.
  • interface identifier mapping also maps glyphs to standard codes.
  • the key here is also a mapping table - glyph identifier mapping table.
  • mapping table For traditional high-level programming languages, there are more or less built-in or third-party libraries. We need to use the corresponding identifiers to access system constants, system functions, standard library functions, class libraries, and so on. These identifiers are often composed of standard code characters.
  • the glyph identifier mapping table is a mapping table between the user's handwriting and the corresponding identifier. In addition, some of the symbols in the handwritten code may also become interfaces - used and accessed by others, in which case we also need to provide the corresponding standard code identifier.
  • the set of target keywords (including system punctuation) mapped to is a well-defined closed, finite set.
  • the target identifier set is an infinite, open collection.
  • the content of the glyph identifier can be pre-defined by the user or interactively during handwritten source conversion.
  • mapping table is automatically generated by the system.
  • the content of this mapping table is the correspondence between the glyphs of the above defined symbols and the corresponding generated standard code identifiers.
  • handwritten text encoding and standard encoding In our handwritten text scheme, we can allow handwritten text encoding and standard encoding to be in the same Mixed use in one content. In the processing of handwriting programming, we also allow such content. In the source code conversion, the part of the standard code is skipped directly, and no conversion is performed. Here, in order to prevent mutual interference between the standard code generated by the handwritten text and the original standard code, we need to insert a blank character between the standard text and the non-control character handwritten text directly adjacent to each other in the conversion process.
  • handwritten text can be constrained by the glyphs of standard coded text, and the user can use any glyph or symbol. So in handwriting programming, we can use any glyph or symbol as a keyword or identifier. But in the process of using, we need to pay attention to the conflict between keywords and identifiers. If the identifier uses the same glyph as a certain keyword, the result of the conversion will often result in a syntax error. By using special glyphs or symbols for keywords, we can circumvent this conflict very well.
  • FIG. 1I is a flowchart of a handwriting program source code conversion method in a method for processing handwritten input characters according to an embodiment of the present invention.
  • FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I.
  • the entire conversion process has five inputs: a handwritten program source file, a handwritten character library, a glyph numeric symbol mapping table, a glyph keyword mapping table, and a glyph interface identifier mapping table.
  • the glyph private identifier mapping table is only needed during the conversion process and can be left unused.
  • the source target location mapping table is very important, because the compilation and interpretation execution process after the conversion is completed is performed by inputting the generated standard code object file, and the corresponding system information is also based on the location information in the text file. Given. With this source target location mapping table, we can directly convert this information into the corresponding location within the handwritten source file. This provides the foundation for our entire handwriting programming environment and related aids.
  • the output is mainly a standard code program text file.
  • the conversion process can be integrated with the existing compilation front end, and the process of writing a file can be skipped, and a standard code character stream is generated in the memory for further processing.
  • the previous conversion process assumes that the glyph interface identifier mapping table is pre-defined.
  • the optimized conversion process can generate intermediate files (including complete numeric identifiers and keyword conversions) without the glyph identifier mapping table, and then according to lexical analysis, parsing And the results of semantic analysis intelligently handle handwritten identifiers.
  • a processing rule can be employed: for a handwritten symbol defined by a symbol, its standard code identifier is automatically generated; for an undefined handwritten symbol, an interactive manner is used to query the user for its identifier definition, and automatically according to user input. Generate a glyph interface identifier mapping table.
  • the deeply integrated compiler is used inside the handwritten text editor, and can also implement functions such as syntax coloring and grammatical intelligence, so as to finally realize integrated development based on handwritten characters. surroundings.
  • FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention.
  • the handwriting program in Fig. 1K corresponds to the programming language Lua language, which is an embedded scripting language.
  • the corresponding font library code can be as shown in Table 1, Table 2 and Table 3.
  • the code is converted, and the user prepares the glyph digital symbol mapping table as shown in Table 4.
  • the glyph keyword mapping table is shown in Table 5.
  • the glyph interface identifier mapping table is shown in Table 6.
  • the system sets a syntax interval threshold of 20.
  • the private identifier auto-generation rule is two underscores (_) followed by a glyph code sequence connected by an underscore.
  • the first identifier is actually a comment content, meaningless. If we use an optimized conversion process, we can omit the conversion directly when it is identified as a comment.
  • This generated program can be interpreted and executed normally by the traditional Lua interpreter, and its execution semantics are exactly the same as those in the handwritten source code.
  • the method may further include:
  • the protocol is stripped according to the preset metadata, the metadata of the saved handwritten text is obtained, and the obtained metadata is stripped from the handwritten text;
  • the handwritten text is divided into at least two pieces of data according to a preset data content splitting specification.
  • the method may further include:
  • the encoding repository selects or creates an encoding specification according to at least a portion of the metadata, and generates a correspondence corresponding to the metadata according to the encoding specification
  • Encoding according to the encoding protocol encoding the handwritten text, obtaining an example encoding, and acquiring a text encoding corresponding to the handwritten text according to the meta encoding and the example encoding; and receiving the encoding warehouse
  • the text code returned, the text code is a reference code form or a content code form.
  • processing procedure of the data splitting can be referred to the specific introduction of the embodiment of the data splitting method in the instruction manual.
  • the specific process of the encoding processing can be referred to the specific introduction of the embodiment of the subsequent encoding processing method of the specification. Let me repeat.
  • FIG. 1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention.
  • the processing device for handwriting input characters in this embodiment may include:
  • the acquiring module 1001A is configured to collect, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes the stroke in the first target row/column Input position
  • a attribution module 1002A for each stroke according to the stroke in the first target line / An input position in the column, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the The character to which the stroke belongs.
  • the handwriting input character processing device in this embodiment may be used to perform the method for processing the handwritten input character shown in FIG. 1A.
  • the specific implementation principle may refer to the foregoing embodiment, and details are not described herein again.
  • the handwriting input character processing apparatus acquires a stroke input by the user and corresponding input information in the currently activated first target row/column, and is in the first target row/column according to the stroke An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke
  • the attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input” or "end single text input” commands.
  • the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.
  • the collection module 1001A is further configured to:
  • the row height/column width information is a default value or determined by the user input
  • the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen. a position or a column of opposite left and right positions in the handwriting input screen;
  • Target row/column selection message receives a target row/column selection message input by the user, where the target row/column selection message includes an identifier of the target row/column to be input by the user;
  • a row/column corresponding to the identifier of the target row/column to be input by the user is used as the currently activated first target row/column.
  • the acquisition module 1001A is further configured to:
  • the position range refers to a relative top side position and a bottom side position of the first target line in the handwriting input screen or a relative left side position and a right side position of the first target column in the handwriting input screen.
  • the collection module 1001A is further configured to:
  • the second target row/column is the currently activated target row/column, and the second target row/column is the next row/column of the first target row/column.
  • the acquisition module 1001A is further configured to:
  • the second target line is/ The column is the currently activated target row/column to enable acquisition of the stroke of the user input in the second target row/column;
  • the second target row/column is the next row/column of the first target row/column.
  • the acquisition module 1001A is further configured to:
  • the first target line/ The column and the second target row/column are simultaneously the currently activated target row/column;
  • the second target row/column is the next row/column of the first target row/column.
  • the first target row/column and the second target row/column are simultaneously the currently activated target row/column, the first target row/column and the second target row/column are both partial region activated. ;
  • a starting position of the active area of the first target row/column is set between an end position of an active area of the second target row/column and an end position of an active area of the first target row/column.
  • the home module 1002A is specifically configured to:
  • the stroke is associated with at least one character
  • the stroke is attributed according to the associated at least one character.
  • the specified character is all characters that are already in the first target row/column
  • the specified character is a character in the area to be compared in the first target row/column, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold.
  • determining the association between the stroke and the character Sex can include:
  • comparing the input position of the stroke in the first target row/column with position information corresponding to a character specified in the first target row/column, and determining between the stroke and the character Relevance can include:
  • comparing the input position of the stroke in the first target row/column with position information corresponding to a character specified in the first target row/column, and determining between the stroke and the character Relevance can include:
  • At least two characters associated with the stroke If there are at least two characters associated with the stroke, at least two characters are combined and the stroke is attributed to the merged character.
  • the performing the attribution processing on the stroke according to the at least one associated character may include:
  • At least two characters with the strongest association with the stroke at least two characters are merged, and the stroke is attributed to the merged character.
  • At least one character associated with the stroke is sorted in order from small to large, and the character corresponding to the minimum distance is used as the most relevant to the stroke. Strong character; or,
  • At least one character associated with the stroke is sorted and the first character is used as the character most strongly associated with the stroke.
  • the collection module 1001A is further configured to:
  • the home module 1002A can be specifically configured to:
  • the stroke is attributed to an existing character in the composition; otherwise, a new character is created in the composition, the stroke being attributed to the new character.
  • the collection module 1001A is further configured to:
  • the characters to be searched are compared with the locally saved characters according to the number of strokes of the character to be searched and the stroke feature, and characters matching the characters to be searched are obtained.
  • the collection module 1001A is further configured to:
  • the new character or the attribute that is created by the acquired stroke is saved every preset time;
  • the acquisition module 1001A is also used to:
  • the saved characters are stored in the second memory, and for each saved character, the characters include a stroke constituting the character and an index corresponding to the stroke;
  • the index corresponding to the stroke points to the input information corresponding to the stroke in the first memory.
  • the input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and an input speed of the stroke.
  • the input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke;
  • the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and a coordinate position of each point in the handwriting of the stroke.
  • the collection module 1001A is further configured to:
  • the correction request including a character to be corrected, or a character to be corrected and a stroke to be corrected;
  • the correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • the correction request is a split correction request
  • the character to be corrected is a character to be split
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • the correction request is a home correction request
  • the character to be corrected is a character to be vested
  • the stroke to be corrected is at least one stroke to be corrected
  • the correcting processing is performed on the character to be corrected according to the correcting request, including:
  • At least one stroke to be corrected is attributed to the to-be-vested character.
  • the collection module 1001A is further configured to:
  • the insertion request including a target row/column to be inserted, a to-be-inserted position in the target row/column to be inserted, and a character to be inserted;
  • the collection module 1001A is further configured to:
  • the selection processing command includes any one or a combination of the following: performing copy processing on the at least one character, performing cut processing on the at least one character, and performing replacement processing on the at least one character And performing a merge process on the at least one character.
  • the number of the first target rows/columns is plural;
  • the active areas corresponding to the plurality of the first target rows/columns do not overlap and are not in contact with each other.
  • the collection module 1001A is further configured to:
  • the handwriting mode is switched to the target mode, and in the target mode, at least one standard character input by the user is received.
  • the collection module 1001A is further configured to:
  • mapping table in the encoding warehouse to obtain the standard language parameters corresponding to the glyphs.
  • the standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.
  • the data splitting of the present invention is a solution that can effectively solve the above problems.
  • 2A is a flowchart of a data splitting method according to an exemplary embodiment. As shown in FIG. 2A, the present invention provides a data splitting method, including:
  • step 101B when receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, and the metadata in the data object corresponding to the data identifier to be stored is obtained.
  • Step 102B Strip the acquired metadata from the data object.
  • Step 103B Split the data according to the preset data content, and divide the data content into at least two data segments.
  • the method may further include:
  • step 104B the metadata and each data segment are separately stored in different storage bodies or in different secure channels.
  • the data splitting method of the embodiment when receiving the storage request carrying the identifier of the data to be stored, according to the preset metadata stripping rule, obtaining the metadata in the data object corresponding to the data identifier to be stored, and the metadata is The data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels.
  • FIG. 2B-1 is a flowchart of a data splitting method according to another exemplary embodiment. As shown in FIG. 2B-1, the present invention provides a data splitting method, including:
  • Step 201B Receive a storage request carrying an identifier of the data to be stored.
  • the data splitting method may be applied to a device such as a terminal (client device) or a network (server device).
  • the device receives a storage request carrying a data identifier to be stored
  • the storage request may be triggered by the terminal application, for example, a mail.
  • the system, the desktop agent and other applications mentioned above take the mail system as an example.
  • the mail system sends the file data
  • it receives the storage request carrying the identifier of the data to be stored, and the data splitting device of the mail system first disassembles the file data. Sub-processing, so that the recipient of the mail needs to obtain the file data fragment from each specified storage body to get the complete
  • the file data is triggered by the user.
  • the data splitting device receives the storage request carrying the data identifier to be stored, and then splits the file.
  • the identifier of the data to be stored may be the name of the file data, and the identifier information such as the message digest algorithm (MD5 code).
  • Step 202B If the metadata specified in the preset metadata stripping protocol includes: attribute information, determine, in the data object corresponding to the data identifier to be stored, the attribute information content that matches the attribute information as metadata.
  • the process of stripping metadata is to separate the metadata of the data object, especially the key metadata, from the data object from its original location, so that only the data content and/or other metadata information remaining cannot be obtained.
  • the key metadata is security-related metadata. Once these key metadata are missing, the system will not be able to read, identify, decode or restore the corresponding data objects.
  • the file type is a key metadata.
  • the file extension is removed
  • the system cannot open the file content normally.
  • Storing file type information and file content data in different cloud storages will cause certain difficulties for malicious attackers or service providers to obtain complete data.
  • Different types of data have different key metadata.
  • tabular data a spreadsheet or database table, etc.
  • its header field name
  • metadata can also cover a wider range. As long as the security of the data is beneficial, any information related to the data content can be separated from the data content itself as metadata.
  • the metadata includes: attribute information; the attribute information is information capable of identifying a unique property of the data object, and is composed of some descriptive information to help find and open the data object. Attributes are not included in the actual content (data content) of the data object, but rather provide information about the data object. It can include a lot of information such as the size of the data object, the type of data, the date the creation was modified, the author, and the rating. Since the attribute information can be set by the person skilled in the art according to the nature of the data object, the content included in the above attribute information is only an example, and is not a limitation on the content of the attribute information.
  • the metadata agreed in the preset metadata stripping protocol includes: a data content identifier and a keyword
  • the data content matching the keyword is determined as metadata from the data content in the data object according to the data content identifier.
  • the data content identifier is used to prompt the extraction location of the metadata from the data content portion, and the keyword is used to indicate the data content that needs to be extracted specifically; the data content matched with the keyword may be key information or sensitive information contained in the data.
  • a number of keywords associated with the account information can be set to extract sensitive information in the account as metadata storage. For example: account number, user ID, user phone, address, etc.
  • the metadata agreed in the preset metadata stripping protocol includes: attribute information, a data content identifier, and a keyword
  • the attribute information content matching the attribute information in the data object is determined as metadata, and according to the data content identifier, From the data content in the data object, the data content matching the keyword is determined as metadata.
  • the strategy for generating the default metadata stripping protocol can be determined by the developer, or it can allow the user to define the applicable protocol.
  • the system needs to do so to present the metadata to the user as comprehensively as possible, and the user can preset the most based on the information.
  • the preset metadata stripping protocol is built into the data splitting system. As in the previous mail client example, the preset metadata stripping protocol can be built into the mail system application.
  • the preset metadata stripping protocol may also be stored with the metadata as part of the metadata content, so that when the recipient merges the data, the data object is merged with reference to the preset metadata stripping protocol.
  • the attachment file (data object) to be sent is split, and the metadata of the attachment file may be: file name, file type, file size, creation time, and the like.
  • the result of file metadata stripping is stored in the file meta information system.
  • the method of dividing the file content and the segmentation result information, such as the hash value or ID of the file fragment, and the storage location of the file fragment, are also stored in the file meta information system. And associated with the corresponding file metadata.
  • all of the content stored in the file meta-information system constitutes an example of this split/peel protocol.
  • Step 203B Detach the acquired metadata from the data object.
  • Stripping also referred to as splitting, refers to metadata that is selected from the data objects that are associated with the data object's split/peel processing.
  • the system will separate the metadata from the data object based on the default metadata stripping protocol (which can be system default or user-selected or user-defined).
  • the statute records information such as rules, constraints, and methods related to metadata split/peel processing. For example, but not limited to: stripping location information of metadata, stripping method of metadata, encoding scheme, information related to stripping encoding, content splitting rules, and other content splitting Closed data and / or information.
  • the metadata may be a complete set or a subset of the metadata of the data object.
  • the type of metadata please refer to the various situations in step 202B above.
  • splitting data such as splitting a data object into multiple segments according to a predetermined rule and saving them separately.
  • this method can not achieve more fine-grained encryption means, and can not separate the important information (metadata) closely related to the data object from the data content itself.
  • the invention adopts a new data splitting method to realize the splitting of data objects. This method not only splits the data object into finer granularity (for example, in characters or even in bits), but also can transfer important information (ie, metadata) closely related to the data object and the data content itself. Peel off.
  • the stripped metadata, data content, and/or the code to be mentioned later can be stored separately in different storage locations or spaces, or under different secure channels, thereby realizing the security of data storage more reliably. .
  • Step 204B Split the data according to the preset data content, and divide the data content into at least two data segments.
  • Content splitting refers to dividing the data content in a data object into several (more than one) segments according to certain rules.
  • the figurative metaphor is like tearing a piece of paper into pieces.
  • content splitting is not necessary, and can be determined according to actual needs. Applications that do not require high content confidentiality may not be split.
  • the content splitting method can use RAID disk array technology to divide data into multiple blocks and write multiple disks in parallel to improve the read and write speed and throughput of the disk.
  • Domain-related content splitting can be divided into domain-related content splitting and domain-independent content splitting.
  • Domain-related content splitting is mainly based on the characteristics of specific domain data, the data is split. For example, structural splitting for specific file formats, or splitting key or sensitive information within the data. The latter may have some overlap with the metadata stripping (when the metadata is in the data).
  • the bank's statement can be stripped of the account information as metadata, or the account information can be split as a data segment for split storage.
  • the preset data content splitting protocol may include at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm.
  • Algorithmic researcher Michael O.Rabin first proposed the Information Decentralized IDA algorithm in 1989 to slice data at the bit level so that it is unrecognizable when the data is transmitted or stored in the array, only with the correct density. The user/device of the key can access it. This information is reassembled when accessed with the correct key.
  • information-distributed IDA algorithms and related derivative algorithms have been widely used.
  • Step 205B Perform separation processing on each data segment according to a preset encoding separation specification to obtain a code corresponding to each data segment.
  • each data segment is separately encoded to obtain a code corresponding to each data segment, including:
  • Decoding a protocol according to a preset encoding querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and respectively, according to the encoding protocol, respectively Encoding each data segment to obtain an instance code corresponding to each data segment;
  • each data segment and the metadata transmitting, according to a preset encoding separation protocol, each data segment and the metadata to the encoding warehouse, so that the encoding warehouse selects or creates an encoding specification according to at least a part of the metadata, and generates according to the encoding protocol.
  • a meta-code corresponding to the metadata and according to the coding protocol, respectively encoding the respective data segments to obtain an instance code; and receiving the meta-code and the instance code returned by the coding warehouse.
  • Step 206B Arrange the respective codes according to the original order of the data segments in the data content to obtain the coded arrangement order information.
  • the data splitting method of the present invention covers two different data processing means, one is the stripping of metadata and encoding, and the other is the splitting of data content.
  • the stripping of metadata has been explained in the foregoing.
  • the stripping of the code here refers to splitting the data content into n pieces of data, and then storing or storing the n blocks separately, and obtaining the corresponding code (number) of the number (number). There may be repetitions in which the codes (numbers) are arranged in the order in which they appear.
  • This encoding (numbering) sequence contains the encoding information as well as the encoding ordering information, and the encoding result can be stored in another secure channel.
  • the encoding is different from the previous data fragment, and splitting it out can be called stripping.
  • splitting it out can be called stripping.
  • the metadata portion and/or the encoded portion are further split processed to achieve a finer-grained protection effect.
  • the above-mentioned stripping and splitting can be combined indefinitely, depending on system requirements and processing capabilities.
  • code stripping is based on content splitting, that is, content splitting is to split some or all of the data content according to certain rules, and encode the addressing mode of each split data. .
  • the final encoded result is formed into separate data.
  • reference codes for data are ubiquitous. Such as the key (Address) of the data record in the database; the abbreviated URL (http://dwz.cn/mzot4) for the URL input and reference; the access identifier used in the cloud storage programming interface (API), and so on.
  • These encoding methods can all be used by the encoding mentioned above. If the result of the splitting of the data part is encoded, the encoded result will replace the original corresponding data.
  • the encoding may not be based on content splitting. For example, for data with low security levels, it is not necessary to split the data content. At this point, it is sufficient to give the entire data content a code if necessary, but it may still be necessary to separate the code from the data content. It can be seen that the code stripping of this embodiment is different from the traditional content splitting, and is different from the existing data reference encoding, but a combination of the two. As long as the coding results (including the code itself and its corresponding combination order) are separated from the data content, the security risk of the data can be reduced to some extent. For example: there are 6 bytes of data ACBDAC, split the two bytes of data into the database. AC returns code 1, and BD returns code 2. The result of this data is the sequence of 121, not just 1 and 2. Wherein, the numbers 1 and 2 represent codes; and the arrangement rules of 1, 2, and 1 are coded arrangement order information.
  • the above-mentioned metadata, encoding, and data content stripping/split methods are not mutually exclusive, and they can be used in combination.
  • the other metadata are put together, as long as they are separated from the data content portion; more preferably, the three parts (metadata, encoding parts, data content) are separated according to their respective splitting rules.
  • the steps 202B to 206B are the order in which the content splitting, the metadata stripping, and the encoding stripping are not performed, and they may be performed separately or may be performed at the same time or simultaneously.
  • the encoding operation of the present invention needs to be performed during or after the content splitting process.
  • the metadata stripping can be done before the content is split, the metadata stripping can also be performed after the content splitting and encoding assignment is completed.
  • other data processing methods such as data compression, encryption, and the like may be mixed. It is also possible to add a description of compression and encryption to the various protocols mentioned above, but at this time it is best to re-execute after performing compression and/or encryption.
  • the split step for the metadata is also possible.
  • Step 207B Store the metadata, the code corresponding to each data segment, and the coded sequence information into different storage banks or different secure channels.
  • the metadata agreed in the preset metadata stripping protocol includes: a data object identifier
  • the rule is stripped according to the preset metadata, and the element in the data object corresponding to the data identifier to be stored is obtained.
  • the data includes parsing the data object to generate a data object identifier uniquely corresponding to the data object.
  • step 204B when the data object is audio data, step 204B, according to the preset data content splitting specification, dividing the data content into the at least two data segments may include: adopting a time domain analysis method or a frequency domain typing method, Performing a splitting process on the audio data to obtain an audio data object to be encoded; wherein the audio data object to be encoded includes a sound wave segment and/or a silent segment.
  • voice data and related processing have always been second-class citizens. The reason is mainly caused by the current input, storage and processing methods of voice data and corresponding technical limitations. People now mainly use two methods to process and use voice input through computers and networks: voice calls and voice recognition.
  • Voice call mainly refers to converting the voice signal output by a person into a digital signal through a computer sound capture device, and then through a computer and a computer network or a communication network (here mainly based on packet-switched voice technology, such as VoLTE, based on circuit-switched voice) The technology has nothing to do with the problems we discussed) processing, transmission and storage, and finally played back through the digital audio playback device.
  • Voice calls can be real-time or non-real-time; they can be one-way or two-way.
  • the main problem with current voice calls is the large amount of data, which is not easy to transfer and store.
  • the current audio sampling rates of sound cards are mainly 11KHz, 22KHz, and 44.1KHz.
  • the sound obtained at 11KHz is called telephone sound quality (the telephone uses 8KHz sampling rate), which basically makes people distinguish the voice of the caller; 22KHz is called broadcast sound quality; 44KHz is CD sound quality.
  • Another sampling parameter is the sampling resolution, which refers to the size of a sound signal (generally the amplitude of the sound wave).
  • the common ones are 8 and 16 and the 8 bits can divide the sound signal into 256 levels.
  • the bit can divide the sound signal into more than 60,000 levels. It can be calculated that the data size of the 8-bit stereo (left and right channel) audio signals sampled at 11KHz in 1 second is 22 KB. This is equivalent to the amount of data in Chinese characters of more than 10,000 words.
  • Audio files such as MP3, WMA, MOV, etc.
  • network streaming protocols such as PTSP, MMS, RTP, RSVP, etc.
  • Speech recognition as we already know, literal data is the first class citizen of current computer systems. Text data is standardized, easy to store, easy to view, find, retrieve, and process. Therefore, speech recognition that converts speech input into text data can make more efficient use of the input data.
  • the human natural voice output contains information other than the corresponding text content.
  • the speech recognition is converted into standard text content, the original speech data is generally not retained, and in fact, this part of the information is lost. These information mainly include voice, intonation, tone, tone, pause, etc., which may contain emotions, emotions, and so on.
  • the recognition rate problem is that speech recognition has not yet become a major obstacle to human computer input.
  • the data of the voice call maintains the original voice information, but the amount of data is large, and is not conducive to the automatic analysis and processing of the computer.
  • speech recognition can generate text data, which is convenient for computer transmission, storage, analysis and processing, some original speech information is lost in this process; and the accuracy and reliability of current speech recognition are not guaranteed, and there is no effective Ways to get the sound sample data of most people to improve the recognition rate.
  • This embodiment proposes a compromise method to process the original voice data so that both the original voice data and the text data are saved, which facilitates the transmission, storage and analysis processing of the computer.
  • This text data is not a standard text encoding, but a private encoding for a specific person.
  • the voice data corresponding to the code is stored in a specific text code warehouse, and the voice data in the code warehouse is differentiated and coded according to different users. Users can set access rights for different users for their own voice data.
  • the system is roughly divided into two parts: the code repository and related services surrounding the data.
  • the process of voice input is as follows: 1. The user logs into the code warehouse and selects the voice text input system; 2.
  • the voice text input system registers a series of encoders according to the current user to the code warehouse; 3.
  • the user inputs the system to the voice text. Input continuous speech; 4, voice text input system stores the user's input into the input buffer; 5, the voice text input system divides the voice data in the input buffer according to certain rules to form different data objects; 6, voice text input
  • the system submits the data to the data warehouse through the corresponding encoder, and obtains the corresponding code; 7.
  • the voice text input system stores the obtained code into the text input result, and clears the corresponding input buffer content; 8. Repeat 3 to 7 In the step, the voice text input system continuously obtains the user input and its corresponding code; 9. When the user stops inputting and there is no data in the input buffer, the entire voice input process is completed.
  • FIG. 2B-3 is a time-domain analysis of a piece of audio data, defining an amplitude less than a certain range (here 0.005), and the time is a period of time (here 20ms) is muted. For mutes less than 50ms, we divide directly from the middle, which belongs to one segment before, and then belongs to another segment. For muting greater than or equal to 50ms, we divide from the beginning and the end of the muting.
  • the method of separating the encoding and the content can easily place the encoding and the data content in different secure channels, and has natural security.
  • the voice data stored in the code warehouse is directly related to a specific person, and naturally can be well used as a training sample for analysis and organization.
  • the existing speech analysis and recognition technology can analyze and identify a lot of useful information, such as pitch, tone, pitch, syllable, etc.; and extract more effective feature parameters, such as MFCC parameters, LPCC parameters, etc. Wait.
  • These can be stored in the code repository to provide further coding services for the corresponding speech coding.
  • Voice text output for the obtained voice text content, that is, the encoding result, there are two different output modes, one is graphic output based on text display output, and the other is audio playback based on voice playback. .
  • Graphic output, graphic output of voice text refers to the presentation of voice text in the way of ordinary text, that is, text layout output.
  • the advantage is that the text processing can be processed and processed using existing word processing methods and tools.
  • the support of voice text output can also allow voice text and traditional text, as well as other forms of text (such as graphic text, image text, etc.) appear in the same text document, supporting more colorful applications.
  • the system can present continuous speech text encoding (including speech data encoding and mute duration encoding, etc.) as a whole, for example: "+ an unauthorised speech text (9 characters, 4 silent characters; mute duration total 2 '369)"
  • continuous speech text encoding including speech data encoding and mute duration encoding, etc.
  • the system can also provide relevant search functions, such as a silent search (with or without constraints).
  • the system can display more relevant information and allow the user to play the voice content, for example, display "+ voice content, duration 8" (5 voices) Character, 4 mute characters; mute duration total 2'369) "When the user expands the voice text, more details can be obtained, as shown in Figure 2B-7.
  • Voice text is graphically output and can be visualized in a variety of formats, such as displaying waveforms, spectrograms, visualization durations, etc., depending on the specific application requirements.
  • results of the analysis of the phonetic characters, or the semantic tags added by the user to the characters can also be presented simultaneously.
  • the third and fourth audio characters are also displayed based on the results of the Chinese Pinyin phonetic analysis.
  • the associated system text search can also provide more search control, such as searching based on semantic tags entered by the user.
  • the system decomposes its metacode according to the target character encoding.
  • the system submits a character meta code to the code repository.
  • the encoding warehouse checks the access rights according to the meta code and the current user. If access is disabled, an error message is returned to the system; the system performs a graphical output based on the character encoding; the process ends. If access is allowed, the corresponding encoded metadata is returned to the system; the process continues.
  • the system decomposes the instance code according to the target character encoding.
  • the system parses the instance code according to the encoded metadata. Specifically, if it is a mute character, the instance code is parsed into a mute duration; if it is an audio character, the character code is submitted to the code repository.
  • the encoding repository checks the access rights according to the audio encoding settings and the current user. If access is disabled, an error message is returned; if access is allowed, the corresponding voice data is obtained and returned to the system.
  • the system outputs the characters according to the parsed or obtained data.
  • the waveform data is recovered according to the voice data, and played out.
  • the system needs to obtain all corresponding phonetic characters and related data, and graphically output the visualized form according to certain typographic rules. If the user's play request is obtained, the play buffer is established, and the audio data is played back in turn (while taking into account the play of the silent characters).
  • Voice playback the voice playback output of voice text is similar to the playback of traditional audio data, and does not need to consider the graphic layout of text. However, the playback of voice text is also based on the user's access rights. The voice text can be played only if the user has obtained the access rights of the voice text corresponding to the data.
  • rich search positioning can be performed on voice text, such as searching according to voice duration, mute duration, semantic tags, mixed text in voice text, and the like.
  • Voice text editing by encoding the text of the audio data, makes it possible to edit the voice data in the manner of traditional text editing.
  • the user can conveniently delete, insert, modify, etc. any character, and can also perform traditional text encoding operations such as searching, replacing, copying and pasting.
  • Some of these operations require the use of specialized audio services. For example, change the mute duration, divide an audio character into multiples, combine multiple speech characters into one, and so on.
  • Noise cancellation audio data recorded in normal environments generally have ambient noise. After it is segmented and encoded, it will be played back. Does the noisy voice character data play with the noiseless mute character, will it sound strange?
  • the sound frequency that the human ear can recognize ranges from 20 Hz to 20 kHz.
  • the frequency of the sound emitted by the human body vocal organs is about 80 Hz to 3400 Hz; while the frequency of the human voice is usually 300 Hz to 3000 Hz. For a specific individual, this frequency range is generally more limited.
  • the volume of conversations of normal people indoors is between 20 and 60 decibels. According to this frequency range, we can automatically remove high frequency and low frequency noise. With low decibel delay, we can perform voice detection and automatically get a silent section. Through the spectrum analysis in the silent section, noise filtering can be performed on the entire audio data. It should be noted here that some of the mute segments will have the same frequency range as the audio data. When performing automatic filtering, we must ensure that the audio of the non-silent segment is not processed into a low-decibel silent segment.
  • Real-time voice call since this method is based on the segmentation of voice data, is this method not applicable to voice applications with high real-time requirements? Indeed, this method is still applicable for voice applications that can allow a delay of a few seconds. If the real-time requirements are high, speech segmentation is not possible. However, for these applications, the method can be used to record the voice, which avoids the problems of large amount of traditional voice recording data and difficulty in editing.
  • Voice transmission in traditional voice call applications, voice data can be directly transmitted to the receiver.
  • the voice text is transmitted to the receiver, and the receiver obtains the real voice data from the code warehouse. Will this process be inefficient?
  • the transmission can completely hide some or all of the voice data after the voice data is transmitted.
  • the receiver cannot play in whole or in part even if it receives the voice code. This is not possible in traditional voice call applications.
  • the amount of actual data, the encoded content of the audio data is indeed much smaller than the original audio data, but for the users who ultimately need to use or play the original voice content, the amount of data has not decreased, but has increased ( Voice text encoding part). So, can we say that it is a defect of this method? It is undeniable that for a specific segment of speech, if the final playback can restore the original input, the amount of data is not reduced (this ignores noise cancellation). However, it must be seen that by centralizing the personalized voice data into the code repository, there is actually a significant amount of redundancy. By processing this redundant information, storage efficiency and transmission efficiency can be greatly improved. Below we specify this.
  • the sound that can be emitted in a lifetime is limited.
  • the basic elements/syllables are more limited considering language limitations.
  • the combination of the elements is also very limited.
  • the specific phonemes that can be formed are limited.
  • the voice data is cut into a continuous sound frame.
  • a sound box is generally 10ms to 40ms, and there can be some overlap between the frames. Appropriate frame segmentation facilitates audio analysis and further parameterizes the audio data for ultimate reuse.
  • Some existing audio fingerprint extraction and matching methods can be used to detect redundant voice data well, to implement content normalization, search matching and other services in the code warehouse. For example, Google's Waveprint method (patent US 8411977 B1).
  • Non-speech audio data here is the emphasis on voice data, then for non-speech audio data, such as music, video and audio track data, etc., is this method also applicable?
  • the method of this paper does not change the original data, but it is divided and encoded.
  • the original content is divided into the encoded stream and the corresponding audio data in the encoding warehouse.
  • Final playback will still be able to fully restore and play the original audio. In this sense, there is no problem with using this method.
  • the text obtained by this method is personal and relevant to a particular user. This also ensures subsequent speech analysis, identification and other highly personalized services for the user. If music or other sounds that are not related to the individual user are stored in the code repository and associated with the user, it will actually affect the subsequent personalized service. Therefore, it is better to find ways to divide voice data into other audio channels. Use other coding classifications for other audio data, such as instrument-related coding for music. Finally, data that divides different audio characters into multiple channels is mixed together.
  • the method further comprises: generating a coding order information unique identifier based on the encoded arrangement order information, and/or generating a respective data segment unique identifier based on each data segment, the coding order information unique identifier and/or each data
  • the fragment unique identifier is stored as part of the metadata.
  • the data object identifier, the encoding order information unique identifier, and the data fragment unique identifier uniquely corresponding to the data object are respectively hash values corresponding to the data object, the encoding ordering information, and the content of each data fragment (eg, MD5, SHA1, etc.) ), or a globally unique identifier (UUID/GUID) generated by the system or any other globally unique encoding.
  • the identifier can be used to perform integrity check on its corresponding content to verify whether the identifier matches its corresponding information, and whether the corresponding information is complete.
  • data splitting refers to splitting a complete piece of data into two or more copies, which are then stored in different storage systems.
  • the purpose of the data splitting of the present invention is not only to store but to Data splitting for data security purposes.
  • data stored in a cloud provider users may not trust, but through data splitting, a piece of data can be stored in one or more vendors, and only all data is leaked (including metadata, each Data fragment) can lead to data leakage. This greatly increases the difficulty of illegally merging data.
  • the data splitting of the present invention allows the end user of the data (i.e., the user entitled to own the data) to directly intervene and control.
  • the data splitting method is built on the operating system (including the cloud operating system), specifically in the application system for splitting purposes, or in the splitting service of other application systems.
  • the storage system is built on the storage physical device, the infrastructure under the operating system.
  • the data splitting method of the present invention will eventually use a data storage system.
  • 2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention, showing the location of the application field of the present invention in the computer system hierarchy.
  • the splitting and merging of data can be done at the terminal or by the server or service provider.
  • the data obtained from a cloud storage server is not complete and is not enough to pose a threat to the privacy and confidentiality of the user.
  • An attacker needs to obtain the identity of the same user in different cloud storage services in order to get different pieces of data that make up the complete data. This difficulty is often much greater than cracking a single system.
  • the merged specification can restore the fragment data to the original complete data. This gives the user's data an extra layer of protection.
  • the hacker can attack the user's terminal system to obtain complete data before or after the user's spin-off.
  • the mail server can be a conventional mail server
  • the attachment needs to be added to the mail
  • the content of the attached file is split.
  • several of them are stored in the cloud storage specified by the user, and several others are saved in the mail as ordinary attachments.
  • the mail cloud application system can register the metadata and the split information (the default metadata stripping protocol, etc.) in the original attachment file to the file meta-information database (an online service system, Both the sender and the recipient must have an account), and the corresponding data access link can be automatically set for the sender according to the settings of the client.
  • the file meta-information database an online service system, Both the sender and the recipient must have an account
  • the corresponding data access link can be automatically set for the sender according to the settings of the client.
  • the recipient there is no fragment of the data on the terminal side before it downloads the attachment.
  • the actual storage of data is distributed among the cloud storage, the mail server, and the corresponding metadata in the file meta-information. Of course, this data also exists in the sender's terminal (if the sender is not using a distributed file system and the file has not been deleted).
  • the system can automatically locate the corresponding item in the file meta-information according to the content stored in the email as a normal attachment, and then locate the cloud. Part of the content in the store, and restore according to the corresponding split method, and finally restore the original raw data on the recipient's client.
  • the account information required by the recipient's mail client is pre-set. There are at least three accounts involved here: the mail system, the cloud storage system, and the file meta-information system.
  • FIG. 2D is a flowchart of a data merging method according to an exemplary embodiment.
  • the present invention provides a data merging method, including:
  • Step 401B Receive a data object acquisition request carrying the identification information.
  • the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object.
  • Step 402B Acquire storage content corresponding to the positioning information, and obtain data information in the other storage content according to the obtained positioning information in the stored content until all data information of the data object is obtained.
  • Step 403B Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.
  • the data merging method of the embodiment obtains the data object acquisition request carrying the identification information, obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and acquires other storage content according to the positioning information in the storage content.
  • the data information is obtained until all the data information constituting the data object is acquired.
  • the obtained data information is combined and processed to obtain a complete data object.
  • FIG. 2E is a flowchart of a data merging method according to another exemplary embodiment. As shown in FIG. 2E, the present invention provides a data merging method, including:
  • Step 501B Receive a data object acquisition request carrying the identification information.
  • the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object.
  • the type of data information is one or more of the following combinations: metadata, data fragments, encoding, and encoding order.
  • Step 502B Acquire storage content corresponding to the location information, and obtain data information in the other storage content according to the location information in the obtained storage content, until all data information of the data object is obtained.
  • Step 503B Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.
  • one or more pieces of data information are obtained according to the positioning information (the data information may be a piece of data that is split, or may be part or all of the metadata, or may be part or all of the encoding and encoding order), according to a specific rule. That is, the preset merge protocol gradually acquires corresponding data information according to one or more data information, and combines the data information together (ie, metadata, data pieces) The merge, encoding, encoding order, etc. are combined to recover the original data object.
  • the specific merger is as follows:
  • the encoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained; and each of the decoded data is decoded according to the encoding order.
  • the data segments are arranged to obtain data objects arranged in the original order of the respective data segments.
  • Metadata agreed in the preset merge specification includes: attribute information, integrity verification is performed on the data objects merged by each data segment according to the attribute information, to confirm that the attribute of the data object matches the attribute information in the metadata; or,
  • the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, the data matching the keyword is merged into the data segment corresponding to the data content identifier, and then each data segment is merged to form a data object. ;or,
  • the metadata agreed in the preset merge specification includes: attribute information, data content identifier, and keyword
  • the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each data segment is determined according to the attribute information.
  • the merged data object performs integrity verification to confirm that the attributes of the merged data object match the attribute information in the metadata.
  • Step 504B If the metadata includes a unique identifier of the data object, perform integrity verification on the merged data object according to the unique identifier.
  • the data merging process is actually the reverse process of the data splitting process and works according to the preset merge statute.
  • the preset merge specification (hereinafter referred to as the merge specification) may be combined with the preset split specification (including: preset metadata stripping protocol, preset data content splitting specification, preset encoding separation specification, etc.
  • the merge/peel protocol it is the same content.
  • a merge specification is data information prepared for data recovery, or it can be called a split merge specification, because it is necessary to ensure that the split data can be recovered back. Therefore, the split statute often includes or implies a merger reduction.
  • the client can locate the storage content in the file meta-information system library, the mail system, the cloud storage, and the like according to the attachment name (ie, the unique identifier of the data object).
  • Data information, data information has split algorithm, each data segment, positioning information and related file metadata items, etc.
  • the mail system can be obtained according to The obtained data information locates and downloads the data segment, and obtains the inverse algorithm according to the splitting algorithm to merge the data segment and the metadata. If there is a code, the data segment can be restored according to the code to obtain the original user data object content; if the metadata includes the data
  • the unique identifier of the object which can also verify the file size, recovery file name, file type, creation time, etc. based on the file metadata.
  • the information of the split protocol in the example of the mail client can be a merge specification. Among them, the specific merge specification, that is, the inversion process, can be derived through the data split description document.
  • the system retains the appropriate split/peel protocol after data splitting and stores the relevant location information (such as its storage location) in the split data segment, or any storage space that is designated for access. in.
  • relevant location information such as its storage location
  • the system will find or extract the corresponding split metadata according to the obtained split/peel protocol or merge specification, and splicing and combining each data segment based on information such as data split/peel protocol or merge specification and metadata. Together, thus recovering the original data.
  • the decoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained, including:
  • split target The information of the data object is divided into three parts: a metadata block, a data block (ie, a data segment), and an index block (ie, an encoding). Any information dispersion algorithm can be used, For example, the IDA algorithm divides the contents of the source file after lossless compression into four bytes (32 bits). It should be noted that compression is not necessary. The divided results are sorted and combined and deduplicated, that is, the duplicates are eliminated and saved as data block files that are not duplicated.
  • the divided data block (data segment) is assigned to the index (encoding) of the data block file, and is saved as an index file (arrangement order information of encoding and encoding) in the original order.
  • the file name of the data block file and the index file may be a hash value (MD5, SHA-1, etc.) of the corresponding file content or a system-generated globally unique identifier (GUID) or any other globally unique code.
  • the file name, size, date, and other information of the source file, as well as the file name of the data block file and the index file, can be stored in the metabase.
  • the system can restore the target file through the data merge process: for example, according to the coding order of the index file and the arrangement order information of the code, the data block (data segment) The four-byte content corresponding to the file index position is spliced; the spliced result is decompressed (if previously compressed) to obtain the target file.
  • a desktop agent can also be established. However, this desktop is built on the desktop agent of the basic cloud storage, which automates the above-mentioned splitting and merging process, and brings convenience to users.
  • the split-store desktop agent of the user client runs in the background of the system, such as GoogleDrive and Microsoft's One Drive.
  • Google Drive has a directory C: ⁇ GDrive that automatically syncs with Google's cloud storage
  • One Drive has a directory C: ⁇ MDrive that automatically syncs with Microsoft's cloud storage.
  • the sync directory corresponding to the split storage desktop agent is C: ⁇ DDrive.
  • the desktop proxy service detects the change of the file system, automatically splits the file, saves the data block (data fragment) file to C: ⁇ GDrive, and indexes the file (encoding and The encoded ordering information is saved to C: ⁇ MDrive and the metadata is saved to the proprietary database cloud service.
  • Google and the Microsoft Desktop Agent service will automatically sync the block file and index file to Google and Microsoft's cloud storage respectively. Go to the user's other terminal directory.
  • the corresponding terminal runs the split storage desktop agent, it will detect the changes of C: ⁇ GDrive and C: ⁇ MDrive directory, automatically obtain the metadata, merge it with the data block file and data index file into the original file and save it. It is in the C: ⁇ DDrive directory, which enables synchronization of split/merge storage.
  • FIG. 2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment.
  • the present invention provides a data splitting apparatus, including: an extracting and stripping module 61B, for receiving and carrying When the storage request of the data identifier is to be stored, the metadata is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object.
  • the segmentation module 62B is configured to split the data content into at least two data segments according to the preset data content splitting protocol.
  • the storage module 63B is configured to store the metadata and the individual data segments in different storage bodies or in different secure channels.
  • the data splitting apparatus of the embodiment obtains the metadata in the data object corresponding to the data identifier to be stored, and obtains the metadata from the data element corresponding to the data identifier to be stored, by receiving the storage request carrying the identifier of the data to be stored.
  • the data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels.
  • FIG. 2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment.
  • the stripping module 61B is obtained, including: a receiving submodule 611B. And for receiving a storage request carrying the identifier of the data to be stored.
  • the determining sub-module 612B is configured to: when the receiving sub-module 611B receives the storage request carrying the data identifier to be stored, the metadata agreed in the preset metadata stripping protocol includes: attribute information; and the data object corresponding to the data identifier to be stored The attribute information content matching the attribute information is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: a data content identifier and a keyword, and corresponding to the data identifier to be stored according to the data content identifier Among the data contents in the data object, the data matching the keyword is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: attribute information, data content identifier, and keyword, The attribute information content matching the attribute information in the data object corresponding to the to-be-stored data identifier is determined as metadata, and according to the data content identifier, the data content matching the keyword is determined as metadata from the data content in the data object.
  • the stripping sub-module 613B is configured to determine the metadata determined
  • the obtaining the stripping module 61B includes: a parsing sub-module 614B, configured to parse the data object to generate a unique correspondence with the data object when the metadata agreed in the preset metadata stripping protocol includes: the data object identifier Data object ID.
  • the apparatus further includes: an encoding module 64B, configured to perform encoding processing on each data segment according to a preset encoding separation protocol, to obtain a code corresponding to each data segment.
  • the arranging module 65B is configured to arrange the respective codes according to the original order of the data segments in the data content to obtain the coded ordering information.
  • the storage module 63B is specifically configured to store metadata, encoding corresponding to each data segment, and encoding sequence information into different storage bodies or different secure channels.
  • the apparatus further includes: an identifier generating module 66B, configured to generate a coding order information unique identifier based on the encoded arrangement order information, and/or generate a respective data segment unique identifier based on each data segment; a storage module 63B, It is also used to store the coding sequence information unique identifier and/or the individual data segment unique identifier as part of the metadata.
  • an identifier generating module 66B configured to generate a coding order information unique identifier based on the encoded arrangement order information, and/or generate a respective data segment unique identifier based on each data segment
  • a storage module 63B It is also used to store the coding sequence information unique identifier and/or the individual data segment unique identifier as part of the metadata.
  • the preset data content splitting protocol includes at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm.
  • FIG. 2H is a schematic structural diagram of a data merging device according to an exemplary embodiment. As shown in FIG. 2H, the present invention provides a data merging device, including:
  • the receiving module 81B is configured to receive a data object acquisition request that carries the identification information, where the identification information includes positioning information, and the positioning information is used to locate a storage address of the partial data information in the data object.
  • the obtaining module 82B is configured to obtain the storage content corresponding to the positioning information, and obtain the data information in the other storage content according to the obtained positioning information in the stored content until all the data information of the data object is obtained.
  • the processing module 83B is configured to combine the acquired data information according to the preset merge protocol in the acquired data information to obtain a data object.
  • the data merging device of the embodiment obtains the data object acquisition request carrying the identification information, and obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and then The data information in the other stored content is obtained according to the positioning information in the stored content until all the data information constituting the data object is acquired.
  • the obtained data information is combined and processed to obtain a complete data object.
  • FIG. 2I is a schematic structural diagram of a data merging device according to another exemplary embodiment.
  • the type of data information is one or more of the following combinations.
  • Method metadata, data fragment, encoding, encoding order.
  • the processing module 83B includes: a decoding sub-module 831B, configured to perform a decoding operation on the encoding according to the combining algorithm in the preset merge protocol, to obtain a code corresponding Data fragment.
  • the arranging sub-module 832B is configured to arrange the decoded data segments according to the encoding order to obtain data objects arranged in the original order of the respective data segments.
  • the processing module 83B is specifically configured to: when the metadata agreed in the preset merge specification includes: attribute information, the data object merged with each data segment according to the attribute information Perform an integrity check to confirm that the properties of the data object match the attribute information in the metadata.
  • the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, and the data matched with the keyword is merged into the data segment corresponding to the data content identifier, and then the respective data segments are merged to form Data object.
  • the metadata agreed in the preset merge specification includes: attribute information, a data content identifier, and a keyword, and the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each of the data content is The merged data object of the data fragment is integrity verified to confirm that the attributes of the merged data object match the attribute information in the metadata.
  • the apparatus further includes: an integrity verification module 84B, configured to include a unique identifier of the data object in the metadata, and perform integrity verification on the merged data object according to the unique identifier.
  • a soft/hardware implementation method in accordance with the present invention will be presented in conjunction with various embodiments of the above split and merge method and apparatus, in a specific example.
  • splitting is primarily about considering how the system distributes data across multiple stores in the system architecture.
  • Such systems typically use metadata, coding, and domain-related data content splitting. Therefore, it is possible to naturally disassemble the application domain, that is, to use a domain-related split method.
  • the data split/stripe, merge process is often built into the system's data access layer, associated with domain-related business logic. Whether it is domain-related data splitting or domain-independent data splitting, its data splitting/stripping methods can be varied. Therefore, we introduce the concept of "data split description language (which can be used as part of the split/merge protocol)" to configure the data splitting process.
  • the system or user can split/stripe the data at runtime using a dynamic data split/peel method.
  • the description of the data split/peel method itself (which can be part of the split specification) can be stored in a particular store as part of the stripped out metadata. Different data can have different split/peel methods.
  • the merging of data will vary from data to data, and the merging process must be based on an understanding of the split/peel method description.
  • the data split/peel/merge engine is a system component that parses and executes the data split/peel description information to complete the data split/peel/merge.
  • At the heart of the data split description language and data split/peel/merge model is the data processor model.
  • a data processor is a software/hardware component that processes data.
  • the splitter is used to implement the split function, and the corresponding merged data is called the combiner. They are also data processors.
  • compressors, decompressors, encryptors, decryptors, savers, extractors, etc. are also data processors.
  • the core of the data processor is the processing, in addition to several input ports (including data input port and parameter input port) and several outputs.
  • the data input port corresponds to the data input
  • the output port corresponds to the data output
  • the parameter input port corresponds to the parameter information that needs to be used in the data processing process.
  • the compressor has an input port (and an additional password parameter input port when there is a compressed password), a data output;
  • the splitter has one data input, multiple data outputs;
  • the combiner has multiple data inputs , a data output; saver has a data input, multiple parameter input (corresponding storage location, access access information, etc.), no output (the process is to submit the input to the storage);
  • the extractor has no input, a data output
  • There is also a very special kind of data processor - generator no data input (sometimes with parameter input), one or more data output, and its data output often participates in the entire data processing process as a parameter of data processing.
  • the distributor is a data input, multiple data outputs, and each output data is the same as the input data.
  • the output of one processor must be connected to the input of another processor (either data input or parameter input).
  • another processor either data input or parameter input.
  • the data generator The data generation process is generally irreversible.
  • the reverse processing in the system is the generated data. Can be obtained directly or indirectly from storage and other processors).
  • the data input of a data processor is the data output of its corresponding reverse processor, and the data output is the data input of its reverse processor; the parameter input remains unchanged.
  • the splitter corresponds to the combiner
  • the encryptor corresponds to the decryptor
  • the compressor corresponds to the decompressor
  • the saver corresponds to the extractor
  • the distributor corresponds to the distributor (the process of the distributor inversion has a data input port selection), and so on.
  • the whole process of data splitting/stripping/merging is actually implemented by a network of data processors, and its essence can be characterized by the Petri net model.
  • the processing is transition, the input port is the library, and the output to the next input port is a directed arc.
  • the directed arc from the data processor input port to the processor is hidden. Included inside the processor - when all data ports have data (tokens), the process is automatically activated and the data flows down.
  • the aforementioned data split description language is mainly used to describe the assembly flow diagram of the data processor.
  • a document described in a data split description language is called a data split description document.
  • Data Split Description The data flow diagram described in the document is essentially a data processor. Therefore, another data flow graph can be used as a data processor in one data flow graph.
  • the data split description document actually defines one or more data flow graphs. For documents that are directly used for data split descriptions, you need to specify the final ingress flow graph.
  • Each data flow graph includes multiple data processors and their connection relationships. The connection relationship is described in the data output port of the data processor.
  • the data flow graph has a specified starting data processor. Data split description documents can be rendered and edited graphically.
  • the data splitting and merging engine splits and merges the data according to the description of the data split description document.
  • the corresponding data splitting process is as shown in FIG. 2J: step 1001B, acquiring metadata of the data object to be separated; step 1002B, creating a separate archive document according to the metadata; step 1003B, reading the data to separate the archive document; and step 1004B separating the data
  • the storage document is instantiated into a data flow graph (instantiating the data processor and establishing a connection between them); step 1005B, passing the data to be separated to the starting data processor of the data flow graph; step 1006B, destroying the flow graph after execution Data flow graph.
  • the data splitting and merging engine is mainly responsible for loading the data split description document and instantiating it as executable.
  • the data flow graph finally passes the data to the flow graph for data processing.
  • Number According to the processor as an active object, that is, the instantiated processor object has its own thread/process, which constantly checks its own executable conditions. Once it finds that all input ports have data, it executes automatically and passes the result to other Data processor. After completing these operations, it will destroy itself.
  • the flowchart is as shown in FIG. 2K.
  • Step 1101B determining whether data is transmitted to the input port; if step 1102B is performed, if step 1103B is not performed; step 1102B, receiving input data; step 1103B, determining whether all data ports have Data; if an empty input port (usually a parameter port) is found, that is, an input port without any data source, the user is allowed to enter the corresponding information through the interactive interface. If there is an execution step 1104B, if not returning to the execution of step 1101B; step 1104B, executing a data processing procedure; step 1105B, passing the processing result to the output corresponding data processor.
  • step 1201B locating the corresponding document according to the input information to separate the stored document
  • step 1202B reading the data to separate the stored document
  • step 1203B instantiating the data separated storage document into the corresponding reverse data stream Figure 1204B.
  • the input information may be a reference code of the data split document, or may be a part of the data content after the split.
  • a hash function also known as a hash function
  • the obtained hash value can also be used as a reference code for the document. With this encoding, a corresponding data split document can be obtained.
  • the data splitting document describes the data splitting process, and the corresponding reverse process needs to be obtained when data is merged.
  • This inversion process is actually started from the actual data processor, and the inversion is performed according to the output port traversing the relevant data processor.
  • the process of reversing the data processor varies by type, but in general, the type is changed to the inverse process type, the data input port becomes the output port, and the output port becomes the data input port.
  • the input parameter port is unchanged.
  • the data split description language definition is shown in Figure 2M; the data split description language visualization flow chart is shown in Figure 2N; the data split description document sample is shown in Table 1:
  • the specific splitting process is as follows: the data to be split is first DES encrypted, the encryption key is from the system configuration storage; the encrypted data is split into block data and encoded data by 4-byte split coding; the encoded data is stored in In Amazon S3 cloud storage, the corresponding SHA1 hash value is stored in the metadata database as the key value for addressing the corresponding metadata; the block data is stored in a local file, and the file name is a GUID generated by the system, and the GUID is also used as Key values are stored in the metadata database.
  • the metadata database related records are shown in Table 2; the split items and metadata mapping tables are shown in Table 3;
  • FIG. 2O illustrates the correlation between various concepts under the above three concepts, and some specific application examples that can be extended with these concepts and concepts. These specific applications are merely exemplary, and there are more variations in practical applications, so the present invention has a very broad application prospect.
  • the present invention not only provides a novel handwriting input method and system, but also combines the object-based open codec scheme of the present invention, and object-based data splitting/stripping/merging.
  • the data processing method and system of the method based on the traditional data processing system, constructs an open, secure and efficient data processing system in the true sense of the future and based on the network environment.
  • the basic background content is first introduced, and the generation and development of the computer are inseparable from the coding technique.
  • As a computer-based coding technology it is widely used in the transmission, storage and processing of data, and its importance is self-evident.
  • the rise of cloud computing, big data, and the Internet of things are poised to bring new opportunities and challenges to coding technology.
  • the content encoding is a method of digitizing or converting the content of the encoding object.
  • Base64 encoding various data compression encoding (including lossless compression, lossy compression, etc.), image encoding (JPEG, SVG, etc.), video and audio encoding (PCM, MP3, MP4, etc.) are all in the category of content encoding.
  • the digitized content of the data itself is directly included in the results of the content encoding and can be analyzed and processed by the computer.
  • structured coding technique for describing the structural information of data. It mainly encodes structured data/document content. For example, HTML, MathML, SVG, etc. are specific structured description languages, and the corresponding coding specification is meta-language XML. Similar coding specifications are JSON, Protocol Buffer, etc.
  • the result of a reference encoding process is not the data content itself, but a reference to the content or a description of the addressing path of the access object.
  • Huffman coding is a pair of source symbols (The content itself) establishes an optimized reference encoding method. URL, IP address, RFID, barcode, QR code, ISBN, zip code, etc. are all reference codes.
  • the text encoding (especially the standard encoding) is essentially a reference encoding, which is the encoding corresponding to the specific text position in the text encoding scheme. As the text body, the sound, shape, meaning and other data are only reflected in the coding specification.
  • a computer program can directly process the encoding without encoding the corresponding content (or the corresponding content has been built into the computer program).
  • standardized coding systems such as ASCII and Unicode.
  • Such encoding and encoding combinations themselves already constitute a higher level of data content.
  • Standardized text encoding is such a typical example. Many of today's text-based coding conventions (such as JSON, CSV, XML, etc.) are based on this.
  • OMG a non-profit standardization organization in the computer field, successfully defined a set of languages and standards for object modeling.
  • OMG divides the model into four levels of abstraction: meta-model layer (M3), meta-model layer (M2), model layer (M1), and runtime data object (M0).
  • the meta-model layer contains the elements needed to define the modeling language;
  • the meta-model layer defines the structure and syntax of a modeling language, which can be specifically mapped to UML (Unified Modeling Language) or object-based programming languages such as Java, C#, etc.;
  • the model layer defines a specific system model, specifically the class or object model we often say;
  • the runtime contains the state of a model object at runtime, etc. The object or instance we are talking about.
  • FIG. 3 is a schematic diagram of a meta model in the prior art.
  • a Meta-Object Facility (MOF) is a standardized specification for establishing a metamodel (M2) defined by the OMG.
  • MOF includes a metamodeling language (M3 model) and methods for creating, manipulating models, and metamodels.
  • the object model has multiple levels, static models that represent structure and functionality, and dynamic models that describe runtime behavior.
  • the main focus of this paper is on static models related to coding, including data and interfaces.
  • the object's identifier is actually a reference encoding.
  • the identifier must be unique, paired with the object. should. In this way, the system can locate the corresponding object by identifier addressing.
  • object reference encoding and object identifiers are a concept because their usage goals are consistent.
  • the reference code may not be used as an object identifier.
  • the reference code is only guaranteed to be correctly addressed to the target, and does not necessarily guarantee a one-to-one correspondence with the object.
  • there is a many-to-one situation one object, multiple encodings). For example, a host can have multiple IP addresses; the same website can have multiple URLs.
  • reflection refers to a class of applications that are self-describing and self-controlling. That is to say, such applications use a mechanism to achieve self-representation and examination of their own behavior, and can adjust or modify the state of the behavior described by the application according to the state and result of their behavior. Relevant semantics.
  • Reflection technology has been supported by modern software development platforms, tools, and programming languages. For example, you can use reflection to get metadata directly from running objects in Java and .Net platforms at runtime.
  • FIG. 4 is a schematic diagram of the architecture of the encoding system of the present invention.
  • the encoding system is mainly divided into three parts: a client. End, encoding server, data storage. Among them, the encoding server and the data storage end together constitute an encoding warehouse.
  • the client can obtain a corresponding data object by sending an encoding to the encoding warehouse; and sending the new data object to the encoding warehouse, the corresponding encoding can be obtained.
  • the encoding server provides services to the client.
  • An encoding repository can include one or more data stores in which real data is stored.
  • the encoding server can send data queries to the data storage terminal to obtain, update, and insert related data.
  • the code repository provides a centralized encoding service that allows different clients to share data objects and encode meta-objects by reference encoding. Further, a variety of different systems can register new coded meta-objects with the code repository to meet a variety of different coding requirements. This centralized coding service makes data integration and exchange of various systems easier.
  • the code repository has a built-in data access control system that provides different access rights for different data objects and coded meta objects.
  • the encoded meta-objects and data objects can be stored on different data storage ends, and or set with different data access rights.
  • the encoded meta information is stored in an encoding repository, and the data object itself may exist in the encoding stream (content encoding) or the storage system of the encoding repository.
  • the reference code of the data object exists in the encoded stream.
  • the data objects in the code stream and the code repository can be placed in different secure channels. The separation of this information has natural security on the one hand and better coding efficiency on the other hand.
  • the data storage end can be implemented by using different storage systems such as file storage, relational database, NoSQL database, and cloud storage.
  • the present invention proposes a new object-based coding and decoding scheme and system, and is also an open solution.
  • object-based open coding schemes can be completely personal and non-standard.
  • This non-standard refers to a standard that is different from the traditional ones that are developed and reused by the organization or organization, but the essence is based on the de facto standard (coding protocol) of the coding warehouse.
  • This solution not only provides more flexible and diverse data services, but also provides more reliable security for data.
  • the coding scheme of the present invention can encode data of any type and any length, can have any coding format and arbitrary coding word length, and the coding rules can be not fixed, that is, the coding rules can be randomly changed as needed. This makes it possible to create fully personalized coding.
  • the coding scheme of the present invention is an encoding scheme that can encode an arbitrary object and is independent of the length of the object data, the encoding rule, and the length of the encoded word. This greatly breaks through the inherent form and limitations of existing standard coding. This coding scheme can be arbitrarily expanded. The same code can also be reused in different encoding processes without affecting each other, thus greatly improving the utilization of the code.
  • the concept of the coding scheme of the present invention consists in creating an encoding protocol for the data object based on the metadata of the data object and generating the encoding according to the encoding specification.
  • the present invention can acquire the features or structures of the data objects in an encoded manner and generate corresponding codes for the data objects in accordance with the features and/or structures of the encoded objects.
  • any party involved in the transmission, as well as the receiving and storing parties have the opportunity to obtain all the information in the data.
  • This is not conducive to the confidentiality of data, but also makes the data transmission amount large, increasing the network bandwidth and the burden of CPU processing, especially for large-scale data transmission, and thus reducing the data transmission efficiency.
  • Another feature of the present invention is that only the data objects that need to be transmitted are stored in the code repository, and the corresponding data access rights are set to obtain the corresponding reference code.
  • the reference code of the data object can be exported, and only the receiver that has the data access right can get the complete data. This can greatly reduce the amount of data transferred, while increasing the security and reliability of the data.
  • the encryption process of data does not require any metadata participation, and only the encryption data is needed to convert the original data into content that cannot be normally recognized or displayed. can.
  • the invention can also achieve the effect of encryption, on the one hand, the invention achieves data protection in a completely different way. Specifically, the data content is protected by means of metadata of the data object in a coded isolation manner.
  • the encrypted ciphertext data size is often the same as or larger than the original plaintext, but the present invention only needs to transmit a very small amount of information such as a corresponding reference code.
  • more useful functions and operational space are provided for data processing. For example, but not limited to, it can reduce the transmission of data and reduce the network load; the flexibility of coding also provides greater convenience for subsequent data processing and the like.
  • the encryption needs to convert the original data into a code or data completely different from the original data by a predetermined rule or algorithm, so that it cannot be easily
  • the ground is identified by a third party.
  • the present invention can completely preserve the original form of the data content, and can also realize the security and confidentiality of the data without any modification to the content, which is not possible by the conventional encryption system.
  • the open system of the present invention can assign different encodings to each data segment in the encoding process, and can also set different access rights for different users. This allows for more granular security.
  • the standard character becomes a special object (the object number of the built-in encoding metadata); the object reference encoding becomes a special character - non-standard characters.
  • the present invention can be used to directly accept the digitized result of human natural output, divide it into different data objects according to certain rules, and place it in an encoding warehouse to form non-standard characters (in this paper, non-standard)
  • the character is based on the object reference encoding of the encoding repository, but focuses on emphasizing that the data object is a piece of data obtained by splitting the human digital output result.
  • the present invention can establish a proprietary font for the writer by assigning a custom unique code or code to all or a fragment of the digitized result of the natural output of each human individual.
  • the user can input or add his own font at any time, thereby eliminating the need to input the reference font in advance as disclosed in Chinese Patent No. CN103136769A. The trouble with information.
  • the invention can also place the object reference coding in different coding spaces, such as the user coding space divided by the user, different users can use the same reference code to correspond to different data objects in the coding warehouse; and the coding according to the date Space; coding space divided by geographic location; coding space divided by department; coding space divided according to online session;
  • the coding space divided by the session has a very high security feature - the reference code of the data exists in the coding space corresponding to the session. When the session ends, the corresponding coding space will disappear, and all the codes in the space will not be decoded correctly. . With this feature, the effect of "reading and burning" can be achieved.
  • introducing the coding space and adopting variable length coding can greatly reduce the storage consumption of the reference code and improve the efficiency of transmission, processing and storage.
  • the new data processing system introduces the concept of an encoding repository.
  • the application can not only query and use the encoding meta-objects already in the encoding repository, but also register and use new encoding meta-objects.
  • the new system breaks through the limitations of existing systems from four different levels.
  • Text encoding is non-standardized. Text encoding and corresponding solution
  • the code information is stored in the application system and the code repository, respectively.
  • the code repository can support different levels of code isolation for users, applications, and content. Therefore, we can authorize the access and use of text content through the access control management of the code repository. In other words, the new data processing system has built-in security.
  • open coding allows us to completely break through these limitations.
  • the corresponding text parser can distinguish which text is the mark and which is the content according to the encoded metadata.
  • anything that can be serially encoded can be stored and encoded by the system, such as music melody, dance action, game data, video subtitles and even computer instructions.
  • the stored results are divided into two parts, one is the data object in the encoding warehouse, which can be multimedia data, or proprietary data, and the other part is the encoded code sequence.
  • the reference encoding of such data objects is not unique to the system.
  • Traditional data processing systems based on standardized encoding can also encode arbitrary data. But far from being based on object coding systems, it is simple, efficient, and natural.
  • the object coding in the object-based coding system may include a meta-encoding and an instance coding part, for
  • the number of metacodes is very limited. For example, two bytes of 16 bits can encode more than 60,000 yuan codes, which can actually correspond to more than 60,000 object types, which is for most applications. All are enough.
  • For a specific object due to the arbitrariness of the object encoding, we can directly use a number to represent its instance code, for example, 4 bytes 32 bits can encode more than 4 billion object individuals, plus we can Putting the reference code in a different encoding space, 32 bits is sufficient for most systems. That is, 6 bytes can represent the reference encoding of objects in most applications.
  • variable-length encoding we can often express an object reference encoding with fewer word counts by setting default meta-encoding, using client-side encoding, and so on. In contrast, in order to prevent data block conflicts in cloud storage, it is much simpler and more effective to use a dozen or even dozens of bytes to reference and encode a data block.
  • the new data processing system we can store the data object corresponding to the object reference encoding in the encoding warehouse, which can greatly improve the storage efficiency of the data object, thereby improving the data transmission and processing efficiency.
  • the HTML of the webpage is re-encoded using the object encoding technique, and the elements and attributes of the standard HTML various tags are encoded, and the relevant meta-information is put into the encoding repository, and the size of the obtained webpage document is greatly reduced, which can be Network transmission of web pages saves traffic.
  • the encoding scheme used by object-based data processing systems can be personalized and non-standard. This is mainly achieved by the isolation of the context coding space. Different users and unused applications have their own context coding space. Further access to personalized coding is achieved by accessing a personalized contextual coding space.
  • Each object reference code has a one-to-one correspondence with the data objects in the encoding repository.
  • the data processing system can dynamically add data object types and their encodings.
  • the system automatically stores the input to the encoding repository as it is entered and encodes the location of the content in the encoding repository.
  • the output process is based on the object reference code, the input content is taken from the code repository, and it is played back naturally.
  • the writer writes under a natural writing constraint (such as row constraint or column constraint), and the system writes the content according to natural participle (such as Chinese character segmentation).
  • a natural writing constraint such as row constraint or column constraint
  • natural participle such as Chinese character segmentation
  • the division of words such as the word segmentation of words in the phonetic language
  • the shape of the word or word that is split is stored in the code warehouse, and its corresponding reference code is generated.
  • These encodings are stored in a textual content--ie, a collection of textual encodings in a specific typographical order.
  • the above handwritten text input process is between the text recognition handwriting input and the non-recognition handwriting input. Similar to the text recognition system, this process requires the division of words and words. But the difference is that you don't need to analyze the standard code corresponding to the input, but "input is what you get.” This method does not have the problem of recognition rate, always 100%. This is the same as the non-identifying system. But the difference is that the process divides the input content and encodes them separately. This allows us to perform some word processing on the coding results in the new system, such as editing, copying, pasting, transferring, searching, retrieving, etc., just like ordinary text.
  • data processing systems based on open coding can also be used in optical recognition based input systems. Especially in the recognition of handwriting input, it is not important whether the handwriting is scribbled or not.
  • the optical recognition system based on open coding only needs to divide and input the input image to divide the image and store it in the code warehouse, and generate corresponding Image object reference encoding. It is worth mentioning that due to the personalized characteristics of the code, the corresponding data objects in the code repository formed by the system can be used as a good sample. The results of analytical training can in turn increase the conventional text recognition rate for that particular individual.
  • the data processing system is also applicable to a voice input system.
  • the input sound signal does not need to be identified, and only needs to be simply processed and divided, and can be stored in the code warehouse and encoded accordingly.
  • the data processing system can also be applied to other text input methods, such as Braille, lip language, sign language, and semaphore input.
  • new text can be created based on this new data processing system.
  • Input method For example, on a small-sized screen touch screen device, specific gestures can be designed as branches, word breakers, and end markers, and then input in full-screen handwriting or voice. The input content is divided according to the word segmentation, and is stored in the code warehouse, and the corresponding text code is obtained.
  • a 3D glove-based sign language input method can be designed. The motion information of the 3D glove is stored as a text content in the code repository, and the code corresponds to the character, and a certain time interval is used as a separation of the actions.
  • the output of the sign language is to play back the 3D glove motion information in the code warehouse through the 3D model.
  • the new data processing system has the following advantages:
  • the first aspect simple and natural
  • the new data processing system does not require the generation of specific standard encodings, so the simplest and most natural input method can be designed for the average user to directly encode the result into a personalized encoding.
  • the user can input any content he wants to express, including graphics, symbols, sounds, videos and other multimedia data.
  • the text output in the new data processing system does not need to be recognized, which ensures uninterrupted and efficient input. A smooth and natural user input experience is guaranteed.
  • the new data processing system is a non-standardized object-based reference encoding. People can't understand the content from the text coding sequence, and they need to get the specific content information of the code from the code repository.
  • the access control of the code repository ensures the security of the data content.
  • the code repository is essentially a full-featured cryptographic server. Further, the code sequence and the data in the code repository can be placed in different secure channels, which greatly increases the difficulty for the data thefter to completely obtain all the data.
  • non-standard text based on object encoding can be context-sensitive text.
  • the same encoding can vary from person to person, from application to application, from document to document, from time to time, from location to location, and so on.
  • the application system, and even the individual user can register a new context specification with the code repository, thereby introducing a new coding space to further isolate the text code.
  • the new system has natural security and privacy.
  • the authorized access service of the encoding warehouse can specifically control these special encodings to achieve specific conditions. , the encryption of a specific text encoding.
  • the specific conditions here may be rules based on context (time, place, environment, user, application, etc.) to achieve complex, flexible text encoding security.
  • the encoding repository can also provide users or systems for identity authentication and digital copyright protection.
  • the third aspect open
  • object-based coded data processing systems are a fully open system. Any data object can be placed in the code repository and its reference code can be recorded in non-standard text.
  • Software developers can register new context object specifications, new encoding spaces, new encoding meta objects, new data objects, or add new encoding services to the system, including new non-standard text services (including new non- Standard text input and output, non-standard text editing and other systems).
  • the new data processing system divides, splits, and encodes the same content. In this process, the system can directly filter out useless information, and only retain important information that people pay attention to, such as filtering out noise in the audio, scanning noise points in the text, and so on. Moreover, through the content normalization service, the duplicate content does not need to be repeatedly stored, which greatly reduces the storage space and improves the transmission speed. More importantly, we can use the existing word processing infrastructure and tools to process and process the text-encoded content formed in the new data processing system, such as searching, indexing, editing, and so on.
  • the flexibility of coding deployment means that for the same encoding type, we can selectively configure it into different encoding spaces, thus having different security levels and visibility.
  • the flexibility of access control means that the user or the administrator of the application system can configure the access to the object code very flexibly through the access control settings of the code repository: on the one hand, the access control can be configured to different coding levels, which can be coding. Space, or encoding metadata, or even specific data objects; on the other hand, access control for encoding can be based on different conditions, such as time, location, user, application, state of the domain model, and so on.
  • the data object encoding and the split storage of content in the new data processing system ensure efficient storage and transmission.
  • the content of the data object needs to be transferred from the encoding repository to the consumer only when it is really needed.
  • the unidentified data object content formed in the new data processing system can be a good personalized identification training sample.
  • the trained text recognition system can more effectively identify personalized non-standard text into corresponding standard codes.
  • the format information of the text can be stored in the code repository.
  • Text format characters use non-standard encoding
  • text data can use standard characters arbitrarily without escaping, which will bring efficient text data transmission and processing.
  • the new data processing system mainly has the following aspects:
  • the first aspect is conducive to the popularity and depth of personal computing.
  • the new data processing system makes it possible to access traditional text input methods that are close to nature, solving many people's problems of "computer input is difficult".
  • a safe, natural data processing system is more acceptable to ordinary people.
  • Such computer text input is no longer a matter related to the individual's cultural background and familiarity with the degree of the keyboard, which is conducive to the popularity and depth of personal computing.
  • the second aspect is conducive to the popularity and depth of cloud computing.
  • the third aspect is conducive to the development and popularization of the Internet of Things.
  • the internet of things combines intellisense technology, recognition technology, and pervasive computing technology, and is called the third wave of information industry development after computers and the Internet.
  • the Internet of Things is an extension of the Internet.
  • the Internet of Things has an urgent need for object addressing coding/identification at the three levels of the sensing layer, the network layer, and the application layer.
  • the number of nodes is large, the variety is large, and the processing capability is limited.
  • a huge challenge has not yet formed a common standard.
  • a simple and flexible object coding mechanism can well meet these needs.
  • the fourth aspect is conducive to cultural protection and inheritance
  • the keyboard input of the existing computer text has caused many people to "write the pen and forget the word”.
  • the new data processing system maintains the original writing tradition of humans.
  • the fifth aspect is conducive to environmental protection
  • the new data processing system makes the direct input and use of text on electronic devices more natural, convenient and secure. Conducive to the formation of a paperless environment, and ultimately save the use of paper.
  • FIG. 5C is a flowchart of Embodiment 1 of an encoding processing method provided by the present invention.
  • an execution body of the method in this embodiment is an encoding system, and the method includes:
  • Step 101C Acquire a data object to be encoded and its metadata according to the received encoding processing request.
  • the metadata of the acquired object is mainly the encoded metadata of the acquired object.
  • the encoded metadata can be a subset or a complete set of metadata. For example, but not limited to, the type of object, the corresponding data structure, constraints on storage and transmission, control, and the like.
  • the metadata of the object is the basis of the system and must be extracted from the data in some way.
  • the object's metadata can be automatically obtained using modern software platforms such as reflection mechanisms in Java, .Net, etc.
  • the data object (also referred to herein as an object) is the basic object of data processing in the present invention, that is, the target object to be encoded by the present invention. It can be in any form of data, either as a single word, symbol, part of it, or as an audio, video, multimedia stream or fragment thereof, or as an encoding itself or a document. It includes at least the metadata portion (or metadata) of the data object, and usually includes the content data portion of the data object, which is the remainder of the data object, or data, after stripping the metadata. The content of the object, or the data content, or the content data. The content data can be related or unrelated to the metadata portion.
  • Metadata is data about data objects, and is a description of the characteristics, attributes, intrinsic logical relationships, and/or structures of data objects. Metadata can appear inside, outside the data, along with the data, or with the data. Metadata may include such things as the type of object, creation and or modification dates, historical version information, data structures, interfaces, storage constraints, transmission constraints, encoding constraints, encoding context constraints, and the like.
  • Specific metadata examples may include, but are not limited to, information on the following: description of the assembly; identification (name, version, culture, public key); type of the export; other assemblies from the assembly; Security permissions; description of the type; name, visibility, base class and implementation interface; members (methods, fields, properties, things Pieces, nested types); attributes; other descriptive elements that modify types and members; header and/or table structure information for tables; palettes in drawing files, and more.
  • Metadata is different for different data objects. For example, for the metadata portion of the data object we call it the metadata of the data object; for the metadata portion of the encoding object mentioned later we can call it the encoding metadata.
  • the ability to acquire or add metadata corresponding to a data object at runtime is the basis for the system to encode data objects.
  • Step 102C Acquire an object code of the data object according to the encoding warehouse and the data object and metadata thereof.
  • the data object to be encoded and its metadata are obtained according to the received encoding processing request, and the object encoding of the data object is obtained according to the encoding warehouse and the data object and its metadata, because the data object can be obtained according to the data object.
  • Metadata and encoding repositories to encode data objects thus enabling flexible and diverse encoding.
  • FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C.
  • a specific implementation manner of step 102C is as follows:
  • Step 102C1 Select or create an encoding protocol according to the encoding repository and at least a portion of the metadata, and generate a meta encoding corresponding to the metadata according to the encoding specification.
  • metadata related to the subsequent encoding process may be further selected from the metadata, and then the corresponding encoding specification may be created or generated based on the selected metadata.
  • an encoding specification is selected or created, and the encoding specification is saved.
  • the encoding protocol will be utilized to generate the corresponding encoding. You can also set the default or default encoding protocol for the system to perform the corresponding encoding and decoding. In this case, you only need to select without creating a new encoding protocol. Some or all of the coding conventions can be selected or created by the user in an interactive manner. It is worth mentioning that the encoding protocol generated during the encoding process can be automatically destroyed after the encoding process is completed (after the encoding factory), and can also be saved.
  • the process of adding or creating a coding specification can be done while the object is being modeled; it can also be done while the specific application is running. It can be done automatically by certain rules or by interaction.
  • the coding protocol mainly includes the coding mode of the object, and the coding constraints of the internal structure of the object.
  • Step 102C2 compiling data content of the data object according to the coding protocol And obtaining an instance code, and acquiring an object code corresponding to the data object according to the meta code and the instance code.
  • the object coding is a reference coding form or a content coding form.
  • the encoding system mainly includes an encoding warehouse and a client, and the encoding processing flow can have two implementation manners, and the specific details are as follows;
  • Step 1a The client acquires the data object to be encoded and its metadata according to the received encoding processing request.
  • Step 2a The client sends the data object to be encoded and its metadata to the code repository.
  • Step 3a The encoding repository selects or creates an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.
  • the object coding protocol (which may be referred to as an encoding protocol) refers to the specification and constraints on how the data object is coded and decoded. It can include encoding of data objects (content encoding, reference encoding, or a mixture of both), encoding constraints of object metadata (such as schemes for related data serialization, word length, endianness, data alignment, etc.), etc. .
  • the object encoding protocol can also be used as part of the metadata of the data object.
  • Object encoding conventions can be added manually (through the modeler) or automatically (via the tool) when the object is modeled, or interactively (by the user) or automatically (via system policy) at runtime.
  • Encoding metadata refers to metadata associated with a data object codec.
  • the encoded metadata can be part or all of the metadata.
  • the encoding metadata of the data object is the basis for the system to encode and decode the data object.
  • Step 4a The code repository encodes the data content of the data object according to the coding protocol, obtains an instance code, and acquires an object code corresponding to the data object according to the meta code and the instance code.
  • the data object and its metadata are stored in an encoding repository.
  • the corresponding object code generated by the code repository is actually the reference code of the data object in the code repository.
  • Step 5a The client receives the object code returned by the encoding warehouse.
  • the second implementation is:
  • Step 1b The client obtains the data object to be encoded according to the received encoding processing request and Metadata.
  • Step 2b The client queries the encoding warehouse to select or create an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.
  • the client proposes an encoding process request to the encoding server in the encoding repository to obtain a meta-encoding corresponding to the encoding meta-object (actually a reference encoding of the encoding meta-object in the encoding repository).
  • the meta-encoding may include one or a combination and/or nesting of: type coding, spatial coding, and context coding.
  • Step 3b The client encodes the data content of the data object according to the coding protocol, obtains an instance code, and obtains an object code corresponding to the data object according to the meta code and the instance code.
  • the generation of the example encoding is also correspondingly divided into two types: for the example encoding of the content encoding form, the encoding client According to the coding convention, the content of the data object is directly serialized into an instance code.
  • the encoding client sends an encoding request to the encoding server; the encoding server obtains the corresponding data object and the encoding specification and related information according to the request, and stores the data object in the encoding warehouse according to the encoding specification and related information; Generate the corresponding instance code and return it to the client.
  • the decoding process of the object encoding is the inverse of the encoding process.
  • the encoding server obtains the object code to be decoded according to the decoding processing request of the encoding client.
  • the data object in the encoding repository is located according to the encoding and returned to the client.
  • the object encoding obtained for reading multiple steps.
  • the encoding client parses the object encoding into a meta-code and an instance code according to a preset rule.
  • a metacoded decoding request is sent to the encoding server.
  • the decoding process of the above example encoding is also divided into two types: for the content encoding form, the encoding client can directly decode the instance code into corresponding data according to the encoding protocol. Object content.
  • the encoding client issues an instance encoding and decoding request to the encoding server; the encoding server obtains the corresponding instance encoding and encoding protocol and related information according to the request, and locates the data object in the encoding warehouse, and Return to the client.
  • the system first acquires the encoded metadata; and then obtains the corresponding content encoding according to the metadata.
  • the encoding metadata may include encoding type information for locating, loading, or transmitting the encoded content, and constraint information for the target encoding space to which the encoding belongs.
  • the encoding metadata is encoded to obtain a meta-encoding.
  • the encoded content of the meta-encoding in the encoding repository is mainly the encoding meta-object.
  • Meta-encoding is generally an integral part of encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism.
  • the encoding metadata as a data object, that is, a data object that encodes metadata as content, which may be referred to as encoding.
  • Meta objects can also have their own metacode. Therefore, the encoded metadata as a data object may also have its corresponding metadata encoding, called meta-encoding.
  • FIG. 6 is a relationship between data objects, metadata, encoding protocols, and encoding meta objects.
  • the encoding meta object is also a data object (for a normal data object, it is an M1 abstraction.
  • the level of the object), the model of its metadata (the abstraction level is M2) is called the encoding metamodel.
  • the encoded metadata of the encoded meta-object is part of the encoding metamodel.
  • the coding element model is the cornerstone of the object coding system.
  • the coding element model is relatively stable at runtime and does not change dynamically, but can be extended. That is to say, the encoding metadata of the encoding meta object is built into the system. Therefore, the system can directly store, transfer, and encode and decode these encoded meta-objects.
  • An object coding system can correspond to a unique core coding metamodel (which can have an extension mechanism).
  • FIG. 7 is a schematic diagram of the core coding element model.
  • the meta-encoding as the object encoding of the encoding meta-object, does it also have its own meta-encoding? This is actually related to the specific design of the coding metamodel and the codec method. If there is only one encoding meta-object in the encoding metamodel, the meta-encoding is all of the encoding meta-object. If there are multiple encoding meta objects in the metamodel and they can be encoded into the same metacode at the same time, then this case does not require metacoded metacode. Otherwise, metacoded metacodes are needed to distinguish them. Sometimes, there is a certain hierarchical relationship between the encoded meta-objects. In this case, multi-level decoding may be required to obtain the encoded meta-object of the final data object.
  • variable length coding is more direct and flexible for the expression of this meta-object hierarchy. And easy to handle: the previous code word is the meta code of the next code word, and the latter code word is the meta code of the next code word, so that multiple levels can be nested.
  • FIG. 8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object coding removes the meta-encoding part), and a conceptual model of the data object and the coding meta-object. As shown in FIG. 8, the following layers are shown. relationship:
  • the encoding meta object can also be used as a data object.
  • Meta-encoding itself can also be used as an object encoding
  • Object coding includes meta coding and instance coding
  • the object encoding is associated with the corresponding data object, which implies the same correspondence between the meta-encoding and the encoding meta-object (mainly implicit in relation 1 and relationship 2 above).
  • FIG. 9 is an exemplary diagram of the meta-encoding in the present embodiment.
  • the object encoding is a 128-bit fixed-length encoding.
  • the owner of the object and the object type. They can be related or unrelated, depending on the definition in the encoding metamodel. Correlation or irrelevant corresponding coding logic is different.
  • FIG. 10 is an exemplary diagram of a similar layer-by-layer correlation of coded meta-objects (variable-length coding of 16-bit word length).
  • FIG. 11 is a schematic diagram of a meta model corresponding to the encoding.
  • the encoding type can have one owner (01) or no owner (00). Therefore, both of the above encoding forms are legal. Only the type encoding as the meta-encoded object encoding corresponds to the data object without the owner. The other one represents a data object with the owner.
  • the meta-encoding is generated based on the metadata and the encoding protocol, and an instance encoding is generated based on the data content.
  • a coding factory is another important component of a system that can be dynamically created by an encoding repository or across components or across systems.
  • the coding factory can provide direct codec services for related objects.
  • the code repository can provide two important services: registration and access to encoded metadata; encoding and decoding of object reference encoding.
  • the encoding repository can also use external storage services to store encoded metadata as well as object data. Wait.
  • the final object encoding is generated from the meta-code and the instance code based on predetermined rules.
  • the meta-encoding and the instance coding may be combined into an object coding in an arbitrary manner, such as splicing or by some kind of operation, etc., as long as the two can be reversely disassembled and restored at the time of decoding.
  • the process of generating the object encoding can be placed on the client side or automatically by the encoding factory, depending on the actual design.
  • the content data may also be the application object itself, or may be positioning and index information of the application object.
  • the data access component of the application system can obtain the corresponding application data through some means or algorithm according to the content data, thereby obtaining the final application object.
  • the content of the data object can be stored in a third party storage system that interfaces with the encoding repository, in which case the encoding repository needs to store information about accessing data objects in the third party storage system.
  • the process of encoding a data object is referred to as object-based encoding.
  • Data serialization referred to as serialization, is the process of encoding content into data.
  • the metadata of the data object and the content data ultimately need to be serialized, or stored in the result based on the object encoding (content encoding method), or stored in a storage other than the result (reference encoding method).
  • the content of the data object and the content of the metadata need to be serialized before being transmitted in the system.
  • the serialization of data objects can also be built entirely on object-based coding methods.
  • the key is that the encoded metadata is stored in the encoding warehouse by the method to obtain the corresponding encoded meta-object reference encoding, that is, the meta-encoding.
  • object-based reference coding is the basis of this method.
  • the encoded meta-object can be reference coded to obtain the meta-encoding.
  • meta-encoding we can both reference the data object and serialize the data object, that is, content encoding. In the implementation of the reference code In the process, better, you need to get the content encoding of the data object (use this method for itself), transfer the content encoding to the encoding warehouse for storage, and then get the reference encoding.
  • object encoding refers to encoding of an arbitrary object.
  • the objects here can be either entity objects such as data, content information, images, voices, etc. (generally they can be reference coded), or they can be value objects (for example, dates, which can be encoded by examples), or High-level objects that include internal object structures, such as array objects, table objects, tree/document objects, and more.
  • Object encoding is one of the outputs of this system for encoding arbitrary objects, and is also one of the inputs for object decoding.
  • FIG. 12 is a schematic diagram of a conceptual model of the object encoding.
  • the object encoding may include two parts, one is a meta-encoding, and the other is an example encoding.
  • Meta-encoding is the encoding of an encoded meta-object. Meta-encoding is generally an integral part of object encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism.
  • Content encoding is the encoding of data content under the corresponding encoding constraints.
  • FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5C, as shown in FIG. 13, the method in this embodiment further includes:
  • Step 201C Set access rights to data in the encoding warehouse.
  • the data may be metadata, data objects, and the like.
  • the metadata includes one or a combination of the following:
  • Type of data object creation time of data object, modification time of data object, historical version information of data object, data structure of data object, interface of data object, storage constraint of data object, transmission constraint of data object, data object Encoding constraints (including constraints on the encoding space).
  • the method may further include:
  • Step 202C Send the object code to the target client.
  • FIG. 14 is a flowchart of Embodiment 3 of an encoding processing method according to the present invention.
  • a specific implementation manner of step 102C2 is:
  • Step 301C Acquire a context object.
  • Step 302C Acquire a corresponding coding space according to the context object and the coded protocol.
  • Step 303C Encode the data content in the data object in the coding space to obtain an instance code.
  • Step 304C Acquire an object code corresponding to the data object according to the meta code and the instance code.
  • the encoding repository (also referred to herein as an encoding repository) may be a repository that stores encoded metadata, encoded meta-objects, and object data, which may also provide related services. Similar to the font library based on the standardized encoding system, the glyph corresponding to the character encoding in the handwriting input system of the present invention can also be stored in the encoding warehouse. 15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment, as shown in FIG. 15, by accessing the glyph information in the encoding warehouse, the application using the new data processing system can render Any text font.
  • the new data processing system uses a solution based on object open coding. You can encode graphics, voice, or other multimedia data, as well as encode different domain data. These encoded metadata are also stored in the encoding repository.
  • the application system can not only query and use various encodings in the encoding warehouse, but also register new encoding types with the encoding warehouse and submit encoded data to them.
  • FIG. 16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system, as shown in FIG. 16, which illustrates the relationship between some of the core concepts in the encoding metamodel. The definition of these specific concepts is then given.
  • the encoding space refers to the logical space that isolates the object encoding. Objects corresponding to different instance codes of the same object type in different coding spaces are different.
  • the coding space is directly related to one or several coding objects (only one of the above-mentioned coding metamodels), and the (several) coding object is called the space and the direct context of the coding object in the space. This encoding space is called the encoding space of this (several) object.
  • the coding space of the coding object in the coding space is called a subspace.
  • the encoding space is called the parent space of its child space.
  • the encoding space without a parent space is called the root space.
  • the root space is generally the encoding space of the encoding repository.
  • the coding space is a means of hierarchically classifying and isolating the encoded metadata.
  • the coding space is hierarchical, that is, the coding space can also have subspaces.
  • the same code belonging to different coding spaces can correspond to different objects.
  • the same element code can be completely different in different spaces.
  • different coding spaces have different levels of security isolation for encoding.
  • Figure 17 is a schematic diagram of a base object that can be applied to a basic coding space.
  • any code is present in the code repository, with the exception of standard codes.
  • different encoding warehouses correspond to different encoding spaces.
  • the encoding space corresponding to an encoding warehouse is the root space of all encodings of this encoding warehouse.
  • each code has its own owner. Then the coding of different users belongs to different user coding spaces. With the complexity of user models in the coding warehouse, the division of user space can be more complicated. For example, there may be a group space shared by multiple users.
  • the encoding is to be serialized into a specific data store.
  • This data store can be a file, a database field, or a string that is transmitted over the network. Separating the encoding for this data content itself maximizes the security of the encoding. In fact, this content space based on data content isolation is a password book that establishes a content-to-code correspondence.
  • context space In the context of encoding formation and use, the above two encoding spaces (named encoding space, context encoding space) may be implicitly present. We call this the context space.
  • the permutation of different kinds of context objects determines the final context space. For example, different user and application permutations combine to correspond to different context spaces. But in general, the code in the non-standard text content is uniquely corresponding to the content, and the content itself implies the corresponding application and user (except, of course, multi-application, multi-user content). Therefore, it is not necessary to divide the application subspace or the user subspace in the content space. In all context spaces, there is a special space, which is a context-independent coding space, which we call public coding. In fact, standardized coding is public coding. The encoding in the root space is not a common encoding, but an encoding related to the encoding warehouse. The encoding space is the root space corresponding to the encoding warehouse.
  • any code will eventually be embodied as a code.
  • the last code corresponding to the coding space is a meta-code, which we can call spatial coding.
  • the encoding space is actually a special encoding meta-object - its corresponding object instance is still an encoding meta-object.
  • For context-independent spatial coding there is no coding space for this encoding.
  • the coding can correspond to different coding spaces depending on the context object. Therefore, for context-independent coding spaces, such as named encoding space, we can directly use spatial encoding, and the corresponding instance encoding is subspace encoding or other metacoding.
  • the code corresponding to the coded warehouse space is the coded warehouse code.
  • the content space corresponds to the instance code.
  • the application space corresponds to the application code.
  • User space corresponds to the user code.
  • Figure 18 is a schematic diagram of the coding structure of a 128 fixed length coding scheme.
  • the arrangement and combination of the above codes are not unique.
  • the example code can be placed at any position in the object code as long as it is clearly defined in advance.
  • context space coding is implicit in the context in which the encoding is used and does not need to appear in the final object encoding.
  • the currently used encoding repository implies the encoding warehouse encoding; the currently used encoding application implies the corresponding application encoding; the current encoding of the document content implies the instance encoding and the encoding owner's user encoding ( Assume a single-user document).
  • context space encoding must appear in the text to set different encoding contexts to isolate different spaces.
  • the text in a document includes the encoding of multiple encoding repositories.
  • the corresponding encoding warehouse code must appear in the content of the document to distinguish different encoding warehouse spaces.
  • an encoding repository that supports encoding repository encoding must provide information to access the encoding repository for the library encoding.
  • multi-user text content must use user encoding; application encoding must be used in content that can be read and written by multiple applications and that uses application space isolation.
  • Content space is an exception, because content encoding is the encoding of the content of the document itself, one-to-one correspondence with the content of the document. It is not possible to encode multiple content in any content, so the content encoding does not need to be displayed in the encoding.
  • the content encoding can be a hash value of the document content, or a hash value of the application encoding and time stamp. Therefore, content encoding is either calculated in real time or stored as content metadata.
  • encoding does not need to include spatial encoding, but it is necessary to indicate which spatial encoding is used, which can be specified by using spatial bits in the encoding. This space bit actually corresponds to the coding context specification in the coding protocol.
  • FIG. 19 is a schematic diagram of four binary bits being four spatial bits.
  • the coded storage bit may also be called a reserved bit.
  • An illustrative example may be, for example, when the reserved bit is 0, the encoding is from the current encoding repository. Otherwise, additional information is required to define the encoding or specify the encoding source, such as the client encoding that will be mentioned later.
  • the content bit is 0, the encoding is independent of the content; when it is 1, the encoding exists for the specific content.
  • the application bit is 0, the code is independent of the application; when the bit is 1, it is the application-specific code.
  • the code is a public code; when it is 1, it is the code owned by the current document user. vice versa. Any other coding scheme can be used as long as it can effectively distinguish different spaces.
  • the type encoding is the same as the normal encoding, and there is also a coding space.
  • the space of type coding and instance coding can be different.
  • using public coding for user space can serve as a security isolation for the user space.
  • the encoding type of the encoding is user space
  • the instance encoding is public space. Since the instance code must belong to a certain encoding type, the same type of instance encodes the same spatial bits.
  • the metadata of the encoding type in the encoding warehouse can be accessed according to the type encoding.
  • the type encoding must contain the corresponding space to ensure that the decoder can get the correct encoding type information from the encoding repository.
  • the type information in the encoding repository can contain the spatial bits corresponding to the instance encoding, so the spatial bits do not need to be Appears in the example code.
  • Context space is the main means to securely isolate the code.
  • the main body that manages and sets the application with the generated encoding target space should be the individual corresponding to the context object (such as the user) and the administrator (such as system administrator and application administrator).
  • the management space is a hierarchical management that facilitates coding and is registered and used by the application.
  • the code word length is the minimum number of bits required to encode a character in a text encoding system.
  • the encoded word length of UTF-8 is 8 binary bits, or one byte.
  • the encoded word length of UTF-16 is two bytes. In the encoding of a coded word length, not all codes are of this length. But its length must be an integer multiple of the code word length. For an encoding system with a multibyte word length, it is also necessary to consider the endian problem in a coded word length. This problem does not exist in single-byte word lengths. All data is arranged in bytes from low to high.
  • variable length coding system In addition, for fixed length coding and variable length coding, in an coding system, all coding lengths are equal to their coding word lengths, and such an encoding system is called a fixed length coding system. On the contrary, it is called a variable length coding system.
  • the coding word length and the associated coding method are closely related to the coding and decoding process, and are independent of the coding element model. That is to say, the object coding system corresponding to the same coding element model can select different coding word lengths and corresponding different coding methods. It is even possible to support multiple word lengths or combinations of encoding methods at the same time. Of course, it is necessary to design an effective mechanism to distinguish them.
  • the coding length and encoding method of the system are not directly related to the serialization word length and method specified in the specific object coding protocol. However, if the serialization result is part of the object encoding, the compatibility of the object encoding word length and the method needs to be considered.
  • the object encoding system can be a system that is independent of the encoding word length. That is to say, based on the same code repository, there can be different word length coding schemes.
  • a code word length often cannot put down a complete code (as mentioned above, including spatial coding, type coding, and instance coding).
  • variable word length coding that is, one code can include multiple words. For example, the metacode portion and the instance code portion are split into a plurality of consecutive code words. Even so, sometimes a word length encoding does not cover all encoding instances.
  • Figure 20 is a An example diagram of a coding scheme, as shown in FIG. 20, enables the encoder to automatically obtain the corresponding codeword length through the previous or first two bytes.
  • the scheme can represent a coding range of 0 to 265-1.
  • FIG. 21 is an exemplary diagram of the encoding scheme of UTF-8. Compared with the encoding scheme of UTF-8 (as shown in FIG. 21), it is found that the encoding results of the two encoding schemes do not conflict with each other and may appear in the same document.
  • the byte corresponds to the ASCII code portion of UTF-8; when the first two bits of the first byte of the code are 10, the corresponding code is the object.
  • variable length coding scheme with one byte word length and multiple byte word lengths can be designed.
  • the encoding type is the object type to which the relevant encoding convention is added.
  • the encoding context is an abstraction of the context object. It is actually the selection criteria for the selection of context objects at runtime.
  • the above encoding metamodel uses the encoding type plus the object role name. In the same encoding context (generally a specific application), the same type of role name must be unique.
  • the encoding context of the data object in the blog content should be the author user. In this way, when any reader opens the content, there is no problem that the decoding error occurs because the currently logged in user is not the author.
  • the premise of correct decoding is to correctly set the encoding context object. For the blog example, when opening each specific blog content, the corresponding author user object is set as the encoding context object.
  • the encoding context path is referred to as the encoding path, and corresponds to a series of encoding context conventions, which is a constraint on the encoding space to which the instance code of the corresponding data object belongs.
  • the definition of the coding space indicates that the coding space is a hierarchy associated with the encoded object with the associated encoding - the subspace can also have subspaces.
  • the encoding path is the encoding space path that is positioned to determine the encoding object. For example, the image encoding path in a personalized journal might look like this:
  • the image corresponding to the image object encoding can be found in the final application space.
  • Encoding path in the encoding metamodel Is the encoding path of a higher level of abstraction, corresponding to:
  • this encoding path is instantiated to the above encoded path instance by selecting the corresponding context object.
  • the so-called context object is a concrete object corresponding to the context specification.
  • the object must conform to the constraints of the context specification and must be accessible in the corresponding encoding and encoding process. For example, there is an "author" context constraint whose corresponding type is "user".
  • the context constraint When the context constraint is set, the current application cannot be set to the corresponding context object. It must be set with an object of the "user" type.
  • the author object after obtaining the author information corresponding to the document, it can be set to the context object corresponding to the "author" context constraint. If the author object is inaccessible to the current user, the context object cannot be instantiated, which means that the encoding context constraint is not satisfied, and the subsequent related instance encoding cannot be decoded. This is also a concrete manifestation of context-based coding security in this method.
  • the encoding path instance is directly related to the encoding space of the corresponding data object instance code in the encoding warehouse.
  • the storage location of the corresponding data object in the encoding warehouse may also be restricted by the encoding space.
  • the specific implementation of the encoding path for the encoding warehouse can have multiple choices depending on the storage scheme.
  • a simple implementation is to use simple context name splicing to form table names for context-sensitive data objects.
  • the table name of this picture table can be:
  • the instance code of the corresponding data object can directly use the keys of the table.
  • Another implementation of the coding space is to uniformly store the data objects, and only distinguish the coding space for coding.
  • the system maintains a table of encoding spaces as follows:
  • the code space ID field is the table primary key; the parent space ID is a foreign key of the table, and is used to represent the nested relationship of the code space.
  • the picture ID field is the primary key of the table.
  • the data for all images is placed in the table.
  • the other is the corresponding picture encoding table:
  • the code space ID field is a foreign key of the system code space table, and the picture ID field is a foreign key of the picture table.
  • the Encoding Space ID field plus the Encoding field is the primary key of the table.
  • the encoded directory entry is a specific encoded meta-object encoded by the context-dependent object.
  • the encoding directory is a list of encoding directory entries.
  • Each encoding directory entry has a unique number in the encoding directory, which is the metacode.
  • the encoding directory entry is specifically the encoding type plus the encoding path.
  • the encoding path can be a relative path, that is, the current space of the encoding directory item, or an absolute path-based root space; or both can be supported at the same time, and only a mechanism for distinguishing the two needs to be established.
  • the meta-encoding (encoding corresponding to the encoding directory entry) and the instance encoding in the object encoding may not be in one encoding space.
  • the encoding directory entry can unify the spatial encoding and type encoding mentioned above. If a meta-encoding, the encoding type in the corresponding object data (actually the encoding directory entry) is still an encoding directory entry, then the meta-encoding corresponds to An encoding space; the instance encoding after the meta-encoding is actually a meta-encoding. In this way, the meta-encoding can represent both the spatial encoding and the encoding of the encoded directory entry, depending on whether the corresponding encoding type is an encoding directory entry type. Therefore, with the support of this design, the meta-encoding of an object encoding can be one or more meta-encoded groups.

Abstract

Provided are methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing. An object-based open encoding and decoding solution can encode and decode any data object in any free and open encoding manner; and with regard to an object-based data splitting/merging method, metadata and/or encoded data of a data object are split/stripped from corresponding data contents so as to ensure the security of the data contents. The methods can be implemented individually, and can also be implemented in combination, or can be combined with the applications in other technical fields either alone or in combination.

Description

手写输入字符的处理、数据拆分和合并及编解码处理方法Handwritten input character processing, data splitting and merging, and codec processing method 技术领域Technical field
本发明涉及数据处理技术,尤其涉及一种手写输入字符的处理、数据拆分和合并及编解码处理方法。The present invention relates to data processing technologies, and in particular, to a method for processing handwritten input characters, data splitting and merging, and encoding and decoding.
背景技术Background technique
目前,随着计算机的发展,编码技术的种类也越来越多,其作为计算机基础的编码技术,已经广泛应用在数据传输、存储和处理中。At present, with the development of computers, there are more and more types of coding technologies. As a computer-based coding technology, it has been widely used in data transmission, storage and processing.
其中,文字编码是最为基础的编码,以供人类输入、查看和编辑、修改;供计算机分析和处理。从早先的ASCII文字编码标准到今天的Unicode,标准化的文字编码是人和机器以及各种系统之间传递信息的一个基础。但是,作为记录人类输出的工具,现有的标准化文字编码远远不够。随着计算机的普及,人机交互技术的发展,标准文字编码及其相应的文字输入方法逐渐成为人类的自然输出进入到数字世界的瓶颈。Among them, text encoding is the most basic encoding for human input, viewing and editing, modification; for computer analysis and processing. From the earlier ASCII text encoding standards to today's Unicode, standardized text encoding is a basis for transferring information between people and machines and various systems. However, as a tool for recording human output, the existing standardized text encoding is far from enough. With the popularity of computers, the development of human-computer interaction technology, standard text encoding and its corresponding text input methods have gradually become the bottleneck of human natural output into the digital world.
在标准文字编码的基础之上,人们已经开发出一系列的通用的、专用的编码方法,通过标记、控制、转义等一系列手段来用字符及字符序列来表述结构化的数据/文档以及专门的领域数据,我们称其为文本编码;对应的数据格式称为文本格式。通用的如XML/SGML用标记构成的树结构来描述复杂结构、JSON用JavaScript语法描述复杂对象;专用的如基于XML的HTML描述网页、MathML描述数学表达式、SVG描述矢量图形;CSV用于表达表格数据;RTF、Markdown等用于表示格式化文档;各种编程语言也主要使用文本格式;等等。基于标准文字的编码允许人类参与到数据的创建、查看、调试、修改过程,便于不同系统之间的集成和交换,提高了系统开发的速度,降低了系统故障检修的成本。但是,另一方面,文本格式对于符号化数据、二进制数据的表达本身就是冗余的,随着系统所要表达结构复杂性的提高,基于文本编码的标记、语法的复杂性随之大幅度提高,数据冗余也会随之加剧。此外,由于特定文字编码标准中编码个数的有限性,数据内容同编码中语法标记的冲突也不可避免,文字转义也会带来一定的数据冗余。 Based on standard text encoding, a series of general-purpose, specialized encoding methods have been developed to express structured data/documents with characters and character sequences through a series of means such as markup, control, and escaping. Specialized domain data, we call it text encoding; the corresponding data format is called text format. Common XML/SGML tree structures with tags to describe complex structures, JSON to describe complex objects with JavaScript syntax; dedicated XML-based HTML description pages, MathML description mathematical expressions, SVG description vector graphics; CSV for expression Tabular data; RTF, Markdown, etc. are used to represent formatted documents; various programming languages also mainly use text formats; Standard text-based coding allows humans to participate in the process of data creation, viewing, debugging, and modification, facilitating integration and exchange between different systems, improving the speed of system development, and reducing the cost of system troubleshooting. However, on the other hand, the text format is redundant for the expression of symbolized data and binary data. As the complexity of the structure to be expressed by the system is improved, the complexity of the mark and syntax based on text coding is greatly improved. Data redundancy will also increase. In addition, due to the limited number of codes in a specific text encoding standard, the conflict between the data content and the grammar mark in the encoding is also inevitable, and text escaping also brings certain data redundancy.
计算机内部的世界是数字的世界,二进制数据是其天然的数据表达形式。人们定义的文本格式数据也会往往通过转换处理成二进制数据,以减少冗余,提高处理、传输效率。目前也有一些通用的基于二进制的编码方法,如国际标准化组织和国际电讯联盟的编码标准ANS.1,谷歌的BufferProtocol,Apache的Thrift以及Avro,还有BSON、Message Pack等等。但是同基于文本的编码方式相反,二进制数据具有相对封闭、不利于交换、不利于人类参与等缺点。The world inside the computer is the world of numbers, and binary data is its natural form of data representation. People-defined text format data will also be processed into binary data through conversion to reduce redundancy and improve processing and transmission efficiency. There are also some general binary-based encoding methods, such as the International Standards Organization and the International Telecommunications Union coding standards ANS.1, Google's BufferProtocol, Apache's Thrift and Avro, as well as BSON, Message Pack and so on. However, contrary to the text-based coding method, binary data has the disadvantages of relatively closed, unfavorable exchange, and unfavorable human participation.
对于编码来说,无论是文本编码还是二进制编码,都存在两种用途,一个是描述数据对象本身,这又称作序列化,本说明书将其称作数据对象的内容编码。前面提到的编码标准和方法主要是用于内容编码。For encoding, whether it is text encoding or binary encoding, there are two purposes, one is to describe the data object itself, which is also called serialization, which is referred to as the content encoding of the data object. The aforementioned coding standards and methods are mainly used for content coding.
编码的另一个用途是用于描述数据对象的地址或者引用,本说明书将其称作数据对象的引用编码。基于文本的引用编码有URN、URL、ANS.1中的对象标识(OID)等等;基于二进制的引用编码有数据库中的键、UUID/GUID、IP地址、MAC地址、MD5、SHA-1等,甚至还有基于图形的一维码、二维码(实际上也是通过识别转换成文本编码或者二进制编码)等等。Another use of encoding is to describe the address or reference of a data object, which is referred to herein as a reference encoding of a data object. Text-based reference encoding has URN, URL, object identifier (OID) in ANS.1, etc.; binary-based reference encoding has keys in the database, UUID/GUID, IP address, MAC address, MD5, SHA-1, etc. There are even one-dimensional codes based on graphics, two-dimensional codes (actually converted into text encoding or binary encoding by recognition) and so on.
现有引用编码有两个主要问题。一是不利于集成、交换:各种不同领域正在使用着不同的编码标准,面对当今互联网、物联网的发展趋势,这种现状不利于各种领域对象的统一引用。另一个问题就是编码的有效性:随着世界互联性的提高,海量的数字对象随时在线,虽然像UUID(16个字节)、SHA-1(20个字节)这样的编码理论上足以对他们提供统一的引用编码,但是这种海量引用编码的传输、处理、存储本身就会占据大量的资源,造成不必要的浪费。There are two main problems with existing reference coding. First, it is not conducive to integration and exchange: different coding standards are being used in different fields. Faced with the development trend of the Internet and the Internet of Things today, this status quo is not conducive to the unified reference of objects in various fields. Another problem is the validity of coding: as the world's interconnectivity improves, massive digital objects are always online, although encodings like UUID (16 bytes) and SHA-1 (20 bytes) are theoretically sufficient. They provide a uniform reference code, but the transmission, processing, and storage of such massive reference code itself will occupy a large amount of resources, causing unnecessary waste.
发明内容Summary of the invention
本发明的第一个方面是提供一种手写输入字符的处理方法,包括:A first aspect of the present invention provides a method for processing handwritten input characters, including:
在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息;其中,所述输入信息包括所述笔划在所述第一目标行/列中的输入位置;And acquiring, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column;
对于每个笔划,根据所述笔划在所述第一目标行/列中的输入位置,或 者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符。For each stroke, according to the input position of the stroke in the first target row/column, or The input position of the stroke in the first target row/column and the character specified in the first target row/column creates a new character for the stroke or determines a character to which the stroke belongs.
本发明第一个方面的技术效果是:提供了一种手写输入字符的处理方法,能够实现边输入边成字的效果,用户不需要借助明确或隐含的“开始单个文字输入”或“结束单个文字输入”的命令来区分不同的字符,因此,在书写过程中不需要每写完一个字必须停顿一段时间或者与系统进行某些交互,书写过程流畅,效率较高;并且,本方法中直接通过笔划的输入位置来确定笔划归属的字符,而不需要进行标准字符的识别,因此能够保留用户手写输入的个性化信息及书写风格和特征。The technical effect of the first aspect of the present invention is to provide a method for processing handwritten input characters, which can realize the effect of inputting a word while inputting, and the user does not need to explicitly or implicitly "start a single text input" or "end". The command of a single text input distinguishes different characters. Therefore, it is not necessary to pause for a period of time or perform some interaction with the system during the writing process, and the writing process is smooth and efficient; and, in the method The character to which the stroke belongs is determined directly by the input position of the stroke, and the identification of the standard character is not required, so that the personalized information and the writing style and characteristics of the user's handwriting input can be retained.
本发明的第二个方面是提供一种数据拆分方法,包括:A second aspect of the present invention provides a data splitting method, including:
在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取所述待存储数据标识对应的数据对象中的元数据,并将获取的元数据从所述数据对象中剥离;When receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. ;
根据预设数据内容拆分规约,将所述数据内容划分为至少两个数据片断。The data content is divided into at least two data segments according to a preset data content splitting specification.
本发明第二个方面的技术效果是:提供了一种数据拆分方法,将用户原始数据中的元数据与数据内容分开,并将数据内容划分为多个数据片断,加大了非法获取到用户原始数据的难度,更加可靠地实现了数据存储的安全性。The technical effect of the second aspect of the present invention is to provide a data splitting method, which separates the metadata in the user's original data from the data content, and divides the data content into a plurality of data segments, thereby increasing illegal acquisition. The difficulty of the user's original data makes the security of data storage more reliable.
本发明的第三个方面是提供一种数据合并方法,包括:A third aspect of the present invention provides a data merging method comprising:
接收携带有标识信息的数据对象获取请求;其中,所述标识信息包括定位信息,且所述定位信息用于定位所述数据对象中部分数据信息的存储地址;Receiving a data object acquisition request carrying the identification information; wherein the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object;
获取所述定位信息对应的存储内容,并根据获取到的所述存储内容中的定位信息获取其他存储内容中数据信息,直到获取到所述数据对象的所有数据信息;Acquiring the storage content corresponding to the positioning information, and acquiring data information in the other storage content according to the obtained positioning information in the storage content, until all data information of the data object is obtained;
根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到所述数据对象。And obtaining, according to the preset merge rule in the obtained data information, the acquired data information, to obtain the data object.
本发明第三个方面的技术效果是:提供了一种数据合并方法,通过根据数据对象获取请求中的标识信息中所包含的定位信息,逐步定位获取到各个 被拆分存储于各个存储体中的数据信息,从而将各个数据信息根据预设的合并规约进行合并处理,得到用户原始数据对象,从而保证了分散于各个存储体中的数据可以被高效、安全地获取到,保障了用户将分散数据成功合并为原始数据的可靠性。The technical effect of the third aspect of the present invention is to provide a data merging method, which is obtained by stepwise positioning according to the positioning information included in the identification information in the data object acquisition request. The data information stored in each storage body is split, so that each data information is combined according to a preset merge rule to obtain a user original data object, thereby ensuring that data dispersed in each storage body can be efficiently and safely. The acquisition ensures the reliability of the user successfully merging the scattered data into the original data.
本发明的第四个方面是提供一种编码处理方法,包括:A fourth aspect of the present invention provides a coding processing method, including:
根据接收的编码处理请求,获取待编码的数据对象及其元数据;Acquiring the data object to be encoded and its metadata according to the received encoding processing request;
根据编码仓库和所述数据对象及其元数据,获取所述数据对象的对象编码。Obtaining an object encoding of the data object according to the encoding repository and the data object and its metadata.
本发明第四个方面的技术效果是:通过根据接收的编码处理请求,获取待编码的数据对象及其元数据,并根据编码仓库和数据对象及其元数据,获取该数据对象的对象编码,由于可以依据数据对象的元数据和编码仓库,来实现对数据对象的编码,因此实现了灵活多样的编码方式。The technical effect of the fourth aspect of the present invention is: obtaining a data object to be encoded and its metadata according to the received encoding processing request, and acquiring an object encoding of the data object according to the encoding warehouse and the data object and the metadata thereof, Since the data object can be encoded according to the metadata of the data object and the encoding warehouse, a flexible and diverse encoding method is realized.
本发明的第五个方面是提供一种解码处理方法,包括:A fifth aspect of the present invention provides a decoding processing method, including:
接收解码处理请求,并根据所述解码处理请求,获取待解码的对象编码;Receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request;
对所述对象编码进行拆解,获取元编码,或者所述元编码和实例编码;Decomposing the object code to obtain a meta code, or the element code and the instance code;
查询编码仓库,根据所述元编码获取对应的元数据和编码规约;Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;
根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述对象编码对应的数据对象。Obtaining a data object corresponding to the object encoding according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
本发明第五个方面的技术效果是:通过接收解码处理请求,并根据该解码处理请求,获取待解码的对象编码,对该对象编码进行拆解,获取元编码,或者该元编码和实例编码,查询编码仓库,根据该元编码获取对应的元数据和编码规约,并根据该元数据和编码规约,或者该元数据、编码规约和实例编码,获取与该对象编码对应的数据对象,由于利用元数据和编码仓库,实现对数据对象的编码,因此,不仅实现了灵活的编码方式,在一定程度上节省了空间,相应的,在解码过程中依据拆解的元编码,以及采用编码仓库,有效地提高了解码的效率。The technical effect of the fifth aspect of the present invention is: receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request, disassembling the object encoding, obtaining a meta encoding, or the meta encoding and the instance encoding. Querying the code repository, obtaining corresponding metadata and coding specifications according to the meta code, and acquiring data objects corresponding to the object code according to the metadata and the coding protocol, or the metadata, the coding protocol, and the instance code, The metadata and the encoding warehouse realize the encoding of the data object. Therefore, not only the flexible coding method is realized, but also the space is saved to a certain extent. Correspondingly, according to the meta-coding of the disassembly and the coding warehouse, Effectively improve the efficiency of decoding.
附图说明DRAWINGS
图1A为本发明提供的一种手写输入字符的处理方法实施例的流程图; 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention;
图1B为本发明提供的一种手写输入字符的处理方法实施例中字符的示意图一;FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention; FIG.
图1C为本发明提供的一种手写输入字符的处理方法实施例中字符的示意图二;1C is a schematic diagram 2 of a character in a method for processing handwritten input characters according to an embodiment of the present invention;
图1D为本发明提供的一种手写输入字符的处理方法实施例中相邻两行同时激活时的示意图;FIG. 1 is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention;
图1E为本发明提供的一种手写输入字符的处理方法实施例中插入字符时的状态示意图;FIG. 1 is a schematic diagram of a state in which a character is inserted in a method for processing handwritten input characters according to an embodiment of the present invention; FIG.
图1F为本发明提供的一种手写输入字符的处理方法实施例中选择处理命令下的编辑模式示意图;FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.
图1G为本发明提供的一种手写输入字符的处理方法实施例中空白字符的示意图;FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.
图1H为本发明提供的一种手写输入字符的处理方法实施例中文字编辑的流程图;FIG. 1H is a flowchart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.
图1I为本发明提供的一种手写输入字符的处理方法实施例中手写程序源代码转换方法的流程图;1I is a flowchart of a handwriting program source code conversion method in an embodiment of a method for processing handwritten input characters provided by the present invention;
图1J为图1I所示的手写程序源代码转换方法中“对B进行标准码转换”的详细流程图;FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I;
图1K为本发明提供的一种手写输入字符的处理方法实施例中手写程序的示意图;FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention; FIG.
图1L为本发明提供的一种手写输入字符的处理装置实施例的结构示意图;1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention;
图2A为根据一示例性实施例示出的一种数据拆分方法的流程图;2A is a flowchart of a data splitting method according to an exemplary embodiment;
图2B-1为根据另一示例性实施例示出的一种数据拆分方法的流程图;2B-1 is a flowchart of a data splitting method according to another exemplary embodiment;
图2B-2为本发明一种数据拆分方法的数据对象为音频数据的系统结构图;2B-2 is a structural diagram of a system in which a data object of the data splitting method is audio data according to the present invention;
图2B-3为本发明一种数据拆分方法的数据对象为音频数据的时域分析图;2B-3 is a time domain analysis diagram of data objects of the data splitting method according to the present invention;
图2B-4为本发明一种数据拆分方法的数据对象为音频数据的语音文字编码表图; 2B-4 is a diagram of a speech text coding table in which a data object of the data splitting method is audio data according to the present invention;
图2B-5为本发明一种数据拆分方法的数据对象为音频数据的语音文字的一种呈现方式图;2B-5 is a schematic diagram showing a voice text of a data object of the data splitting method according to the present invention;
图2B-6为本发明一种数据拆分方法的数据对象为音频数据的语音文字的另一种呈现方式图;2B-6 is another schematic diagram showing the voice text of the data object in the data splitting method according to the present invention;
图2B-7为本发明一种数据拆分方法的数据对象为音频数据的语音文字的又一种呈现方式图;2B-7 is still another schematic diagram of a voice text of a data object of the data splitting method according to the present invention;
图2B-8为本发明一种数据拆分方法的数据对象为音频数据的语音文字的再一种呈现方式图;2B-8 is still another schematic diagram of a voice text of a data object in which the data object is a data splitting method according to the present invention;
图2C为本发明一种数据拆分方法在计算机系统层次中的位置关系图;2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention;
图2D为根据一示例性实施例示出的一种数据合并方法的流程图;2D is a flowchart of a data merging method according to an exemplary embodiment;
图2E为根据另一示例性实施例示出的一种数据合并方法的流程图;2E is a flowchart of a data merging method according to another exemplary embodiment;
图2F为根据一示例性实施例示出的一种数据拆分装置的结构示意图;2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment;
图2G为根据另一示例性实施例示出的一种数据拆分装置的结构示意图;2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment;
图2H为根据一示例性实施例示出的一种数据合并装置的结构示意图;2H is a schematic structural diagram of a data combining apparatus according to an exemplary embodiment;
图2I为根据另一示例性实施例示出的一种数据合并装置的结构示意图;2I is a schematic structural diagram of a data merging device according to another exemplary embodiment;
图2J为一示例性数据拆分流程图;2J is an exemplary data splitting flowchart;
图2K为另一示例性数据拆分流程图;2K is another exemplary data splitting flowchart;
图2L为一示例性数据合并流程图;2L is an exemplary data merge flowchart;
图2M为一示例性数据拆分描述语言定义示意图;2M is a schematic diagram of an exemplary data split description language definition;
图2N为一示例性数据拆分描述语言可视化流程图;2N is a flow chart of an exemplary data split description language visualization;
图2O为本发明三种构思中各概念之间的关联关系图;Figure 2O is a diagram showing the relationship between concepts in the three concepts of the present invention;
图3为现有技术中元模型的示意图;3 is a schematic diagram of a meta model in the prior art;
图4为本发明的编码系统的架构示意图;4 is a schematic structural diagram of an encoding system of the present invention;
图5C为本发明提供的一种编码处理方法的实施例一的流程图;FIG. 5C is a flowchart of Embodiment 1 of a coding processing method according to the present invention; FIG.
图5D为上述图5C中步骤102C的一种具体实现方式的流程图;FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C;
图6为数据对象、元数据、编码规约、编码元对象四者之间的关系;6 is a relationship between data objects, metadata, coding protocols, and coding meta-objects;
图7为该核心编码元模型的示意图;Figure 7 is a schematic diagram of the core coding metamodel;
图8为对象编码、元编码、实例编码(也就是对象引用编码去除掉元编码部分)三者以及数据对象与编码元对象的概念模型; 8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object reference coding removes the meta-coded part), and data objects and coding meta-objects;
图9为本实施例中元编码的一个示例图;FIG. 9 is a diagram showing an example of meta-encoding in the embodiment; FIG.
图10为一个类似的编码元对象逐层相关的例子(16位字长的变长编码)的示例图;Figure 10 is a diagram showing an example of a layer-by-layer correlation of a coded meta-object (variable-length coding of 16-bit word length);
图11为对应编码的元模型示意图;11 is a schematic diagram of a meta model corresponding to a code;
图12为该对象编码的概念模型示意图;Figure 12 is a schematic diagram of a conceptual model of the object encoding;
图13为本发明提供的一种编码处理方法的实施例二的流程图;FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention;
图14为本发明提供的一种编码处理方法的实施例三的流程图;FIG. 14 is a flowchart of Embodiment 3 of a coding processing method according to the present invention; FIG.
图15为本实施例的手写输入系统中非标准字符编码对应的字形存储在编码仓库的示意图;15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment;
图16为一个示例性的上下文相关的对象编码系统的编码元模型的核心概念图;16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system;
图17为可以应用到基本编码空间的基本对象的示意图;17 is a schematic diagram of a basic object that can be applied to a basic coding space;
图18为一个128定长编码方案的编码构成的示意图;18 is a schematic diagram showing the coding structure of a 128 fixed length coding scheme;
图19为四个二进制位就是四个空间位的示意图;Figure 19 is a schematic diagram of four binary bits being four spatial bits;
图20为一个编码方案的示例图;Figure 20 is a diagram showing an example of a coding scheme;
图21为UTF-8的编码方案的示例图;21 is a diagram showing an example of a coding scheme of UTF-8;
图22为元编码和实例编码构成的对象编码的示意图;Figure 22 is a schematic diagram of object coding consisting of element coding and example coding;
图23为编码细节图;Figure 23 is a detailed view of the encoding;
图24为渲染结果图;Figure 24 is a rendering result diagram;
图25为OTF-8除UTF-8之外的编码点的示意图;25 is a schematic diagram of code points other than UTF-8 of OTF-8;
图26为待定义的编码示意图;Figure 26 is a schematic diagram of the coding to be defined;
图27为本发明提供的一种编码处理方法的实施例四的流程图;FIG. 27 is a flowchart of Embodiment 4 of a coding processing method according to the present invention; FIG.
图28为对应编码元模型更新图;28 is an update diagram of a corresponding coding element model;
图29为编码组合示意图;Figure 29 is a schematic diagram of coding combination;
图30为本发明提供的一种编码处理方法的实施例五的流程图;FIG. 30 is a flowchart of Embodiment 5 of a coding processing method according to the present invention; FIG.
图31为手写输入程序;Figure 31 is a handwriting input program;
图32为本发明提供的一种解码处理方法的实施例一的流程图;32 is a flowchart of Embodiment 1 of a decoding processing method according to the present invention;
图33为本发明提供的一种解码处理方法的实施例二的流程图;FIG. 33 is a flowchart of Embodiment 2 of a decoding processing method according to the present invention;
图34为本发明提供的一种解码处理方法的实施例三的流程图;FIG. 34 is a flowchart of Embodiment 3 of a decoding processing method according to the present invention;
图35为本发明提供的一种解码处理方法的实施例四的流程图; FIG. 35 is a flowchart of Embodiment 4 of a decoding processing method according to the present invention;
图36为手写输入的内容;Figure 36 is the content of the handwritten input;
图37为将字符间距的长度可视化出来的示意图;Figure 37 is a schematic view showing the length of the character pitch;
图38为解码过程示意图;Figure 38 is a schematic diagram of a decoding process;
图39为一个混合编码的内容显示的示例图;Figure 39 is a diagram showing an example of a mixed encoded content display;
图40为输出的内容的示意图;Figure 40 is a schematic diagram of the contents of the output;
图41为手写笔划落在字符输出的结果之上的示意图;Figure 41 is a schematic view showing the strobe stroke falling on the result of the character output;
图42为加入一个标准笑脸图标后的示意图;Figure 42 is a schematic diagram of adding a standard smiley face icon;
图43为一个网上围棋的示意图;Figure 43 is a schematic view of an online Go;
图44为本发明的一种编码处理系统的第一实施例的结构示意图;44 is a schematic structural diagram of a first embodiment of an encoding processing system according to the present invention;
图45为本发明的一种解码处理系统的第一实施例的结构示意图;FIG. 45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention; FIG.
图46为主要基于对象编码系统的文字处理系统的架构示意图;46 is a schematic structural diagram of a word processing system mainly based on an object coding system;
图47为应用内部署的架构示意图;47 is a schematic diagram of an architecture of an in-application deployment;
图48为终端部署的架构示意图;48 is a schematic structural diagram of terminal deployment;
图49为移动外置设备部署的架构示意图;49 is a schematic structural diagram of a mobile external device deployment;
图50为应用共享同一编码仓库的架构示意图;Figure 50 is a schematic diagram of an architecture in which an application shares the same code repository;
图51为编码仓库的网络部署是私有云部署或者内部服务器部署的一个示例图;Figure 51 is a diagram showing an example of a network deployment of a code repository being a private cloud deployment or an internal server deployment;
图52为点对点部署的架构示意图;Figure 52 is a schematic diagram of the architecture of a point-to-point deployment;
图53为混合部署的架构示意图;Figure 53 is a schematic diagram of a hybrid deployment architecture;
图54为扩展操作系统来允许传统应用支持对象编码的架构图;Figure 54 is an architectural diagram of an extended operating system to allow legacy applications to support object encoding;
图55为基于本发明的对象编码系统和应用系统的交互原理图。Figure 55 is a diagram showing the interaction of an object encoding system and an application system based on the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例,对本发明实施例中的技术方案进行清楚、完整地描述。需要说明的是,在附图或说明书中,相似或相同的元件皆使用相同的附图标记。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the embodiments of the present invention. It should be noted that in the drawings or the description, similar or identical elements are denoted by the same reference numerals.
首先需要介绍一下本发明的发明背景,随着互联网以及移动计算的发展,云存储系统及其相关应用应运而生。所谓云存储系统是指将用户的数据存储于云端的服务器中。这样,用户可以用不同的终端设备来随时访问云存储中的数据,省去了数据在不同终端系统之间的迁移。同时,用户也不用不 停地更新存储设备,云存储服务提供了足够的伸缩性以应对各种存储需求。传统的数据维护工作,如数据备份、加密等也随之转移到云存储服务器中进行,往往更加专业、高效。此外,由于云存储所具有的可靠性和随时在线性等特点,一些不同于传统应用的数据使用模式也随之出现,如数据共享、网络协作等。这些都极大地提高了人们之间以及应用系统之间数据传递的效率。基于云存储系统的应用多种多样,其中最主要的终端应用是桌面代理。桌面代理是一种云存储的客户端,该云存储是基于文件系统。桌面代理将终端中的特定文件夹同云端存储同步——存储到该文件夹中的文件会被代理自动上传到服务器端;服务器端接收到的其他上传文件也会通过该代理自动下载到对应文件夹。这样,通过这种方式,同一用户的文件就会在不同终端上自动同步。用户就能够用传统的方式无缝地跨平台使用该文件夹中的数据。桌面代理也可以自动地将共享文件夹同步到不同用户的终端中去,从而实现方便的数据共享和合作。Dropbox就是一个典型的桌面代理。此外,微软的OneDrive(曾用名SkyDrive),Google的Google Drive,百度网盘、金山快盘等都有云存储的桌面代理。除了桌面代理以外,还有各种各样基于云存储、跨设备的终端应用。云存储系统带来了方便、高效的数据访问和共享。但数据存储于云端引发了一个必然的担心,那就是安全性、隐私的保护,核心数据的保密完全依赖于云存储系统。很多组织和个人就是基于这点,而不敢将数据,至少是关键数据置于云存储系统中。这里主要有两个方面的隐患——其一是云存储中的数据依赖于用户的身份认证进行保护。一旦用户的身份被盗用,所有用户的云端数据就会暴露在盗用者面前。另外就是云存储的安全性是建立在对云存储服务供应商完全信赖的基础之上。然而,这个基础并不牢靠。一方面,现有的计算机安全技术基础薄弱,各种系统的安全漏洞层出不穷。恶意攻击者可以轻易对在线服务发起攻击。近几年来,重大数据泄漏事故时有发生,其中的事故方就不乏云存储供应商。例如,2013年2月,印象笔记Evernote的系统就遭到入侵;2013年11月,大量腾讯QQ的用户数据被泄露;2014年5月,800万小米用户数据泄露等等。另一方面,供应商本身就可能会误用或者滥用数据对用户造成威胁,美国“棱镜计划”的曝光就证明了这一点。First of all, it is necessary to introduce the background of the invention. With the development of the Internet and mobile computing, cloud storage systems and related applications have emerged. The so-called cloud storage system refers to storing user data in a server in the cloud. In this way, users can use different terminal devices to access data in the cloud storage at any time, eliminating the migration of data between different terminal systems. At the same time, users don’t need to By temporarily updating storage devices, cloud storage services provide sufficient scalability to handle a variety of storage needs. Traditional data maintenance tasks, such as data backup and encryption, are also transferred to cloud storage servers, which are often more professional and efficient. In addition, due to the reliability of cloud storage and the linearity at any time, some data usage patterns different from traditional applications also appear, such as data sharing and network collaboration. These have greatly improved the efficiency of data transfer between people and between applications. Cloud storage systems are used in a variety of applications, the most important of which is the desktop agent. A desktop agent is a cloud storage client that is based on a file system. The desktop agent synchronizes the specific folder in the terminal with the cloud storage - the files stored in the folder are automatically uploaded to the server by the agent; other uploaded files received by the server are also automatically downloaded to the corresponding file through the agent. folder. In this way, files of the same user are automatically synchronized on different terminals. Users can seamlessly use the data in this folder across platforms in a traditional way. The desktop agent can also automatically synchronize shared folders to different users' terminals, thus facilitating convenient data sharing and cooperation. Dropbox is a typical desktop proxy. In addition, Microsoft's OneDrive (formerly known as SkyDrive), Google's Google Drive, Baidu network disk, Jinshan Express, etc. have cloud storage desktop agents. In addition to desktop agents, there are a variety of cloud-based, cross-device end applications. Cloud storage systems bring convenient and efficient data access and sharing. But the data stored in the cloud raises an inevitable concern, that is, the protection of security and privacy. The security of core data is completely dependent on the cloud storage system. Many organizations and individuals are based on this, not to put data, at least critical data, in cloud storage systems. There are two main hidden dangers here: one is that the data in the cloud storage is protected by the user's identity authentication. Once the user's identity is stolen, all users' cloud data will be exposed to the thief. In addition, the security of cloud storage is based on the complete trust of cloud storage service providers. However, this foundation is not solid. On the one hand, the existing computer security technology foundation is weak, and security vulnerabilities of various systems emerge one after another. Malicious attackers can easily attack online services. In recent years, major data leakage accidents have occurred, and there are no shortage of cloud storage providers. For example, in February 2013, Evernote's system was invaded; in November 2013, a large number of Tencent QQ user data was leaked; in May 2014, 8 million cubic meters of user data leaked and so on. On the other hand, suppliers themselves may misuse or abuse data to pose a threat to users. This is evidenced by the exposure of the US Prism Project.
本发明主要涉及数据处理方法、系统及应用,并通过以下几个方面的有 效解决上述问题。特别涉及以下三方面的创新:(1)一种新颖的手写输入方法和系统,特别是一种手写输入字符的拆分方法;(2)基于对象的开放式编解码解决方案,可以以自由、开放的任何编码方式对任何数据对象进行编解码;以及(3)基于对象的数据拆分/合并方法,即将数据对象的元数据和/或编码数据与相应的数据内容拆分/剥离开,以保障数据内容的安全性。这些技术方案可以分别单独实施,也可以将它们组合在一起实施,或单独或组合起来与其他技术领域的应用相结合。本发明有着广泛的应用前景和巨大的应用价值。具体方案如下:The invention mainly relates to a data processing method, system and application, and has the following aspects Effectively solve the above problems. In particular, it involves the following three aspects of innovation: (1) a novel handwriting input method and system, especially a method for splitting handwritten input characters; (2) an object-based open codec solution, which can be free, Any encoding method that is open to encode or decode any data object; and (3) an object-based data splitting/merging method that splits/separates the metadata and/or encoded data of the data object from the corresponding data content to Guarantee the security of data content. These technical solutions can be implemented separately or in combination, or combined with other technical fields, alone or in combination. The invention has broad application prospects and great application value. The specific plan is as follows:
本发明提供了一种基于数据对象的编码方法,该方法包括:The invention provides a data object based encoding method, the method comprising:
a)从数据对象中提取元数据、和/或对数据对象进行解析并为数据对象创建或生成对应的元数据;a) extracting metadata from the data object, and/or parsing the data object and creating or generating corresponding metadata for the data object;
b)依据该元数据的至少一部分为该数据对象选择或创建编码规约,以便以编码的形式来描述该数据对象;b) selecting or creating an encoding specification for the data object based on at least a portion of the metadata to describe the data object in encoded form;
c)依据该编码规约,为该数据对象生成或返回对象编码。c) generating or returning an object code for the data object in accordance with the coding convention.
进一步的,在上述基于数据对象的编码方法的方案的基础上,其中步骤c)中的生成对象编码步骤包括:依据预定的规则为该数据对象生成元编码和/或实例编码,并由该元编码和/或实例编码生成该对象编码。Further, on the basis of the foregoing scheme of the data object-based encoding method, the generating object encoding step in step c) includes: generating a meta-code and/or an instance code for the data object according to a predetermined rule, and by the element The encoding and/or instance encoding generates the object encoding.
进一步的,在上述基于数据对象的编码方法的方案的基础上,其中,在步骤a)之前还包括对数据对象的压缩和/或加密步骤,以及在步骤c)之后,还包括对所生成的对象编码的加密步骤。Further, based on the foregoing scheme of the data object-based encoding method, wherein the step of compressing and/or encrypting the data object is further included before step a), and after step c), further comprising generating the generated The encryption step of the object encoding.
进一步的,在上述基于数据对象的编码方法的方案的基础上,其中,该元编码包括以下编码中的一种、或两种以上的组合和/或嵌套:空间编码、上下文编码、类型编码。Further, based on the foregoing scheme of the data object-based encoding method, wherein the meta-coding comprises one of the following encodings, or a combination and/or nesting of two or more types: spatial encoding, context encoding, and type encoding. .
进一步的,在上述基于数据对象的编码方法的方案的基础上,在该步骤a)之前还包括:数据拆分步骤,按照预定的规则将大的数据对象拆分成小的数据块(或称作数据片断),在数据拆分过程中或之后对拆分后的每一个数据块执行步骤a)至步骤c),直到完成对所有数据块的编码。Further, on the basis of the foregoing scheme of the data object-based encoding method, before the step a), the method further includes: a data splitting step of splitting the large data object into small data blocks according to a predetermined rule (or As a data segment, steps a) to c) are performed on each of the split data blocks during or after the data splitting process until the encoding of all the data blocks is completed.
本发明还提供了一种基于数据对象的解码方法,该方法包括:The invention also provides a data object based decoding method, the method comprising:
a)获取对象编码;a) obtaining the object code;
b)拆解对象编码,获得元编码和/或实例编码; b) disassemble the object code to obtain the meta code and/or the instance code;
c)依据拆解出的元编码获取相应的编码元数据和/或编码规约;c) obtaining corresponding coding metadata and/or coding specifications according to the disassembled meta code;
d)依据编码元数据和/或编码规约、以及实例编码,恢复出原始的数据对象。d) recovering the original data object based on the encoded metadata and/or encoding convention, and the instance encoding.
进一步的,在上述基于数据对象的解码方法的方案的基础上,其中步骤b)中的拆解对象编码的步骤包括:依据编码时的预定规则将该对象编码拆解出元编码和/或实例编码。Further, on the basis of the foregoing solution of the data object-based decoding method, the step of decoding the object in step b) comprises: disassembling the object code into a meta-code and/or an instance according to a predetermined rule at the time of encoding. coding.
进一步的,在上述基于数据对象的解码方法的方案的基础上,其中,在步骤a)之前和/或步骤b)之前还包括对获取对象编码和/或编码时的预定规则的授权验证步骤。Further, on the basis of the above-described scheme of the data object-based decoding method, before the step a) and/or before the step b), an authorization verification step of acquiring a predetermined rule when encoding and/or encoding the object is further included.
进一步的,在上述基于数据对象的解码方法的方案的基础上,其中,如果在编码过程中曾经使用了压缩和/或加密手段,则在解码过程中需要对应的解压缩和/或解密手段。Further, on the basis of the above-described scheme of the data object-based decoding method, if compression and/or encryption means are used in the encoding process, corresponding decompression and/or decryption means are needed in the decoding process.
本发明还提供了一种手写输入字符拆分方法,该方法包括:The invention also provides a handwritten input character splitting method, the method comprising:
a)以当前激活的目标行/列为约束接收用户的输入,并至少记录下每一笔划在当前行/列中的输入位置;a) receiving the user's input with the currently activated target row/column as a constraint, and recording at least the input position of each stroke in the current row/column;
b)通过将每一笔划与当前行/列中的所有或部分笔划和/或字符进行逐一比对,来判断每一笔划与其他笔划和/或字符之间的相关度或关联性,如果一笔划不与任何字符或笔划相关联,则为其创建一个新的字符,否则将该笔划归属于相关度最大或关联性最强的一个或多个字符。b) judging the correlation or correlation between each stroke and other strokes and/or characters by comparing each stroke with all or part of the strokes and/or characters in the current row/column, if one If the stroke is not associated with any character or stroke, a new character is created for it, otherwise the stroke is attributed to one or more of the most relevant or most relevant characters.
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,其中,该步骤c)是在如下情况之一时执行的:1)在当前笔划的输写过程中、2)或者在该当前笔划输入完成后(即抬笔后)、3)或者在当前行输入完成后。Further, based on the above-described scheme based on the handwritten input character splitting method, wherein the step c) is performed in one of the following cases: 1) in the input and writing process of the current stroke, 2) or at the current After the stroke input is completed (ie, after the pen is lifted), 3) or after the current line is entered.
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,其中,在当前笔划输入完成后,将该当前笔划仅与预定范围内的笔划和/或字符进行逐一比对。Further, in the above-mentioned scheme based on the handwritten input character splitting method, after the current stroke input is completed, the current stroke is only compared with the strokes and/or characters within the predetermined range one by one.
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,其中该步骤c)包括:Further, based on the foregoing solution based on the handwritten input character splitting method, wherein the step c) comprises:
判断当前输入的笔划是否是当前输入状态下该行/列中空间上的第一笔划还是空间上的最后一个笔划;Determining whether the currently input stroke is the first stroke on the space in the row/column or the last stroke in the space in the current input state;
如果当前输入的笔划是该行/列中空间上的第一笔划、且与当前行/列中 已经输入的其他字符(或笔划)不相关联,或者如果当前输入的笔划是该行/列中空间上的最后一个笔划、且与当前行/列中已经输入的其他字符(或笔划)不相关联,则为该笔划创建一个新的字符;如果当前笔划既不是该行/列中空间上的第一笔划也不是该行/列中空间上的最后一个笔划,则将该当前笔划与已输入过的所有字符之间的间距相比较,并将该当前输入的笔划归属于相关联的一个或多个字符(或笔划)。If the currently entered stroke is the first stroke on the space in the row/column and is in the current row/column Other characters (or strokes) that have been entered are not associated, or if the currently entered stroke is the last stroke in the space in the row/column and is not related to other characters (or strokes) already entered in the current row/column Create a new character for the stroke; if the current stroke is neither the first stroke on the space in the row/column nor the last stroke on the space in the row/column, then the current stroke is entered The spacing between all characters passed is compared and the currently entered stroke is attributed to the associated one or more characters (or strokes).
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,其中在该步骤c)中,预先设定笔划与字符或笔划与笔划之间的最小间距的阈值(MIN_GAP),将每一笔划与其他已经输入的字符或笔划之间的间距与该阈值进行比较,从而判断出该笔划与其他字符或笔划之间的关联性。Further, in the above-described scheme based on the handwritten input character splitting method, in the step c), a threshold (MIN_GAP) of a minimum distance between the stroke and the character or the stroke and the stroke is preset, each of The spacing between the stroke and other characters or strokes that have been entered is compared to the threshold to determine the association between the stroke and other characters or strokes.
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,其中,在步骤b)中还包括:在接收每一输入笔划时,记录下每一笔划的输入时间和输入位置信息。Further, in the above-mentioned scheme based on the handwriting input character splitting method, in the step b), the method further includes: recording, when receiving each input stroke, the input time and the input position information of each stroke.
进一步的,在上述基于手写输入字符拆分方法的方案的基础上,该输入时间包括落笔时刻和抬笔时刻,该输入位置至少包括:落笔时的位置、抬笔时的位置、以及该笔划的笔迹中每个点的坐标位置。Further, in the above-mentioned solution based on the handwriting input character splitting method, the input time includes a pen down time and a pen up time, and the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and the stroke The coordinate position of each point in the handwriting.
本发明还提供了一种基于对象的数据对象拆分方法,该方法包括:The invention also provides an object-based data object splitting method, the method comprising:
a)获取数据对象的元数据;a) obtaining metadata of the data object;
b)依据该元数据的至少一部分,为该数据对象选择或创建相应的数据拆分/剥离规约;b) selecting or creating a corresponding data split/peel protocol for the data object based on at least a portion of the metadata;
c)依据该数据拆分/剥离规约,将该数据对象的至少一部分拆分成各数据片段、和/或剥离出该数据对象的至少一部分。c) splitting at least a portion of the data object into pieces of data, and/or stripping out at least a portion of the data object in accordance with the data split/peel protocol.
进一步的,在上述基于对象的数据对象拆分方法的方案的基础上,其中,该数据拆分/剥离规约包括如下选项中的至少一种或者两者以上的组合:1)数据内容拆分规约,记录了对数据内容进行拆分的方法及过程;2)元数据剥离规约,记录了与相应的元数据从该数据对象中剥离出来的方法及过程;3)如果在数据拆分过程中产生了编码,则还包括编码分离规约,记录了相应的编码与被编码的对象之间的编码规则及编码过程。Further, based on the foregoing solution of the object-based data object splitting method, wherein the data splitting/peeling protocol comprises at least one of the following options or a combination of two or more: 1) data content splitting protocol , recording the method and process of splitting the data content; 2) the metadata stripping protocol, recording the method and process of separating the corresponding metadata from the data object; 3) if generated during the data splitting process The encoding also includes an encoding separation protocol, and records the encoding rules and encoding processes between the corresponding encoding and the encoded object.
进一步的,在上述基于对象的数据对象拆分方法的方案的基础上,其中,在步骤c)之后,还包括步骤d):对拆分后的各数据片段进行重新组合。 Further, based on the solution of the object-based data object splitting method, after step c), further comprising the step d): reassembling the split data segments.
进一步的,在上述基于对象的数据对象拆分方法的方案的基础上,其中,该元数据的至少一部分构成了拆分元数据。Further, on the basis of the above-described scheme of the object-based data object splitting method, at least a part of the metadata constitutes split metadata.
本发明还提供了一种基于对象的数据对象合并方法,该方法包括:The invention also provides an object-based data object merging method, the method comprising:
a)获取被拆分的各数据片段、以及拆分/剥离规约或相应的合并规约;a) obtaining the fragmented data fragments, and the split/peel protocol or the corresponding merge protocol;
b)依据所获取的数据片段和/或拆分/剥离规约或合并规约,获得该数据对象的拆分元数据;b) obtaining split metadata of the data object according to the obtained data segment and/or the split/peel protocol or the merge specification;
c)基于该数据拆分/剥离规约或该合并规约、和该拆分元数据,将各数据片段组合到一起,从而恢复出原始数据。c) Combining the data segments based on the data split/peel protocol or the merge specification, and the split metadata, thereby recovering the original data.
进一步的,在上述基于对象的数据对象合并方法的方案的基础上,其中,在完成对该数据对象的拆分处理之后,该方法还包括:存储步骤,将拆分/剥离后的各数据片段分别存储到不同的存储体中、或不同的安全通道下。Further, on the basis of the foregoing solution of the object-based data object merging method, after completing the splitting process on the data object, the method further includes: a storing step of splitting/stripping each data segment Stored separately in different banks or under different secure channels.
以下将对手写输入方法和系统进行详细的说明。The handwriting input method and system will be described in detail below.
图1A为本发明提供的一种手写输入字符的处理方法实施例的流程图。本实施例提供的手写输入字符的处理方法,相对于现有的手写输入系统能够更加逼近人们的自然书写习惯,同时完全、本真地保留书写人的书写风格和特征。如图1A所示,本实施例中的方法,可以包括:FIG. 1A is a flowchart of an embodiment of a method for processing handwritten input characters according to the present invention. The method for processing handwritten input characters provided by the embodiment can be closer to people's natural writing habits than the existing handwriting input system, and at the same time completely and truly preserve the writing style and features of the writer. As shown in FIG. 1A, the method in this embodiment may include:
步骤101A、在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息;其中,所述输入信息包括所述笔划在所述第一目标行/列中的输入位置。 Step 101A: In the currently activated first target row/column, acquire a stroke input by the user and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column .
本实施例中的执行主体可以是手写输入设备,如常规的触摸屏、手写屏幕、或其他适当的手写设备,或者是直接适配于本实施例的手写系统。优选地,本实施例可以采用触摸屏式的手写输入设备,即通过手写或借助专用或非专用书写工具等可以直接在屏幕上实现信息输入的输入设备。The execution subject in this embodiment may be a handwriting input device such as a conventional touch screen, handwriting screen, or other suitable handwriting device, or directly adapted to the handwriting system of the present embodiment. Preferably, the present embodiment may employ a touch screen type handwriting input device, that is, an input device that can directly input information on the screen by handwriting or by means of a dedicated or non-dedicated writing tool.
具体地,本实施例可以适用于任何书写方式,书写方式可以由用户自定义设置,也可以采用默认设置。本实施例所述的书写方式,可以包括但不限于以下方式:按行书写(对应于常用的横排格式,从左向右、自上而下的书写习惯);按列书写(对应于竖排格式,自上而下、从右向左的书写习惯);也可以是用户自定义的其他书写格式,例如,可以是为阿拉伯人设置的自右向左的书写格式;或者也可以是自上而下、从左向右的书写格式等等。 Specifically, the embodiment can be applied to any writing mode, and the writing mode can be set by the user or the default setting. The writing manners described in this embodiment may include, but are not limited to, the following methods: writing in a row (corresponding to a commonly used horizontal format, left to right, top-down writing habits); writing in columns (corresponding to vertical Row format, top-down, right-to-left writing habits; can also be other user-defined writing formats, for example, can be a right-to-left writing format set for Arabs; or it can be self Top down, writing format from left to right, and so on.
通常,用户在书写过程中,按照自己的笔划顺序手写输入每个字符。本实施例可以按照时间顺序记录下用户每一笔划及其输入位置。例如,当用户开始书写“我”字时,先写下第一笔“丿”(撇),系统自动记录下该撇以及该撇在面板上的输入位置,例如可以使用手写输入屏的像素位置信息作为相应的输入位置,也可以采用其他定位算法或位置确定方法,只要能够唯一地确定每一笔划的输入位置即可。Usually, the user manually writes each character in the order of his or her stroke during the writing process. In this embodiment, each stroke of the user and its input position can be recorded in chronological order. For example, when the user starts writing the word "I", first write the first "丿" (撇), the system automatically records the 撇 and the input position of the 撇 on the panel, for example, the pixel position of the handwriting input screen can be used. As the corresponding input position, other positioning algorithms or position determining methods may be employed as long as the input position of each stroke can be uniquely determined.
在用户进行手写输入时,有一个目标行/列的概念,所述目标行/列可以作为用户手写输入的约束范围,即当某一行/列被激活后,成为目标行/列,才能在该行/列进行输入。在更改目标行/列之前,用户所输入的所有笔划都属于该目标行/列。在这种情况下,可以禁止用户在目标行/列以外的区域进行手写输入,或者,允许用户在任意位置进行输入,但是,当用户输入的笔划超出了目标行/列的边界时,可以采用以下几种不同的处理方式:其一,在低精度要求的情况下,可以抛弃超出边界预定阈值的那部分笔划;其二,在需要高精度复原原始输入时,可以完整地记录下超出边界的笔划信息,例如时间和位置等信息,以便能够完整地复原用户的原始输入状态。When the user performs handwriting input, there is a concept of a target row/column, which can be used as a constraint range for the user's handwriting input, that is, when a row/column is activated, it becomes a target row/column. Row/column input. All strokes entered by the user belong to the target row/column before the target row/column is changed. In this case, the user can be prohibited from handwriting input in an area other than the target row/column, or the user is allowed to input at any position, but when the stroke input by the user exceeds the boundary of the target row/column, it can be used. The following different processing methods: First, in the case of low precision requirements, the part of the stroke beyond the predetermined threshold of the boundary can be discarded; secondly, when the original input needs to be restored with high precision, the boundary beyond the boundary can be completely recorded. Stroke information, such as time and location, to fully restore the user's original input state.
本实施例提供的方法,能够以行(横排)或列(竖排)为单位作为输入的限制或约束,即当前的输入只能是限定在某个特定的行或者列中,不存在跨越行或者列的笔划或者文字。基于这个行或者列的约束,输入内容能够按照输入顺序形成字符流。相对于现有技术来说,本实施例提供的方法更加接近于人们的自然书写习惯,使用户的书写体验能够更加自然、流畅。The method provided in this embodiment can be used as a limitation or constraint of input in units of rows (horizontal rows) or columns (vertical rows), that is, the current input can only be limited to a specific row or column, and there is no span. Line or column strokes or text. Based on this row or column constraint, the input can form a stream of characters in the order of input. Compared with the prior art, the method provided by the embodiment is closer to the natural writing habits of the people, so that the writing experience of the user can be more natural and smooth.
在用户进行输入时,可以在手写输入屏上显示目标行/列的范围,例如,高亮显示所述目标行/列,或者在所述手写输入屏上显示出作文格或信笺格式的行/列底纹或者明纹图案等,以提示用户当前可输入的目标行/列所处的位置。When the user inputs, the range of the target row/column may be displayed on the handwriting input screen, for example, highlighting the target row/column, or displaying a line in a text or letter format on the handwriting input screen/ A column or a grain pattern, etc., to indicate the location of the target row/column that the user can currently input.
优选的是,在步骤101A之前,可以先选择或创建所述当前激活的第一目标行/列。选择或创建所述当前激活的第一目标行/列可以采用多种方式,本实施例给出如下两种。Preferably, prior to step 101A, the currently activated first target row/column may be selected or created. Selecting or creating the currently activated first target row/column can take many forms, and the present embodiment gives the following two.
选择目标行/列方式一:首先确定每一行/列的位置范围,然后由用户选择目标行/列。其中确定每一行/列的位置范围,具体可以包括:Select target row/column mode one: first determine the range of positions for each row/column, and then the user selects the target row/column. The location range of each row/column is determined, which may specifically include:
获取手写输入屏的尺寸信息以及行高/列宽的信息; Obtaining the size information of the handwriting input screen and the information of the row height/column width;
根据所述手写输入屏的尺寸信息以及行高/列宽的信息,将所述手写输入屏划分为至少一行/列,并确定每一行/列的位置范围;Decoding the handwriting input screen into at least one row/column according to the size information of the handwriting input screen and the information of the row height/column width, and determining a range of positions of each row/column;
其中,所述行高/列宽的信息为默认值或由所述用户输入确定,所述每一行/列的位置范围是指每一行在所述手写输入屏中相对的顶边位置和底边位置或者每一列在所述手写输入屏中相对的左侧位置和右侧位置。Wherein, the row height/column width information is a default value or determined by the user input, and the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen. The position or each column is in the opposite left and right positions in the handwriting input screen.
通过上述步骤,可以将手写输入屏划分为多个行/列,并确定了每一行/列的位置范围,在用户后续输入过程中,可以基于上述划分好的行/列来输入笔划。Through the above steps, the handwriting input screen can be divided into a plurality of rows/columns, and the range of positions of each row/column can be determined. During the subsequent input process of the user, the strokes can be input based on the divided rows/columns.
在确定每一行/列的位置范围之后,可以由用户选择目标行/列。所述由用户选择目标行/列,具体可以包括:After determining the range of positions for each row/column, the target row/column can be selected by the user. The target row/column selected by the user may specifically include:
接收用户输入的目标行/列选择消息,所述目标行/列选择消息中包括所述用户欲输入的目标行/列的标识;Receiving a target row/column selection message input by the user, where the target row/column selection message includes an identifier of the target row/column to be input by the user;
根据所述目标行/列选择消息,将所述用户欲输入的目标行/列的标识对应的行/列作为所述当前激活的第一目标行/列。According to the target row/column selection message, a row/column corresponding to the identifier of the target row/column to be input by the user is used as the currently activated first target row/column.
其中,所述用户欲输入的目标行/列的标识,可以是用户点击的任一坐标点,所述坐标点所在的行/列即为所述坐标点对应的行/列;或者,所述用户欲输入的目标行/列的标识可以是行/列号,例如第10行或第10列,这时可以将所述行/列号对应的行/列作为所述当前激活的第一目标行/列。The identifier of the target row/column to be input by the user may be any coordinate point clicked by the user, and the row/column where the coordinate point is located is the row/column corresponding to the coordinate point; or, the The identifier of the target row/column to be input by the user may be a row/column number, for example, the 10th row or the 10th column, and the row/column corresponding to the row/column number may be used as the first target of the current activation. Row/column.
当外接其它输入设备时,用户可以通过所接入的输入设备来选择目标行/列。例如,当外接键盘时,用户可以通过键盘来选择目标行/列;或者,当外接鼠标时,用户可以通过移动鼠标来选择不同的目标行/列;或者,当外接输入笔时,可以在输入笔与手写输入屏接触之前,通过输入笔的指向来选择目标行/列。When external input devices are externally connected, the user can select the target row/column through the input device that is accessed. For example, when an external keyboard is used, the user can select a target row/column through the keyboard; or, when an external mouse is connected, the user can select a different target row/column by moving the mouse; or, when an external stylus is input, it can be input. Before the pen is in contact with the handwriting input screen, the target row/column is selected by the pointing of the input pen.
选择目标行/列方式二:基于用户预先输入的字符来激活一个目标行/列。该方法具体可以包括:Select target row/column mode 2: Activate a target row/column based on the characters previously entered by the user. The method may specifically include:
采集获取用户输入的至少一个字符;Collecting at least one character obtained by the user;
以所述至少一个字符所在的行/列作为所述当前激活的第一目标行/列;Using the row/column of the at least one character as the currently activated first target row/column;
根据所述至少一个字符的字符边界,设置所述当前激活的第一目标行/列的位置范围;Setting a range of locations of the currently activated first target row/column according to a character boundary of the at least one character;
其中,所述位置范围是指第一目标行在手写输入屏中相对的顶边位置和 底边位置或者第一目标列在手写输入屏中相对的左侧位置和右侧位置。Wherein, the position range refers to a relative top edge position of the first target line in the handwriting input screen and The bottom edge position or the first target is listed in the opposite left and right positions in the handwriting input screen.
由于书写习惯的不同,可以为所述第一目标行/列的宽度设置适当的阈值,以便满足特殊用户的需求。例如对于横排书写而言,书写人的自然书写行可能习惯性地偏右上或偏右下倾斜,这时,可以将用户已输入的至少一个字符的边界,适当地向上或向下扩展一段距离,作为所述第一目标行/列的边界。Due to differences in writing habits, an appropriate threshold can be set for the width of the first target row/column to meet the needs of a particular user. For example, for horizontal writing, the natural writing line of the writer may be habitually inclined to the right or to the lower right. In this case, the boundary of at least one character that the user has input may be appropriately extended upward or downward by a distance. As the boundary of the first target row/column.
以上提供的两种选择目标行/列的方式,方式一简单、快捷;方式二更能够满足用户的个性化输入以及图形系统中的手写文字输入。The two methods of selecting the target row/column provided above are simple and fast; the second method can satisfy the user's personalized input and the handwritten text input in the graphic system.
步骤102A、对于每个笔划,根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符。 Step 102A, for each stroke, according to an input position of the stroke in the first target row/column, or an input position of the stroke in the first target row/column and the first target row The character specified in the /column, creating a new character for the stroke or determining the character to which the stroke belongs.
本实施例采用的是与现有技术不同的文字划分或分割方式,即基于对每个输入笔划与其他字符或笔划之间的关联性来判断当前输入笔划的归属。因此,本实施例提供的方法可以省却用户按字符为单位进行输入的繁琐交互过程,从而大大简化了输入操作。This embodiment adopts a text division or division manner different from the prior art, that is, the attribution of the current input stroke is determined based on the correlation between each input stroke and other characters or strokes. Therefore, the method provided in this embodiment can save the user's tedious interaction process by inputting characters, thereby greatly simplifying the input operation.
其中,字符指的是具有二维形状的独立字符对象,不仅包括表意文字的标准字符,如单个汉字、日文、韩文、阿拉伯文、藏文、缅文等或其局部(例如偏旁部首等),或者表音文字的标准单词,如英文、德文、法文、俄文、西班牙文等西文字母或单词等;还可以是基于传统标准码的计算机字符,如ASCII码字符、Unicode码字符或字符串等;也可以是手写字符与标准字符的混合而成的组合字符或字符串等;甚至还可以是用户输入的任何图形、图像,如“心”形图案、照片、任何涂鸦等,或其他任何书面表达形式。Among them, the character refers to an independent character object having a two-dimensional shape, including not only standard characters of ideographic characters, such as single Chinese characters, Japanese, Korean, Arabic, Tibetan, Burmese, etc. or parts thereof (for example, radicals, etc.) Or standard words of phonetic characters, such as English letters, German, French, Russian, Spanish, etc.; or computer characters based on traditional standard codes, such as ASCII characters, Unicode characters, or a string or the like; a combination of characters and strings of handwritten characters and standard characters; or any graphic or image input by the user, such as a "heart" pattern, a photo, any graffiti, etc., or Any other written expression.
图1B为本发明提供的一种手写输入字符的处理方法实施例中字符的示意图一。图1C为本发明提供的一种手写输入字符的处理方法实施例中字符的示意图二。图1B中示出了五个字符,其中包括“笔划字符”即用户输入的手写字符,如第一、第三和第四个字符,“图形字符”即用户输入的任意的图形或者图像信息,如第二和第五个字符。除此之外,本实施例中还可以输入其它字符,如“标准字符”(现有的各种标准字库中的任意字符)、“组合字符”(各种字符混合在一起的混合字符)等,“组合字符”也可以直接包括手写笔 划——当手写笔划直接在非“笔划字符”之上书写时,就会形成“组合字符”。如图1C所示,“饕餮”两字是标准字符和笔划字符组合而成的组合字符。FIG. 1B is a schematic diagram 1 of a character in a method for processing handwritten input characters according to an embodiment of the present invention. FIG. 1C is a second schematic diagram of a character in a method for processing handwritten input characters according to an embodiment of the present invention. Five characters are shown in FIG. 1B, including "stroke characters", that is, handwritten characters input by the user, such as first, third, and fourth characters, and "graphic characters", that is, arbitrary graphic or image information input by the user, Such as the second and fifth characters. In addition, other characters such as "standard characters" (any of the existing standard fonts), "combined characters" (mixed characters of various characters mixed together), and the like can be input in this embodiment. , "combined characters" can also directly include the stylus Stroke - When a handwritten stroke is written directly on a non-"stroke character", a "combined character" is formed. As shown in FIG. 1C, the word "饕餮" is a combination of standard characters and stroke characters.
本实施例中,不需要对用户输入的字符进行识别,仅需要判断每一笔划归属于哪个字符,将字符划分开来即可。在对用户输入的笔划进行归属判断时,可以按照设定语言的内在约定(例如基于每种语言的书写或排版方式等)对第一目标行/列中输入的笔划进行自动划分。In this embodiment, it is not necessary to identify the characters input by the user, and it is only necessary to determine which character each stroke belongs to, and the characters are divided. When determining the attribution of the stroke input by the user, the strokes input in the first target row/column can be automatically divided according to the intrinsic convention of the set language (for example, based on the writing or typesetting manner of each language, etc.).
其中,判断所述笔划所归属的字符即是对输入字符进行拆分的过程。可以以一边输入一边拆分的方式来实现对输入字符的拆分操作(即成字操作),即随着用户的自然书写,可以确定已经输入的笔划归属于哪一个字符,如此可以实现边输入边成字的效果。Wherein, determining the character to which the stroke belongs is a process of splitting the input character. The splitting operation of the input characters (ie, the wording operation) can be realized by splitting one side while inputting, that is, with the natural writing of the user, it can be determined which character the stroke has been input belongs to, so that the side input can be realized. The effect of the word on the side.
对于字符拆分的触发条件,可以选取如下方法中的一种:(1)从用户落笔的一刻开始,就以输入笔划的点阵为单位对输入的笔划作实时的判断,确定其所应当归属的字符;(2)在完成每一个笔划的输入(即抬笔)之后再对该笔划的归属作出判断;(3)在完成一行的输入之后、或者在判断出用户有比较长时间的输入停顿时,对之前输入的所有笔划逐一作出判断,并将相关度最大或关联性最强的那些笔划归属于同一字符。For the trigger condition of character splitting, one of the following methods may be selected: (1) from the moment the user drops the pen, the input stroke is judged in real time by the dot matrix of the input stroke to determine the attribution thereof. (2) making a judgment on the attribution of each stroke after completing the input of each stroke (ie, raising the pen); (3) after completing the input of one line, or determining that the user has a longer input pause At the same time, all the strokes entered before are judged one by one, and those strokes with the highest correlation or the strongest correlation are attributed to the same character.
上述三种方法各有利弊,按先后顺序,它们的计算量从大到小。即触发条件(1)下的计算量最大,后两种计算量相当,但比第一种要小。此外,在触发条件(1)下,由于这种实时的判断会导致判断结果动态变化,即依据先前输入的点阵判断当前笔划应当归属于前一字符,但随着笔划的输入,此后会发现该笔划应当独立成字、或归属于后一字符,此时就需要对每一笔划的最后归属作更新处理,以避免将同一笔划归属于不同的字符。这种处理也会增加一定的计算量。尽管在大多数情况下,用户并不关心作为后台处理的成字过程,然而,与后两种方法相比,触发条件(1)下的处理方法可以获得更加实时的交互体验效果。Each of the above three methods has its own advantages and disadvantages. In sequential order, their calculations are from large to small. That is, the calculation amount under the trigger condition (1) is the largest, and the latter two calculation amounts are equivalent, but smaller than the first one. In addition, under the trigger condition (1), since the real-time judgment will cause the judgment result to change dynamically, that is, the current stroke should be attributed to the previous character according to the previously input lattice, but as the stroke is input, it will be found thereafter. The stroke should be word-independent or attributed to the next character. In this case, the final assignment of each stroke needs to be updated to avoid assigning the same stroke to a different character. This kind of processing also increases the amount of calculation. Although in most cases, the user does not care about the wording process as a background process, the processing method under the trigger condition (1) can obtain a more real-time interactive experience effect than the latter two methods.
对于每个笔划来说,如果所述笔划是所述第一目标行/列的第一个笔划,则可以为所述笔划创建一个新的字符;如果所述笔划不是所述第一目标行/列的第一个笔划,则可以根据所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中的其它字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符。 For each stroke, if the stroke is the first stroke of the first target row/column, a new character can be created for the stroke; if the stroke is not the first target row/ The first stroke of the column may create a new character for the stroke according to the input position of the stroke in the first target row/column and other characters in the first target row/column Determining the character to which the stroke belongs.
本实施例提供的手写输入字符的处理方法,在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息,并根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,能够实现边输入边成字的效果,用户不需要借助明确或隐含的“开始单个文字输入”或“结束单个文字输入”的命令来区分不同的字符,因此,在书写过程中不需要每写完一个字必须停顿一段时间或者与系统进行某些交互,书写过程流畅,效率较高;并且,本方法中直接通过笔划的输入位置来确定笔划归属的字符,而不需要进行标准字符的识别,因此能够保留用户手写输入的个性化信息及书写风格和特征。The method for processing handwritten input characters provided in this embodiment, in the currently activated first target row/column, acquiring a stroke input by the user and corresponding input information, and according to the stroke in the first target row/column An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke The attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input" or "end single text input" commands. Therefore, during the writing process It is not necessary to pause for a period of time or perform some interaction with the system, the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.
由于本实施例可以使手写输入更加自然流畅,因此更便于不熟悉计算机、手机、平板电脑、膝上电脑、笔记本、iPad等电子输入设备的老人及儿童使用这些设备。Since the present embodiment can make the handwriting input more natural and smooth, it is more convenient for the elderly and children who are unfamiliar with electronic input devices such as computers, mobile phones, tablet computers, laptop computers, notebooks, and iPads to use these devices.
同传统的键盘/字符流模型不同,本实施例中的手写输入字符的处理方法采用的是笔/纸模型。用户可以直接激活页面中的任意行进行输入。系统可以将手写输入内容之前、以及手写输入内容之间的空行处理为空的段落。对于用户而言,可以只存在改变输入行的命令,而没有回车、换行的概念。Different from the traditional keyboard/character stream model, the handwritten input character processing method in this embodiment adopts a pen/paper model. The user can directly activate any line in the page for input. The system can process empty lines between handwritten input and handwritten input as empty paragraphs. For the user, there can be only the command to change the input line, and there is no concept of carriage return or line feed.
当用户输入到一行的结尾处时,可能需要将目标行/列移动到所述第一目标行/列的下一行/列,方便用户在下一行/列进行输入,这就是本实施例提供的断行功能。具体地,断行功能可以有多种实现方式,本实施例中提供如下四种:When the user inputs to the end of a line, it may be necessary to move the target row/column to the next row/column of the first target row/column, so that the user can input in the next row/column, which is the broken line provided by this embodiment. Features. Specifically, the line break function can be implemented in multiple manners. In this embodiment, the following four types are provided:
断行方式一:Break mode one:
接收用户输入的断行/列命令;Receiving a line break/column command input by the user;
根据所述断行/列命令,将第二目标行/列作为当前激活的目标行/列,所述第二目标行/列为所述第一目标行/列的下一行/列。According to the line break/column command, the second target row/column is the currently activated target row/column, and the second target row/column is the next row/column of the first target row/column.
本方式中,可以通过预先设定的交互方式来确定断行的位置。例如,可以预先约定在每次自然书写完一行达到自认为的行尾时,通过连续地点击两下或三下输入框或屏幕的右边界的某个对应位置或按钮来确认该行的结束,或者,可以在第一目标行/列的结尾处设置命令按钮,当用户点击该命令按钮时,自动激活下一行/列进行编辑。 In this mode, the position of the line break can be determined by a preset interaction mode. For example, it may be stipulated in advance that the end of the line is confirmed by continuously clicking a corresponding position or button of the right border of the input box or the screen twice or three times each time the line is naturally written to reach the end of the line. Alternatively, a command button can be set at the end of the first target row/column, and when the user clicks the command button, the next row/column is automatically activated for editing.
断行方式二:Break mode 2:
判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离是否小于第一预设阈值;Determining whether a distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than a first preset threshold;
若判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离小于所述第一预设阈值,则将第二目标行/列作为当前激活的目标行/列,以实现在所述第二目标行/列中采集获取用户输入的笔划;If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the second target line is/ The column is the currently activated target row/column to enable acquisition of the stroke of the user input in the second target row/column;
其中,所述第二目标行/列为所述第一目标行/列的下一行/列。The second target row/column is the next row/column of the first target row/column.
断行方式三:Break mode three:
判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离是否小于第一预设阈值;Determining whether a distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than a first preset threshold;
若判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离小于所述第一预设阈值,则将第一目标行/列和第二目标行/列同时作为当前激活的目标行/列;If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the first target line/ The column and the second target row/column are simultaneously the currently activated target row/column;
在第一目标行/列和/或第二目标行/列中采集获取用户后续输入的至少一个笔划,并在所述第二目标行/列采集获取第一个笔划时,仅将第二目标行/列作为当前激活的目标行/列;Acquiring at least one stroke of the user's subsequent input in the first target row/column and/or the second target row/column, and only acquiring the second target when the second target row/column acquisition acquires the first stroke Row/column as the currently active target row/column;
其中,所述第二目标行/列为所第一目标行/列的下一行/列。The second target row/column is the next row/column of the first target row/column.
在本方式中,为了实现连续输入,需要解决相邻行内笔划归属的问题。当有两个或者多个相邻行同时激活时,用户的笔划可能会跨越多个行/列,这时候必须以一定规则来确定该笔划所属的行/列:可以是起点所在的行/列,也可以是终点所在行/列,还可以是占比例最大的那行/列。当然,也可以通过加大相邻两行/列之间的行/列间距来缓解这个矛盾。In this mode, in order to realize continuous input, it is necessary to solve the problem of attribution of strokes in adjacent lines. When two or more adjacent lines are activated at the same time, the user's stroke may span multiple rows/columns. In this case, the row/column to which the stroke belongs must be determined by certain rules: it can be the row/column where the starting point is located. It can also be the row/column of the end point, or the row/column with the largest proportion. Of course, this contradiction can also be alleviated by increasing the row/column spacing between adjacent two rows/columns.
优选地,在所述将第一目标行/列和第二目标行/列同时作为当前激活的目标行/列时,所述第一目标行/列和所述第二目标行/列均为部分区域激活;Preferably, when the first target row/column and the second target row/column are simultaneously used as the currently activated target row/column, the first target row/column and the second target row/column are both Partial area activation;
所述第一目标行/列的激活区域的起始位置设置在所述第二目标行/列的激活区域的结束位置与所述第一目标行/列的激活区域的结束位置之间。A starting position of the active area of the first target row/column is set between an end position of an active area of the second target row/column and an end position of an active area of the first target row/column.
断行方式四:Line break mode four:
用户通过对代表激活区域的手写面板在段内位置的完全控制来决定是否断行。该手写面板本身就具备在段落内自动断行的特征。当用户通过交互 (如键盘命令或者触摸屏手势等)将手写面板向排版方向或者反方向移动时,系统会根据其在段落中的位置以及和当前行的关系,将手写面板的一部分或者全部移动到下一行或者上一行。随着在段内位置的不同,手写面板中呈现的内容也会相应变化。当手写面板被移动至段落的末行后,手写面板自动断行的再次触发实际上就给该段落进行了断行。The user decides whether or not to break the line by fully controlling the position of the handwriting panel representing the active area within the segment. The handwriting panel itself has the feature of automatically breaking lines within the paragraph. When the user interacts (such as keyboard commands or touch screen gestures, etc.) When moving the handwriting panel to the layout direction or the reverse direction, the system will move some or all of the handwriting panel to the next line or above according to its position in the paragraph and the relationship with the current line. One line. As the position within the segment is different, the content presented in the handwriting panel will change accordingly. When the handwriting panel is moved to the last line of the paragraph, the re-triggering of the handwriting panel's automatic line break actually breaks the paragraph.
图1D为本发明提供的一种手写输入字符的处理方法实施例中相邻两行同时激活时的示意图。图中方框框出的位置即为激活区域。如图1D所示,激活区域为相邻两行/列内逻辑上连续的一个区域,用户只能在激活区域内进行输入。由于相邻两行/列的激活区域有重叠,这样就避免了跨行/列笔划出现的情况。同时,也可以根据用户的交互操作来将激活区域切换到全行/列范围(第一目标行/列或者第二目标行/列)。FIG. 1D is a schematic diagram of a method for processing handwritten input characters according to an embodiment of the present invention, in which two adjacent rows are simultaneously activated. The position in the box in the figure is the active area. As shown in FIG. 1D, the active area is a logically continuous area within two adjacent rows/columns, and the user can only input in the active area. Since the active areas of two adjacent rows/columns overlap, this avoids the occurrence of cross-row/column strokes. At the same time, the active area can also be switched to the full row/column range (the first target row/column or the second target row/column) according to the user's interaction.
对于同时激活相邻两行/列的情况来说,存在一个约束条件:对于段落的首行/列或者尾行/列,没有相应的向前绕行或者向后绕行特征。下面详细说明。For the case of simultaneously activating two adjacent rows/columns, there is a constraint that there is no corresponding forward or backward detour feature for the first row/column or tail row/column of the paragraph. The details are explained below.
在同一段落内,若当前激活的目标行不是段首行,则笔划在目标行内的输入位置与该行的开始位置之间的距离小于某一阈值时,可以将目标行和前一行的相关区域同时激活;若当前激活的目标行不是段尾行,则笔划在目标行内的输入位置与该行的结束位置之间的距离小于某一阈值时,可以将目标行和下一行的相关区域同时激活。In the same paragraph, if the currently activated target line is not the first line of the segment, the target line and the relevant area of the previous line may be used when the distance between the input position of the stroke in the target line and the start position of the line is less than a certain threshold. Simultaneous activation; if the currently activated target line is not the end of the segment, the target row and the relevant region of the next row can be simultaneously activated when the distance between the input position of the stroke in the target row and the end position of the row is less than a certain threshold.
但是对于段首行和段尾行来说,如果本段前后还有其它段落,那么当用户在本段的段首行进行输入时,不能将段首行和其前一行同时激活,因为其前一行属于其它的段落;当用户在本段的段尾行进行输入时,不能将段尾行和其后一行同时激活,因为其后一行属于下一段落。However, for the first line of the segment and the last line of the segment, if there are other paragraphs before and after this paragraph, when the user inputs in the first line of the paragraph of this paragraph, the first line of the paragraph and the previous line cannot be activated at the same time because of the previous line. It belongs to other paragraphs; when the user enters at the end of the paragraph of this paragraph, the end of the paragraph and the subsequent line cannot be activated at the same time, because the next line belongs to the next paragraph.
特别的,对于段落尾行,用户可能需要发出“行扩展”命令,在其后插入同属本段落的空行,才能开启同时激活相邻两行的功能。In particular, for the end of the paragraph, the user may need to issue a "line extension" command, followed by a blank line that belongs to this paragraph, in order to enable the function of simultaneously activating two adjacent lines.
上述四种断行方法中,方法一和方法四是用户主动断行,通过与用户的交互实现目标行/列的转移,比较准确;方法二和方法三是自动断行,不需要与用户进行额外的交互操作,只要用户的书写方式完全符合行或列的要求,就可以自动地识别每一行/列的结束位置,而无需用户对每一行/列的结束进行交互式确认,从而甚至可以将整个手写输入屏当做普通纸张一样使 用,大大提高了用户的输入体验。Among the above four methods of line breaking, the first method and the fourth method are that the user actively breaks the line, and the target row/column is transferred through the interaction with the user, which is more accurate; the second method and the third method are automatic line breaking, and no additional interaction with the user is needed. Operation, as long as the user's writing style fully meets the requirements of rows or columns, the end position of each row/column can be automatically recognized without the user having to interactively confirm the end of each row/column, so that the entire handwriting can be input even The screen is made like ordinary paper Use, greatly improving the user's input experience.
对于本实施例中的手写输入字符的处理方法来说,有两个重要的概念:断行(软回车)以及段落结束(硬回车)。断行是指当前段落并未结束,但是由于手写字符已输入到本行的结尾位置,因此需要激活下一行;段落结束是指本段内容结束,当判断段落结束后,可以在本行后插入一空行,然后激活空行的下一行作为下一段落的第一行,使用户在空行的下一行进行输入;或者,当判断段落结束后,可以直接激活本行的下一行/列,作为下一段落的第一行来进行输入。For the processing method of handwritten input characters in this embodiment, there are two important concepts: line break (soft carriage return) and paragraph end (hard carriage return). Line break means that the current paragraph is not over, but since the handwritten character has been entered at the end of the line, the next line needs to be activated; the end of the paragraph means the end of the paragraph, and when the paragraph is judged, it can be inserted after the line. Line, then activate the next line of the blank line as the first line of the next paragraph, so that the user can input on the next line of the blank line; or, when the judgment paragraph ends, you can directly activate the next line/column of the line as the next paragraph The first line is used for input.
为了区分断行和段落结束,可以设定不同的交互方式,比如点击某一按钮为断行,点击另一按钮为段落结束;或者,到一行的结尾位置时自动断行,通过手动交互才能段落结束;或者,到一行的结尾位置时自动段落结束,通过手动交互才能断行,本实施例对此不作限制。In order to distinguish between line breaks and end of paragraphs, different interaction modes can be set, such as clicking a button to break the line, clicking another button to end the paragraph, or automatically breaking the line at the end of the line, and ending the paragraph by manual interaction; or When the end position of a line is reached, the automatic paragraph ends, and the manual interaction can be used to break the line. This embodiment does not limit this.
例如,可以采用上述断行方式一、二、三中的任意一种方式来进行断行,对于段落结束,则需要与用户进行某些交互操作。For example, any one of the above-mentioned line break modes one, two, and three may be used to perform line break. For the end of the paragraph, some interaction with the user is required.
或者,用户在不同的行上进行输入时,可以自动将不同的行归于不同的段落,并为段落之间的空行建立空的段落,而对于一个段落往下一行的延伸(即断行),则需要明确的交互命令来确定。一般的,段落扩展命令只在段落的最后一行或者插入的最后一行有意义。当前编辑行和该行对应段落中所有其他行会拥有某种相同的可视状态,以区别与其他的段落。Or, when the user enters on different lines, he can automatically assign different lines to different paragraphs, and create empty paragraphs for empty lines between paragraphs, and for the extension of one paragraph to the next line (ie, line break), Then you need a clear interactive command to determine. In general, the paragraph extension command only makes sense on the last line of the paragraph or the last line inserted. The current edit line and all other lines in the corresponding paragraph of the line will have some sort of visual state to distinguish them from other paragraphs.
在上述实施例提供的技术方案的基础上,优选的是,还可以对用户输入的字符进行保存。On the basis of the technical solutions provided by the above embodiments, it is preferable that the characters input by the user can also be saved.
本实施例中的保存功能,具体可以包括:The saving function in this embodiment may specifically include:
每隔预设时间,将采集获取的笔划所创建的新的字符或者归属的字符进行保存;The new character or the attribute that is created by the acquired stroke is saved every preset time;
或者,or,
在同一页面上,获取在所述页面上的当前激活的目标行/列由一个目标行/列切换至另一个目标行/列时,保存所述一个目标行/列上采集获取的笔划所创建的新的字符或者归属的字符;On the same page, when the currently activated target row/column on the page is switched from one target row/column to another target row/column, the strokes acquired by the acquisition on the one target row/column are saved. New character or attribute of the character;
或者,or,
在获取在当前页面由一个页面切换至另一个页面时,保存所述一个页面 上采集获取的笔划所创建的新的字符或者归属的字符。Save the one page when getting the current page from one page to another Collect new characters or attribute characters created by the acquired strokes.
具体地,在进行保存时,可以将所述用户输入的笔划以及对应的输入信息保存在第一内存中;在第二内存中存储保存的字符,对于每个保存的字符,所述字符包括构成所述字符的笔划和所述笔划对应的索引;其中,所述笔划对应的索引指向所述第一内存中所述笔划对应的输入信息。或者,也可以将笔划及其输入信息以及对应的字符全部保存在一个内存中,本实施例对此不作限制。Specifically, when saving, the stroke input by the user and the corresponding input information may be saved in the first memory; the saved characters are stored in the second memory, and the characters include the composition for each saved character. An index corresponding to the stroke of the character and the stroke; wherein an index corresponding to the stroke points to input information corresponding to the stroke in the first memory. Alternatively, the strokes and their input information and corresponding characters may all be stored in one memory, which is not limited in this embodiment.
对于笔划和字符的存储顺序或序列,可以采用任何适当的存储方式,只要能够有效地区分每一笔划所归属的字符和每个不同的字符即可。优选地,可以一边输入,一边将输入的笔划和划分出的字符等信息存储在系统的临时存储位置或空间(如系统的RAM或闪存等)中,而在结束每一目标行/列的输入后,才将该目标行/列中的所有划分出的字符和笔划信息存储到指定的永久存储位置或空间。For the storage order or sequence of strokes and characters, any suitable storage method may be employed as long as it can effectively distinguish the characters to which each stroke belongs and each different character. Preferably, information such as input strokes and divided characters can be stored in a temporary storage location or space of the system (such as RAM or flash memory of the system) while inputting, and the input of each target row/column is ended. All of the divided character and stroke information in the target row/column is then stored in the specified permanent storage location or space.
在上述实施例提供的技术方案的基础上,优选的是,所述笔划对应的输入信息还包括如下一种或者几种的组合:所述笔划的输入时间、所述笔划的输入力度和所述笔划的输入速度。On the basis of the technical solutions provided by the foregoing embodiments, it is preferable that the input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and the The input speed of the stroke.
其中,所述输入时间包括所述笔划的落笔时刻和抬笔时刻、以及所述笔划的笔迹中每个点的停留时间;所述输入位置至少包括:落笔时的位置、抬笔时的位置、以及所述笔划的笔迹中每个点的坐标位置。The input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke; the input position includes at least: a position when the pen is dropped, a position when the pen is lifted, And the coordinate position of each point in the stroke of the stroke.
本实施例中,可以根据需要记录下每一笔划的输入时间、力度和速度等信息,来进一步细化输入信息。可以将该笔划以及相应的输入时间、力度和速度以列表的形式存储在一个单独的笔划数据库中。In this embodiment, information such as input time, velocity, and speed of each stroke can be recorded as needed to further refine the input information. The strokes and corresponding input time, velocity and speed can be stored in a separate stroke database in the form of a list.
由于本实施例可以在接收每一输入笔划的同时,按照书写时的笔划顺序,记录并保留下每一笔划的详细输入信息,因此能够完整地记录下并保留与每个用户相关的所有书写风格和习惯的几乎所有信息,例如笔划顺序风格、行笔风格、字词间隔等书写特征,从而使得例如笔迹鉴定等成为轻而易举之事。Since the present embodiment can record and retain the detailed input information of each stroke in accordance with the stroke order at the time of writing while receiving each input stroke, it is possible to completely record and retain all the writing styles associated with each user. And almost all the information that is used to it, such as stroke order style, stroke style, word spacing and other writing features, making for example handwriting identification a breeze.
对于遗漏的笔划,本实施例也显示出了极大的优势。例如,当用户在输入字“我”的时候忘记了输入其右上角的“丶”(点),在输入完其他字符后发现了该遗漏的笔划“丶”,此时,用户可以如正常在纸张上书写一样,在“我” 字原有的位置处的相应右上角位置添加该“丶”,尽管该“丶”的输入时间与“我”字其他笔划的输入时间不同,但从位置信息上可以判断出该“丶”属于先前输入的“我”字的组成部分。This embodiment also shows great advantages for missing strokes. For example, when the user enters the word "I", he forgets to input "丶" (dot) in the upper right corner, and finds the missing stroke "丶" after inputting other characters. At this time, the user can be as normal. Writing on paper is like "I" The "丶" is added to the corresponding upper right corner position of the original position of the word. Although the input time of the "丶" is different from the input time of other strokes of the "I" character, it can be judged from the position information that the "丶" belongs to The previously entered part of the "I" word.
当用户在输入过程中以涂鸦的方式画出了一个自定义的图形或字符时,如同常规的字符一样,其每一笔划的输入时间和输入位置也都被记录下来。When the user draws a custom graphic or character in a graffiti manner during the input process, as with the regular character, the input time and input position of each stroke are also recorded.
由于本实施例可以完整地保留包括每一笔划的输入时间、位置、力度、速度以及字词间距等所有的输入信息,因此也为后续的编辑和其他处理等应用服务提供了更广阔的施展空间。Since the present embodiment can completely retain all the input information including the input time, position, velocity, speed, and word spacing of each stroke, it also provides a wider space for application services such as subsequent editing and other processing. .
在上述实施例提供的技术方案的基础上,优选的是,步骤102A中根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,具体可以包括:Based on the technical solution provided by the foregoing embodiment, it is preferable that the input position in the first target row/column according to the stroke in step 102A, or the stroke is in the first target row/column The input position in the first target row/column, the character specified in the first target row/column, the creation of a new character for the stroke, or the character to which the stroke belongs, may specifically include:
将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character;
若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
其中,本实施例中所述指定的字符可以是所述第一目标行/列中已存在的所有字符;或者,所述指定的字符可以是所述第一目标行/列中的待比较区域中的字符,所述待比较区域的边界位置与所述笔划的距离小于第二预设阈值。将所述笔划仅与周围一定范围内的字符进行比较,能够有效减少计算量,提高笔划归属判断的效率。The specified character in the embodiment may be all the characters that are already in the first target row/column; or the specified character may be the to-be-compared region in the first target row/column. a character in the middle, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold. Comparing the stroke with only a certain range of characters in the surrounding area can effectively reduce the amount of calculation and improve the efficiency of the stroke attribution determination.
将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性,可以有多种实现方法,下面分别进行说明。Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character, There are a variety of implementation methods, which are described separately below.
判断关联性方式一、通过判断笔划是否与字符重合来判断所述笔划与字符的关联性。具体地,步骤102A中的所述根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所 述笔划归属的字符,具体可以包括:Judging the relevance mode 1. Determine the relevance of the stroke to the character by judging whether the stroke coincides with the character. Specifically, the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, create a new character for the stroke or determine the location The characters to which the stroke belongs may specifically include:
将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划是否与所述字符中的至少一个笔划重叠;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining whether the stroke is at least one of the characters Overlapping strokes;
若所述笔划与所述字符中的至少一个笔划重叠,则判断所述笔划与所述字符相关联;If the stroke overlaps with at least one of the characters, determining that the stroke is associated with the character;
若所述笔划与所述字符中的所有笔划均不重叠,则判断所述笔划与所述字符不相关联;If the stroke does not overlap with all the strokes in the character, determining that the stroke is not associated with the character;
若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
本方式中,可以将彼此有交叉的笔划作为同一个字符的笔划,将这些笔划归属于同一个字符,这种方式简单、快捷。In this method, strokes that intersect each other can be used as strokes of the same character, and the strokes are assigned to the same character, which is simple and quick.
判断关联性方式二,通过计算笔划与字符边界的距离来判断所述笔划与字符的关联性。本方式中,步骤102A中的所述根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,具体可以包括:Judging the relevance mode 2, the relationship between the stroke and the character is determined by calculating the distance between the stroke and the character boundary. In this manner, the input position in the first target row/column according to the stroke in the step 102A, or the input position of the stroke in the first target row/column and the first A character specified in the target row/column, a new character is created for the stroke, or a character to which the stroke belongs is determined, which may specifically include:
对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符对应的位置信息进行对比,判断所述笔划与所述字符的边界之间的距离是否小于第三预设阈值;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character specified in the first target row/column, and determining the stroke and the location Whether the distance between the boundaries of the characters is less than a third preset threshold;
若所述笔划与所述字符的边界小于第三预设阈值,则判断所述笔划与所述字符相关联;If the boundary of the stroke and the character is less than a third preset threshold, determining that the stroke is associated with the character;
若所述笔划与所述字符的边界不小于第三预设阈值,则判断所述笔划与所述字符不相关联;If the boundary between the stroke and the character is not less than a third preset threshold, determining that the stroke is not associated with the character;
若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。 If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
例如,对于有明显左右或上下结构的字符而言,如“温”字,由于个人书写习惯的不同,可能在书写过程中可能将左边的偏旁“氵”(三点水)与右半部分的“昷”分开过大,此时,可以通过与预先设置好的第三预设阈值相比较来判定这些笔划所归属的字符。当当前输入的笔划与相邻的字符之间的间距小于第三预设阈值时,则可以认为所述笔划属于该相邻的字符,否则就可以为所述笔划创建一个新的归属字符。For example, for characters with obvious left and right or top and bottom structures, such as "warm" characters, due to differences in personal writing habits, it may be possible to "left" the side of the left side (three points of water) and the right part of the middle part of the writing process. The "昷" is too large, and at this time, the characters to which the strokes belong can be determined by comparison with a preset third preset threshold. When the distance between the currently input stroke and the adjacent character is less than the third preset threshold, the stroke may be considered to belong to the adjacent character, otherwise a new attribution character may be created for the stroke.
判断关联性方式三、通过计算笔划与字符中各个笔划的距离来判断所述笔划与字符的关联性。本方式中,所述根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,具体可以包括:Judging the relevance mode 3. Determine the correlation between the stroke and the character by calculating the distance between the stroke and each stroke in the character. In this manner, the input position according to the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the first target row/column The character specified in the character, the creation of a new character for the stroke or the determination of the character to which the stroke belongs, may specifically include:
对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符中的各个笔划对应的位置信息进行对比,获取所述笔划与所述字符对应的各个笔划之间的间距中的最小间距值,并判断所述最小间距值是否小于第四预设阈值;Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value in a spacing between each stroke corresponding to the character, and determining whether the minimum spacing value is less than a fourth preset threshold;
若小于,则所述笔划与所述字符相关联;If less than, the stroke is associated with the character;
若不小于,则所述笔划与所述字符不相关联;If not less than, the stroke is not associated with the character;
若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
在判断关联性方式一、二、三中,所述根据相关联的至少一个字符,对所述笔划进行归属处理,可以包括:In determining the relevance mode one, two, and three, the performing the attribution processing on the stroke according to the at least one associated character may include:
若与所述笔划相关联的字符有一个,则将所述笔划归属于与所述笔划相关联的一个字符;If there is one character associated with the stroke, assigning the stroke to one character associated with the stroke;
若与所述笔划相关联的字符有至少两个,则将至少两个字符合并,并将所述笔划归属于合并后的字符。If there are at least two characters associated with the stroke, at least two characters are combined and the stroke is attributed to the merged character.
本实施例中,当某一笔划同时可以归属于左侧和右侧的字符时,则表明该笔划应当与其左侧和右侧的字符合并在一起构成一个字形,例如“树”字中偏旁“又”中的笔划与左边的偏旁“木”和右边的偏旁“寸”之间的位置关系。当 然,如果不需要后续的识别操作,也可以不设定上述预设阈值,只要能将字符划分开即可。In this embodiment, when a stroke can be attributed to the left and right characters at the same time, it indicates that the stroke should be merged with the characters on the left and right sides to form a glyph, for example, the "tree" in the word "side" The positional relationship between the stroke in the middle and the "wood" on the left side and the "inch" on the right side. when However, if the subsequent recognition operation is not required, the preset threshold may not be set as long as the characters can be divided.
此外,在判断关联性方式二和三中,还可以对笔划与字符的关联性进行强弱划分,并根据关联性强弱对所述笔划的归属进行判断。In addition, in judging the relevance modes 2 and 3, the association between the stroke and the character can be divided into strong and weak, and the attribution of the stroke is judged according to the strength of the association.
具体地,所述根据相关联的至少一个字符,对所述笔划进行归属处理,可以包括:Specifically, the performing the attribution processing on the stroke according to the at least one associated character may include:
从相关联的至少一个字符中获取与所述笔划关联性最强的字符;Obtaining the character most strongly associated with the stroke from the associated at least one character;
若与所述笔划关联性最强的字符为一个,则将所述笔划归属于最强的字符;If the character with the strongest correlation with the stroke is one, the stroke is attributed to the strongest character;
若与所述笔划关联性最强的字符有至少两个,则将至少两个字符合并,并将所述笔划归属于合并后的字符。If there are at least two characters with the strongest association with the stroke, at least two characters are merged, and the stroke is attributed to the merged character.
相应地,所述从相关联的至少一个字符中获取与所述笔划关联性最强的字符,可以包括:Correspondingly, the obtaining the most strongly associated character from the stroke from the associated at least one character may include:
根据所述笔划与所述字符的边界的距离,按照从小到大的顺序,将与所述笔划相关联的至少一个字符进行排序,并将最小距离所对应的字符作为与所述笔划关联性最强的字符;或者,And according to the distance between the stroke and the boundary of the character, at least one character associated with the stroke is sorted in order from small to large, and the character corresponding to the minimum distance is used as the most relevant to the stroke. Strong character; or,
根据所述笔划与所述字符对应的最小间距值,按照从小到大的顺序,将与所述笔划相关联的至少一个字符进行排序,并将第一个字符作为与所述笔划关联性最强的字符。Sorting at least one character associated with the stroke according to a minimum spacing value corresponding to the character by the stroke, and ordering the first character as the strongest correlation with the stroke according to an order from small to large character of.
当以行为输入的约束时,默认的是可以将具有上下位置关系的笔划归属于同一字符,而只需要对笔划与相邻的左右字符之间的位置关系作出判断即可。同样,当以列为输入的约束时,默认的是可以将具有左右位置关系的笔划归属于同一字符,而只需要对笔划与相邻的上下字符之间的位置关系作出判断即可。When the constraint is input by behavior, the default is that the stroke with the upper and lower positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent left and right characters needs to be judged. Similarly, when the constraint is listed as input, the default is that the stroke with the left and right positional relationship can be attributed to the same character, and only the positional relationship between the stroke and the adjacent upper and lower characters needs to be judged.
在实际应用过程中,当需要对笔划的归属进行判断时,可以综合采用上述多种方式中所述的方法,如对某些笔划采用判断关联性方式一中的方法进行判断,对某些笔划采用判断关联性方式二中的方法进行判断,对其余笔划采用判断关联性方式三中的方法进行判断。In the actual application process, when it is necessary to judge the attribution of the stroke, the methods described in the above various manners may be comprehensively used, for example, the method of judging the relevance method 1 is used for some strokes, and some strokes are determined. The method of judging the relevance method 2 is used for judging, and the remaining strokes are judged by the method of judging the correlation method.
例如,若当前输入的笔划是第一目标行/列中空间上的第一笔划或最后一个笔划,则可以按照判断关联性方式一所述的方法判断所述笔划是否与所 述第一目标行/列中已经输入的其他字符相关联,若不相关联,则为所述笔划创建一个新的字符;如果当前笔划既不是所述第一目标行/列中空间上的第一笔划也不是最后一个笔划,则可以按照判断关联性方式二或者判断关联性方式三中的方法,将所述当前输入的笔划与已输入过的所有字符或笔划之间的间距相比较,并根据比较结果将所述当前输入的笔划归属于相关联的一个或多个字符。For example, if the currently input stroke is the first stroke or the last stroke on the space in the first target row/column, the method of determining the relevance manner may be used to determine whether the stroke is The other characters already entered in the first target row/column are associated, if not associated, a new character is created for the stroke; if the current stroke is neither the space in the first target row/column If the stroke is not the last stroke, the distance between the currently input stroke and all the characters or strokes that have been input may be compared according to the method of determining the correlation method 2 or determining the correlation method 3, and The currently entered stroke is attributed to the associated one or more characters based on the result of the comparison.
上述的第一预设阈值、第二预设阈值、第三预设阈值和第四预设阈值可以是由用户根据自己的书写习惯而确定的,也可以采用系统默认值。The first preset threshold, the second preset threshold, the third preset threshold, and the fourth preset threshold may be determined by the user according to their own writing habits, and may also adopt a system default value.
此外,系统也可以提供可视化信息来辅助自动划分,如基于作文格的字符划分:可以基于当前输入笔划与当前输入行中相应的作文格条纹之间的关联性来判断当前输入笔划所应当归属的字符。In addition, the system can also provide visual information to assist in automatic segmentation, such as character-based character segmentation: based on the correlation between the current input stroke and the corresponding text stripe in the current input line, the current input stroke should be determined. character.
本实施例中,还可以利用作文格来判断笔划的归属。具体地,在步骤101A中的所述采集获取用户输入的笔划以及对应的输入信息之前,可以对所述第一目标行/列进行划分,以将所述第一目标行/列划分成多个作文格。In this embodiment, the text can also be used to determine the attribution of the stroke. Specifically, before the collecting in step 101A acquires the stroke input by the user and the corresponding input information, the first target row/column may be divided to divide the first target row/column into multiple Writing a text.
相应的,步骤102A中的所述根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,包括:Correspondingly, the input position in the first target row/column according to the stroke in step 102A, or the input position of the stroke in the first target row/column and the first target The character specified in the row/column, creating a new character for the stroke or determining the character to which the stroke belongs, including:
根据所述笔划在所述第一目标行/列中的输入位置,确定所述笔划所在的作文格;Determining, according to the input position of the stroke in the first target row/column, the composition of the stroke;
判断所述作文格中是否已存在字符;Determining whether a character already exists in the composition grid;
若存在,则所述笔划归属于所述作文格中已存在的字符;反之,则在所述作文格中创建一个新的字符,所述笔划归属于所述新的字符。If present, the stroke is attributed to an existing character in the composition; otherwise, a new character is created in the composition, the stroke being attributed to the new character.
具体地,若所述笔划跨越一个作文格,则判断所述作文格中是否存在字符,若存在则将所述笔划归属于所述作文格中的字符,若不存在,则为所述笔划创建一个新的字符,所述新的字符属于所述作文格;若所述笔划跨越至少两个作文格,则判断所述至少两个作文格中是否存在字符,若所述至少两个作文格中均不存在字符,则为所述笔划创建一个新的字符,所述新的字符属于所述至少两个作文格,若所述至少两个作文格中有仅一个作文格中存在字符,则将所述笔划归属于所述存在有字符的作文格,若所述至少两个作文 格中有多个作文格中均有字符,则将所述多个作文格中的字符合并,并将所述笔划归属于所述合并后的字符。Specifically, if the stroke spans a composition, it is determined whether there is a character in the composition, and if so, the stroke is attributed to the character in the composition, and if not, the stroke is created. a new character, the new character belongs to the composition; if the stroke spans at least two composition grids, determining whether there is a character in the at least two composition grids, if the at least two composition grids If there is no character, a new character is created for the stroke, and the new character belongs to the at least two composition grids. If only one of the at least two composition grids has characters in the text grid, then The stroke is attributed to the composition in which the character exists, if the at least two compositions If there are multiple characters in the grid, the characters in the plurality of composition grids are merged, and the strokes are attributed to the merged characters.
通过作文格来辅助判断笔划归属的字符,不仅简单方便,而且能够更好地对用户的输入进行约束,使得判断结果更加准确。By making a text grid to assist in judging the characters to which the stroke belongs, it is not only simple and convenient, but also better constrains the user's input, making the judgment result more accurate.
以上描述了如何对笔划归属于哪个字符进行判断,但是,自动划分不可避免地会存在划分错误,如一字被识别为多字,多字被识别为一字等。但是,本实施例通常情况下并不需要对字符进行识别,只有在特别需要时,才对输入的字符作识别处理。这是因为,一方面,本实施例的每个输入字符都是以字形对象(非标准的、即手写的字符)为基础进行分割和存储的,换句话说,在本实施例中被划分或分割出的每个输入字符都被当做一个非标准的字形对象来处理;另一方面,如果手写内容最终只用于供人阅读(更注重原始输入信息形态的保留),那么划分错误不需要纠正。The above describes how to judge which character the stroke belongs to, but the automatic division inevitably has a division error, such as a word being recognized as a multi-word, and a multi-word being recognized as a word. However, in this embodiment, it is not necessary to recognize characters, and the input characters are recognized only when they are particularly needed. This is because, on the one hand, each input character of the embodiment is divided and stored on the basis of a glyph object (non-standard, ie, handwritten character), in other words, in this embodiment, or Each input character that is segmented is treated as a non-standard glyph object; on the other hand, if the handwritten content is ultimately only used for human reading (more on the retention of the original input information form), the division error does not need to be corrected. .
然而,如果在行/列的绕行处出现了字符的拆分错误,例如,在行结尾处,错误地将输入的“的”字拆分成“白”和“勺”两个字符,并且将它们放置在了不同的行或列中,则需要借助某种方式纠正这种错误的拆分。或者,当用户在浏览之前输入的字符时,发现了错误拆分的字符,也可以借助某种方式进行纠正。However, if a character splitting error occurs at the row/column bypass, for example, at the end of the line, the input "the" word is erroneously split into two characters "white" and "spoon", and Putting them in different rows or columns requires some way to correct this erroneous split. Or, when the user enters the characters entered before, the characters that are incorrectly split are found, and can be corrected in some way.
对于上述纠正功能,可以通过交互的方式来修改这种错误的拆分,也可以通过其他可行的方式来实现相同的效果。本实施例提供一种纠正方法,具体包括:For the above correction function, the splitting of the error can be modified interactively, and the same effect can be achieved by other feasible methods. This embodiment provides a corrective method, which specifically includes:
分别获取并显示本地保存的每个字符的边界;Get and display the boundaries of each character saved locally;
接收用户输入的纠正请求,所述纠正请求包括待纠正的字符,或者待纠正的字符和待纠正的笔划;Receiving a correction request input by a user, the correction request including a character to be corrected, or a character to be corrected and a stroke to be corrected;
根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理。Performing corresponding correction processing on the character to be corrected according to the correction request.
具体地,根据不同的场景,所述纠正请求的具体内容可能不同,本实施例中提供如下几种场景:Specifically, the specific content of the correction request may be different according to different scenarios. In this embodiment, the following scenarios are provided:
场景一:将两个字符合并为一个,即,所述纠正请求为合并纠正请求,所述待纠正的字符为待合并的至少两个字符;Scenario 1: Combining two characters into one, that is, the correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括: Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待合并的至少两个字符合并为一个字符。Combining the at least two characters to be merged into one character.
场景二:将一个字符拆分为多个字符,即,所述纠正请求为拆分纠正请求,所述待纠正的字符为待拆分的一个字符;Scenario 2: splitting a character into a plurality of characters, that is, the correction request is a split correction request, and the character to be corrected is a character to be split;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括:Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待拆分的一个字符拆分为至少两个字符。Splitting one character to be split into at least two characters.
场景三:将归属于一个字符的某一笔划改为归属于另一字符,即,所述纠正请求为归属纠正请求,所述待纠正的字符为一个待归属字符,所述待纠正的笔划为待纠正的至少一个笔划;Scenario 3: changing a stroke attributed to one character to another character, that is, the correction request is a home correction request, the character to be corrected is a character to be vested, and the stroke to be corrected is At least one stroke to be corrected;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括:Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待纠正的至少一个笔划归属于所述待归属字符。At least one stroke to be corrected is attributed to the to-be-vested character.
通过上述纠正功能,可以通过与用户交互的方式对已经拆分好的字符进行重新拆分,提高了字符拆分的准确性。Through the above correction function, the characters that have been split can be re-splitted by interacting with the user, thereby improving the accuracy of character splitting.
由于在对字符的划分过程中,已经将每一个字符(可能是一个或者是多个字、词的组合)拆分为单独的个体,因此很容易实现各字符之间的区分。进一步,由于本实施例提供的方法还可以记录下用户书写每一笔划的笔划顺序(基于时间)、以及相应笔划的形状特征,因此,根据这些信息很容易对照查找出具有相同或类似笔划顺序、以及笔划形状特征的字符,在满足适当的阈值条件的情况下,可以将这些字符视为同一字符。这使得对字符的匹配、搜索、查找成为轻而易举之事,甚至可以以用户输入的字符为搜索条件进行检索。Since each character (possibly a combination of one or more words, words) has been split into separate individuals during the division of characters, it is easy to distinguish between the characters. Further, since the method provided by the embodiment can also record the stroke order (based on time) of each stroke written by the user and the shape feature of the corresponding stroke, it is easy to find out the same or similar stroke order according to the information. Characters with stroke shape characteristics can be treated as the same character if the appropriate threshold conditions are met. This makes matching, searching, and searching for characters a breeze, and even searching for the characters entered by the user.
本实施例中,还可以增加查找和插入的功能。In this embodiment, the functions of finding and inserting can also be added.
其中,查找功能可以具体包括下述步骤:The search function may specifically include the following steps:
接收用户输入的查找命令,所述查找命令中包括所述用户输入的待查找字符;Receiving a search command input by the user, where the search command includes a character to be searched by the user;
根据所述待查找字符的笔划数量和笔划特征,将所述待查找字符分别与本地保存的字符进行比对,获取与所述待查找字符匹配的字符。The characters to be searched are compared with the locally saved characters according to the number of strokes of the character to be searched and the stroke feature, and characters matching the characters to be searched are obtained.
经过本实施例提供的方法对用户输入的内容划分后,可以得到拆分的手写文字字符。在此基础上,可以进行基于图形匹配的手写文字查找,其过程 主要是将查找源中的每个字符与待查找字符进行逐个匹配。可以通过笔划数量和笔划顺序的匹配来查找相匹配的字符。After the content input by the user is divided by the method provided in this embodiment, the split handwritten character characters can be obtained. On this basis, handwritten text search based on pattern matching can be performed. The main thing is to match each character in the search source with the character to be found one by one. Matching characters can be found by matching the number of strokes and the stroke order.
下面给出本实施例中一个基于笔划进行单个文字匹配的示例性流程:An exemplary flow for performing a single text match based on a stroke in the present embodiment is given below:
判断待查找字符中的笔划个数与某一个本地保存的字符中的笔划个数是否相同,若不同,则匹配失败,若相同,则执行下一步骤;Determining whether the number of strokes in the character to be searched is the same as the number of strokes in a locally saved character. If they are different, the matching fails. If they are the same, the next step is performed;
对所述待查找字符和所述本地保存的字符中的笔划进行一一匹配,即曲线的匹配,若不一致,则最终匹配结果为失败,若一致,则最终匹配结果为成功。The one-to-one matching between the character to be searched and the stroke in the locally saved character, that is, the matching of the curve, if not, the final matching result is a failure, and if they are consistent, the final matching result is successful.
当然,也可以使用现有技术中任何图形分析或其它匹配方法来实现字符查找功能,本实施例对此不作限制。基于与查找功能同样的原理,也可以实现对字符的替换功能,此处不再赘述。Of course, any character analysis or other matching method in the prior art can be used to implement the character search function, which is not limited in this embodiment. The function of replacing characters can also be implemented based on the same principle as the search function, and will not be described here.
本实施例中,手写文字输入编辑的插入功能可以具体包括下述步骤:In this embodiment, the insertion function of the handwritten text input editing may specifically include the following steps:
接收用户输入的插入请求,所述插入请求中包括待插入的目标行/列、在所述待插入的目标行/列中的待插入位置、以及待插入字符;Receiving an insertion request input by a user, the insertion request including a target row/column to be inserted, a to-be-inserted position in the target row/column to be inserted, and a character to be inserted;
将所述待插入的目标行/列激活,并将所述待插入字符插入到所述待插入位置;Activating the target row/column to be inserted, and inserting the character to be inserted into the to-be-inserted position;
对所述待插入位置之后的字符进行相应地调整。The characters after the position to be inserted are adjusted accordingly.
如果要在已有内容中间插入新的字符,需要明确的命令来进入/退出插入模式,而不是像传统字符输入那样自动插入。此外,由于插入的字符既可以是手写字符,也可以是利用键盘输入的标准字符、或使用其他输入设备的非标准字符等,因此还需要相应的插入控制或切换指令、以及对插入内容的标识和编辑等指令。If you want to insert new characters in the middle of existing content, you need an explicit command to enter/exit the insert mode instead of automatically inserting it like a traditional character input. In addition, since the inserted characters can be either handwritten characters, standard characters input using a keyboard, or non-standard characters using other input devices, etc., corresponding insertion control or switching instructions, and identification of the inserted content are also required. And editing instructions.
如果用户需要在已经变为非活动行的某个位置处添加字符时,例如,在某一行的第3和第4字符之间插入一个字符时,用户需要首先激活该行,系统会在该行的空白字符处提供接受用户输入的辅助界面。用户激活该行的第3和第4字符之间辅助界面,即可选择在该字符间隔处施加插入操作。If the user needs to add a character at a position that has become an inactive line, for example, when inserting a character between the 3rd and 4th characters of a line, the user needs to activate the line first, and the system will be in the line. The blank character provides an auxiliary interface that accepts user input. The user activates the auxiliary interface between the 3rd and 4th characters of the line, and optionally inserts an insertion operation at the character interval.
插入可以在任何字符的前后进行。当对于手写系统,我们可以进一步约束为在空白字符处插入。图1E为本发明提供的一种手写输入字符的处理方法实施例中插入字符时的状态示意图。如图1E所示,在进入插入编辑状态之后,可以将插入位置之后的现有字符移动到下一行,插入位置到当前行末 为可书写空间。标记有右箭头的行为插入行,点击右箭头可以退出插入状态。在插入结束之前,用户只能在前后两个插入标记之间进行输入。Inserts can be done before and after any character. When it comes to handwriting systems, we can further constrain to insert at blank characters. FIG. 1E is a schematic diagram of a state in which a character is inserted in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1E, after entering the insert editing state, the existing characters after the insertion position can be moved to the next line, and the insertion position is to the end of the current line. It is a space for writing. Insert the line marked with the right arrow and click the right arrow to exit the insertion state. Before the insertion is complete, the user can only enter between the two insertion markers.
插入位置之前的字符和插入位置之后的字符都为只读(但可选),直到结束插入。插入完成之后,根据插入的字符重新排版断行。可以对插入行的最后一行(插入开始时,插入当前行即为插入的最后一行)进行扩展,扩展出来的行是新的最后插入行。理论上,插入可以嵌套,也就是说,插入内容内可以再进行插入。插入行要与普通行有不同的可视状态,以帮助用户明确当前的编辑状态。The characters before the insertion position and the characters after the insertion position are read-only (but optional) until the end of the insertion. After the insertion is complete, the line breaks according to the inserted characters. You can extend the last line of the inserted line (the last line inserted is inserted at the beginning of the insertion), and the expanded line is the new last inserted line. In theory, inserts can be nested, that is, inserts can be inserted again. Insert rows have different visual states than normal rows to help users clarify the current editing state.
除了上述查找和插入功能,还可以对用户手写输入的字符进行其它处理,处理过程可以包括下述步骤:In addition to the above search and insertion functions, other characters can be processed by the user's handwritten input, and the processing may include the following steps:
采集获取所述用户所选择的至少一个字符;Acquiring and acquiring at least one character selected by the user;
接收用户输入的选择处理命令,并根据所述选择处理命令对所述至少一个字符进行处理操作;Receiving a selection processing command input by the user, and performing a processing operation on the at least one character according to the selection processing command;
其中,所述选择处理命令包括下述任一一种或几种的组合:对所述至少一个字符进行复制处理、对所述至少一个字符进行剪切处理,对所述至少一个字符进行替换处理,对所述至少一个字符进行合并处理。The selection processing command includes any one or a combination of the following: performing copy processing on the at least one character, performing cut processing on the at least one character, and performing replacement processing on the at least one character And performing a merge process on the at least one character.
图1F为本发明提供的一种手写输入字符的处理方法实施例中选择处理命令下的编辑模式示意图。如图1F所示,可以在手写输入屏上显示插入、粘贴、全选、选择、合并等功能,方便用户进行相应的操作。FIG. 1F is a schematic diagram of an editing mode under a selection processing command in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1F, functions such as inserting, pasting, selecting all, selecting, and merging can be displayed on the handwriting input screen to facilitate the user to perform corresponding operations.
此外,本实施例还可以对已输入的字符之上插入或添加笔划、注释,或者删除某些字符等。本实施例中提供的查找、插入、复制等功能,能够有效地避免现有的手写输入系统不够直观、修改困难等弊端。In addition, the embodiment may also insert or add a stroke, a comment, or delete some characters or the like on the input character. The functions of searching, inserting, and copying provided in this embodiment can effectively avoid the disadvantages of the existing handwriting input system being less intuitive and difficult to modify.
在上述实施例提供的技术方案的基础上,优选的是,所述第一目标行/列的数量为多个;On the basis of the technical solutions provided by the foregoing embodiments, it is preferable that the number of the first target rows/columns is plural;
多个所述第一目标行/列对应的激活区域均不重叠、且互相不接触。The active areas corresponding to the plurality of the first target rows/columns do not overlap and are not in contact with each other.
在这种情况下,多个用户可以分别在多个第一目标行/列对应的激活区域内进行输入,满足了大尺寸手写输入屏允许多人同时输入的功能。In this case, multiple users can input in the active areas corresponding to the plurality of first target rows/columns, respectively, satisfying the function that the large-size handwriting input screen allows multiple people to simultaneously input.
在上述实施例提供的技术方案的基础上,优选的是,本实施例还可以与现有的键盘、鼠标、以及其他已有的输入设备相兼容,通过进行模式切换来实现混合输入。本实施例中的模式切换方法,可以具体包括: Based on the technical solutions provided by the foregoing embodiments, it is preferable that the embodiment is compatible with the existing keyboard, mouse, and other existing input devices, and the hybrid input is implemented by performing mode switching. The mode switching method in this embodiment may specifically include:
接收用户输入的模式切换请求,所述模式切换请求包括目标模式;Receiving a mode switching request input by a user, where the mode switching request includes a target mode;
将手写模式切换至所述目标模式,并在所述目标模式下,接收用户输入的至少一个标准字符。The handwriting mode is switched to the target mode, and in the target mode, at least one standard character input by the user is received.
其中,所述目标模式可以是键盘输入模式、鼠标输入模式或者其它已有的输入模式。例如,可以结合已有的键盘,在行或列的输入限制内添加标准编码字符或插入其他符号或信息,从而实现混合排版(参见本申请示例中的手写图文混排)。The target mode may be a keyboard input mode, a mouse input mode, or other existing input modes. For example, a mixed typesetting can be implemented by adding standard code characters or inserting other symbols or information into the input limits of a row or column in combination with an existing keyboard (see handwritten text mixing in the example of the present application).
具体地,可以借助适当的触控按键或操作(例如点击)来激活连接的其他输入设备如键盘,以允许用户可以在手写输入与键盘等其它常规输入设备之间进行自由地切换。对于键盘输入内容的划分,可以使用标准代码的划分形式,也可以使用本发明中对字符的划分方式。In particular, other connected input devices, such as a keyboard, can be activated by means of appropriate touch buttons or operations (eg, clicks) to allow the user to freely switch between handwriting input and other conventional input devices such as a keyboard. For the division of the keyboard input content, a division form of a standard code may be used, or a division manner of characters in the present invention may be used.
此外,在手写输入过程中,激活区域还可以随着用户的输入自动移动。例如,总是以用户最后笔划的位置为激活区域的中点位置,对激活区域进行重新定位。这样,大多数情况下,激活区域会随着用户的书写而自动移动,这样可以不用手动设置激活区域的位置。In addition, during the handwriting input process, the active area can also automatically move with the user's input. For example, the active area is always repositioned with the position of the user's last stroke as the midpoint of the active area. In this way, in most cases, the active area will automatically move as the user writes, so that the location of the active area does not need to be manually set.
在传统的标准码输入状态下,系统会有一个闪烁的游标来表示当前输入的位置。而在手写文字输入状态下,系统显示激活区域来表示当前可以输入的范围。当用户进行输入模式切换时,这两者可以按照一定规则互相转化。例如,从标准字符输入切换到手写输入时,系统以游标位置为激活区域的中点来设置激活区域的位置;当从手写输入切换到标准字符输入时,与激活区域中点最近的字符位置就被设置成了当前的输入位置。In the traditional standard code input state, the system will have a flashing cursor to indicate the current input position. In the handwritten text input state, the system displays the active area to indicate the range that can be currently input. When the user performs input mode switching, the two can be converted to each other according to certain rules. For example, when switching from standard character input to handwriting input, the system sets the position of the active area with the cursor position as the midpoint of the active area; when switching from handwriting input to standard character input, the character position closest to the midpoint of the active area is Is set to the current input position.
在上述实施例提供的技术方案的基础上,优选的是,还可以增加控制字符的概念,以解决对手写文字内容排版、编辑的问题。在标准编码(如ASCII码)字符集中存在着控制字符,与此类似,我们在手写文字中也可以引入控制字符的概念,这样可以对手写文字内容的输出及处理更加方便、灵活。On the basis of the technical solutions provided by the foregoing embodiments, it is preferable to increase the concept of controlling characters to solve the problem of typesetting and editing of handwritten text content. Control characters exist in the standard code (such as ASCII code) character set. Similarly, we can introduce the concept of control characters in handwritten text, which makes the output and processing of handwritten text content more convenient and flexible.
具体地,控制字符可以是标准的控制字符,如空格、制表、换行等特殊字符等;还可以是非标准的控制字符,如空白字符。其中,标准的控制字符与现有技术类似,对于空白字符,下面以实施例六为例来进行详细介绍。Specifically, the control characters may be standard control characters, such as spaces, tabulations, line breaks, and the like; or non-standard control characters, such as white space characters. The standard control characters are similar to the prior art. For the blank characters, the following describes the sixth embodiment as an example.
此外,本实施例还额外提供了空白字符的功能。具体地,本实施例中, 可以保留字符之间的空白间距信息,例如,对于横排格式而言的左右字符之间的空白间距大小、或者竖排格式而言的上下字符之间的空白间距大小等,并且可以直接将空白间距创建为带有空白间距信息的空白字符。In addition, this embodiment additionally provides the function of blank characters. Specifically, in this embodiment, The space spacing information between characters can be reserved, for example, the size of the space between the left and right characters for the horizontal format, or the size of the space between the upper and lower characters for the vertical format, etc., and can directly blank The spacing is created as a whitespace character with blank spacing information.
对于用户手写输入的字符来说,当书写风格是从左向右、自上而下时,可以将字符所在的目标行的水平基线限定为所述字符的水平基线,将所述字符中处于最左侧的部件(如图形、图像、笔划等)位置设为所述字符的起始位置,字符中的每个部件都以基线和起始位置为原点,排版方向为正方向记录其在字符内部的位置。这样,同一个字符内容就可以在文字的不同位置中出现,只要根据字符所在行以及该字符在行中的位置,正确计算出相应的字符原点坐标,就能正确绘制出内部的所有部件。同样,对于其它类型的书写风格,可以按照类似方式设定每个字符的起始位置,字符内部部件位置均使用所述起始位置的相对坐标。For the characters handwritten by the user, when the writing style is from left to right and top to bottom, the horizontal baseline of the target line where the character is located may be limited to the horizontal baseline of the character, and the character is the most The position of the left part (such as graphics, images, strokes, etc.) is set to the starting position of the character. Each part in the character is based on the baseline and the starting position, and the typesetting direction is recorded in the positive direction. s position. In this way, the same character content can appear in different positions of the text. As long as the corresponding character origin coordinates are correctly calculated according to the line of the character and the position of the character in the line, all the internal components can be correctly drawn. Similarly, for other types of writing styles, the starting position of each character can be set in a similar manner, and the relative internal coordinates of the starting position are used for the character internal part position.
这些起始位置只是在字符绘制时需要。当对划分好的字符进行存储时,起始位置并不存储。但是与之相关的字符之间的间距会被独立出来,形成空白字符,保存于文字对应的字符序列中。These starting positions are only needed when the characters are drawn. When the divided characters are stored, the starting position is not stored. However, the spacing between the characters associated with them will be separated to form a blank character, which is stored in the character sequence corresponding to the text.
图1G为本发明提供的一种手写输入字符的处理方法实施例中空白字符的示意图。如图1G所示,本实施例中引入自定义空格字符,将字间距作为参数/内容予以保存。图1G中的数字12、16、10为每个空白字符的数值,表示每个空白字符的长度信息。在分析、处理的过程中(如识别、绕行等)可以加以区分对待。类似的,还可以在语音输入的文字中加入基于时间的空白字符。FIG. 1G is a schematic diagram of a blank character in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1G, in this embodiment, a custom space character is introduced, and the word spacing is saved as a parameter/content. The numbers 12, 16, and 10 in Fig. 1G are numerical values of each blank character, indicating the length information of each blank character. In the process of analysis and processing (such as identification, bypass, etc.) can be treated differently. Similarly, time-based whitespace characters can be added to the text of the voice input.
一般说来,用户输入的字符沿排版方向的最大坐标就是该字符的宽度。对于字符宽度,我们可以将其存储,也可以不存,而是通过该字符中所有内部部件的位置信息恢复出来。在对文字进行排版时,只要获取所有字符(包括控制字符)的宽度信息,就可以将所有字符在所属行/列内的起始位置恢复出来,为进一步的文字渲染提供基础。In general, the maximum coordinate of the character entered by the user along the layout direction is the width of the character. For the character width, we can store it or not, but recover it by the position information of all internal parts in the character. When formatting text, as long as you get the width information of all characters (including control characters), you can restore all the characters in the starting position of the row/column, providing a basis for further text rendering.
本实施例中,引入了标准控制字符和空白字符,这些控制字符就和用户手写输入的字符有了类似的模型、编码、字形、字义等。因此,就可以将处理手写输入字符的理论、方法和工具直接或者间接地用于控制字符。进一步的,用户手写输入的字符和控制字符可以混合在一起进行处理,有了这个基 础,字符的拆分才有更加重大的意义。In this embodiment, standard control characters and blank characters are introduced. These control characters have similar models, codes, glyphs, and meanings as the characters handwritten by the user. Therefore, the theory, methods, and tools for processing handwritten input characters can be used directly or indirectly to control characters. Further, the characters handwritten by the user and the control characters can be mixed and processed together, with this base Basic, the splitting of characters is even more significant.
本实施例中处理的对象,可以是用户输入的笔划字符、标准字符、图形字符、组合字符或者控制字符,也可以是其中多种字符的混合。The object processed in this embodiment may be a stroke character, a standard character, a graphic character, a combined character or a control character input by the user, or may be a mixture of a plurality of characters.
图1H为本发明提供的一种手写输入字符的处理方法实施例中文字编辑的流程图。如图1H所示,本实施例中的文字编辑,可以具体包括如下步骤:FIG. 1H is a flow chart of text editing in an embodiment of a method for processing handwritten input characters according to the present invention. As shown in FIG. 1H, the text editing in this embodiment may specifically include the following steps:
步骤601A、判断打开方式:若为打开已有文档,则执行步骤602A;若为新建文档,则执行步骤603A。Step 601A: Determine the open mode: if the existing document is opened, step 602A is performed; if the new document is created, step 603A is performed.
本实施例主要是用于为相关文档提供个性化手写字符输入,主要有两种进入手写输入系统的方式:带文档数据的方式和无文档数据的方式。前者是打开已有文档,后者是新建文档。This embodiment is mainly used to provide personalized handwritten character input for related documents, and there are mainly two ways of entering the handwriting input system: a method with document data and a method without document data. The former is to open an existing document, and the latter is to create a new document.
步骤602A、装载文档数据并根据排版约束进行文字排版,并执行步骤604A。Step 602A, loading document data and performing typesetting according to the typesetting constraint, and executing step 604A.
具体地,字符的相关数据可以是分级装载的。例如,在对字符进行排版时,需要的只是相关字符的宽度(对基于列的排版来说是高度),因而在本步骤中,可以仅装载字符的宽度信息。而对其他信息,如绘制需要的笔划信息、或者轮廓信息等可以在之后按需装载,这样可以节省系统资源(内存和网络流量等)。并执行步骤604A。Specifically, the related data of the characters may be hierarchically loaded. For example, when formatting a character, all that is required is the width of the associated character (higher for column-based layout), so in this step, only the width information of the character can be loaded. Other information, such as drawing stroke information or contour information, can be loaded on demand later, which saves system resources (memory, network traffic, etc.). And step 604A is performed.
步骤603A、初始化手写文档,并执行步骤604A。Step 603A, initializing the handwritten document, and executing step 604A.
步骤604A、将代表文字输入行的手写文字对象序列初始化(置空)。Step 604A: Initialize (empty) the sequence of handwritten text objects representing the character input lines.
以下将代表文字输入行的手写文字对象序列简称为AL(Active Line),AL是本实施例提供的方法中需要处理的核心数据。The sequence of handwritten text objects representing the character input lines is hereinafter referred to as AL (Active Line), and AL is the core data to be processed in the method provided in this embodiment.
步骤605A、呈现文档内容,并执行步骤606A。Step 605A, presenting the document content, and performing step 606A.
被呈现的内容包括多个部分:文档自身的可视化信息(包括手写字符的可视化信息,如字符的位置、形状等信息)、文档呈现环境的可视化信息(如背景、底纹、纸张边界等)、文档编辑相关的可视信息(如选中区域、表示输入焦点的游标或者激活区域等、辅助线等)等。步骤602A中提到,手写字符的可视化数据在需要呈现时必需装载。对于不需要呈现的字符来说,可以不装载其对应的可视化数据。The presented content includes multiple parts: visual information of the document itself (including visual information of handwritten characters, such as the position and shape of characters), visual information of the document presentation environment (such as background, shading, paper border, etc.), Visual information related to document editing (such as selected area, cursor or active area indicating input focus, auxiliary lines, etc.). It is mentioned in step 602A that the visualized data of the handwritten characters must be loaded when it needs to be presented. For characters that do not need to be rendered, their corresponding visualization data may not be loaded.
同传统数据处理系统类似,本实施例中,将字符流从存储区域中装载到 内存,在显示之前,需要进行排版。对于简单的无格式文本来说,这里的排版就是指断行。Similar to the conventional data processing system, in this embodiment, the character stream is loaded from the storage area to Memory, you need to typeset before displaying. For simple unformatted text, the typesetting here refers to line breaks.
具体地,可以在段落结束标记/换行符处断行(硬回车);在每行/列中计算每个字符的位置,并累计输入的文字内容总长度。在位置超出该行的最大位置时断行(软回车)。截断的位置位于之前最后一个可断行处。Specifically, the line can be broken at the end of the paragraph mark/newline (hard return); the position of each character is calculated in each row/column, and the total length of the input text content is accumulated. Breaks when the position exceeds the maximum position of the line (soft return). The truncated position is at the last breakable line.
对于可断行的位置有一系列的判断规则:There are a series of judgment rules for the position that can be broken:
标点符号之后可断行(标点符号不能作为软回车后的行首字符);Punctuation can be broken after the punctuation (punctuation can not be used as the first character after the soft carriage return);
空白处(空白字符、制表符等)可断行,下一行的第一个字符为之后的非空白字符(空白字符不能作为软回车后的行首字符);Blank spaces (blank characters, tabs, etc.) can be broken, and the first character of the next line is the following non-whitespace character (the whitespace character cannot be used as the first character after the soft carriage return);
东亚文字前后均可直接断行;East Asian characters can be directly broken before and after;
英文单词中间不能直接断行(对于简单系统,整个单词直接排到下一行;对于添加了识别功能的复杂系统,还可以根据单词的前后缀断行,并添加连字符);In the middle of an English word, you can't directly break the line (for a simple system, the whole word is directly routed to the next line; for a complex system with the recognition function added, you can also break the line according to the suffix of the word and add a hyphen);
手写字符前后均可直接断行。Handwritten characters can be broken directly before and after.
在实际的实现中,可以将空白字符转换成带有标准长度的空白间距。连续的空白间距可以直接合并,这样,排版算法会更加简单。空白间距的处理方式与空白字符相同。In a practical implementation, whitespace characters can be converted to blank spaces with standard lengths. Continuous blank spaces can be merged directly, so the typesetting algorithm is much simpler. Blank spacing is handled in the same way as whitespace characters.
排版之后的文档模型就包括了每个显示行的信息。行中包括了带位置的词(包括字符组成的单词,东亚字符以及手写字符)。空白字符并不需要出现在这个模型中,相关信息隐含在词的位置属性(左边界,右边界(左边界+宽度))中。因此,空白字符(包括手写间距导致的空白字符、标准空白字符、制表字符等)在排版之后可以丢弃。The document model after typesetting includes information for each display line. The line includes words with position (including characters, East Asian characters, and handwritten characters). Blank characters do not need to appear in this model, and the relevant information is implicit in the position attribute of the word (left border, right border (left border + width)). Therefore, blank characters (including white space, standard white space, tab characters, etc. caused by handwriting pitch) can be discarded after typesetting.
对于排版之后的文档模型,字符之间的间距信息都隐含在字符的坐标关系上了。例如,在某一行内,一个字符的左端坐标为12,字宽为2.5;下一个字符左端点为16。因此可以计算出,这两个字符的间距为16–12–2.5=1.5。每行内的文字都会随着用户的输入发生变化,用户输入和擦掉笔画都可能导致字符的间距发生变化,或者产生新的字符。只要字符坐标正确,间距就能正确产生。只有在需要将编辑内容存储时,才需要计算并生成空白字符,插入到适当的位置。For the document model after typesetting, the spacing information between characters is implicit in the coordinate relationship of the characters. For example, in a line, the left end of a character has a coordinate of 12 and the word width is 2.5; the left end of the next character is 16. It can therefore be calculated that the spacing between the two characters is 16–12–2.5=1.5. The text in each line will change with the user's input. User input and erased strokes may cause the spacing of characters to change or generate new characters. As long as the character coordinates are correct, the spacing will be correctly generated. Only when you need to store the edited content, you need to calculate and generate whitespace characters and insert them into the appropriate locations.
步骤606A、接收命令,并根据命令来执行不同的操作。 Step 606A, receiving the command, and performing different operations according to the command.
这里的命令可以是用户输入的命令,也可以是系统命令或者其他应用系统传递过来的命令。The commands here can be commands entered by the user, or they can be system commands or commands passed by other application systems.
发送命令的方式多种多样,可以通过传统交互设备直接发命令,也可以通过手势发出,例如,当识别出用户沿横向穿过几个连续的字符输入一横线时,可以将该输入手势识别为删除这些字符的操作。还可以通过一些设定自动进行,如新建或者打开文档后自动开始手写输入,选中内容后自动结束手写输入等。There are various ways to send commands. You can send commands directly through traditional interactive devices, or you can send them by gesture. For example, when you recognize that the user enters a horizontal line through several consecutive characters in the horizontal direction, you can recognize the input gesture. The operation to delete these characters. It can also be automatically performed through some settings, such as automatically starting the handwriting input after creating or opening a document, and automatically ending the handwriting input after selecting the content.
具体地,若所述命令为文字编码排版命令,则执行步骤607A;若所述命令为开始手写输入命令,则执行步骤608A;若所述命令为结束手写输入命令,则执行步骤610A;若所述命令为系统退出命令,则执行步骤612A。Specifically, if the command is a text encoding typesetting command, step 607A is performed; if the command is to start a handwriting input command, step 608A is performed; if the command is to end a handwriting input command, step 610A is performed; If the command is a system exit command, step 612A is performed.
步骤607A、根据命令对文字内容进行排版。Step 607A: Typesetting the text content according to the command.
在字符信息的存储过程中,也可以将排版约束、排版方向保存于每个字符的信息中。这样,当同一字符出现在不同排版方式的文字中时,可以根据这个信息调整字符所有部件在当前排版方式下的内部相对位置,从而正确绘制出该字符。In the process of storing character information, the typesetting constraint and the typesetting direction can also be stored in the information of each character. Thus, when the same character appears in the text of different typesetting modes, the internal relative position of all the characters in the current typesetting mode can be adjusted according to this information, thereby correctly drawing the character.
下面以两个例子说明不同排版方式的互相转换。The following two examples illustrate the mutual conversion of different typesetting methods.
一个例子是将最初横排的字符用于竖排,或者反之。横向排版的字符是根据宽度进行步进(即根据排版方向如从左至右累计行长),纵向排版的字符是根据高度进行步进。因此,在具体实现时,需要将横排字符和竖排字符加以区分。对于横排类字符,可能会采用以行基线(对齐线)为横轴,最左笔画点为纵轴的内部坐标系统,而对于竖排类字符,则可能会采用以列中轴线为横轴,最高笔画点为纵轴的内部坐标系统。这样,不同的字符在相应的排版绘制中就会保持原始的对齐状态。将横排文字改为竖排或者竖排改横排时,有了这个字符的排版元信息,系统可以自动进行坐标转换。字符之间原始的对齐状态虽然无法保留,但是每个字符还是能正常呈现。An example would be to use the first horizontal characters for vertical or vice versa. The horizontally typed characters are stepped according to the width (that is, the line length is accumulated from left to right according to the typesetting direction), and the vertically typed characters are stepped according to the height. Therefore, in the specific implementation, it is necessary to distinguish between horizontal characters and vertical characters. For horizontal class characters, the internal coordinate system with the line baseline (alignment line) as the horizontal axis and the leftmost stroke point as the vertical axis may be used. For the vertical class characters, the column axis may be the horizontal axis. The highest stroke point is the internal coordinate system of the vertical axis. In this way, different characters will remain in the original alignment state in the corresponding layout drawing. When the horizontal text is changed to vertical or vertical to horizontal, with the typesetting meta information of this character, the system can automatically perform coordinate conversion. Although the original alignment between characters cannot be preserved, each character can still be rendered normally.
另一个例子是作文格排版改成普通排版。在作文格排版中,字符的类型中标记了作文格排版,然后每个字符的内部坐标系统可以是以对应作文格的左下角(实际上任何一点都可以,如中心点)为原点。这样,每个字符都与对应作文格对齐。作文格排版的手写文字中没有文字间隔/空格字符(但是有空作文格字符)。当我们将作文格排版改成普通排版时,我们可以对每个字 符重新计算,更换坐标系统(如采用上述基线与最左端交点为原点的系统),并根据新的坐标系统,在字符之间插入对应的间隔字符。Another example is the change of text layout into ordinary typesetting. In the typography, the character type is marked in the type of the character, and then the internal coordinate system of each character can be the origin of the lower left corner of the corresponding composition (actually any point, such as the center point). Thus, each character is aligned with the corresponding composition. There is no text space/space character in the handwritten text of the text layout (but there is a space character). When we change the typesetting of texts into ordinary typesetting, we can match each word. Recalculate, replace the coordinate system (such as the system with the above baseline and the leftmost intersection as the origin), and insert the corresponding interval character between the characters according to the new coordinate system.
步骤608A、激活目标行/列,并执行步骤609A。Step 608A, activate the target row/column, and perform step 609A.
本步骤中,可以激活目标行/列,并将所述目标行/列内的文字对象激活(装载笔划信息),对象序列赋与AL。In this step, the target row/column can be activated, and the text object in the target row/column is activated (loading stroke information), and the object sequence is assigned to AL.
本实施例中,手写字符的输入是在行/列的约束下进行的。输入内容即使跨越了多个行/列,其对应的字符最终也必须存储到某个特定行的特定位置。因此字符输入的目标行/列可以用可视化的方式来呈现,并且,还可以通过特定的设置来避免用户进行跨行输入,如辅助面板、全屏幕的行编辑等。In this embodiment, the input of the handwritten characters is performed under the constraint of the row/column. Even if the input spans multiple rows/columns, the corresponding characters must eventually be stored in a specific location on a particular row. Therefore, the target row/column of character input can be presented in a visual manner, and the user can also avoid cross-line input through specific settings, such as auxiliary panel, full-screen line editing, and the like.
步骤609A、在激活的目标行/列的约束下进行手写输入,返回执行步骤605A。Step 609A: Perform handwriting input under the constraint of the activated target row/column, and return to step 605A.
本步骤中,可以在激活的目标行/列的约束下进行手写输入,输入的每个笔划同AL按照一定规则自动组合,形成新的手写文字的对象序列(即AL得到更新)。In this step, handwriting input can be performed under the constraint of the activated target row/column, and each stroke input is automatically combined with the AL according to a certain rule to form a new sequence of handwritten characters (ie, the AL is updated).
手写字符的输入过程主要是按照行/列内的空间约束自动将输入笔划分组结合成不同的字符,其实现方式可参见前述实施例,具体可以通过字间距约束或作文格约束来实现成字效果。The input process of the handwritten characters is mainly to automatically combine the input pens into different characters according to the spatial constraints in the row/column. For the implementation manner, refer to the foregoing embodiment, and the word spacing effect can be realized by the word spacing constraint or the text constraint. .
步骤610A、将AL中文字对象的内容进行存储,并执行步骤611A。Step 610A: Store the content of the AL Chinese character object, and execute step 611A.
本步骤中,将AL中文字对象的内容进行存储,如有必要还可以将AL相关文字内容重新进行排版。In this step, the contents of the AL Chinese character object are stored, and if necessary, the AL related text content can be re-typed.
手写字符输入结束时,AL中的字符对象得以确定(之前都是依笔划输入而动态变化的)。这些字符对象有的没有变化,有的内容(笔划)有变化,有的是全新字符。有变化的字符和全新字符都是新字符。最终AL对应的字符序列需要更新到他们在文档中的相应位置中去。如果此处用了到了编码和内容相拆分的存储方式,则需要首先将新字符的内容存储到编码库,得到对应编码。再将新的编码序列保存到文档(一般是内存中的文档模型)的相应位置。At the end of the handwritten character input, the character object in the AL is determined (previously changed dynamically by stroke input). Some of these character objects have not changed, some content (strokes) have changed, and some are brand new characters. Both changed and new characters are new characters. The sequence of characters corresponding to the final AL needs to be updated to their corresponding position in the document. If the storage method of encoding and content splitting is used here, the content of the new character needs to be first stored in the encoding library to obtain the corresponding encoding. The new code sequence is then saved to the appropriate location in the document (typically the in-memory document model).
由于本手写字符方法使用的是行/列内空间约束,一般情况下,行/列的长度不会发生变化。但是在结束插入内容编辑以及扩展行(软回车)编辑的 时候,需要对当前行及之后的排版信息进行更新,即从当前行开始重新排版。Since this handwritten character method uses a row/column space constraint, in general, the length of the row/column does not change. But at the end of the insert content editing and extension line (soft return) editing At that time, it is necessary to update the current line and the subsequent layout information, that is, to re-type from the current line.
步骤611A、将AL清空,返回执行步骤605A。In step 611A, the AL is cleared, and the process returns to step 605A.
结束手写输入后,不存在手写输入的目标行/列,对应的数据结构可以清空。After the handwriting input is finished, there is no target row/column of handwriting input, and the corresponding data structure can be cleared.
步骤612A、结束。Step 612A, the end.
本实施例提供的手写输入字的处理方法,方便用户对手写字符进行编辑和处理,进一步提高了用户的输入体验度。The processing method of the handwritten input word provided by the embodiment facilitates the user to edit and process the handwritten character, thereby further improving the user's input experience.
此外,除了文档内容的编辑、排版以及字符的拆分、合并、识别、插入、查找、替换,本实施例中,还可以对文档内容进行其它处理,如文档的保存、打印等,还有针对手写字符输入特有的处理操作,例如但不限于以下的示例。In addition, in addition to editing, formatting, and character splitting, merging, recognizing, inserting, searching, and replacing, in this embodiment, other processing of the document content, such as saving and printing of the document, and Handwritten characters are input to unique processing operations such as, but not limited to, the following examples.
为了更接近如同在纸张上的书写效果,还可以参考现有的常规文字编辑工具或软件中的滚动条,在本实施例中设置相应的行、列滚动标尺,以便向上、向下、向左或向右扩展面板的输入范围,即行、列的输入范围空间。并且,在移动标尺时,可以相应地显示和/或激活相应的目标行/列。In order to get closer to the writing effect on the paper, it is also possible to refer to the existing conventional text editing tool or the scroll bar in the software. In this embodiment, the corresponding row and column scrolling rulers are set to be up, down, and left. Or expand the input range of the panel to the right, that is, the input range space of the row and column. Also, when the scale is moved, the corresponding target row/column can be displayed and/or activated accordingly.
还可以将手写输入时的行高与标准字库的特定字号的大小形成对应关系,从而对手写输入字的字号进行标准化或字号调节。It is also possible to associate the line height at the time of handwriting input with the size of a specific font size of the standard font, thereby standardizing or adjusting the font size of the handwritten input word.
还可以在对字符识别完之后丢弃掉字符之间的空白信息,甚至可以有选择地丢弃部分字符间距信息和位置信息,从而节省一定的存储空间。It is also possible to discard the blank information between the characters after the characters are recognized, and even to selectively discard the partial character spacing information and the position information, thereby saving a certain storage space.
本实施例中还可以增加编码的功能。The function of encoding can also be added in this embodiment.
具体地,本实施例中的编码功能,可以包括:Specifically, the coding function in this embodiment may include:
接收编码请求,并根据所述编码请求,确定手写输入程序中的手写字符对应的字形;Receiving an encoding request, and determining a glyph corresponding to the handwritten character in the handwriting input program according to the encoding request;
查询编码仓库中的映射表,获取所述字形对应的标准语言参数。Query the mapping table in the encoding warehouse to obtain the standard language parameters corresponding to the glyphs.
其中,所述标准语言参数包括一种或者几种组合:数字、符号、关键字、公有标识符和私有标识符。Wherein, the standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.
本实施例可以实现对手写输入过程中产生的字符进行编码的功能,下面进行详细描述。This embodiment can implement the function of encoding characters generated during the handwriting input process, which will be described in detail below.
本发明中,将输入的文字或数据对象抽象成“字符”的概念。字符可以是 指表意文字的手写字符,如单个汉字、日文、韩文、阿拉伯文、藏文、缅文等或其局部(例如偏旁部首等),或者表音文字的手写单词,如英文、德文、法文、俄文、西班牙文等西文字母或单词等;还可以是基于传统标准码的计算机字符,如ASCII码字符、Unicode码字符或字符串等,甚至包括控制字符,如空格、制表、换行等特殊字符等;还可以是指非标准的控制字符,如本文中的手写字符间的间隔或间距等;也可以是手写字符与标准字符的混合和/或合成字符或字符串等;甚至还可以是用户输入的任何图形、图像,如后文中的“心”形图案、照片、任何涂鸦等,或其他任何书面表达形式。在本发明的输入方案或系统中,所有以上述方式输入的字符对象都将以非标准字形的方式被识别为字符。In the present invention, the input text or data object is abstracted into the concept of "character". Character can be Refers to handwritten characters of ideograms, such as single Chinese characters, Japanese, Korean, Arabic, Tibetan, Burmese, etc. or parts thereof (such as radicals, etc.), or handwritten words of phonetic characters, such as English, German, French Western letters or words in Russian, Spanish, etc.; can also be computer characters based on traditional standard codes, such as ASCII characters, Unicode code characters or strings, and even control characters such as spaces, tabs, line breaks Such as special characters, etc.; can also refer to non-standard control characters, such as the spacing or spacing between handwritten characters in this article; can also be mixed with handwritten characters and standard characters and / or synthesized characters or strings; It can be any graphic, image input by the user, such as a "heart" pattern, a photo, any graffiti, etc., or any other written expression. In the input scheme or system of the present invention, all character objects input in the above manner will be recognized as characters in a non-standard glyph manner.
本发明中所称的字形类似于标准字库中字符的概念,区别仅在于本发明生成的都是非标准的字形。由于本发明的目的并非在于生成标准字体或字库,因此本发明的系统最终生成的字形很可能包括了对各种字符或字词的错误拆分或它们之间的合并,也可能包括了用户输入的任意图形或图像等。The glyphs referred to in the present invention are similar to the concept of characters in a standard font, except that the present invention generates non-standard glyphs. Since the object of the present invention is not to generate a standard font or font, the resulting glyphs of the system of the present invention are likely to include erroneous splitting of various characters or words or merging between them, and may also include user input. Any graphics or images, etc.
对于现代高级程序设计语言,大体上可以分为编译生成和解释执行两种处理方式。前者是是将源代码经过一系列的编译转换,生成封装了目标机器(可以是虚拟机)指令序列的二进制文件。需要将二进制文件装载到目标系统才能执行。而解释执行是指运行在目标系统中的解释器,通过读取源代码,通过内部的一系列处理直接运行。For modern high-level programming languages, it can be divided into two types: compilation generation and interpretation execution. The former is to convert the source code through a series of compilation and conversion, and generate a binary file that encapsulates the instruction sequence of the target machine (which can be a virtual machine). Binary files need to be loaded into the target system for execution. Interpretation execution refers to an interpreter running in the target system, which reads the source code and runs directly through a series of internal processing.
基于解释执行的语言一般称为脚本语言,典型的有JavaScript、Lua、Tcl等。很多传统程序设计语言都是编译型语言,如C、C++、Objective-C、Java、C#、go、Swift等。也有一些语言两种方式都支持,如Python、Ruby、Lua、Haskell、Scheme、F#等。Languages based on interpretation are generally called scripting languages, typically JavaScript, Lua, Tcl, and so on. Many traditional programming languages are compiled languages such as C, C++, Objective-C, Java, C#, go, Swift, and so on. There are also some languages supported, such as Python, Ruby, Lua, Haskell, Scheme, F#, etc.
对于程序源代码进行处理的核心部件,不管是编译器,还是解释器,其前端构造是非常相似,甚至相同的。所谓前端,是指将源代码转换成一种内部的中间形式。对应的,对于编译器,后端是指将中间形式转换成机器码,而对于解释器,是指将中间形式通过执行引擎进行执行。有的系统中,还有针对中间形式的处理和优化,称之为中端。本文的重点是在前端部分,因此,一般情况下,我们对编译型和解释型并不加以区别。对于前端这里统称为编译前端。 The core components of the program source code, whether it is the compiler or the interpreter, have very similar front-end constructs, even the same. The so-called front end refers to the conversion of source code into an internal intermediate form. Correspondingly, for the compiler, the backend refers to converting the intermediate form into machine code, and for the interpreter, the intermediate form is executed by the execution engine. In some systems, there is also processing and optimization for the intermediate form, which is called the mid-end. The focus of this article is on the front-end part, so in general, we don't make a distinction between compile and explain. For the front end here is collectively referred to as the compilation front end.
编译前端大体可以包括四个处理过程:词法扫描、语法分析、语义分析和中间码生成。词法扫描器将源代码转换成标记流;语法分析器将标记流转换成抽象语法树;语义分析将抽象语法树加上语义标签;中间码生成器将带标签的抽象语法书转换成编译器的中间形式。The compilation front end can generally include four processes: lexical scanning, parsing, semantic analysis, and intermediate code generation. The lexical scanner converts the source code into a tag stream; the parser converts the tag stream into an abstract syntax tree; the semantic analysis adds the abstract syntax tree to the semantic tag; the median code generator converts the tagged abstract grammar book into a compiler Intermediate form.
在一个编程环境中,除了源代码的核心处理器(编译器/解释器)以外,还有一些其它相关的系统支撑系统/平台及工具等。如输入、修改源代码的代码编辑器,调试代码执行过程的调试器,管理代码版本的源码控制工具等等。In a programming environment, in addition to the core processor (compiler/interpreter) of the source code, there are other related system support systems/platforms and tools. Such as input, modify the source code of the code editor, debug the code execution process debugger, manage the code version of the source control tools and so on.
所谓集成开发环境(IDE,Integrated Development Environment),就是将所有这些系统和工具整合起来,提供一个集成的使用界面的应用程序。The so-called Integrated Development Environment (IDE) is the integration of all these systems and tools to provide an integrated application interface.
对于手写文字系统的编程环境来说,手写文字系统带来了全新的文字输入方式,具有安全、方便等优点。但是,其输入、编辑结果仍然是字符流,只不过使用的不是标准码,而是基于输入者个人的专有编码。For the programming environment of handwritten text system, the handwritten text system brings a new way of text input, which is safe and convenient. However, the input and edit results are still character streams, but not the standard code, but the individual code of the input person.
针对手写文字,我们可以设计专门的编程语言;也可以利用手写文字系统中的字形匹配服务来生成基于标准码的程序源代码。对于后者,大量已有的编程环境和工具可以被直接重用。本实施例主要是针对这种方案进行举例说明。For handwritten text, we can design a special programming language; we can also use the glyph matching service in the handwritten text system to generate standard source code based source code. For the latter, a large number of existing programming environments and tools can be reused directly. This embodiment is mainly for illustrating such a scheme.
实际上,该方案相当直接——就是将基于个人的专有编码转换成标准编码。也就是说,将手写源代码转换成普通编译前端能够识别的源代码。于是,传统的编译前端之前加上一个转换过程,就能对手写源代码进行处理,即整个流程大体可以包括五个处理过程:手写源码转换、词法扫描、语法分析、语义分析和中间码生成。In fact, the solution is quite straightforward—that is, converting personal-based proprietary encoding to standard encoding. That is to say, the handwritten source code is converted into source code that can be recognized by the normal compile front end. Therefore, the traditional compilation front end is preceded by a conversion process to process the handwritten source code, that is, the entire process can generally include five processes: handwritten source code conversion, lexical scanning, syntax analysis, semantic analysis, and intermediate code generation.
这个编码转换过程主要是按照既定规则,对手写源代码进行转换和匹配,生成对应的标准码内容,与文字库中的字形脱离。该过程主要分为控制符转换和字形转换两个部分。This encoding conversion process mainly converts and matches the handwritten source code according to the established rules, and generates the corresponding standard code content, which is separated from the glyphs in the font library. The process is mainly divided into two parts: controll conversion and glyph conversion.
针对控制符转换,程序设计语言中的控制符主要包括空格、制表符、回车、换行等。由于我们手写文字中可以采用同普通文本相同或者相似的控制符,所以这个转换非常简单直接。例如将手写间隔码直接转换成标准空白字符。如果手写换行符直接采用标准换行码,则可将其保留,不用转换。For control character conversion, the control characters in the programming language mainly include spaces, tabs, carriage returns, line feeds, and so on. Since the handwritten text can use the same or similar control characters as the normal text, this conversion is very straightforward. For example, the handwritten interval code is directly converted into a standard blank character. If the handwritten line break uses a standard line feed code directly, it can be retained without conversion.
针对字形转换,字形转换主要是将手写源码中的个性化字形码转换成对 应标准编码。这个转换的依据就是其对应的文字字形库中的字形,这里需要用到手写文字系统的字形匹配服务。其中包括数字符号映射、关键字映射、接口标识符映射以及私有标识符生成和映射四个部分。For glyph conversion, the glyph conversion is mainly to convert the personalized glyph code in the handwritten source code into a pair. Should be coded in standard. The basis of this conversion is the glyphs in the corresponding text font library. Here, the glyph matching service of the handwritten text system is needed. These include digital symbol mapping, keyword mapping, interface identifier mapping, and private identifier generation and mapping.
关于数字符号映射:绝大多数高级程序设计语言的源程序都是以文本文件的形式存在的。相对于普通文本内容,其最主要的不同就在于语法约束。这个约束具体就体现在严格的关键字及语法符号限定上。About digital symbol mapping: The source code for most high-level programming languages exists as text files. The main difference with ordinary text content is the grammatical constraints. This constraint is embodied in strict keyword and grammatical symbol restrictions.
数字符号映射就是根据用户定义的字形数字符号映射表,在手写源码中进行字形查找匹配,替换成对应的标准码数字和符号。这里所说的符号是指程序设计语言中使用的标点符号,如加减乘除、大于、等于、小于符号、各种括号等。The digital symbol mapping is based on the user-defined glyph digital symbol mapping table, and the glyph search matching is performed in the handwritten source code, and replaced with the corresponding standard code numbers and symbols. The symbols referred to herein refer to punctuation marks used in programming languages, such as addition, subtraction, multiplication and division, greater than, equal to, less than symbols, various brackets, and the like.
可以看到,这个字形数字符号映射表是数字符号映射的关键。这个表是一个个人化的设置。每个人的书写习惯、笔顺、字形都不太相同,对同一个人的字形进行查找匹配才有意义。因此,每个程序员都有自己的字形数字符号映射表,该表只能对该程序员书写的手写源码进行映射。在一个团队软件开发环境中,程序员需要向特定用户/账号授权,共享其字形数字符号映射表,其手写源码才能被他人编译/运行。实际上,这是手写文字的安全性在软件开发/运行过程中的延伸。It can be seen that this glyph digital symbol mapping table is the key to digital symbol mapping. This table is a personalized setting. Everyone's writing habits, strokes, and glyphs are not the same. It makes sense to find and match the glyphs of the same person. Therefore, each programmer has its own glyph numeric symbol mapping table, which can only map the handwritten source code written by the programmer. In a team software development environment, programmers need to authorize specific users/accounts to share their glyph-like numeric symbol mapping tables, and their handwritten source code can be compiled/runned by others. In fact, this is an extension of the security of handwritten text during software development/running.
由于手写字形的不可靠性,字形数字符号映射表可以是多对一的映射。也就是说,多个字形可以对应同一个数字、符号。Due to the unreliability of the handwriting, the glyph digital symbol mapping table can be a many-to-one mapping. In other words, multiple glyphs can correspond to the same number and symbol.
由于程序源码的长期有效性,特定用户针对特定程序语言的字形数字符号映射表原则上应该是只能增加不能删除和修改的。而且其内容不能互相冲突,如不允许同一字形对应不同数字、符号。Due to the long-term validity of the program source code, the glyph number symbol mapping table of a specific user for a specific programming language should in principle be added only to be deleted and modified. Moreover, the contents cannot conflict with each other, such as not allowing the same glyph to correspond to different numbers and symbols.
不同于关键字及标识符,标准码中的数字、符号字符不是由字母表中的字符构成。因此,传统编译前端词法扫描时,往往对符号字符进行特殊处理,一个符号能够直接终止之前的词法标记;标识符也往往不能以数字字符开始。类似的,我们也需要对手写字形有特殊的约定,以便于处理。例如,可以约定数字、符号只能对应独立字形,而不能对应多个字形的组合。Unlike keywords and identifiers, numbers and symbol characters in standard codes are not composed of characters in the alphabet. Therefore, when compiling a front-end lexical scan, the symbol characters are often specially processed, and one symbol can directly terminate the previous lexical mark; the identifier often cannot start with a numeric character. Similarly, we also need a special convention for the opponent to write, in order to facilitate processing. For example, it can be agreed that numbers and symbols can only correspond to independent glyphs, and cannot correspond to combinations of multiple glyphs.
由于符号的特殊性,字形数字符号映射表一般由用户预先定义。Due to the particularity of the symbols, the glyph digital symbol mapping table is generally predefined by the user.
关于关键字映射:同数字符号映射一样,关键字映射也是基于映射表的字形到标准码的映射。这个映射表就是字形关键字映射表。是一个个人化的 多对一的表。About keyword mapping: Like the numeric symbol mapping, the keyword mapping is also based on the mapping of the glyphs of the mapping table to the standard code. This mapping table is a glyph keyword mapping table. Is a personal A many-to-one table.
关键字对于程序语言的识别和解析也是至关重要,关键字决定了相关语法元素的位置和个数。所以字形关键字映射表的内容一般也是由用户预先定义,也可以在手写源码转换时交互进行。Keywords are also crucial for the recognition and parsing of programming languages. Keywords determine the location and number of related syntax elements. Therefore, the content of the glyph keyword mapping table is generally pre-defined by the user, and can also be interactively performed during handwriting source conversion.
不同于数字符号映射,关键字映射允许一个关键字对应多个字形的组合,也就是说,相同字形的不同组合可以对应不同的关键字。Unlike digital symbol mapping, keyword mapping allows one keyword to correspond to a combination of multiple glyphs, that is, different combinations of the same glyphs can correspond to different keywords.
关于接口标识符映射:同样的,接口标识符映射也是将字形映射成标准码。这里的关键也是一张映射表——字形标识符映射表。对于传统高级程序设计语言来说,或多或少存在内置或者第三方的库,我们需要使用对应的标识符来访问里面的系统常量、系统函数、标准库函数、类库等。这些标识符往往是由标准码字符组成。字形标识符映射表就是用户手写字形与对应标识符之间的映射表。此外,手写代码中的部分符号也有可能成为接口——被他人使用和访问,在这种情况下,我们也需要为之提供对应的标准码标识符。About interface identifier mapping: Similarly, interface identifier mapping also maps glyphs to standard codes. The key here is also a mapping table - glyph identifier mapping table. For traditional high-level programming languages, there are more or less built-in or third-party libraries. We need to use the corresponding identifiers to access system constants, system functions, standard library functions, class libraries, and so on. These identifiers are often composed of standard code characters. The glyph identifier mapping table is a mapping table between the user's handwriting and the corresponding identifier. In addition, some of the symbols in the handwritten code may also become interfaces - used and accessed by others, in which case we also need to provide the corresponding standard code identifier.
字形关键字映射表中,对于特定程序语言,映射到的目标关键字集合(包括系统标点符号)是一个明确的封闭、有限集合。而字形标识符映射表中,目标标识符集合是一个无限、开放的集合。随着用户访问系统/外部接口的增多,以及对外提供接口的增多而增多。In the glyph keyword mapping table, for a particular programming language, the set of target keywords (including system punctuation) mapped to is a well-defined closed, finite set. In the glyph identifier mapping table, the target identifier set is an infinite, open collection. As the number of user access systems/external interfaces increases, and the number of externally provided interfaces increases.
同字形关键字映射表一样,字形标识符的内容可以由用户预先定义,也可以在手写源码转换时交互进行。Like the glyph keyword mapping table, the content of the glyph identifier can be pre-defined by the user or interactively during handwritten source conversion.
实际上,我们也可以将常用字符串、代码片段放到这个映射表中,并用合适的字形序列与之对应。这样会提高编程效率,并提高程序的易读性。In fact, we can also put common strings and code snippets into this mapping table and correspond to them with a suitable sequence of glyphs. This will increase programming efficiency and improve program readability.
关于私有标识符生成和映射:私有标识符在源代码中的出现有两种情况,一种是定义或者声明,另一种是引用。对定义符号的编码转换是针对用户定义或者申明的私有符号(非接口符号),按照系统既定规则,自动进行的标准码标识符生成。这个标准码标识符不需要有特定的文字含义,只需要保证标识符的唯一性,即不同字形生成不同的标准码标识符。About private identifier generation and mapping: There are two cases in which private identifiers appear in the source code, one is a definition or a declaration, and the other is a reference. The code conversion for the defined symbol is for the user-defined or declared private symbol (non-interface symbol), which is automatically generated according to the established rules of the system. This standard code identifier does not need to have a specific literal meaning. It only needs to guarantee the uniqueness of the identifier, that is, different glyphs generate different standard code identifiers.
对于引用符号的编码转换,实际上同上面基于映射表的转换类似,只不过这个映射表是由系统自动生成的。这个映射表的内容就是上面定义符号的字形与对应生成的标准码标识符的对应关系。The encoding conversion for reference symbols is actually similar to the conversion based on the mapping table above, except that this mapping table is automatically generated by the system. The content of this mapping table is the correspondence between the glyphs of the above defined symbols and the corresponding generated standard code identifiers.
在我们的手写文字方案中,我们可以允许手写文字编码和标准编码在同 一内容中混合使用。在手写编程的处理中,我们也允许这样的内容。只不过在源代码转换中,对于标准码的部分直接跳过,不做任何转换。这里为防止手写文字生成的标准码和原有标准码的互相干扰,我们需要在转换过程中,在标准文字和非控制符手写文字直接相邻的情况下,在其之间插入一个空白字符。In our handwritten text scheme, we can allow handwritten text encoding and standard encoding to be in the same Mixed use in one content. In the processing of handwriting programming, we also allow such content. In the source code conversion, the part of the standard code is skipped directly, and no conversion is performed. Here, in order to prevent mutual interference between the standard code generated by the handwritten text and the original standard code, we need to insert a blank character between the standard text and the non-control character handwritten text directly adjacent to each other in the conversion process.
大多数程序设计语言主要是以基于拼音文字的自然语言(如英语)为基础的。因此,标识符往往对应的是单词。使用手写编程的一个好处是可以不受这个自然语言的限制,只要通过映射表映射到目标语言就可以了。例如,我们可以使用中文。在中文中,并不存在单词的概念,尤其是手写汉字中,每个字符都可以有一定的间距。如果我们根据这个间距把单个字符作为一个标识符来处理,这样的结果显然是不对的。因此,我们需要定义一个较大的字符间距,来保证多个字符能够形成一个标识符。Most programming languages are based primarily on natural language based on phonetic characters, such as English. Therefore, identifiers often correspond to words. One of the benefits of using handwritten programming is that it is not limited by this natural language, as long as it is mapped to the target language through a mapping table. For example, we can use Chinese. In Chinese, there is no concept of words, especially in handwritten Chinese characters, each character can have a certain spacing. If we treat a single character as an identifier based on this spacing, this result is obviously wrong. Therefore, we need to define a large character spacing to ensure that multiple characters can form an identifier.
传统程序中不可避免需要使用标准码字符串的输入、输出及相关处理,其对应代码中或多或少会嵌入标准码字符串内容。手写文字的一个好处是不用手写识别,实时生成标准码字符串。因此,在手写文字的程序代码中嵌入标准码字符串确实是一个问题。可以通过以下方法来对其解决或者规避:In the traditional program, it is inevitable to use the input, output and related processing of the standard code string, and the corresponding code will embed the standard code string content more or less. One of the benefits of handwritten text is the ability to generate standard code strings in real time without handwriting recognition. Therefore, embedding a standard code string in the program code of handwritten text is indeed a problem. It can be solved or circumvented by the following methods:
1、将字符串放入字形接口标识符映射表,编程时使用对应字形。通过标准码转换过程来获得需要的字符串;1. Put the string into the glyph interface identifier mapping table, and use the corresponding glyph when programming. Obtain the required string through the standard code conversion process;
2、将字符串放入资源文件(很多系统支持这种做法,而且考虑到国际化问题,这是推荐做法),通过其对应ID来运行时装载字符串。这样就可以避免在程序源码中嵌入字符串;2, put the string into the resource file (many systems support this practice, and considering the internationalization problem, this is the recommended practice), the runtime load string through its corresponding ID. This will avoid embedding strings in the source code of the program;
3、考虑在程序中加入手写文字运行时的支持,这样编写出的程序就能直接支持基于有些文字的输入输出。3, consider adding handwritten text runtime support in the program, so that the program can directly support input and output based on some text.
在字形数字符号映射表中,可以直接定义0-9这10个数字以及小数点对应的字形。但是,对于手写数字的一个问题就是某些数字的字形同其他符号或者文字很难区分,导致文字查找匹配服务的结果出现偏差。例如数字1和小括号(或者),以及英文字母大写I(i)以及小写l(L)的字形都高度相似,数字0和字母O的大小写都难以区分,数字7和字母T也可能相同。针对这个问题,用户需要在输入手写数字时,刻意将其字形同其他符号和字母区分开来。这通常也是人们在日常生活中采用的办法。 In the glyph digital symbol mapping table, 10 numbers of 0-9 and glyphs corresponding to the decimal point can be directly defined. However, one problem with handwritten numbers is that the glyphs of certain numbers are difficult to distinguish from other symbols or words, resulting in deviations in the results of the text lookup matching service. For example, the number 1 and the parentheses (or), as well as the uppercase I (i) and lowercase l (L) glyphs are highly similar, the case of the number 0 and the letter O are indistinguishable, and the number 7 and the letter T may be the same. . In response to this problem, users need to deliberately distinguish their glyphs from other symbols and letters when entering handwritten numbers. This is usually the way people use it in their daily lives.
手写文字的一个优势就是可以不受标准编码文字的字形约束,用户可以使用任意的字形或者符号。因此在手写编程中,我们可以使用任意的字形或者符号来作为关键词或者标识符。但是在使用的过程中,我们需要注意关键字与标识符的冲突。如果标识符使用了同某个关键字相同的字形,转换的结果往往会导致语法错误。通过对关键词采用特殊的字形或者符号,我们可以很好地规避这种冲突。One advantage of handwritten text is that it can be constrained by the glyphs of standard coded text, and the user can use any glyph or symbol. So in handwriting programming, we can use any glyph or symbol as a keyword or identifier. But in the process of using, we need to pay attention to the conflict between keywords and identifiers. If the identifier uses the same glyph as a certain keyword, the result of the conversion will often result in a syntax error. By using special glyphs or symbols for keywords, we can circumvent this conflict very well.
图1I为本发明提供的一种手写输入字符的处理方法实施例中手写程序源代码转换方法的流程图。图1J为图1I所示的手写程序源代码转换方法中“对B进行标准码转换”的详细流程图。FIG. 1I is a flowchart of a handwriting program source code conversion method in a method for processing handwritten input characters according to an embodiment of the present invention. FIG. 1J is a detailed flowchart of “standard code conversion for B” in the handwriting program source code conversion method shown in FIG. 1I.
如图1I和图1J所示,整个转换过程有五个输入:手写程序源文件,手写文字库、字形数字符号映射表、字形关键字映射表、字形接口标识符映射表。转换的结果有三个:标准码目标文件、源目标位置映射表以及字形私有标识符映射表。其中字形私有标识符映射表只是在转换过程中需要使用,可以不用保留。但是源目标位置映射表非常重要,因为转换完成之后的编译、解释执行过程都是以生成的标准码目标文件为输入而进行的,相应的系统信息也都是基于该文本文件内的位置信息来给出。有了这个源目标位置映射表,我们就能够直接将这些信息转换为手写源码文件内部对应的位置。这为我们整个手写编程环境及相关辅助工具提供了基础。As shown in FIG. 1I and FIG. 1J, the entire conversion process has five inputs: a handwritten program source file, a handwritten character library, a glyph numeric symbol mapping table, a glyph keyword mapping table, and a glyph interface identifier mapping table. There are three conversion results: the standard code object file, the source target location mapping table, and the glyph private identifier mapping table. The glyph private identifier mapping table is only needed during the conversion process and can be left unused. However, the source target location mapping table is very important, because the compilation and interpretation execution process after the conversion is completed is performed by inputting the generated standard code object file, and the corresponding system information is also based on the location information in the text file. Given. With this source target location mapping table, we can directly convert this information into the corresponding location within the handwritten source file. This provides the foundation for our entire handwriting programming environment and related aids.
上面描述的详细转换过程中,输出的主要是标准码程序文本文件。但是在实际实现时,转换过程可以和已有编译前端整合,可以跳过写文件的过程,在内存中生成标准码字符流,供进一步处理。另一方面,之前的转换流程假定字形接口标识符映射表已预先定义完成。实际上,通过同编译前端的深度整合,优化的转换过程可以在没有字形标识符映射表的情况下生成中间文件(包括了完整的数字标识符以及关键字转换),然后根据词法分析、语法分析以及语义分析的结果智能地处理手写标识符。如,可以采用这样的处理规则:对于处于符号定义的手写符号,自动生成其标准码标识符;对于未定义的手写符号,使用交互式的方式向用户询问其标识符定义,并根据用户输入自动生成字形接口标识符映射表。In the detailed conversion process described above, the output is mainly a standard code program text file. However, in actual implementation, the conversion process can be integrated with the existing compilation front end, and the process of writing a file can be skipped, and a standard code character stream is generated in the memory for further processing. On the other hand, the previous conversion process assumes that the glyph interface identifier mapping table is pre-defined. In fact, through deep integration with the compiled front end, the optimized conversion process can generate intermediate files (including complete numeric identifiers and keyword conversions) without the glyph identifier mapping table, and then according to lexical analysis, parsing And the results of semantic analysis intelligently handle handwritten identifiers. For example, a processing rule can be employed: for a handwritten symbol defined by a symbol, its standard code identifier is automatically generated; for an undefined handwritten symbol, an interactive manner is used to query the user for its identifier definition, and automatically according to user input. Generate a glyph interface identifier mapping table.
跟进一步,将深度整合的编译器用于手写文字编辑器内部,还可以实现语法着色、语法智能感知等功能,从而能最终实现基于手写文字的集成开发 环境。Further, the deeply integrated compiler is used inside the handwritten text editor, and can also implement functions such as syntax coloring and grammatical intelligence, so as to finally realize integrated development based on handwritten characters. surroundings.
图1K为本发明提供的一种手写输入字符的处理方法实施例中手写程序的示意图。图1K中的手写程序对应编程语言为Lua语言,这是一种嵌入式的脚本语言。对应的字形库编码可以如表1、表2和表3所示。FIG. 1K is a schematic diagram of a handwriting program in an embodiment of a method for processing handwritten input characters according to the present invention. The handwriting program in Fig. 1K corresponds to the programming language Lua language, which is an embedded scripting language. The corresponding font library code can be as shown in Table 1, Table 2 and Table 3.
表1Table 1
Figure PCTCN2015086672-appb-000001
Figure PCTCN2015086672-appb-000001
表2Table 2
Figure PCTCN2015086672-appb-000002
Figure PCTCN2015086672-appb-000002
Figure PCTCN2015086672-appb-000003
Figure PCTCN2015086672-appb-000003
表3table 3
Figure PCTCN2015086672-appb-000004
Figure PCTCN2015086672-appb-000004
上面的手写程序中有三类编码:字形编码、字间距编码以及换行编码。我们将字形编码表示为W+(具体的字形编码)的形式,将字间距编码表示为S+(字间距数值)的形式。对于换行符,为方便起见,我们不将其编码嵌入内容,而是直接用新行来表示。因此,上面的手写程序对应的编码可以表示如下:There are three types of coding in the above handwriting: glyph coding, word spacing coding, and line feed coding. We represent the glyph encoding as W+ (specific glyph encoding) and the word spacing encoding as S+ (word spacing value). For line breaks, for convenience, we don't embed the code in the content, but directly with the new line. Therefore, the code corresponding to the above handwriting program can be expressed as follows:
S06 W01 S22 W02 S07 W03 S06 W04 S11 W05 S06 W06 S09 W07 S12 W08 S09 W09S06 W01 S22 W02 S07 W03 S06 W04 S11 W05 S06 W06 S09 W07 S12 W08 S09 W09
S05 W10 S38 W11 S13 W12 S11 W13 S13 W14S05 W10 S38 W11 S13 W12 S11 W13 S13 W14
S46 W15 S39 W16 S23 W17 S24 W18 S33 W19S46 W15 S39 W16 S23 W17 S24 W18 S33 W19
S114 W20 S40 W21S114 W20 S40 W21
S51 W22S51 W22
S113 W23 S39 W24 S25 W25 S25 W26 S11 W27 S08 W28 S12 W29 S12 W30 S09 W31S113 W23 S39 W24 S25 W25 S25 W26 S11 W27 S08 W28 S12 W29 S12 W30 S09 W31
S62 W32S62 W32
S17 W33 S17 W33
S31 W34 S30 W35 S27 W36 S12 W37 S05 W38 S03 W39S31 W34 S30 W35 S27 W36 S12 W37 S05 W38 S03 W39
S30 W40 S09 W41 S16 W42 S16 W43 S16 W44 S13 W45 S18 W46 S13 W47S30 W40 S09 W41 S16 W42 S16 W43 S16 W44 S13 W45 S18 W46 S13 W47
对该代码进行转换,用户准备字形数字符号映射表如表4所示。The code is converted, and the user prepares the glyph digital symbol mapping table as shown in Table 4.
表4Table 4
Figure PCTCN2015086672-appb-000005
Figure PCTCN2015086672-appb-000005
字形关键字映射表如表5所示。The glyph keyword mapping table is shown in Table 5.
表5table 5
Figure PCTCN2015086672-appb-000006
Figure PCTCN2015086672-appb-000006
Figure PCTCN2015086672-appb-000007
Figure PCTCN2015086672-appb-000007
字形接口标识符映射表如表6所示。The glyph interface identifier mapping table is shown in Table 6.
表6Table 6
Figure PCTCN2015086672-appb-000008
Figure PCTCN2015086672-appb-000008
在这里,系统设置的语法间隔阈值为20。私有标识符自动生成规则为两个下划线(_)之后跟随用下划线相连的字形编码序列。Here, the system sets a syntax interval threshold of 20. The private identifier auto-generation rule is two underscores (_) followed by a glyph code sequence connected by an underscore.
最终,根据之前的流程,可以得这样的标准码程序代码:Finally, according to the previous process, you can get such standard code program code:
Figure PCTCN2015086672-appb-000009
Figure PCTCN2015086672-appb-000009
可以看到,有四个私有标识符被生成了出来,生成的私有标示符如表7所示。As you can see, four private identifiers are generated, and the generated private identifiers are shown in Table 7.
表7Table 7
Figure PCTCN2015086672-appb-000010
Figure PCTCN2015086672-appb-000010
其中,第一个标识符实际上是注释内容,没有意义。如果我们采用优化的转换过程,在识别到其为注释内容时,可以直接省略对其的转换。Among them, the first identifier is actually a comment content, meaningless. If we use an optimized conversion process, we can omit the conversion directly when it is identified as a comment.
这段生成的程序能够被传统Lua解释器正常解释执行,其执行语义同手写源代码中的也是完全相同。This generated program can be interpreted and executed normally by the traditional Lua interpreter, and its execution semantics are exactly the same as those in the handwritten source code.
进一步的,本发明在上述图1A的基础上,该方法还可以进一步包括:Further, based on the foregoing FIG. 1A, the method may further include:
在接收到存储请求时,根据预设元数据剥离规约,获取保存的手写文字的元数据,并将获取的元数据从所述手写文字中剥离;When receiving the storage request, the protocol is stripped according to the preset metadata, the metadata of the saved handwritten text is obtained, and the obtained metadata is stripped from the handwritten text;
根据预设数据内容拆分规约,将所述手写文字划分为至少两个数据片断。The handwritten text is divided into at least two pieces of data according to a preset data content splitting specification.
更进一步的,该方法还可以包括:Further, the method may further include:
查询编码仓库,根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,对所述手写文字进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述手写文字对应的文字编码;Querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and encoding the handwritten text according to the encoding specification, Obtaining an instance code, and acquiring a text code corresponding to the handwritten text according to the meta code and the instance code;
或者,or,
将所述手写文字和所述元数据发送给所述编码仓库,以供所述编码仓库根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,对所述手写文字进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述手写文字对应的文字编码;并接收所述编码仓库返回的所述文字编码,所述文字编码是引用编码形式或者内容编码形式。Transmitting the handwritten text and the metadata to the encoding repository, wherein the encoding repository selects or creates an encoding specification according to at least a portion of the metadata, and generates a correspondence corresponding to the metadata according to the encoding specification Encoding according to the encoding protocol, encoding the handwritten text, obtaining an example encoding, and acquiring a text encoding corresponding to the handwritten text according to the meta encoding and the example encoding; and receiving the encoding warehouse The text code returned, the text code is a reference code form or a content code form.
需要说明的是,数据拆分的处理流程可以参见说明书后续数据拆分方法实施例部分的具体介绍,另外,编码处理的具体流程可以参见说明书后续编码处理方法实施例部分的具体介绍,此处不再赘述。It should be noted that the processing procedure of the data splitting can be referred to the specific introduction of the embodiment of the data splitting method in the instruction manual. In addition, the specific process of the encoding processing can be referred to the specific introduction of the embodiment of the subsequent encoding processing method of the specification. Let me repeat.
图1L为本发明提供的一种手写输入字符的处理装置实施例的结构示意图。如图1L所示,本实施例中的手写输入字符的处理装置,可以包括:FIG. 1L is a schematic structural diagram of an embodiment of a device for processing handwritten input characters according to the present invention. As shown in FIG. 1L, the processing device for handwriting input characters in this embodiment may include:
采集模块1001A,用于在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息;其中,所述输入信息包括所述笔划在所述第一目标行/列中的输入位置;The acquiring module 1001A is configured to collect, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes the stroke in the first target row/column Input position
归属模块1002A,用于对于每个笔划,根据所述笔划在所述第一目标行/ 列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符。a attribution module 1002A for each stroke according to the stroke in the first target line / An input position in the column, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the The character to which the stroke belongs.
本实施例中的手写输入字符的处理装置,可以用于执行图1A所示的手写输入字符的处理方法实施例,其具体实现原理可以参照上述实施例,此处不再赘述。The handwriting input character processing device in this embodiment may be used to perform the method for processing the handwritten input character shown in FIG. 1A. The specific implementation principle may refer to the foregoing embodiment, and details are not described herein again.
本实施例提供的手写输入字符的处理装置,在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息,并根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,能够实现边输入边成字的效果,用户不需要借助明确或隐含的“开始单个文字输入”或“结束单个文字输入”的命令来区分不同的字符,因此,在书写过程中不需要每写完一个字必须停顿一段时间或者与系统进行某些交互,书写过程流畅,效率较高;并且,本方法中直接通过笔划的输入位置来确定笔划归属的字符,而不需要进行标准字符的识别,因此能够保留用户手写输入的个性化信息及书写风格和特征。The handwriting input character processing apparatus provided in this embodiment acquires a stroke input by the user and corresponding input information in the currently activated first target row/column, and is in the first target row/column according to the stroke An input position in the input, or an input position of the stroke in the first target row/column and a character specified in the first target row/column, creating a new character for the stroke or determining the stroke The attribute of the attribution can realize the effect of typing on the side of the input. The user does not need to distinguish the different characters by means of explicit or implicit "start single text input" or "end single text input" commands. Therefore, during the writing process It is not necessary to pause for a period of time or perform some interaction with the system, the writing process is smooth and efficient, and the input position of the stroke is directly determined by the method to determine the character to which the stroke belongs, without standardization. Character recognition, thus retaining the personalized information and writing style and features of the user's handwriting input.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
获取手写输入屏的尺寸信息以及行高/列宽的信息;Obtaining the size information of the handwriting input screen and the information of the row height/column width;
根据所述手写输入屏的尺寸信息以及行高/列宽的信息,将所述手写输入屏划分为至少一行/列,并确定每一行/列的位置范围;Decoding the handwriting input screen into at least one row/column according to the size information of the handwriting input screen and the information of the row height/column width, and determining a range of positions of each row/column;
其中,所述行高/列宽的信息为默认值或由所述用户输入确定,所述每一行/列的位置范围是指每一行在所述手写输入屏中相对的顶边位置和底边位置或者每一列在所述手写输入屏中相对的左侧位置和右侧位置;Wherein, the row height/column width information is a default value or determined by the user input, and the position range of each row/column refers to a relative top edge position and a bottom edge of each row in the handwriting input screen. a position or a column of opposite left and right positions in the handwriting input screen;
接收用户输入的目标行/列选择消息,所述目标行/列选择消息中包括所述用户欲输入的目标行/列的标识;Receiving a target row/column selection message input by the user, where the target row/column selection message includes an identifier of the target row/column to be input by the user;
根据所述目标行/列选择消息,将所述用户欲输入的目标行/列的标识对应的行/列作为所述当前激活的第一目标行/列。According to the target row/column selection message, a row/column corresponding to the identifier of the target row/column to be input by the user is used as the currently activated first target row/column.
或者,所述采集模块1001A还用于:Alternatively, the acquisition module 1001A is further configured to:
采集获取用户输入的至少一个字符; Collecting at least one character obtained by the user;
以所述至少一个字符所在的行/列作为所述当前激活的第一目标行/列;Using the row/column of the at least one character as the currently activated first target row/column;
根据所述至少一个字符的字符边界,设置所述当前激活的第一目标行/列的位置范围;Setting a range of locations of the currently activated first target row/column according to a character boundary of the at least one character;
其中,所述位置范围是指第一目标行在手写输入屏中相对的顶边位置和底边位置或者第一目标列在手写输入屏中相对的左侧位置和右侧位置。Wherein, the position range refers to a relative top side position and a bottom side position of the first target line in the handwriting input screen or a relative left side position and a right side position of the first target column in the handwriting input screen.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
接收用户输入的断行/列命令;Receiving a line break/column command input by the user;
根据所述断行/列命令,将第二目标行/列作为当前激活的目标行/列,所述第二目标行/列为所述第一目标行/列的下一行/列。According to the line break/column command, the second target row/column is the currently activated target row/column, and the second target row/column is the next row/column of the first target row/column.
或者,所述采集模块1001A还用于:Alternatively, the acquisition module 1001A is further configured to:
判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离是否小于第一预设阈值;Determining whether a distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than a first preset threshold;
若判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离小于所述第一预设阈值,则将第二目标行/列作为当前激活的目标行/列,以实现在所述第二目标行/列中采集获取用户输入的笔划;If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the second target line is/ The column is the currently activated target row/column to enable acquisition of the stroke of the user input in the second target row/column;
其中,所述第二目标行/列为所述第一目标行/列的下一行/列。The second target row/column is the next row/column of the first target row/column.
或者,所述采集模块1001A还用于:Alternatively, the acquisition module 1001A is further configured to:
判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离是否小于第一预设阈值;Determining whether a distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than a first preset threshold;
若判断所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列的结束位置之间的距离小于所述第一预设阈值,则将第一目标行/列和第二目标行/列同时作为当前激活的目标行/列;If it is determined that the distance between the input position of the stroke in the first target row/column and the end position of the first target row/column is less than the first preset threshold, then the first target line/ The column and the second target row/column are simultaneously the currently activated target row/column;
在第一目标行/列和/或第二目标行/列中采集获取用户后续输入的至少一个笔划,并在所述第二目标行/列采集获取第一个笔划时,仅将第二目标行/列作为当前激活的目标行/列;Acquiring at least one stroke of the user's subsequent input in the first target row/column and/or the second target row/column, and only acquiring the second target when the second target row/column acquisition acquires the first stroke Row/column as the currently active target row/column;
其中,所述第二目标行/列为所第一目标行/列的下一行/列。The second target row/column is the next row/column of the first target row/column.
在所述将第一目标行/列和第二目标行/列同时作为当前激活的目标行/列时,所述第一目标行/列和所述第二目标行/列均为部分区域激活; When the first target row/column and the second target row/column are simultaneously the currently activated target row/column, the first target row/column and the second target row/column are both partial region activated. ;
所述第一目标行/列的激活区域的起始位置设置在所述第二目标行/列的激活区域的结束位置与所述第一目标行/列的激活区域的结束位置之间。A starting position of the active area of the first target row/column is set between an end position of an active area of the second target row/column and an end position of an active area of the first target row/column.
在上述实施例提供的技术方案的基础上,优选的是,所述归属模块1002A,具体用于:On the basis of the technical solutions provided by the foregoing embodiments, it is preferred that the home module 1002A is specifically configured to:
将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character;
若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
其中,所述指定的字符为所述第一目标行/列中已存在的所有字符;The specified character is all characters that are already in the first target row/column;
或者,所述指定的字符为所述第一目标行/列中的待比较区域中的字符,其中,所述待比较区域的边界位置与所述笔划的距离小于第二预设阈值。Or the specified character is a character in the area to be compared in the first target row/column, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold.
具体地,将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性,可以包括:Specifically, comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, determining the association between the stroke and the character Sex can include:
将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划是否与所述字符中的至少一个笔划重叠;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining whether the stroke is at least one of the characters Overlapping strokes;
若所述笔划与所述字符中的至少一个笔划重叠,则判断所述笔划与所述字符相关联;If the stroke overlaps with at least one of the characters, determining that the stroke is associated with the character;
若所述笔划与所述字符中的所有笔划均不重叠,则判断所述笔划与所述字符不相关联。If the stroke does not overlap with all the strokes in the character, it is determined that the stroke is not associated with the character.
或者,所述将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性,可以包括:Alternatively, comparing the input position of the stroke in the first target row/column with position information corresponding to a character specified in the first target row/column, and determining between the stroke and the character Relevance can include:
对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符对应的位置信息进行对比,判断所述笔划与所述字符的边界之间的距离是否小于第三预设阈值; Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character specified in the first target row/column, and determining the stroke and the location Whether the distance between the boundaries of the characters is less than a third preset threshold;
若所述笔划与所述字符的边界小于第三预设阈值,则判断所述笔划与所述字符相关联;If the boundary of the stroke and the character is less than a third preset threshold, determining that the stroke is associated with the character;
若所述笔划与所述字符的边界不小于第三预设阈值,则判断所述笔划与所述字符不相关联。If the boundary between the stroke and the character is not less than a third preset threshold, it is determined that the stroke is not associated with the character.
或者,所述将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性,可以包括:Alternatively, comparing the input position of the stroke in the first target row/column with position information corresponding to a character specified in the first target row/column, and determining between the stroke and the character Relevance can include:
对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符中的各个笔划对应的位置信息进行对比,获取所述笔划与所述字符对应的各个笔划之间的间距中的最小间距值,并判断所述最小间距值是否小于第三预设阈值;Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value in a spacing between each stroke corresponding to the character, and determining whether the minimum spacing value is less than a third preset threshold;
若小于,则所述笔划与所述字符相关联。If less than, the stroke is associated with the character.
若不小于,则所述笔划与所述字符不相关联。If not less than, the stroke is not associated with the character.
其中,所述根据相关联的至少一个字符,对所述笔划进行归属处理,可以包括:The performing the affiliation processing on the stroke according to the associated at least one character may include:
若与所述笔划相关联的字符有至少两个,则将至少两个字符合并,并将所述笔划归属于合并后的字符。If there are at least two characters associated with the stroke, at least two characters are combined and the stroke is attributed to the merged character.
或者,所述根据相关联的至少一个字符,对所述笔划进行归属处理,可以包括:Alternatively, the performing the attribution processing on the stroke according to the at least one associated character may include:
从相关联的至少一个字符中获取与所述笔划关联性最强的字符;Obtaining the character most strongly associated with the stroke from the associated at least one character;
若与所述笔划关联性最强的字符为一个,则将所述笔划归属于最强的字符;If the character with the strongest correlation with the stroke is one, the stroke is attributed to the strongest character;
若与所述笔划关联性最强的字符有至少两个,则将至少两个字符合并,并将所述笔划归属于合并后的字符。If there are at least two characters with the strongest association with the stroke, at least two characters are merged, and the stroke is attributed to the merged character.
其中,所述从相关联的至少一个字符中获取与所述笔划关联性最强的字符,包括:The obtaining the most relevant character from the stroke from the associated at least one character includes:
根据所述笔划与所述字符的边界的距离,按照从小到大的顺序,将与所述笔划相关联的至少一个字符进行排序,并将最小距离所对应的字符作为与所述笔划关联性最强的字符;或者,And according to the distance between the stroke and the boundary of the character, at least one character associated with the stroke is sorted in order from small to large, and the character corresponding to the minimum distance is used as the most relevant to the stroke. Strong character; or,
根据所述笔划与所述字符对应的最小间距值,按照从小到大的顺序,将 与所述笔划相关联的至少一个字符进行排序,并将第一个字符作为与所述笔划关联性最强的字符。According to the minimum spacing value corresponding to the character of the stroke, in order from small to large, At least one character associated with the stroke is sorted and the first character is used as the character most strongly associated with the stroke.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
在所述采集获取用户输入的笔划以及对应的输入信息之前,对所述第一目标行/列进行划分,以将所述第一目标行/列划分成多个作文格;Before the collecting and acquiring the stroke input by the user and the corresponding input information, dividing the first target row/column to divide the first target row/column into a plurality of composition grids;
相应的,所述归属模块1002A,可以具体用于:Correspondingly, the home module 1002A can be specifically configured to:
根据所述笔划在所述第一目标行/列中的输入位置,确定所述笔划所在的作文格;Determining, according to the input position of the stroke in the first target row/column, the composition of the stroke;
判断所述作文格中是否已存在字符;Determining whether a character already exists in the composition grid;
若存在,则所述笔划归属于所述作文格中已存在的字符;反之,则在所述作文格中创建一个新的字符,所述笔划归属于所述新的字符。If present, the stroke is attributed to an existing character in the composition; otherwise, a new character is created in the composition, the stroke being attributed to the new character.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
接收用户输入的查找命令,所述查找命令中包括所述用户输入的待查找字符;Receiving a search command input by the user, where the search command includes a character to be searched by the user;
根据所述待查找字符的笔划数量和笔划特征,将所述待查找字符分别与本地保存的字符进行比对,获取与所述待查找字符匹配的字符。The characters to be searched are compared with the locally saved characters according to the number of strokes of the character to be searched and the stroke feature, and characters matching the characters to be searched are obtained.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
每隔预设时间,将采集获取的笔划所创建的新的字符或者归属的字符进行保存;The new character or the attribute that is created by the acquired stroke is saved every preset time;
或者,or,
在同一页面上,获取在所述页面上的当前激活的目标行/列由一个目标行/列切换至另一个目标行/列时,保存所述一个目标行/列上采集获取的笔划所创建的新的字符或者归属的字符;On the same page, when the currently activated target row/column on the page is switched from one target row/column to another target row/column, the strokes acquired by the acquisition on the one target row/column are saved. New character or attribute of the character;
或者,or,
在获取在当前页面由一个页面切换至另一个页面时,保存所述一个页面上采集获取的笔划所创建的新的字符或者归属的字符。When the current page is switched from one page to another, the new character or the attribute created by the acquired stroke is saved on the one page.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块 1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferred that the acquisition module 1001A is also used to:
将所述用户输入的笔划以及对应的输入信息保存在第一内存中;Saving the stroke input by the user and the corresponding input information in the first memory;
在第二内存中存储保存的字符,对于每个保存的字符,所述字符包括构成所述字符的笔划和所述笔划对应的索引;The saved characters are stored in the second memory, and for each saved character, the characters include a stroke constituting the character and an index corresponding to the stroke;
其中,所述笔划对应的索引指向所述第一内存中所述笔划对应的输入信息。The index corresponding to the stroke points to the input information corresponding to the stroke in the first memory.
所述笔划对应的输入信息还包括如下一种或者几种的组合:所述笔划的输入时间、所述笔划的输入力度和所述笔划的输入速度。The input information corresponding to the stroke further includes one or a combination of the following: an input time of the stroke, an input strength of the stroke, and an input speed of the stroke.
所述输入时间包括所述笔划的落笔时刻和抬笔时刻、以及所述笔划的笔迹中每个点的停留时间;The input time includes a pen down time and a pen up time of the stroke, and a dwell time of each point in the stroke of the stroke;
所述输入位置至少包括:落笔时的位置、抬笔时的位置、以及所述笔划的笔迹中每个点的坐标位置。The input position includes at least: a position when the pen is dropped, a position when the pen is lifted, and a coordinate position of each point in the handwriting of the stroke.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
分别获取并显示本地保存的每个字符的边界;Get and display the boundaries of each character saved locally;
接收用户输入的纠正请求,所述纠正请求包括待纠正的字符,或者待纠正的字符和待纠正的笔划;Receiving a correction request input by a user, the correction request including a character to be corrected, or a character to be corrected and a stroke to be corrected;
根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理。Performing corresponding correction processing on the character to be corrected according to the correction request.
其中,所述纠正请求为合并纠正请求,所述待纠正的字符为待合并的至少两个字符;The correction request is a merge correction request, and the character to be corrected is at least two characters to be merged;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括:Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待合并的至少两个字符合并为一个字符。Combining the at least two characters to be merged into one character.
或者,所述纠正请求为拆分纠正请求,所述待纠正的字符为待拆分的一个字符;Alternatively, the correction request is a split correction request, and the character to be corrected is a character to be split;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括:Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待拆分的一个字符拆分为至少两个字符。Splitting one character to be split into at least two characters.
或者,所述纠正请求为归属纠正请求,所述待纠正的字符为一个待归属字符,所述待纠正的笔划为待纠正的至少一个笔划; Or the correction request is a home correction request, the character to be corrected is a character to be vested, and the stroke to be corrected is at least one stroke to be corrected;
相应的,所述根据所述纠正请求,对所述待纠正的字符进行相应的纠正处理,包括:Correspondingly, the correcting processing is performed on the character to be corrected according to the correcting request, including:
将所述待纠正的至少一个笔划归属于所述待归属字符。At least one stroke to be corrected is attributed to the to-be-vested character.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
接收用户输入的插入请求,所述插入请求中包括待插入的目标行/列、在所述待插入的目标行/列中的待插入位置、以及待插入字符;Receiving an insertion request input by a user, the insertion request including a target row/column to be inserted, a to-be-inserted position in the target row/column to be inserted, and a character to be inserted;
将所述待插入的目标行/列激活,并将所述待插入字符插入到所述待插入位置;Activating the target row/column to be inserted, and inserting the character to be inserted into the to-be-inserted position;
对所述待插入位置之后的字符进行相应地调整。The characters after the position to be inserted are adjusted accordingly.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
采集获取所述用户所选择的至少一个字符;Acquiring and acquiring at least one character selected by the user;
接收用户输入的选择处理命令,并根据所述选择处理命令对所述至少一个字符进行处理操作;Receiving a selection processing command input by the user, and performing a processing operation on the at least one character according to the selection processing command;
其中,所述选择处理命令包括下述任一一种或几种的组合:对所述至少一个字符进行复制处理、对所述至少一个字符进行剪切处理,对所述至少一个字符进行替换处理,对所述至少一个字符进行合并处理。The selection processing command includes any one or a combination of the following: performing copy processing on the at least one character, performing cut processing on the at least one character, and performing replacement processing on the at least one character And performing a merge process on the at least one character.
在上述实施例提供的技术方案的基础上,优选的是,所述第一目标行/列的数量为多个;On the basis of the technical solutions provided by the foregoing embodiments, it is preferable that the number of the first target rows/columns is plural;
多个所述第一目标行/列对应的激活区域均不重叠、且互相不接触。The active areas corresponding to the plurality of the first target rows/columns do not overlap and are not in contact with each other.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
接收用户输入的模式切换请求,所述模式切换请求包括目标模式;Receiving a mode switching request input by a user, where the mode switching request includes a target mode;
将手写模式切换至所述目标模式,并在所述目标模式下,接收用户输入的至少一个标准字符。The handwriting mode is switched to the target mode, and in the target mode, at least one standard character input by the user is received.
在上述实施例提供的技术方案的基础上,优选的是,所述采集模块1001A还用于:Based on the technical solution provided by the foregoing embodiment, it is preferable that the collection module 1001A is further configured to:
接收编码请求,并根据所述编码请求,确定手写输入程序中的手写字符对应的字形; Receiving an encoding request, and determining a glyph corresponding to the handwritten character in the handwriting input program according to the encoding request;
查询编码仓库中的映射表,获取所述字形对应的标准语言参数。Query the mapping table in the encoding warehouse to obtain the standard language parameters corresponding to the glyphs.
其中,所述标准语言参数包括一种或者几种组合:数字、符号、关键字、公有标识符和私有标识符。Wherein, the standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.
以下将对其中的数据拆分及数据合并进行详细的说明。The data splitting and data merging will be described in detail below.
本发明的数据拆分是一个可以有效解决上述问题的方案。图2A为根据一示例性实施例示出的一种数据拆分方法的流程图,如图2A所示,本发明提供一种数据拆分方法,包括:The data splitting of the present invention is a solution that can effectively solve the above problems. 2A is a flowchart of a data splitting method according to an exemplary embodiment. As shown in FIG. 2A, the present invention provides a data splitting method, including:
步骤101B、在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取待存储数据标识对应的数据对象中的元数据。In step 101B, when receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, and the metadata in the data object corresponding to the data identifier to be stored is obtained.
步骤102B、将获取到的元数据从数据对象中剥离。 Step 102B: Strip the acquired metadata from the data object.
步骤103B、根据预设数据内容拆分规约,将数据内容划分为至少两个数据片断。 Step 103B: Split the data according to the preset data content, and divide the data content into at least two data segments.
可选地,该方法还可以进一步包括:Optionally, the method may further include:
步骤104B、将元数据、各个数据片断分别存储到不同的存储体中或不同的安全通道中。In step 104B, the metadata and each data segment are separately stored in different storage bodies or in different secure channels.
本实施例的数据拆分方法,通过在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取待存储数据标识对应的数据对象中的元数据,并将元数据从数据对象中剥离;再根据预设数据内容拆分规约,将数据内容划分为多个数据片断;再将元数据和各个数据片断分别存储到不同的存储体中或不同的安全通道中。从而加大了非法获取到用户原始数据的难度,更加可靠地实现了数据存储的安全性。In the data splitting method of the embodiment, when receiving the storage request carrying the identifier of the data to be stored, according to the preset metadata stripping rule, obtaining the metadata in the data object corresponding to the data identifier to be stored, and the metadata is The data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and the security of the data storage is more reliably realized.
图2B-1为根据另一示例性实施例示出的一种数据拆分方法的流程图,如图2B-1所示,本发明提供一种数据拆分方法,包括:FIG. 2B-1 is a flowchart of a data splitting method according to another exemplary embodiment. As shown in FIG. 2B-1, the present invention provides a data splitting method, including:
步骤201B、接收到携带有待存储数据标识的存储请求。 Step 201B: Receive a storage request carrying an identifier of the data to be stored.
该数据拆分方法可以应用于终端(客户端设备)或网络端(服务器设备)等设备中,当设备接收到携带有待存储数据标识的存储请求,该存储请求可以为终端应用程序触发,例如邮件系统、前文提到的桌面代理等应用程序,以邮件系统为例,邮件系统在发送文件数据时接收到携带有待存储数据标识的存储请求,由邮件系统的数据拆分装置将文件数据先行进行拆分处理,使得邮件的接收方需要从各个指定存储体中获取到文件数据片断才能得到完整 的文件数据;或者该存储请求由用户触发,如用户欲将某文件拆分后再存储,则数据拆分装置接收携带有待存储数据标识的存储请求,再对文件进行拆分处理。其中,待存储数据标识可以为文件数据的名称,编码(如文件的信息摘要算法第五版,Message Digest Algorithm,简称MD5码)等标识性信息。The data splitting method may be applied to a device such as a terminal (client device) or a network (server device). When the device receives a storage request carrying a data identifier to be stored, the storage request may be triggered by the terminal application, for example, a mail. The system, the desktop agent and other applications mentioned above take the mail system as an example. When the mail system sends the file data, it receives the storage request carrying the identifier of the data to be stored, and the data splitting device of the mail system first disassembles the file data. Sub-processing, so that the recipient of the mail needs to obtain the file data fragment from each specified storage body to get the complete The file data is triggered by the user. If the user wants to split a file and then store it, the data splitting device receives the storage request carrying the data identifier to be stored, and then splits the file. The identifier of the data to be stored may be the name of the file data, and the identifier information such as the message digest algorithm (MD5 code).
步骤202B、若预设元数据剥离规约中约定的元数据包括:属性信息,则将待存储数据标识对应的数据对象中与该属性信息匹配的属性信息内容确定为元数据。Step 202B: If the metadata specified in the preset metadata stripping protocol includes: attribute information, determine, in the data object corresponding to the data identifier to be stored, the attribute information content that matches the attribute information as metadata.
剥离元数据的过程是将数据对象的元数据、特别是关键元数据从数据对象中、从其原有的位置处剥离出来,来达到仅仅通过数据内容和/或剩下的其他元数据信息无法访问、识别、正确读取出、或使用原始数据对象的目的。其中,关键元数据是与安全相关的元数据,一旦缺少了这些关键元数据,系统将无法正常读取、识别、解码或还原出相应的数据对象。The process of stripping metadata is to separate the metadata of the data object, especially the key metadata, from the data object from its original location, so that only the data content and/or other metadata information remaining cannot be obtained. The purpose of accessing, identifying, correctly reading, or using raw data objects. Among them, the key metadata is security-related metadata. Once these key metadata are missing, the system will not be able to read, identify, decode or restore the corresponding data objects.
举例来说,对于Windows系统中的以文件形式存在的数据来说,文件类型就是一个关键元数据。当我们把文件的类型信息去掉(在Windows系统中就是将文件扩展名去掉),系统就无法正常打开文件内容。将文件的类型信息和文件内容数据分别存储于不同的云存储中,会给恶意攻击者或者服务供应商获取完整数据造成一定的困难。不同类型的数据有不同的关键元数据,例如,对于表格数据(电子表格或者数据库表格等)来说,其表头(字段名称)就是一种关键元数据。实际应用中,元数据还可以涵盖更宽的范围,只要对数据的安全有利,就可以将任何与数据内容相关的信息作为元数据与数据内容本身剥离开。其中,元数据包括:属性信息;属性信息为能够标识该数据对象的某种独特性质的信息,由一些描述性信息构成,用来帮助查找、打开数据对象。属性未包含在数据对象的实际内容(数据内容)中,而是提供了有关数据对象的信息。可以包括如数据对象的大小、数据类型、创建修改日期、作者和分级等众多信息。由于属性信息可以由本领域技术人员根据数据对象性质自行设定,因此上述属性信息所包含的内容仅为示例,不作为对属性信息内容的限制。For example, for data in the form of files in a Windows system, the file type is a key metadata. When we remove the type information of the file (in the Windows system, the file extension is removed), the system cannot open the file content normally. Storing file type information and file content data in different cloud storages will cause certain difficulties for malicious attackers or service providers to obtain complete data. Different types of data have different key metadata. For example, for tabular data (a spreadsheet or database table, etc.), its header (field name) is a key metadata. In practical applications, metadata can also cover a wider range. As long as the security of the data is beneficial, any information related to the data content can be separated from the data content itself as metadata. The metadata includes: attribute information; the attribute information is information capable of identifying a unique property of the data object, and is composed of some descriptive information to help find and open the data object. Attributes are not included in the actual content (data content) of the data object, but rather provide information about the data object. It can include a lot of information such as the size of the data object, the type of data, the date the creation was modified, the author, and the rating. Since the attribute information can be set by the person skilled in the art according to the nature of the data object, the content included in the above attribute information is only an example, and is not a limitation on the content of the attribute information.
或者,若预设元数据剥离规约中约定的元数据包括:数据内容标识和关键词,则根据数据内容标识,从数据对象中的数据内容中,将与关键词匹配的数据内容确定为元数据。 Alternatively, if the metadata agreed in the preset metadata stripping protocol includes: a data content identifier and a keyword, the data content matching the keyword is determined as metadata from the data content in the data object according to the data content identifier. .
数据内容标识用于提示元数据的提取位置来自于数据内容部分,关键词用于指出具体需要提取的数据内容;与关键词匹配的数据内容可以是数据内部中包含的关键信息或者敏感信息。例如:银行对账单中,可以设置与账户信息关联的若干关键词,从而将账户中的敏感信息提取出来作为元数据存储。例如:账号号码、用户身份证、用户电话、住址等。The data content identifier is used to prompt the extraction location of the metadata from the data content portion, and the keyword is used to indicate the data content that needs to be extracted specifically; the data content matched with the keyword may be key information or sensitive information contained in the data. For example, in a bank statement, a number of keywords associated with the account information can be set to extract sensitive information in the account as metadata storage. For example: account number, user ID, user phone, address, etc.
或者,若预设元数据剥离规约中约定的元数据包括:属性信息、数据内容标识和关键词,则将数据对象中与属性信息匹配的属性信息内容确定为元数据,以及根据数据内容标识,从数据对象中的数据内容中,将与关键词匹配的数据内容确定为元数据。Alternatively, if the metadata agreed in the preset metadata stripping protocol includes: attribute information, a data content identifier, and a keyword, the attribute information content matching the attribute information in the data object is determined as metadata, and according to the data content identifier, From the data content in the data object, the data content matching the keyword is determined as metadata.
预设元数据剥离规约生成的策略可由开发人员决定,也可以允许用户定义自身适用的规约,那么系统需要做的,就是尽可能全面地将元数据呈现给用户,用户才能根据这些信息预设最适当的元数据剥离规约。该预设元数据剥离规约内置在数据拆分系统中,如之前的邮件客户端例子,该预设元数据剥离规约可以内置于邮件系统的应用程序中。当然该预设元数据剥离规约也可以作为元数据内容的一部分随元数据进行存储,这样方便接收方进行数据合并时,参考该预设元数据剥离规约进行数据对象的合并。The strategy for generating the default metadata stripping protocol can be determined by the developer, or it can allow the user to define the applicable protocol. The system needs to do so to present the metadata to the user as comprehensively as possible, and the user can preset the most based on the information. Appropriate metadata stripping protocol. The preset metadata stripping protocol is built into the data splitting system. As in the previous mail client example, the preset metadata stripping protocol can be built into the mail system application. Of course, the preset metadata stripping protocol may also be stored with the metadata as part of the metadata content, so that when the recipient merges the data, the data object is merged with reference to the preset metadata stripping protocol.
再以邮件客户端的例子来说明,对待发附件文件(数据对象)进行拆分,该附件文件的元数据可以为:如文件名、文件类型、文件大小、创建时间等。文件元数据剥离的结果存储在文件元信息系统中,文件内容分割的方法以及分割的结果信息,如文件片段的散列值或者ID、文件片段的存储位置等也存放在文件元信息系统中,并与对应的文件元数据相关联。实际上,上面提到的,所有在文件元信息系统中存储的内容整体构成了这个拆分/剥离规约实例。Then, according to the example of the mail client, the attachment file (data object) to be sent is split, and the metadata of the attachment file may be: file name, file type, file size, creation time, and the like. The result of file metadata stripping is stored in the file meta information system. The method of dividing the file content and the segmentation result information, such as the hash value or ID of the file fragment, and the storage location of the file fragment, are also stored in the file meta information system. And associated with the corresponding file metadata. In fact, as mentioned above, all of the content stored in the file meta-information system constitutes an example of this split/peel protocol.
步骤203B、将获取到的元数据从数据对象中剥离。 Step 203B: Detach the acquired metadata from the data object.
剥离也可称作拆分,是指将元数据从数据对象中甄选出来的、与实现数据对象的拆分/剥离处理相关的那些元数据。系统将根据预设元数据剥离规约(该规约可以为系统默认或用户选择或用户自行定义的)将元数据从数据对象中分离出来。该规约中记录有涉及到元数据拆分/剥离处理的规则、约束、方法等信息。例如但不限于:元数据的剥离位置信息、元数据的剥离方法、编码方案、与剥离编码相关的信息、内容拆分规则、以及其他与内容拆分相 关的数据和/或信息。其中,元数据可以是该数据对象的元数据的全集或子集。具体有关元数据的类型信息请参考上述步骤202B中的各种情况。Stripping, also referred to as splitting, refers to metadata that is selected from the data objects that are associated with the data object's split/peel processing. The system will separate the metadata from the data object based on the default metadata stripping protocol (which can be system default or user-selected or user-defined). The statute records information such as rules, constraints, and methods related to metadata split/peel processing. For example, but not limited to: stripping location information of metadata, stripping method of metadata, encoding scheme, information related to stripping encoding, content splitting rules, and other content splitting Closed data and / or information. Wherein, the metadata may be a complete set or a subset of the metadata of the data object. For details about the type of metadata, please refer to the various situations in step 202B above.
对数据进行拆分的方法多种多样,例如根据预定规则直接将数据对象拆分成多个片段,分别保存。但是这种方法既不能实现更细粒度的加密手段,又无法将与数据对象密切相关的重要信息(元数据)与数据内容本身剥离开来。本发明采用了一种全新的数据拆分方法来实现数据对象的拆分。这种方法不仅可以将数据对象拆分为更细的粒度(例如以字符为单位,甚至以位为单位),而且还能够将与数据对象密切相关的重要信息(即元数据)与数据内容本身剥离开来。最终可以将剥离出来的元数据、数据内容、和/或后续将提到的编码分开存储在不同的存储位置或空间中、或不同的安全通道下,从而更加可靠地实现了数据存储的安全性。There are various methods for splitting data, such as splitting a data object into multiple segments according to a predetermined rule and saving them separately. However, this method can not achieve more fine-grained encryption means, and can not separate the important information (metadata) closely related to the data object from the data content itself. The invention adopts a new data splitting method to realize the splitting of data objects. This method not only splits the data object into finer granularity (for example, in characters or even in bits), but also can transfer important information (ie, metadata) closely related to the data object and the data content itself. Peel off. Finally, the stripped metadata, data content, and/or the code to be mentioned later can be stored separately in different storage locations or spaces, or under different secure channels, thereby realizing the security of data storage more reliably. .
步骤204B、根据预设数据内容拆分规约,将数据内容划分为至少两个数据片断。 Step 204B: Split the data according to the preset data content, and divide the data content into at least two data segments.
内容拆分是指将数据对象中的数据内容按照一定的规则分成若干个(一个以上的)片段。形象化的比喻就像将一张纸撕扯成多个碎片。但内容拆分并不是必须的,可以视实际需要而定,对内容的保密要求不高的应用可以不作内容拆分。内容拆分方法可以采用RAID磁盘阵列技术将数据分成多块,并行写入多个磁盘,以提高磁盘的读写速度和吞吐量。Content splitting refers to dividing the data content in a data object into several (more than one) segments according to certain rules. The figurative metaphor is like tearing a piece of paper into pieces. However, content splitting is not necessary, and can be determined according to actual needs. Applications that do not require high content confidentiality may not be split. The content splitting method can use RAID disk array technology to divide data into multiple blocks and write multiple disks in parallel to improve the read and write speed and throughput of the disk.
内容拆分可以分为领域相关内容拆分和领域无关内容拆分两种。领域相关内容拆分主要是根据具体领域数据的特征,对数据进行拆分。如,针对具体文件格式而进行的结构性拆分,或者是对数据内部的关键信息或者敏感信息进行拆分。后者可能和元数据剥离有一定的重叠(当元数据是处于数据内时)。例如:银行的对账单,可以将账户信息作为元数据剥离出来,也可以将账号信息作为数据片断拆分出来进行拆分存储。Content splitting can be divided into domain-related content splitting and domain-independent content splitting. Domain-related content splitting is mainly based on the characteristics of specific domain data, the data is split. For example, structural splitting for specific file formats, or splitting key or sensitive information within the data. The latter may have some overlap with the metadata stripping (when the metadata is in the data). For example, the bank's statement can be stripped of the account information as metadata, or the account information can be split as a data segment for split storage.
进一步地,预设数据内容拆分规约可以包括:磁盘阵列RAID拆分算法、信息分散IDA算法中的至少一种。算法研究人员Michael O.Rabin在1989年首先提出了信息分散IDA算法,用于在位级将数据分片,这样当数据在网络传输或存储于阵列中时是不可识别的,只有带有正确密钥的用户/设备才能访问。当使用正确密钥访问时,这个信息就会被重新组合。在分布式存储领域,信息分散IDA算法以及相关衍生算法已经被广泛使用。 Further, the preset data content splitting protocol may include at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm. Algorithmic researcher Michael O.Rabin first proposed the Information Decentralized IDA algorithm in 1989 to slice data at the bit level so that it is unrecognizable when the data is transmitted or stored in the array, only with the correct density. The user/device of the key can access it. This information is reassembled when accessed with the correct key. In the field of distributed storage, information-distributed IDA algorithms and related derivative algorithms have been widely used.
步骤205B、根据预设编码分离规约,分别对各个数据片断进行编码处理,以获取每个数据片段对应的编码。 Step 205B: Perform separation processing on each data segment according to a preset encoding separation specification to obtain a code corresponding to each data segment.
在本实施例中,可选地,所述根据预设编码分离规约,分别对各个数据片断进行编码处理,以获取每个数据片段对应的编码,包括:In this embodiment, optionally, according to the preset coding separation protocol, each data segment is separately encoded to obtain a code corresponding to each data segment, including:
根据预设编码分离规约,查询编码仓库,根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,分别对各个数据片断进行编码处理,获取每个数据片段对应的实例编码;Decoding a protocol according to a preset encoding, querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and respectively, according to the encoding protocol, respectively Encoding each data segment to obtain an instance code corresponding to each data segment;
或者,or,
根据预设编码分离规约,将各个数据片断和所述元数据发送给所述编码仓库,以供所述编码仓库根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,并分别对所述各个数据片断进行编码,获取实例编码;并接收所述编码仓库返回的所述元编码和实例编码。And transmitting, according to a preset encoding separation protocol, each data segment and the metadata to the encoding warehouse, so that the encoding warehouse selects or creates an encoding specification according to at least a part of the metadata, and generates according to the encoding protocol. a meta-code corresponding to the metadata; and according to the coding protocol, respectively encoding the respective data segments to obtain an instance code; and receiving the meta-code and the instance code returned by the coding warehouse.
需要说明的是,编码处理的具体流程可以参见说明书后续编码处理方法实施例部分的具体介绍,此处不再赘述。It should be noted that the specific process of the encoding process can be referred to the specific description of the embodiment of the subsequent encoding processing method of the specification, and details are not described herein again.
步骤206B、根据各个数据片断在数据内容中的原始顺序,排列各个编码,以得到编码的排列顺序信息。 Step 206B: Arrange the respective codes according to the original order of the data segments in the data content to obtain the coded arrangement order information.
如上所述,本发明的数据拆分方法涵盖了两种不同的数据处理手段,一是对元数据、编码的剥离,二是对数据内容的拆分。有关元数据的剥离在前文已经阐述,这里的编码剥离是指将数据内容拆分成n块数据片断后,将n块集中或者分离存储,并得到对应的n个编码(编号)这个编码(编号)可能会有重复,按照数据片断出现的顺序排列编码(编号)。这个编码(编号)序列即包含编码信息也包含编码的排列顺序信息,编码结果可以存储于另一个安全通道。编码同之前的数据片段性质不同,将其拆分出来可以称之为剥离。同时,在大多数情况下我们仅仅需要对数据对象的数据内容部分作拆分处理,而不必对已经剥离出的元数据部分和/或编码部分再做拆分,但如果需要,也可以对剥离出的元数据部分和/或编码部分作进一步的拆分处理,以达到更细粒度的保护效果。上述的剥离、拆分可以无限组合下去,取决于系统需求及处理能力。 As described above, the data splitting method of the present invention covers two different data processing means, one is the stripping of metadata and encoding, and the other is the splitting of data content. The stripping of metadata has been explained in the foregoing. The stripping of the code here refers to splitting the data content into n pieces of data, and then storing or storing the n blocks separately, and obtaining the corresponding code (number) of the number (number). There may be repetitions in which the codes (numbers) are arranged in the order in which they appear. This encoding (numbering) sequence contains the encoding information as well as the encoding ordering information, and the encoding result can be stored in another secure channel. The encoding is different from the previous data fragment, and splitting it out can be called stripping. At the same time, in most cases we only need to split the data content part of the data object, without having to split the metadata part and/or the code part that has been stripped out, but if necessary, it can also be stripped. The metadata portion and/or the encoded portion are further split processed to achieve a finer-grained protection effect. The above-mentioned stripping and splitting can be combined indefinitely, depending on system requirements and processing capabilities.
大多数情况下,编码剥离是以内容拆分为基础的,即内容拆分是先将数据内容的部分或者全部按照一定规则进行拆分,并对每个被拆分数据的寻址方式进行编码。将最终的编码结果形成单独的数据。在计算机领域中,对数据的引用编码普遍存在。如数据库中寻址数据记录的键(Key);方便网址输入和引用的缩略网址(http://dwz.cn/mzot4);云存储编程接口(API)中使用的访问标识等等。这些编码方式都可以被上面提到的编码来使用。如果编码的是数据部分内容的拆分结果,则编码结果将替换掉原有对应数据。然而,有时编码也可以不基于内容拆分。例如,对于低保密级别的数据来说,就不必要对数据内容作拆分处理。此时,如果需要,只要赋予整个数据内容一个编码就足够了,但可能仍然需要将该编码与数据内容分离开。可以看出,本实施例的编码剥离既不同于传统的内容拆分,也不同于现有的数据引用编码,而是两者的结合。只要将编码结果(包括编码本身及其对应的组合顺序)和数据内容分离开,就可以在一定程度上降低数据的安全风险。例如:有6个字节的数据ACBDAC,将数据两字节拆分,放到数据库中。AC返回编码1,BD返回编码2。这个数据的编码结果就是121这个序列,而不仅仅是1和2。其中,数字1、2代表编码;1、2、1的排列规律为编码的排列顺序信息。In most cases, code stripping is based on content splitting, that is, content splitting is to split some or all of the data content according to certain rules, and encode the addressing mode of each split data. . The final encoded result is formed into separate data. In the computer field, reference codes for data are ubiquitous. Such as the key (Address) of the data record in the database; the abbreviated URL (http://dwz.cn/mzot4) for the URL input and reference; the access identifier used in the cloud storage programming interface (API), and so on. These encoding methods can all be used by the encoding mentioned above. If the result of the splitting of the data part is encoded, the encoded result will replace the original corresponding data. However, sometimes the encoding may not be based on content splitting. For example, for data with low security levels, it is not necessary to split the data content. At this point, it is sufficient to give the entire data content a code if necessary, but it may still be necessary to separate the code from the data content. It can be seen that the code stripping of this embodiment is different from the traditional content splitting, and is different from the existing data reference encoding, but a combination of the two. As long as the coding results (including the code itself and its corresponding combination order) are separated from the data content, the security risk of the data can be reduced to some extent. For example: there are 6 bytes of data ACBDAC, split the two bytes of data into the database. AC returns code 1, and BD returns code 2. The result of this data is the sequence of 121, not just 1 and 2. Wherein, the numbers 1 and 2 represent codes; and the arrangement rules of 1, 2, and 1 are coded arrangement order information.
在实际应用中,上述元数据、编码、数据内容的剥离/拆分方法之间并不是互相排斥的,它们可以混合使用。例如但不限于,如前所述,可以仅将元数据与数据内容拆分开;也可以仅将编码部分与数据内容部分拆分开;还可以将编码部分视作一种特殊的元数据与其他元数据放在一起,只要将它们与数据内容部分分离开;更优选的是将三个部分(元数据、编码部分、数据内容)分别按照各自的拆分规约拆分开。In practical applications, the above-mentioned metadata, encoding, and data content stripping/split methods are not mutually exclusive, and they can be used in combination. For example, but not limited to, as described above, it is possible to separate only the metadata from the data content; it is also possible to separate only the encoded portion from the data content portion; it is also possible to treat the encoded portion as a special metadata and The other metadata are put together, as long as they are separated from the data content portion; more preferably, the three parts (metadata, encoding parts, data content) are separated according to their respective splitting rules.
此外,步骤202B~步骤206B即内容拆分、元数据剥离、以及编码剥离没有执行的先后顺序,它们可以分别单独执行,也可以彼此交叉或同时执行。但通常,本发明的编码操作需要在内容拆分过程中或之后执行。但当不需要执行内容拆分处理时,也可以不执行编码操作。由于元数据剥离可以在内容拆分之前完成,也可以在完成内容拆分和编码分配之后再执行元数据剥离。其间,例如在每一拆分步骤的前后,即在步骤202B~步骤206B之间还可以混合其他数据处理方法,如数据压缩、加密等。也可以将对压缩和加密的描述也加入到上述各种规约之中,但此时最好是在执行完压缩和/或加密之后再执 行对元数据的拆分步骤。In addition, the steps 202B to 206B are the order in which the content splitting, the metadata stripping, and the encoding stripping are not performed, and they may be performed separately or may be performed at the same time or simultaneously. Usually, however, the encoding operation of the present invention needs to be performed during or after the content splitting process. However, when it is not necessary to perform content split processing, the encoding operation may not be performed. Since the metadata stripping can be done before the content is split, the metadata stripping can also be performed after the content splitting and encoding assignment is completed. In the meantime, for example, before and after each splitting step, that is, between steps 202B to 206B, other data processing methods such as data compression, encryption, and the like may be mixed. It is also possible to add a description of compression and encryption to the various protocols mentioned above, but at this time it is best to re-execute after performing compression and/or encryption. The split step for the metadata.
步骤207B、将元数据、各个数据片段对应的编码以及编码的排列顺序信息分别存储到不同的存储体中或不同的安全通道中。Step 207B: Store the metadata, the code corresponding to each data segment, and the coded sequence information into different storage banks or different secure channels.
在上述实施例的基础上,进一步地,若预设元数据剥离规约中约定的元数据包括:数据对象标识,则根据预设元数据剥离规约,获取待存储数据标识对应的数据对象中的元数据包括:对数据对象进行解析,以生成与数据对象唯一对应的数据对象标识。On the basis of the foregoing embodiment, further, if the metadata agreed in the preset metadata stripping protocol includes: a data object identifier, the rule is stripped according to the preset metadata, and the element in the data object corresponding to the data identifier to be stored is obtained. The data includes parsing the data object to generate a data object identifier uniquely corresponding to the data object.
进一步地,当数据对象为音频数据时,步骤204B、根据预设数据内容拆分规约,将所述数据内容划分为至少两个数据片断可以包括:采用时域分析方法或者频域分型方法,对音频数据进行拆分处理,获取待编码的音频数据对象;其中,所述待编码的音频数据对象包括声波片段和/或静音片段。Further, when the data object is audio data, step 204B, according to the preset data content splitting specification, dividing the data content into the at least two data segments may include: adopting a time domain analysis method or a frequency domain typing method, Performing a splitting process on the audio data to obtain an audio data object to be encoded; wherein the audio data object to be encoded includes a sound wave segment and/or a silent segment.
具体的,语音是比文字更早、更自然的表达方式。然而在与人类生产和生活日益相关的计算机、互联网的世界里,语音数据及相关处理,一直是二等公民。究其原因,主要是当前对语音数据的输入、存储和处理方式及对应的技术限制所导致。人们现在主要是以两种方式来通过计算机,以及网络来处理和使用语音输入:语音通话和语音识别。Specifically, speech is an earlier and more natural expression than words. However, in the world of computers and the Internet, which are increasingly related to human production and life, voice data and related processing have always been second-class citizens. The reason is mainly caused by the current input, storage and processing methods of voice data and corresponding technical limitations. People now mainly use two methods to process and use voice input through computers and networks: voice calls and voice recognition.
语音通话主要是指将人输出的语音信号通过计算机声音捕获设备转换成数字信号,然后通过计算机以及计算机网络或者通讯网络(这里主要是基于包交换的语音技术,如VoLTE等,基于电路交换的语音技术与我们讨论的问题无关)处理、传输和存储,最终通过数字音频回放设备回放出来。语音通话可以是实时的,也可以是非实时的;可以是单向的,也可以是双向的。目前语音通话的最主要问题就是数据量大,不容易传输和存储。当前声卡常见的音频采样率主要有11KHz,22KHz,44.1KHz。11KHz获得的声音称为电话音质(电话采用8KHz采样率),基本上能让人分辨出通话人的声音;22KHz称为广播音质;44KHz为CD音质。采样率越高,获得音频数据的音质越好,占用存储也越大。另一个采样参数为采样分辨率,是指一个声音信号(一般是声波振幅)占用数据的大小,常见的有8位和16为两种,8位能将声音信号分为256个级别,而16位能将声音信号分为6万多个级别。可以算出,1秒钟11KHz采样的8位立体声(左右两个声道)音频信号的数据大小为22KB。这相当于一万多字的中文文字的数据量。目前最常用的双向、实时的语音通 话应用中,用户很少将通话数据录制保存下来。其原因主要就在于音频数据占用存储量大,且无法检索、查询。也有一些能够保留单向通话结果的应用系统,他们一般对保留的数据大小会有所限制。如微信的“按下说话”功能就有1分钟的限制,对应的,其文字微信就没有什么限制,发送百万字都没问题;类似的,Skype具有语音留言功能,留言时长也有限制,最多只能保留10分钟。目前常见的语音数据多为数字有声读物,如评书、相声、讲座、有声电子书等。它们一般存储于音频文件(如MP3、WMA、MOV等格式),或者通过网络流媒体协议(如PTSP、MMS、RTP、RSVP等)实时访问。人们一般通过音频数据之外的元数据(如MP3中的ID3V1、ID3V2信息)来获知音频数据的相关信息;对于一个首次收听的音频数据的内部,除非存在辅助的文字定位信息(如字幕文件),否则无法随机查找、定位,只能顺序收听。Voice call mainly refers to converting the voice signal output by a person into a digital signal through a computer sound capture device, and then through a computer and a computer network or a communication network (here mainly based on packet-switched voice technology, such as VoLTE, based on circuit-switched voice) The technology has nothing to do with the problems we discussed) processing, transmission and storage, and finally played back through the digital audio playback device. Voice calls can be real-time or non-real-time; they can be one-way or two-way. The main problem with current voice calls is the large amount of data, which is not easy to transfer and store. The current audio sampling rates of sound cards are mainly 11KHz, 22KHz, and 44.1KHz. The sound obtained at 11KHz is called telephone sound quality (the telephone uses 8KHz sampling rate), which basically makes people distinguish the voice of the caller; 22KHz is called broadcast sound quality; 44KHz is CD sound quality. The higher the sampling rate, the better the sound quality of the audio data and the larger the storage. Another sampling parameter is the sampling resolution, which refers to the size of a sound signal (generally the amplitude of the sound wave). The common ones are 8 and 16 and the 8 bits can divide the sound signal into 256 levels. The bit can divide the sound signal into more than 60,000 levels. It can be calculated that the data size of the 8-bit stereo (left and right channel) audio signals sampled at 11KHz in 1 second is 22 KB. This is equivalent to the amount of data in Chinese characters of more than 10,000 words. Currently the most commonly used two-way, real-time voice communication In the application, the user rarely saves the call data recording. The main reason is that audio data occupies a large amount of storage and cannot be retrieved or queried. There are also applications that can preserve the results of one-way calls, and they generally limit the size of the data that is retained. For example, WeChat's "press and talk" function has a limit of 1 minute. Correspondingly, there is no limit to its text WeChat. It is okay to send millions of words. Similarly, Skype has a voice message function, and the message duration is also limited. Can only be kept for 10 minutes. At present, most common voice data are digital audio books, such as storytelling, cross talk, lectures, and audio e-books. They are generally stored in audio files (such as MP3, WMA, MOV, etc.) or accessed in real time through network streaming protocols (such as PTSP, MMS, RTP, RSVP, etc.). People generally know the information about the audio data through metadata other than the audio data (such as ID3V1 and ID3V2 information in MP3); for the inside of the audio data that is first listened to, unless there is auxiliary text positioning information (such as a subtitle file). Otherwise, you can't find and locate them randomly, you can only listen in sequence.
语音识别,我们已经知道,文字数据是当前计算机系统的头等公民。文字数据具有标准化、易于存储、易于查看、查找、检索、处理等特点。因此,将语音输入转换为文字数据的语音识别能够更加有效地利用输入的数据。但是,这里存在两个方面的问题,其一是信息丢失;其二是识别率问题。人类自然语音输出里面包含了对应文字内容以外的信息。目前,当将语音进行识别转换为标准文字内容以后,一般并不保留原始语音数据,实际上,就将这部分信息给丢失了。这些信息主要包括,语音、语调、语气、音色、停顿等,其中可能蕴含了情绪、情感等。识别率问题是语音识别还没有作为人类计算机输入首选的一个主要障碍。对于针对特定人、经过一定的识别训练的语音识别来说,识别率还是相当高的,能够达到90%以上。因此,苹果公司的Siri,亚马逊的回声,微软的小娜,谷歌的Now等数字语音助理应用近几年的使用增长特别快,一部分人群已经能够用数字语音助理代替传统的搜索引擎了。但是,我们也看到,语言的问题、口音的问题使很多人远离这些应用。语音训练和语音识别本身就是鸡和蛋的关系,由于缺乏语音训练的数据,语音识别的识别率对特定人群就不会太高。反过来,因为低的识别率,该特定人群就没有什么热情去使用语音识别,从而导致系统缺乏足够的样本数据来分析和优化。此外,以文字录入为目的的语音识别还存在标点符号、文字控制方面的识别困难,影响了输入的效率。 Speech recognition, as we already know, literal data is the first class citizen of current computer systems. Text data is standardized, easy to store, easy to view, find, retrieve, and process. Therefore, speech recognition that converts speech input into text data can make more efficient use of the input data. However, there are two problems here, one is loss of information; the other is the problem of recognition rate. The human natural voice output contains information other than the corresponding text content. At present, when the speech recognition is converted into standard text content, the original speech data is generally not retained, and in fact, this part of the information is lost. These information mainly include voice, intonation, tone, tone, pause, etc., which may contain emotions, emotions, and so on. The recognition rate problem is that speech recognition has not yet become a major obstacle to human computer input. For speech recognition for a specific person and after a certain recognition training, the recognition rate is still quite high, and can reach more than 90%. As a result, Apple's Siri, Amazon's echo, Microsoft's Xiaona, Google's Now and other digital voice assistant applications have grown particularly fast in recent years, and some people have been able to replace traditional search engines with digital voice assistants. However, we also see that language problems and accent problems keep many people away from these applications. Speech training and speech recognition are themselves the relationship between chicken and egg. Due to the lack of data for speech training, the recognition rate of speech recognition is not too high for a specific group of people. Conversely, because of the low recognition rate, this particular group has little enthusiasm to use speech recognition, resulting in the system lacking sufficient sample data for analysis and optimization. In addition, speech recognition for the purpose of text entry also has difficulty in identifying punctuation and text control, which affects the efficiency of input.
综上,我们已经看到,语音通话的数据保持了原有语音信息,但是其数据量大,且不利于计算机的自动分析和处理。语音识别虽然能够产生文字数据,便于计算机的传输、存储、分析处理,但是一些原始的语音信息在这个过程中丢失了;而且目前的语音识别的准确性和可靠性并没有保证,也没有有效的办法获取大多数人的声音样本数据来提高识别率。In summary, we have seen that the data of the voice call maintains the original voice information, but the amount of data is large, and is not conducive to the automatic analysis and processing of the computer. Although speech recognition can generate text data, which is convenient for computer transmission, storage, analysis and processing, some original speech information is lost in this process; and the accuracy and reliability of current speech recognition are not guaranteed, and there is no effective Ways to get the sound sample data of most people to improve the recognition rate.
本实施例提出一种折衷的方法来处理原始的语音数据,使得既保留了原始的语音数据,又产生了文字数据,便于计算机的传输、存储和分析处理。这里的关键就是这个文字数据并不是标准的文字编码,而是针对特定人的私有编码。编码对应的语音数据存放于特定的文字编码仓库中,编码仓库中的语音数据根据不同用户加以区分编码。用户可以针对自己的语音数据给不同的用户设置访问权限。如图2B-2所示,系统大体分为两部分:编码仓库以及围绕这些数据的相关服务。其中,语音输入的过程如下:1、用户登录到编码仓库并选择语音文字输入系统;2、语音文字输入系统根据当前的用户向编码仓库注册一系列的编码器;3、用户向语音文字输入系统输入连续语音;4、语音文字输入系统将用户的输入存放到输入缓存中;5、语音文字输入系统将输入缓存中的语音数据按照一定规则进行切分形成不同的数据对象;6、语音文字输入系统通过对应的编码器向数据仓库提交数据,并得到相应的编码;7、语音文字输入系统将得到的编码存放到文字输入结果中,并将相应的输入缓存内容清除;8、重复3至7的步骤,语音文字输入系统不停获得用户输入及其对应编码;9、当用户停止输入,并且输入缓存中没有任何数据时,整个语音输入过程完成。This embodiment proposes a compromise method to process the original voice data so that both the original voice data and the text data are saved, which facilitates the transmission, storage and analysis processing of the computer. The key here is that this text data is not a standard text encoding, but a private encoding for a specific person. The voice data corresponding to the code is stored in a specific text code warehouse, and the voice data in the code warehouse is differentiated and coded according to different users. Users can set access rights for different users for their own voice data. As shown in Figure 2B-2, the system is roughly divided into two parts: the code repository and related services surrounding the data. The process of voice input is as follows: 1. The user logs into the code warehouse and selects the voice text input system; 2. The voice text input system registers a series of encoders according to the current user to the code warehouse; 3. The user inputs the system to the voice text. Input continuous speech; 4, voice text input system stores the user's input into the input buffer; 5, the voice text input system divides the voice data in the input buffer according to certain rules to form different data objects; 6, voice text input The system submits the data to the data warehouse through the corresponding encoder, and obtains the corresponding code; 7. The voice text input system stores the obtained code into the text input result, and clears the corresponding input buffer content; 8. Repeat 3 to 7 In the step, the voice text input system continuously obtains the user input and its corresponding code; 9. When the user stops inputting and there is no data in the input buffer, the entire voice input process is completed.
可以看出,这里对输入缓存中的语音数据进行切分是一个关键步骤。实际上,这是一个语音数据处理的一个成熟技术,叫做“端点侦测”或者“语音侦测”。常见的有时域分析和频域分型两种方法。这里以时域分析的方法予以举例。图2B-3为一段音频数据的时域分析图,定义振幅小于一定范围(这里是0.005),并且时间持续一段时间(这里是20ms)为静音。对于小于50ms的静音,我们直接从中间进行划分,之前属于一个片段,之后属于另一个片段。对于大于或者等于50ms的静音,我们从静音的起始处和终止处进行划分。这样将这段音频划分成了九个片段:901ms的静音,949ms的一个声音片段,421ms的静音,2558ms的声音片段,337ms的声音片段,578ms的声音片段, 368ms的静音,1209ms的声音片段,679ms的静音。这里使用两种编码类型,一种是声音片段编码,用字母V后跟对应的编号来表示;另一种是静音编码,用字母S后跟静音的时长(单位为毫秒)来编码。编码仓库该用户对应的语音文字编码表中的数据如图2B-4所示。这样我们可以得到对应的文字编码如下:S901 V001 S421 V002 V003 V004 S368 V005 S679It can be seen that segmenting the voice data in the input buffer is a key step. In fact, this is a mature technology for voice data processing called "endpoint detection" or "voice detection." Common methods of time domain analysis and frequency domain typing. Here is an example of a time domain analysis method. Figure 2B-3 is a time-domain analysis of a piece of audio data, defining an amplitude less than a certain range (here 0.005), and the time is a period of time (here 20ms) is muted. For mutes less than 50ms, we divide directly from the middle, which belongs to one segment before, and then belongs to another segment. For muting greater than or equal to 50ms, we divide from the beginning and the end of the muting. This divides the audio into nine segments: 901ms of silence, 949ms of a sound clip, 421ms of silence, 2558ms of sound clips, 337ms of sound clips, 578ms of sound clips, Silence of 368ms, sound clip of 1209ms, and silence of 679ms. Two encoding types are used here, one is the sound segment encoding, which is represented by the letter V followed by the corresponding number; the other is the silent encoding, which is encoded by the length of the letter S followed by the mute (in milliseconds). The data in the speech text encoding table corresponding to the user of the encoding warehouse is as shown in FIG. 2B-4. In this way, we can get the corresponding text code as follows: S901 V001 S421 V002 V003 V004 S368 V005 S679
通过这种方法,我们将8秒钟的音频数据转换成了9个特殊的文字字符。以每个字符四个字节(这实际上是和具体的编码方案相关,采用上下文相关的基于对象编码,完全可以实现平均四字节的字长)来计算,整个编码结果也就是36个字节,几乎是原有音频数据176K(22K/s X 8s)的5000分之一。因此,编码结果在存储、传输、编辑、同其他数据混合等处理上就会方便、有效得多。只有最终需要播放声音内容的用户,才需要从编码仓库中获取对应的数据,将音频内容还原出来。In this way, we convert 8 seconds of audio data into 9 special text characters. With four bytes per character (this is actually related to the specific coding scheme, using context-dependent object-based coding, which can achieve an average of four bytes of word length), the entire coding result is 36 words. The section is almost one-thousandth of the original audio data of 176K (22K/s X 8s). Therefore, the encoding result is much more convenient and efficient in the processing of storage, transmission, editing, and other data mixing. Only the user who needs to play the sound content finally needs to obtain the corresponding data from the code repository and restore the audio content.
值得一提的是,这样将编码和内容分开的方法可以很容易将编码和数据内容分别置于不同的安全通道,具有天然的安全性。It is worth mentioning that the method of separating the encoding and the content can easily place the encoding and the data content in different secure channels, and has natural security.
同时,存储于编码仓库中的语音数据直接和特定人相关,自然能够很好地作为训练样本来进行分析和整理。目前已有的语音分析和识别技术就能对其分析和识别出很多有用的信息,如音高、音色、音调、音节等;还能提取出更加有效的特征参数,如MFCC参数,LPCC参数等等。这些都可以存放在编码仓库中,对对应的语音编码提供更进一步的编码服务。如内容查找匹配服务、内容归一服务、内容选择服务等。At the same time, the voice data stored in the code warehouse is directly related to a specific person, and naturally can be well used as a training sample for analysis and organization. The existing speech analysis and recognition technology can analyze and identify a lot of useful information, such as pitch, tone, pitch, syllable, etc.; and extract more effective feature parameters, such as MFCC parameters, LPCC parameters, etc. Wait. These can be stored in the code repository to provide further coding services for the corresponding speech coding. Such as content search matching service, content normalization service, content selection service, and the like.
语音文字输出,对于所获得的语音文字内容,也就是编码结果,可以有两种不同的输出方式,一种是以文字显示输出为主的图形输出,一种是以语音播放为主的音频播放。Voice text output, for the obtained voice text content, that is, the encoding result, there are two different output modes, one is graphic output based on text display output, and the other is audio playback based on voice playback. .
图形输出,语音文字的图形输出是指将语音文字按照普通文字的呈现方式进行呈现输出的,也就是文字排版输出。其好处就是可以利用现有文字处理的方法和工具对语音文字进行加工和处理。此外,支持语音文字的图形输出,还能够允许语音文字同传统文字,以及其他形式的文字(如图形文字、图片文字等)在同一个文字文档中出现,支持更加丰富多彩的应用。Graphic output, graphic output of voice text refers to the presentation of voice text in the way of ordinary text, that is, text layout output. The advantage is that the text processing can be processed and processed using existing word processing methods and tools. In addition, the support of voice text output, can also allow voice text and traditional text, as well as other forms of text (such as graphic text, image text, etc.) appear in the same text document, supporting more colorful applications.
语音文字的具体呈现方式会因用户访问权限的不同而不同。The specific presentation of voice text will vary depending on the user's access rights.
1、对于一个支持多文字类型的文字输出系统来说,如果用户没有文字编 码的任何访问权限(包括文字类型信息),用户能够看到的只是编码本身的信息而已,可以是如图2B-5的呈现方式。1. For a text output system that supports multiple text types, if the user does not have a text editor Any access rights of the code (including text type information), the user can only see the information of the code itself, which can be the presentation mode as shown in FIG. 2B-5.
2、如果用户能够获得编码的类型信息,但无法访问每个音频文字编码的具体内容。系统可以将连续的语音文字编码(包括语音数据编码以及静音时长编码等)作为一个整体呈现,例如:“+一段未被授权的语音文字(9个字符,其中4个静音字符;静音时长共计2’369)”当用户将上述引号内的内容展开,可以输出更多细节如图2B-6所示。2. If the user is able to obtain the type information of the code, but cannot access the specific content of each audio text code. The system can present continuous speech text encoding (including speech data encoding and mute duration encoding, etc.) as a whole, for example: "+ an unauthorised speech text (9 characters, 4 silent characters; mute duration total 2 '369)" When the user expands the contents of the above quotes, more details can be output as shown in Figure 2B-6.
如上图所示,我们能不仅能够看到每个语音字符,还能直观地看到静音时长。利用这些信息,系统还能提供相关的搜索功能,如对静音搜索(可以带时长约束或者不带约束)。As shown above, we can not only see each phonetic character, but also visually see the duration of the mute. Using this information, the system can also provide relevant search functions, such as a silent search (with or without constraints).
3、更进一步,如果用户有权获得语音字符对应的语音数据,那么系统能够显示更多的相关信息,并允许用户将语音内容播放出来,例如显示“+语音内容,时长8‘(5个语音字符,4个静音字符;静音时长共计2’369)
Figure PCTCN2015086672-appb-000011
”当用户将该语音文字展开,可以得到更多细节,如图2B-7所示。
3. Further, if the user has the right to obtain the voice data corresponding to the voice characters, the system can display more relevant information and allow the user to play the voice content, for example, display "+ voice content, duration 8" (5 voices) Character, 4 mute characters; mute duration total 2'369)
Figure PCTCN2015086672-appb-000011
"When the user expands the voice text, more details can be obtained, as shown in Figure 2B-7.
用户可以点击任意语音字符将其播放。语音文字在图形输出是,可以以多种形式进行可视化,如显示波形图、频谱图,可视化时长,等;取决于具体的应用需求。此外,还可以将对语音字符分析的结果,或者用户对字符添加的语义标签同时呈现出来。如图2B-8所示,第三、第四个音频字符同时还显示了,基于汉语拼音音元分析的结果。Users can click on any voice character to play it. Voice text is graphically output and can be visualized in a variety of formats, such as displaying waveforms, spectrograms, visualization durations, etc., depending on the specific application requirements. In addition, the results of the analysis of the phonetic characters, or the semantic tags added by the user to the characters, can also be presented simultaneously. As shown in FIG. 2B-8, the third and fourth audio characters are also displayed based on the results of the Chinese Pinyin phonetic analysis.
由于能够访问到音频字符的编码仓库信息,相关的系统文字搜索也能够提供更多的搜索控制,如,根据用户输入的语义标签进行搜索。Due to the ability to access the encoded warehouse information for audio characters, the associated system text search can also provide more search control, such as searching based on semantic tags entered by the user.
其中,单个语音字符(包括静音字符)的输出过程如下:Among them, the output process of a single phonetic character (including silent characters) is as follows:
1、用户登录到编码仓库。1. The user logs in to the code repository.
2、系统根据目标字符编码分解出其元编码。2. The system decomposes its metacode according to the target character encoding.
3、系统向编码仓库提交字符元编码。3. The system submits a character meta code to the code repository.
4、编码仓库根据元编码以及当前用户,检查访问权限。如果禁止访问,则向系统返回出错信息;系统根据字符编码进行图形输出;过程结束。如果允许访问,则向系统返回对应编码元数据;过程继续。 4. The encoding warehouse checks the access rights according to the meta code and the current user. If access is disabled, an error message is returned to the system; the system performs a graphical output based on the character encoding; the process ends. If access is allowed, the corresponding encoded metadata is returned to the system; the process continues.
5、系统根据目标字符编码分解出实例编码。5. The system decomposes the instance code according to the target character encoding.
6、系统根据编码元数据对实例编码进行解析。具体的,如果是静音字符,则将实例编码解析为静音时长;如果是音频字符,则向编码仓库提交字符编码。编码仓库根据音频编码设置以及当前用户检查访问权限,如果禁止访问,则返回出错信息;如果允许访问,则获取对应的语音数据,并将其返回给系统。6. The system parses the instance code according to the encoded metadata. Specifically, if it is a mute character, the instance code is parsed into a mute duration; if it is an audio character, the character code is submitted to the code repository. The encoding repository checks the access rights according to the audio encoding settings and the current user. If access is disabled, an error message is returned; if access is allowed, the corresponding voice data is obtained and returned to the system.
7、系统根据解析或者获得的数据将字符进行图形输出。7. The system outputs the characters according to the parsed or obtained data.
8、如果系统获得用户的播放请求,则根据语音数据恢复出波形数据,将其播放出来。8. If the system obtains the user's play request, the waveform data is recovered according to the voice data, and played out.
如果是输出多个连续的字符,则系统需要获得所有对应语音字符以及相关数据,按照一定的排版规则将其可视化的形式进行图形输出。如果获得用户的播放请求,则建立播放缓存,依次将音频数据播放出来(同时需要考虑到静音字符的播放)。If multiple consecutive characters are output, the system needs to obtain all corresponding phonetic characters and related data, and graphically output the visualized form according to certain typographic rules. If the user's play request is obtained, the play buffer is established, and the audio data is played back in turn (while taking into account the play of the silent characters).
语音播放,对语音文字的语音播放输出就是类似传统音频数据的播放,并不需要考虑文字的图形排版。但是,语音文字的播放,也是建立在用户访问权限的基础之上的。只有在用户获得了语音文字对应数据访问权限的前提下,才能对语音文字进行播放。Voice playback, the voice playback output of voice text is similar to the playback of traditional audio data, and does not need to consider the graphic layout of text. However, the playback of voice text is also based on the user's access rights. The voice text can be played only if the user has obtained the access rights of the voice text corresponding to the data.
除了类似于传统语音播放的时间定位,对语音文字还可以进行丰富的搜索定位,如根据语音时长、静音时长、语义标签、语音文字中混合的传统文字等等进行搜索。In addition to time positioning similar to traditional voice playback, rich search positioning can be performed on voice text, such as searching according to voice duration, mute duration, semantic tags, mixed text in voice text, and the like.
值得一提的是,通过语音文字和传统文字的混合,可以实现很多传统语音播放无法实现的效果。如,嵌入字幕、嵌入结构化导航信息、嵌入照片链接、嵌入图形等等。It is worth mentioning that through the mixture of voice text and traditional text, many effects that traditional voice playback cannot achieve can be achieved. For example, embedding subtitles, embedding structured navigation information, embedding photo links, embedding graphics, and more.
语音文字编辑,通过对音频数据的文字编码化,使得以传统文字编辑的方式对语音数据进行编辑成为可能。在语音文字图形输出的状态下,用户可以方便地对任意字符进行删除、插入、修改等操作,还可以进行查找、替换、拷贝和粘贴等传统文字编码操作。Voice text editing, by encoding the text of the audio data, makes it possible to edit the voice data in the manner of traditional text editing. In the state of voice text graphics output, the user can conveniently delete, insert, modify, etc. any character, and can also perform traditional text encoding operations such as searching, replacing, copying and pasting.
其中,部分操作需要使用专门的音频服务。例如,更改静音时长,将一个音频字符且分为多个,将多个语音字符合并成一个等等。Some of these operations require the use of specialized audio services. For example, change the mute duration, divide an audio character into multiples, combine multiple speech characters into one, and so on.
通过上面,我们可以看到,音频数据的文字化为人们利用计算机来安全、 有效地进行语音来表达和沟通提供了更多的机会。但是,有人也会对这个方法产生一些疑问。From the above, we can see that the textualization of audio data is safe for people to use computers. Effective voice communication to express and communicate provides more opportunities. However, some people will also have some doubts about this method.
噪音消除,普通环境下录制的音频数据一般都有环境噪音。将其切分编码后进行回放,有噪音的语音字符数据同无噪音的静音字符一起播放,会不会听起来很怪异?Noise cancellation, audio data recorded in normal environments generally have ambient noise. After it is segmented and encoded, it will be played back. Does the noisy voice character data play with the noiseless mute character, will it sound strange?
这的确是个问题。解决这个问题的办法很直接,就是在音频数据在存储前进行统一的去噪处理。目前自动去噪的技术已经比较成熟,针对纯语音的噪音消除则更加容易。This is indeed a problem. The solution to this problem is straightforward, that is, unified denoising of audio data before storage. At present, the technology of automatic denoising is relatively mature, and the noise cancellation for pure speech is easier.
人耳能够识别的声音频率范围为20Hz到20kHz。人体发声器官发出的声音频率大约为80Hz到3400Hz;而人说话时信号频率通常为300Hz到3000Hz。对于一个具体的个体,这个频率范围一般会更加有限。此外,正常人在室内的谈话音量大概在20至60分贝之间。根据这个频率范围,我们可以自动去除高频、低频噪音。通过低分贝延时我们可以进行语音侦测,自动得到静音区段。通过对静音区段内的频谱分析,可以对整个音频数据进行噪音滤除。这里需要注意的是,有的静音区段内会出现同音频数据相同的频率范围,我们在进行自动滤除时要确保不要将非静音段的音频处理成低分贝的静音段。The sound frequency that the human ear can recognize ranges from 20 Hz to 20 kHz. The frequency of the sound emitted by the human body vocal organs is about 80 Hz to 3400 Hz; while the frequency of the human voice is usually 300 Hz to 3000 Hz. For a specific individual, this frequency range is generally more limited. In addition, the volume of conversations of normal people indoors is between 20 and 60 decibels. According to this frequency range, we can automatically remove high frequency and low frequency noise. With low decibel delay, we can perform voice detection and automatically get a silent section. Through the spectrum analysis in the silent section, noise filtering can be performed on the entire audio data. It should be noted here that some of the mute segments will have the same frequency range as the audio data. When performing automatic filtering, we must ensure that the audio of the non-silent segment is not processed into a low-decibel silent segment.
经过整体噪音消除的语音数据和完全静音的静音字符就会一同和谐播放。The voice data with the overall noise cancellation and the completely muted mute characters will play together in harmony.
在实际应用环境中,一般不会等到完全获得语音数据才进行切分、去噪处理。我们可以在内存中建立一个几秒钟的缓存,对其进行分析处理。但是对识别出的噪声特征,可以进行累计,在后面的音频处理中重用、更新。In the actual application environment, it is generally not necessary to wait until the voice data is completely obtained before performing the segmentation and denoising processing. We can build a cache of a few seconds in memory and analyze it. However, the identified noise characteristics can be accumulated and reused and updated in subsequent audio processing.
实时语音通话,既然这个方法建立在都语音数据的切分基础之上,那么对实时性要求较高的语音应用来说,这个方法是不是就不适用了?的确如此,对于能够允许几秒钟延时的语音应用来说,本方法还是可以适用的。如果实时性要求很高,就无法进行语音切分了。但是,对于这些应用来说,可以使用本方法对语音进行录制,避免了传统语音录制数据量大,编辑困难等问题。Real-time voice call, since this method is based on the segmentation of voice data, is this method not applicable to voice applications with high real-time requirements? Indeed, this method is still applicable for voice applications that can allow a delay of a few seconds. If the real-time requirements are high, speech segmentation is not possible. However, for these applications, the method can be used to record the voice, which avoids the problems of large amount of traditional voice recording data and difficulty in editing.
语音的传输,在传统语音通话应用中,语音数据可以直接传送给接收方。而这个方法中,语音文字传送给接收方,再由接收方从编码仓库获取真正的语音数据。这个过程会不会低效?Voice transmission, in traditional voice call applications, voice data can be directly transmitted to the receiver. In this method, the voice text is transmitted to the receiver, and the receiver obtains the real voice data from the code warehouse. Will this process be inefficient?
实际上,针对基于网路语音通话应用的编码仓库应该部署在云端的数据 中心。现在的数据中心一般会提供CDN(内容传输网络)服务,也就是自动选择最快的途径传输数据。所以这个过程可以做到最高效,这个完全取决于编码仓库的部署方案。In fact, the code repository for VoIP-based calling applications should be deployed in the cloud. center. Today's data centers generally provide CDN (Content Delivery Network) services, which automatically select the fastest way to transfer data. So this process can be most efficient, and it all depends on the deployment of the code repository.
另一方面,由于编码和数据的分离,发送发完全可以做到在语音数据发送之后,将其部分或者全部语音数据予以隐蔽。接收方即使接收到了语音编码,也无法全部或者部分地播放。这个在传统的语音通话应用中是无法做到的。On the other hand, due to the separation of the code and the data, the transmission can completely hide some or all of the voice data after the voice data is transmitted. The receiver cannot play in whole or in part even if it receives the voice code. This is not possible in traditional voice call applications.
实际数据量的大小,音频数据文字化后编码内容确实比原始的音频数据要小得多,但是,对于最终需要使用或者播放原始语音内容的用户来说,数据量并没有减少,反而增多了(语音文字编码部分)。那么,我们可不可以说其是这个方法的缺陷呢?不可否认,对于具体的某段语音来说,如果最终的播放能够还原原始的输入,数据量并没有减少(这且忽略噪音消除)。但是,必须看到,通过将个人化的语音数据集中存放到编码仓库中,实际上会存在极大的冗余。处理好这个冗余信息,就能极大地提高存储效率和传输效率。下面我们对此具体说明。The amount of actual data, the encoded content of the audio data is indeed much smaller than the original audio data, but for the users who ultimately need to use or play the original voice content, the amount of data has not decreased, but has increased ( Voice text encoding part). So, can we say that it is a defect of this method? It is undeniable that for a specific segment of speech, if the final playback can restore the original input, the amount of data is not reduced (this ignores noise cancellation). However, it must be seen that by centralizing the personalized voice data into the code repository, there is actually a significant amount of redundancy. By processing this redundant information, storage efficiency and transmission efficiency can be greatly improved. Below we specify this.
对于一个具体的个体来说,其一生能够发出的声音是有限的。考虑到语言限制,基本的音元/音节更加有限。音元的组合也非常有限。不考虑音量的高低,其能够形成的具体音素就很有限。正是基于此,我们在对语音数据进行存储时,进行进一步的切分,就能重复使用。如现有的音频处理中,就会将语音数据切分成一个个连续的音框。一个音框一般10ms至40ms,音框之间可以有一定的重叠。合适的音框切分可以便于音频分析,将音频数据进一步参数化,实现最终的重复利用。For a specific individual, the sound that can be emitted in a lifetime is limited. The basic elements/syllables are more limited considering language limitations. The combination of the elements is also very limited. Regardless of the level of the volume, the specific phonemes that can be formed are limited. Based on this, when we store the voice data, we can further reuse it by further segmentation. In the existing audio processing, the voice data is cut into a continuous sound frame. A sound box is generally 10ms to 40ms, and there can be some overlap between the frames. Appropriate frame segmentation facilitates audio analysis and further parameterizes the audio data for ultimate reuse.
现有的一些音频指纹提取及匹配方法可以用来很好地检测冗余的语音数据,来实现编码仓库中的内容归一、查找匹配等服务。例如谷歌的Waveprint方法(专利US 8411977 B1)。Some existing audio fingerprint extraction and matching methods can be used to detect redundant voice data well, to implement content normalization, search matching and other services in the code warehouse. For example, Google's Waveprint method (patent US 8411977 B1).
可以预见,通过本实施的方法,可以容易地将一个人一生的语音数据全部都录制下来,来完成一些以前所无法想象的应用。It can be foreseen that by the method of the present embodiment, it is possible to easily record all the voice data of a person's life to complete some applications that were previously unimaginable.
编码内容的篡改,文字化的音频数据实际上是更加易于修改了,那么,谁来保证音频数据的安全、可靠性呢?怎样保证音频字符序列是原始的字符序列呢?事实上,这个并不是一个新问题,传统的文字就面临同样的问题。 我们只要使用现有的解决方案(如数字签名),就能解决同样的问题。The falsification of the encoded content, the textual audio data is actually easier to modify, so, who will ensure the security and reliability of the audio data? How to ensure that the audio character sequence is the original character sequence? In fact, this is not a new problem, and the traditional text faces the same problem. We can solve the same problem by using existing solutions such as digital signatures.
非语音的音频数据,这里重点提到了语音数据,那么对于非语音的音频数据,如音乐、视音频中的音轨数据等,这个方法是否也适用呢?Non-speech audio data, here is the emphasis on voice data, then for non-speech audio data, such as music, video and audio track data, etc., is this method also applicable?
首先,本文的方法并没有改变原始数据,只不过是对其进行了切分和编码,原始的内容分成了编码流以及编码仓库中对应的音频数据。最终播放仍然能够将原始音频完全恢复并播放。从这个意义上说,使用这个方法完全没有问题。First of all, the method of this paper does not change the original data, but it is divided and encoded. The original content is divided into the encoded stream and the corresponding audio data in the encoding warehouse. Final playback will still be able to fully restore and play the original audio. In this sense, there is no problem with using this method.
但是,从文字化的角度来说,这个方法所得到的文字是个人化的,与特定用户相关。这也就保证了之后的针对该用户的语音分析、识别和其他高度个性化的服务。如果将音乐或者其他与用户个体无关的声音一起存储于编码仓库,并同该用户相关联,实际上会影响之后的个性化服务。因此,更好的做法是想办法将语音数据同其他音频数据分为不同的音频通道。对其他音频数据采用相应的编码分类,如对音乐采用乐器相关的编码。最后将不同的音频字符分成多个通道的数据混合在一起。However, from a textual point of view, the text obtained by this method is personal and relevant to a particular user. This also ensures subsequent speech analysis, identification and other highly personalized services for the user. If music or other sounds that are not related to the individual user are stored in the code repository and associated with the user, it will actually affect the subsequent personalized service. Therefore, it is better to find ways to divide voice data into other audio channels. Use other coding classifications for other audio data, such as instrument-related coding for music. Finally, data that divides different audio characters into multiple channels is mixed together.
多种文字类型的混合,既然我们将语音数据切分、编码后的内容称为文字,是不是可以将其同传统文字以及其他类型的文字编码混合到一起呢?的确如此,这正是该方案的优势之一。人本身的自然输出是多通道的,例如,人在进行书写或者敲击键盘的同时,就能说话。现有的系统只能将这些结果分散为不同的数据加以存储、处理,丧失了其天然的同步特性。我们采用合适的编码方法,将不同的数据文字化,就能对其统一存储、处理,并相互关联。A mixture of multiple text types. Since we divide the encoded and encoded content of speech data into text, can we mix it with traditional text and other types of text encoding? Indeed, this is one of the strengths of the program. The natural output of a person is multi-channel. For example, a person can speak while writing or typing on a keyboard. Existing systems can only disperse these results into different data for storage and processing, losing their natural synchronization characteristics. We use appropriate coding methods to textize different data and store them, process them, and correlate them.
随着云计算、大数据技术的发展,计算机系统能够更加系统、深入地对人类的生产、生活进行分析、总结甚至预测。然而,目前计算机系统能够分析、处理的数据主要是在数字世界内部产生的数据。人类的输出主要是通过键盘进入数字世界,这是一个巨大的瓶颈。而且对大多数人来说,键盘并不是一个友好、易用的设备。本文提供的方法建立在人类自然输出的基础之上,将输出的语音数据切分编码。编码结果可以使用传统文字的方法和工具进行处理,而编码对应的数据存放于编码仓库中。编码仓库可以置于云存储中,便于分析利用。这个方法将极大地提高人类语音输出数字化的效率。而且随着语音数据的积累,编码仓库有机会提供更加智能、个性化的语音数据服务。 最终让人类无缝地同数字世界相融合。With the development of cloud computing and big data technology, computer systems can analyze, summarize and even predict human production and life more systematically and deeply. However, the data that can be analyzed and processed by computer systems at present is mainly data generated inside the digital world. Human output is mainly through the keyboard into the digital world, which is a huge bottleneck. And for most people, the keyboard is not a friendly, easy-to-use device. The method provided in this paper is based on the natural output of human beings, and the output speech data is segmented and encoded. The coding results can be processed using traditional text methods and tools, and the corresponding data of the code is stored in the code repository. The code repository can be placed in cloud storage for easy analysis and utilization. This method will greatly improve the efficiency of digitizing human speech output. And with the accumulation of voice data, the code warehouse has the opportunity to provide smarter, personalized voice data services. Ultimately, humans are seamlessly integrated with the digital world.
进一步地,该方法还包括:基于编码的排列顺序信息生成编码顺序信息唯一标识符,和/或基于各个数据片断生成各自的数据片断唯一标识符,将编码顺序信息唯一标识符和/或各个数据片断唯一标识符作为元数据的一部分存储。Further, the method further comprises: generating a coding order information unique identifier based on the encoded arrangement order information, and/or generating a respective data segment unique identifier based on each data segment, the coding order information unique identifier and/or each data The fragment unique identifier is stored as part of the metadata.
上述与数据对象唯一对应的数据对象标识、编码顺序信息唯一标识符、数据片断唯一标识符分别为与数据对象、编码的排列顺序信息、各个数据片断内容对应的散列值(如MD5,SHA1等),或者为系统生成的全局唯一标识符(UUID/GUID)或任何其他全局唯一编码。该标识可用于对其对应的相应内容进行完整性校验,以验证标识与其对应的信息是否相符,以及对应的信息是否完整。The data object identifier, the encoding order information unique identifier, and the data fragment unique identifier uniquely corresponding to the data object are respectively hash values corresponding to the data object, the encoding ordering information, and the content of each data fragment (eg, MD5, SHA1, etc.) ), or a globally unique identifier (UUID/GUID) generated by the system or any other globally unique encoding. The identifier can be used to perform integrity check on its corresponding content to verify whether the identifier matches its corresponding information, and whether the corresponding information is complete.
综上,数据拆分具体是指将一份完整的数据分拆成两份或者多份,随后分别存储于不同的存储系统中。需要特别注意的是,尽管在拆分之后包括了如上述实施例中步骤104B、步骤207B中对拆分数据的分离存储操作,然而,本发明数据拆分的目的并非仅仅在于存储,而是以数据安全为目的的数据拆分处理。对于数据存储在某个云端供应商处,用户可能并不信任,但是通过数据拆分,可以将一份数据分散存储于一个或多个供应商中,只有所有数据都泄露(包括元数据、各个数据片断),才能导致数据的泄露。这就大大提高了非法者合并数据的难度。本发明的数据拆分是允许数据的最终用户(即有权拥有该数据的用户)直接干预和控制的。本数据拆分方法是建立在操作系统(包括云操作系统)之上,具体是以拆分为目的的应用系统中,或者其他应用系统的拆分服务中。而存储系统则是建立在存储物理设备之上,操作系统之下的基础设施。本发明的数据拆分方法最终会用到数据存储系统。图2C为本发明一种数据拆分方法在计算机系统层次中的位置关系图,展示了本发明所处应用领域在计算机系统层次中的位置。In summary, data splitting refers to splitting a complete piece of data into two or more copies, which are then stored in different storage systems. It should be noted that although the split storage operation for the split data in step 104B and step 207B in the above embodiment is included after the split, the purpose of the data splitting of the present invention is not only to store but to Data splitting for data security purposes. For data stored in a cloud provider, users may not trust, but through data splitting, a piece of data can be stored in one or more vendors, and only all data is leaked (including metadata, each Data fragment) can lead to data leakage. This greatly increases the difficulty of illegally merging data. The data splitting of the present invention allows the end user of the data (i.e., the user entitled to own the data) to directly intervene and control. The data splitting method is built on the operating system (including the cloud operating system), specifically in the application system for splitting purposes, or in the splitting service of other application systems. The storage system is built on the storage physical device, the infrastructure under the operating system. The data splitting method of the present invention will eventually use a data storage system. 2C is a diagram showing the positional relationship of a data splitting method in a computer system hierarchy according to the present invention, showing the location of the application field of the present invention in the computer system hierarchy.
数据的拆分和合并可以在终端进行,也可以由服务器或服务供应商执行。这样,不论是攻击者还是数据服务供应商本身,从某个云存储服务器中获得的数据并不是完整的,并不足以对用户的隐私和机密构成威胁。攻击者需要获取同一用户在不同云存储服务中的身份才能得到组成完整数据的不同数据片段。这个难度往往比破解一个单一系统要大得多。此外,还需要使用正确 的合并规约才能将片段数据还原成最初的完整数据。这使得用户的数据就有多了一层保护。当然,黑客可以攻击用户的终端系统,从而获取用户分拆前或者合并后的完整数据。这种风险一直存在,和是否使用云存储没有关系。一般来说终端设备,特别是移动终端,对外暴露的服务较少,而且并不是稳定在线,其被直接攻击的风险一般较随时在线的服务器小。另外,使用具有数据拆分合并功能的应用系统可以在运行时实时拆分、合并数据,而并不一定需要将拆分前或者合并后的数据存储在终端系统中。在这种情况下,即使终端系统被攻击了,拆分存储的数据仍然是安全的;终端系统出现故障时,维修人员以及企业IT部门的人员也无法获取通过这种方式保护的数据。以具备数据拆分功能的邮件系统为例:在不使用数据时,终端侧可能并不存在任何数据的片段。在给某人发文档时,只有当收件人下载文档之后,文档才存在于终端侧。更进一步,一个假定的使用基于本发明数据拆分及合并方法的增强型邮件客户端,此处的邮件服务器可以还是传统的邮件服务器,当需要给邮件添加附件时,附件文件的内容被拆分成多个部分,其中几份被保存于用户指定的云存储中,另几份作为普通附件保存于邮件中。随之用户选择发件人,发送邮件,邮件云端应用系统可以将原始的附件文件中的元数据以及拆分信息(预设元数据剥离规约等)注册到文件元信息库(一个在线服务系统,发件人和收件人必需都有账户)中,同时可以根据用户端的设置自动为发件人设置相应数据访问链接。对应于收件人,在其下载附件之前,其终端侧没有该数据的任何片段。数据实际存储分散于云存储、邮件服务器中、以及文件元信息库中对应的元数据中。当然,该数据还存在于发件人的终端中(如果发件人使用的不是分布式文件系统,且文件没有被删除)。当收件人使用的同样是增强型邮件客户端,当其打开附件时,系统可以自动根据作为普通附件保存于邮件中的部分内容定位到文件元信息库中的对应项,随之定位到云存储中的部分内容,并且根据对应的拆分方法进行还原,最终在收件人的客户端恢复最初的原始数据。当然,这个过程自动完成的前提是,收件人的邮件客户端内预先设置好了需要用的账户信息。这里至少涉及到三个账户:邮件系统、云存储系统以及文件元信息库系统。The splitting and merging of data can be done at the terminal or by the server or service provider. In this way, whether the attacker or the data service provider itself, the data obtained from a cloud storage server is not complete and is not enough to pose a threat to the privacy and confidentiality of the user. An attacker needs to obtain the identity of the same user in different cloud storage services in order to get different pieces of data that make up the complete data. This difficulty is often much greater than cracking a single system. In addition, you need to use it correctly. The merged specification can restore the fragment data to the original complete data. This gives the user's data an extra layer of protection. Of course, the hacker can attack the user's terminal system to obtain complete data before or after the user's spin-off. This risk has always existed and has nothing to do with whether or not to use cloud storage. In general, terminal devices, especially mobile terminals, have less exposed services and are not stable online. The risk of direct attacks is generally smaller than that of online servers. In addition, an application system with data splitting and merging can split and merge data in real time at runtime, and does not necessarily need to store the pre-split or merged data in the end system. In this case, even if the terminal system is attacked, the split storage data is still safe; when the terminal system fails, the maintenance personnel and the personnel of the enterprise IT department cannot obtain the data protected in this way. Take the mail system with data splitting function as an example: when no data is used, there may not be any fragment of data on the terminal side. When a document is sent to someone, the document exists on the terminal side only after the recipient downloads the document. Further, a hypothetical use of the enhanced mail client based on the data splitting and merging method of the present invention, where the mail server can be a conventional mail server, when the attachment needs to be added to the mail, the content of the attached file is split. In multiple parts, several of them are stored in the cloud storage specified by the user, and several others are saved in the mail as ordinary attachments. Then the user selects the sender and sends the email, and the mail cloud application system can register the metadata and the split information (the default metadata stripping protocol, etc.) in the original attachment file to the file meta-information database (an online service system, Both the sender and the recipient must have an account), and the corresponding data access link can be automatically set for the sender according to the settings of the client. Corresponding to the recipient, there is no fragment of the data on the terminal side before it downloads the attachment. The actual storage of data is distributed among the cloud storage, the mail server, and the corresponding metadata in the file meta-information. Of course, this data also exists in the sender's terminal (if the sender is not using a distributed file system and the file has not been deleted). When the recipient uses the same enhanced email client, when the attachment is opened, the system can automatically locate the corresponding item in the file meta-information according to the content stored in the email as a normal attachment, and then locate the cloud. Part of the content in the store, and restore according to the corresponding split method, and finally restore the original raw data on the recipient's client. Of course, the premise of this process is automatically completed, the account information required by the recipient's mail client is pre-set. There are at least three accounts involved here: the mail system, the cloud storage system, and the file meta-information system.
相应于本发明的数据拆分,图2D为根据一示例性实施例示出的一种数据合并方法的流程图,如图2D所示,本发明提供一种数据合并方法,包括: Corresponding to the data splitting of the present invention, FIG. 2D is a flowchart of a data merging method according to an exemplary embodiment. As shown in FIG. 2D, the present invention provides a data merging method, including:
步骤401B、接收携带有标识信息的数据对象获取请求。 Step 401B: Receive a data object acquisition request carrying the identification information.
其中,标识信息包括定位信息,且定位信息用于定位数据对象中部分数据信息的存储地址。The identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object.
步骤402B、获取定位信息对应的存储内容,并根据获取到的存储内容中的定位信息获取其他存储内容中数据信息,直到获取到数据对象的所有数据信息。 Step 402B: Acquire storage content corresponding to the positioning information, and obtain data information in the other storage content according to the obtained positioning information in the stored content until all data information of the data object is obtained.
步骤403B、根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到数据对象。 Step 403B: Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.
本实施例的数据合并方法,通过接收携带有标识信息的数据对象获取请求,并根据标识信息中的定位信息,获取到定位信息指示的存储内容,再根据存储内容中的定位信息获取其他存储内容中数据信息,直到获取到构成数据对象的所有数据信息。根据预设合并规约,将获取到的各个数据信息进行合并处理,得到完整的数据对象。从而加大了非法获取到用户原始数据的难度,通过非法手段即使获取到部分用户数据也难以得到完整且正确的数据对象,从而更加可靠地实现了数据存储的安全性。The data merging method of the embodiment obtains the data object acquisition request carrying the identification information, obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and acquires other storage content according to the positioning information in the storage content. The data information is obtained until all the data information constituting the data object is acquired. According to the preset merge specification, the obtained data information is combined and processed to obtain a complete data object. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and it is difficult to obtain a complete and correct data object even by obtaining some user data through illegal means, thereby realizing the security of data storage more reliably.
图2E为根据另一示例性实施例示出的一种数据合并方法的流程图,如图2E所示,本发明提供一种数据合并方法,包括:FIG. 2E is a flowchart of a data merging method according to another exemplary embodiment. As shown in FIG. 2E, the present invention provides a data merging method, including:
步骤501B、接收携带有标识信息的数据对象获取请求。 Step 501B: Receive a data object acquisition request carrying the identification information.
其中,标识信息包括定位信息,且定位信息用于定位数据对象中部分数据信息的存储地址。数据信息的类型为如下一种或多种组合方式:元数据、数据片断、编码、编码顺序。The identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object. The type of data information is one or more of the following combinations: metadata, data fragments, encoding, and encoding order.
步骤502B、获取定位信息对应的存储内容,并根据获取到的存储内容中的定位信息获取其他存储内容中数据信息,直到获取到数据对象的所有数据信息。 Step 502B: Acquire storage content corresponding to the location information, and obtain data information in the other storage content according to the location information in the obtained storage content, until all data information of the data object is obtained.
步骤503B、根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到数据对象。 Step 503B: Combine the acquired data information according to the preset merge rule in the acquired data information to obtain a data object.
具体的,根据定位信息获取到一个或者多个数据信息(数据信息可以是被拆分的数据片段,也可以是部分或全部元数据,也可以是部分或全部编码及编码顺序),按特定规则即预设合并规约根据一个或者多个数据信息逐步获取相应数据信息,将各数据信息组合到一起(即元数据、数据片 断、编码、编码顺序等进行合并),从而恢复出原始数据对象。具体的合并情况如下:Specifically, one or more pieces of data information are obtained according to the positioning information (the data information may be a piece of data that is split, or may be part or all of the metadata, or may be part or all of the encoding and encoding order), according to a specific rule. That is, the preset merge protocol gradually acquires corresponding data information according to one or more data information, and combines the data information together (ie, metadata, data pieces) The merge, encoding, encoding order, etc. are combined to recover the original data object. The specific merger is as follows:
A、当数据信息的类型为数据片断、编码、编码顺序的组合时,根据预设合并规约中的合并算法,对编码进行解码操作,得到编码对应的数据片断;根据编码顺序对解码后的各个数据片断进行排列,得到按照各个数据片断原始顺序排列的数据对象。A. When the type of the data information is a combination of data segment, encoding, and encoding order, the encoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained; and each of the decoded data is decoded according to the encoding order. The data segments are arranged to obtain data objects arranged in the original order of the respective data segments.
B、当数据信息的类型为元数据、数据片断的组合时:B. When the type of data information is a combination of metadata and data fragments:
B1、若预设合并规约中约定的元数据包括:属性信息,则根据属性信息对各个数据片断合并后的数据对象进行完整性验证,以确认数据对象的属性与元数据中的属性信息匹配;或者,B1. If the metadata agreed in the preset merge specification includes: attribute information, integrity verification is performed on the data objects merged by each data segment according to the attribute information, to confirm that the attribute of the data object matches the attribute information in the metadata; or,
B2、若预设合并规约中约定的元数据包括:数据内容标识和关键词,则将与关键词匹配的数据合并入数据内容标识对应的数据片断内,再将各个数据片断合并,形成数据对象;或者,B2. If the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, the data matching the keyword is merged into the data segment corresponding to the data content identifier, and then each data segment is merged to form a data object. ;or,
B3、若预设合并规约中约定的元数据包括:属性信息、数据内容标识和关键词,则将与关键词匹配的数据合并入数据内容标识对应的数据内容中,根据属性信息对各个数据片断合并后的数据对象进行完整性验证,以确认合并后的数据对象的属性与元数据中的属性信息匹配。B3. If the metadata agreed in the preset merge specification includes: attribute information, data content identifier, and keyword, the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each data segment is determined according to the attribute information. The merged data object performs integrity verification to confirm that the attributes of the merged data object match the attribute information in the metadata.
步骤504B、若元数据中包含数据对象的唯一标识符,根据唯一标识符对合并后的数据对象进行完整性验证。 Step 504B: If the metadata includes a unique identifier of the data object, perform integrity verification on the merged data object according to the unique identifier.
数据合并过程实际上是数据拆分过程的逆过程,是根据预设合并规约来工作的。实际操作中,预设合并规约(以下简称为合并规约)可以与预设拆分规约(包括:预设元数据剥离规约、预设数据内容拆分规约、预设编码分离规约等,以下统一简称为拆分/剥离规约),是同一份内容。与拆分规约类似,合并规约就是为恢复数据而准备的数据信息,或也可称为拆分合并规约,因为拆分时就需要确保被拆分的数据能够恢复回来。所以拆分规约中往往包括了或者隐含了合并归约。The data merging process is actually the reverse process of the data splitting process and works according to the preset merge statute. In actual operation, the preset merge specification (hereinafter referred to as the merge specification) may be combined with the preset split specification (including: preset metadata stripping protocol, preset data content splitting specification, preset encoding separation specification, etc. For the split/peel protocol, it is the same content. Similar to the split specification, a merge specification is data information prepared for data recovery, or it can be called a split merge specification, because it is necessary to ensure that the split data can be recovered back. Therefore, the split statute often includes or implies a merger reduction.
以邮件客户端举例,客户端在取到邮件附件之后,根据附件名称(即该数据对象的唯一标识)可以在文件元信息系统库、邮件系统、云存储等位置中定位到各个存储内容中的数据信息,数据信息中有拆分算法、各个数据片段、定位信息以及相关的文件元数据项等,邮件系统可以根据获取 到的数据信息定位并下载数据片段,根据拆分算法得到其逆算法来合并数据片断、元数据,若有编码还可以根据编码恢复数据片断,得到原始用户数据对象内容;若元数据中包含数据对象的唯一标识符,还可以根据文件元数据验证文件大小、恢复文件名、文件类型、创建时间等。邮件客户端的例子中拆分规约的信息可以就是合并规约。其中,通过数据拆分描述文档可以推导出具体的合并规约,即求逆过程。For example, in the mail client, after the client obtains the email attachment, the client can locate the storage content in the file meta-information system library, the mail system, the cloud storage, and the like according to the attachment name (ie, the unique identifier of the data object). Data information, data information has split algorithm, each data segment, positioning information and related file metadata items, etc., the mail system can be obtained according to The obtained data information locates and downloads the data segment, and obtains the inverse algorithm according to the splitting algorithm to merge the data segment and the metadata. If there is a code, the data segment can be restored according to the code to obtain the original user data object content; if the metadata includes the data The unique identifier of the object, which can also verify the file size, recovery file name, file type, creation time, etc. based on the file metadata. The information of the split protocol in the example of the mail client can be a merge specification. Among them, the specific merge specification, that is, the inversion process, can be derived through the data split description document.
可见,在合并数据时,当仅仅获取到各个数据片段是无法恢复出原始数据的,至少还需要获得在数据拆分过程中建立的拆分/剥离规约,并通过逆向解析获得数据的合并规约或者直接获取到预设合并规约。通常,系统会在数据拆分处理后保留相应的拆分/剥离规约,并将相关的定位信息(例如其存储位置)存放在被拆分的数据片段中、或指定的可访问的任何存储空间中。当然,也可以在进行数据拆分的过程中直接生成与拆分/剥离规约对应的合并规约并存储到被拆分的各数据片段中或其他的指定位置。此时,在合并过程中,仅需要直接获取该合并规约即可。随后,系统将依据所获得的拆分/剥离规约或合并规约,查找或提取出相应的拆分元数据,将基于数据拆分/剥离规约或合并规约以及元数据等信息将各数据片段拼接组合在一起,从而恢复出原始数据。It can be seen that when merging data, when only the data segments are acquired, the original data cannot be recovered, at least the split/peel protocol established in the data splitting process needs to be obtained, and the merged protocol of the data is obtained through reverse parsing or Get the default merge specification directly. Typically, the system retains the appropriate split/peel protocol after data splitting and stores the relevant location information (such as its storage location) in the split data segment, or any storage space that is designated for access. in. Of course, it is also possible to directly generate a merge specification corresponding to the split/peel protocol in the process of data splitting and store it in each of the split data segments or other specified locations. At this point, in the merge process, you only need to directly obtain the merge specification. Subsequently, the system will find or extract the corresponding split metadata according to the obtained split/peel protocol or merge specification, and splicing and combining each data segment based on information such as data split/peel protocol or merge specification and metadata. Together, thus recovering the original data.
进一步的,所述根据预设合并规约中的合并算法,对编码进行解码操作,得到所述编码对应的数据片断,包括:Further, the decoding operation is performed according to the merging algorithm in the preset merging protocol, and the data segment corresponding to the encoding is obtained, including:
根据预设合并规约中的合并算法,对所述数据信息进行拆解,获取元编码,或者所述元编码和实例编码;Disassembling the data information according to a merge algorithm in a preset merge protocol, obtaining a meta code, or the meta code and an instance code;
查询编码仓库,根据所述元编码获取对应的元数据和编码规约;Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;
根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述数据信息对应的数据对象。Obtaining a data object corresponding to the data information according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
需要说明的是,关于解码的具体流程可以参见说明书后续解码的处理方法的实施例部分,此处不再赘述。It should be noted that the specific process for decoding may be referred to the embodiment part of the processing method for subsequent decoding of the specification, and details are not described herein again.
下面以一个具体的例子阐述整个数据对象的拆分合并过程,需要说明的是,该例子中涉及到具体数据、算法等仅为示例性说明,不作为对本发明的限制。拆分目标:将数据对象的信息分割成三部分:元数据块、数据块(即数据片断)、索引块(即编码)。可以使用任意的信息分散算法, 例如IDA算法,将无损压缩后的源文件内容按四字节(32位)划分,需要说明的是,压缩并不是必要的。将划分后的结果排序并合并去重,即消去重复项,保存为互不重复的数据块文件。将划分出的数据块(数据片断)对应到数据块文件的索引(编码),按照原始顺序保存为索引文件(编码及编码的排列顺序信息)。数据块文件和索引文件的文件名可以是对应文件内容的散列值(MD5,SHA-1等)或者系统生成的全局唯一标识符(GUID)或任何其他全局唯一编码。源文件的文件名、大小、日期等信息,以及数据块文件和索引文件的文件名可以存放到元数据库中。只要将这三个部分(元数据块、数据块即数据片断、索引块即编码和编码顺序信息)分别存储到多个云存储系统中去,就能起到既定的安全防护作用。这个部署方案灵活多样,可以将数据块文件和索引文件都放到一个基于文件的云存储中,而将元数据放到另一个云端数据库中;也可以将这三份数据分别存储于三个不同的云存储中;为了提高可用性,还可以为每份数据提供单独的冗余备份。此外,多人数据共享、协同的使用模式中,共享数据的方案就更加灵活多样了,对三份数据的共享可以是多种通讯、共享方式的组合:email、云端共享、即时消息、FTP等。在得到三份数据、或者数据所对应的存储系统的访问授权后,系统通过数据合并过程可以还原目标文件:例如,按照索引文件内的编码及编码的排列顺序信息,将数据块(数据片断)文件索引位置对应的四字节内容拼接;对拼接结果解压缩(如果先前经过了压缩处理)即得到目标文件。在这种通用的拆分存储系统中,也可以建立桌面代理。不过这个桌面是建立在基础云存储的桌面代理基础之上,将上述拆分和合并过程自动化,为用户带来使用上的方便。举例来说,用户客户端的拆分存储桌面代理运行在系统后台,其基础云存储例如是GoogleDrive和微软的One Drive。Google Drive有目录C:\GDrive同谷歌的云存储自动同步,One Drive有目录C:\MDrive同微软的云存储自动同步。拆分存储桌面代理对应的同步目录是C:\DDrive。当用户将文件保存到C:\DDrive时,桌面代理服务程序检测到文件系统的变化,自动将该文件进行拆分,数据块(数据片断)文件保存到C:\GDrive,索引文件(编码及编码的排列顺序信息)保存到C:\MDrive,而元数据保存到专有数据库云服务中。谷歌和微软桌面代理服务会自动将数据块文件和索引文件分别同步到谷歌和微软的云存储以及该 用户的其他的终端目录中去。对应的终端如果运行有拆分存储桌面代理,则会探知到C:\GDrive和C:\MDrive目录的变化,自动获取元数据,将其和数据块文件、数据索引文件合并为原始文件并保存到C:\DDrive目录中,从而实现了拆分/合并存储的同步。The following is a specific example to illustrate the process of splitting and merging the entire data object. It should be noted that the specific data, algorithms and the like in this example are merely exemplary and are not intended to limit the present invention. Split target: The information of the data object is divided into three parts: a metadata block, a data block (ie, a data segment), and an index block (ie, an encoding). Any information dispersion algorithm can be used, For example, the IDA algorithm divides the contents of the source file after lossless compression into four bytes (32 bits). It should be noted that compression is not necessary. The divided results are sorted and combined and deduplicated, that is, the duplicates are eliminated and saved as data block files that are not duplicated. The divided data block (data segment) is assigned to the index (encoding) of the data block file, and is saved as an index file (arrangement order information of encoding and encoding) in the original order. The file name of the data block file and the index file may be a hash value (MD5, SHA-1, etc.) of the corresponding file content or a system-generated globally unique identifier (GUID) or any other globally unique code. The file name, size, date, and other information of the source file, as well as the file name of the data block file and the index file, can be stored in the metabase. As long as these three parts (metadata blocks, data blocks, ie data fragments, index blocks, ie coding and coding order information) are stored in multiple cloud storage systems, respectively, it can play a predetermined security protection role. This deployment solution is flexible, you can put the data block file and index file into one file-based cloud storage, and put the metadata into another cloud database; you can also store these three data in three different In cloud storage; to increase availability, you can also provide separate redundant backups for each piece of data. In addition, in the multi-person data sharing and collaborative usage mode, the solution for sharing data is more flexible, and the sharing of three data can be a combination of multiple communication and sharing methods: email, cloud sharing, instant messaging, FTP, etc. . After obtaining three pieces of data or access authorization of the storage system corresponding to the data, the system can restore the target file through the data merge process: for example, according to the coding order of the index file and the arrangement order information of the code, the data block (data segment) The four-byte content corresponding to the file index position is spliced; the spliced result is decompressed (if previously compressed) to obtain the target file. In this general purpose split storage system, a desktop agent can also be established. However, this desktop is built on the desktop agent of the basic cloud storage, which automates the above-mentioned splitting and merging process, and brings convenience to users. For example, the split-store desktop agent of the user client runs in the background of the system, such as GoogleDrive and Microsoft's One Drive. Google Drive has a directory C:\GDrive that automatically syncs with Google's cloud storage, and One Drive has a directory C:\MDrive that automatically syncs with Microsoft's cloud storage. The sync directory corresponding to the split storage desktop agent is C:\DDrive. When the user saves the file to C:\DDrive, the desktop proxy service detects the change of the file system, automatically splits the file, saves the data block (data fragment) file to C:\GDrive, and indexes the file (encoding and The encoded ordering information is saved to C:\MDrive and the metadata is saved to the proprietary database cloud service. Google and the Microsoft Desktop Agent service will automatically sync the block file and index file to Google and Microsoft's cloud storage respectively. Go to the user's other terminal directory. If the corresponding terminal runs the split storage desktop agent, it will detect the changes of C:\GDrive and C:\MDrive directory, automatically obtain the metadata, merge it with the data block file and data index file into the original file and save it. It is in the C:\DDrive directory, which enables synchronization of split/merge storage.
图2F为根据一示例性实施例示出的一种数据拆分装置的结构示意图,如图2F所示,本发明提供一种数据拆分装置,包括:获取剥离模块61B,用于在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取待存储数据标识对应的数据对象中的元数据,并将获取的元数据从数据对象中剥离。分割模块62B,用于根据预设数据内容拆分规约,将数据内容划分为至少两个数据片断。存储模块63B,用于将元数据、各个数据片断分别存储到不同的存储体中或不同的安全通道中。2F is a schematic structural diagram of a data splitting apparatus according to an exemplary embodiment. As shown in FIG. 2F, the present invention provides a data splitting apparatus, including: an extracting and stripping module 61B, for receiving and carrying When the storage request of the data identifier is to be stored, the metadata is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. The segmentation module 62B is configured to split the data content into at least two data segments according to the preset data content splitting protocol. The storage module 63B is configured to store the metadata and the individual data segments in different storage bodies or in different secure channels.
本实施例的数据拆分装置,通过在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取待存储数据标识对应的数据对象中的元数据,并将元数据从数据对象中剥离;再根据预设数据内容拆分规约,将数据内容划分为多个数据片断;再将元数据和各个数据片断分别存储到不同的存储体中或不同的安全通道中。从而加大了非法获取到用户原始数据的难度,更加可靠地实现了数据存储的安全性。The data splitting apparatus of the embodiment obtains the metadata in the data object corresponding to the data identifier to be stored, and obtains the metadata from the data element corresponding to the data identifier to be stored, by receiving the storage request carrying the identifier of the data to be stored. The data object is stripped; the data content is split into multiple data segments according to the preset data content; and the metadata and each data segment are separately stored in different storage bodies or in different secure channels. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and the security of the data storage is more reliably realized.
在上述实施例的基础上,进一步地,图2G为根据另一示例性实施例示出的一种数据拆分装置的结构示意图,如图2G所示,获取剥离模块61B,包括:接收子模块611B,用于接收携带有待存储数据标识的存储请求。确定子模块612B,用于在接收子模块611B接收到携带有待存储数据标识的存储请求时,当预设元数据剥离规约中约定的元数据包括:属性信息;将待存储数据标识对应的数据对象中与属性信息匹配的属性信息内容确定为元数据;或者,用于当预设元数据剥离规约中约定的元数据包括:数据内容标识和关键词,根据数据内容标识,从待存储数据标识对应的数据对象中的数据内容中,将与关键词匹配的数据确定为元数据;或者,用于当预设元数据剥离规约中约定的元数据包括:属性信息、数据内容标识和关键词,将待存储数据标识对应的数据对象中与属性信息匹配的属性信息内容确定为元数据,以及根据数据内容标识,从数据对象中的数据内容中,将与关键词匹配的数据内容确定为元数据。剥离子模块613B,用于将确定子模块612B确定的元数据从数 据对象中剥离。On the basis of the above-mentioned embodiments, FIG. 2G is a schematic structural diagram of a data splitting apparatus according to another exemplary embodiment. As shown in FIG. 2G, the stripping module 61B is obtained, including: a receiving submodule 611B. And for receiving a storage request carrying the identifier of the data to be stored. The determining sub-module 612B is configured to: when the receiving sub-module 611B receives the storage request carrying the data identifier to be stored, the metadata agreed in the preset metadata stripping protocol includes: attribute information; and the data object corresponding to the data identifier to be stored The attribute information content matching the attribute information is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: a data content identifier and a keyword, and corresponding to the data identifier to be stored according to the data content identifier Among the data contents in the data object, the data matching the keyword is determined as metadata; or the metadata used in the preset metadata stripping protocol includes: attribute information, data content identifier, and keyword, The attribute information content matching the attribute information in the data object corresponding to the to-be-stored data identifier is determined as metadata, and according to the data content identifier, the data content matching the keyword is determined as metadata from the data content in the data object. The stripping sub-module 613B is configured to determine the metadata determined by the sub-module 612B from the number According to the object peeling.
进一步地,获取剥离模块61B,包括:解析子模块614B,用于当预设元数据剥离规约中约定的元数据包括:数据对象标识,则对数据对象进行解析,以生成与数据对象唯一对应的数据对象标识。Further, the obtaining the stripping module 61B includes: a parsing sub-module 614B, configured to parse the data object to generate a unique correspondence with the data object when the metadata agreed in the preset metadata stripping protocol includes: the data object identifier Data object ID.
进一步地,该装置还包括:编码模块64B,用于根据预设编码分离规约,分别对各个数据片断进行编码处理,以获取每个数据片段对应的编码。排列模块65B,用于根据各个数据片断在数据内容中的原始顺序,排列各个编码,以得到编码的排列顺序信息。其中,存储模块63B,具体用于将元数据、各个数据片段对应的编码以及编码的排列顺序信息分别存储到不同的存储体中或不同的安全通道中。Further, the apparatus further includes: an encoding module 64B, configured to perform encoding processing on each data segment according to a preset encoding separation protocol, to obtain a code corresponding to each data segment. The arranging module 65B is configured to arrange the respective codes according to the original order of the data segments in the data content to obtain the coded ordering information. The storage module 63B is specifically configured to store metadata, encoding corresponding to each data segment, and encoding sequence information into different storage bodies or different secure channels.
进一步地,该装置还包括:标识符生成模块66B,用于基于编码的排列顺序信息生成编码顺序信息唯一标识符,和/或基于各个数据片断生成各自的数据片断唯一标识符;存储模块63B,还用于将编码顺序信息唯一标识符和/或各个数据片断唯一标识符作为元数据的一部分存储。Further, the apparatus further includes: an identifier generating module 66B, configured to generate a coding order information unique identifier based on the encoded arrangement order information, and/or generate a respective data segment unique identifier based on each data segment; a storage module 63B, It is also used to store the coding sequence information unique identifier and/or the individual data segment unique identifier as part of the metadata.
其中,预设数据内容拆分规约包括:磁盘阵列RAID拆分算法、信息分散IDA算法中的至少一种。The preset data content splitting protocol includes at least one of a disk array RAID splitting algorithm and an information dispersed IDA algorithm.
上述数据拆分装置的实现方法及原理与数据拆分方法相似,在此不再赘述。The implementation method and principle of the above data splitting device are similar to the data splitting method, and are not described here.
图2H为根据一示例性实施例示出的一种数据合并装置的结构示意图,如图2H所示,本发明提供一种数据合并装置,包括:2H is a schematic structural diagram of a data merging device according to an exemplary embodiment. As shown in FIG. 2H, the present invention provides a data merging device, including:
接收模块81B,用于接收携带有标识信息的数据对象获取请求;其中,标识信息包括定位信息,且定位信息用于定位数据对象中部分数据信息的存储地址。The receiving module 81B is configured to receive a data object acquisition request that carries the identification information, where the identification information includes positioning information, and the positioning information is used to locate a storage address of the partial data information in the data object.
获取模块82B,用于获取定位信息对应的存储内容,并根据获取到的存储内容中的定位信息获取其他存储内容中数据信息,直到获取到数据对象的所有数据信息。The obtaining module 82B is configured to obtain the storage content corresponding to the positioning information, and obtain the data information in the other storage content according to the obtained positioning information in the stored content until all the data information of the data object is obtained.
处理模块83B,用于根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到数据对象。The processing module 83B is configured to combine the acquired data information according to the preset merge protocol in the acquired data information to obtain a data object.
本实施例的数据合并装置,通过接收携带有标识信息的数据对象获取请求,并根据标识信息中的定位信息,获取到定位信息指示的存储内容,再根 据存储内容中的定位信息获取其他存储内容中数据信息,直到获取到构成数据对象的所有数据信息。根据预设合并规约,将获取到的各个数据信息进行合并处理,得到完整的数据对象。从而加大了非法获取到用户原始数据的难度,通过非法手段即使获取到部分用户数据也难以得到完整且正确的数据对象,从而更加可靠地实现了数据存储的安全性。The data merging device of the embodiment obtains the data object acquisition request carrying the identification information, and obtains the storage content indicated by the positioning information according to the positioning information in the identification information, and then The data information in the other stored content is obtained according to the positioning information in the stored content until all the data information constituting the data object is acquired. According to the preset merge specification, the obtained data information is combined and processed to obtain a complete data object. Thereby, the difficulty of illegally obtaining the original data of the user is increased, and it is difficult to obtain a complete and correct data object even by obtaining some user data through illegal means, thereby realizing the security of data storage more reliably.
在上述实施例的基础上,进一步地,图2I为根据另一示例性实施例示出的一种数据合并装置的结构示意图,如图2I所示,数据信息的类型为如下一种或多种组合方式:元数据、数据片断、编码、编码顺序。On the basis of the foregoing embodiment, FIG. 2I is a schematic structural diagram of a data merging device according to another exemplary embodiment. As shown in FIG. 2I, the type of data information is one or more of the following combinations. Method: metadata, data fragment, encoding, encoding order.
A、当数据信息的类型为数据片断、编码、编码顺序的组合时,处理模块83B包括:解码子模块831B,用于根据预设合并规约中的合并算法,对编码进行解码操作,得到编码对应的数据片断。排列子模块832B,用于根据编码顺序对解码后的各个数据片断进行排列,得到按照各个数据片断原始顺序排列的数据对象。A. When the type of the data information is a combination of the data segment, the encoding, and the encoding sequence, the processing module 83B includes: a decoding sub-module 831B, configured to perform a decoding operation on the encoding according to the combining algorithm in the preset merge protocol, to obtain a code corresponding Data fragment. The arranging sub-module 832B is configured to arrange the decoded data segments according to the encoding order to obtain data objects arranged in the original order of the respective data segments.
B、当数据信息的类型为元数据、数据片断的组合时,处理模块83B具体用于当预设合并规约中约定的元数据包括:属性信息,根据属性信息对各个数据片断合并后的数据对象进行完整性验证,以确认数据对象的属性与元数据中的属性信息匹配。或者,具体用于当预设合并规约中约定的元数据包括:数据内容标识和关键词,将与关键词匹配的数据合并入数据内容标识对应的数据片断内,再将各个数据片断合并,形成数据对象。或者,具体用于当预设合并规约中约定的元数据包括:属性信息、数据内容标识和关键词,将与关键词匹配的数据合并入数据内容标识对应的数据内容中,根据属性信息对各个数据片断合并后的数据对象进行完整性验证,以确认合并后的数据对象的属性与元数据中的属性信息匹配。B. When the type of the data information is a combination of the metadata and the data segment, the processing module 83B is specifically configured to: when the metadata agreed in the preset merge specification includes: attribute information, the data object merged with each data segment according to the attribute information Perform an integrity check to confirm that the properties of the data object match the attribute information in the metadata. Or specifically, the metadata agreed in the preset merge specification includes: a data content identifier and a keyword, and the data matched with the keyword is merged into the data segment corresponding to the data content identifier, and then the respective data segments are merged to form Data object. Or specifically, the metadata agreed in the preset merge specification includes: attribute information, a data content identifier, and a keyword, and the data matched with the keyword is merged into the data content corresponding to the data content identifier, and each of the data content is The merged data object of the data fragment is integrity verified to confirm that the attributes of the merged data object match the attribute information in the metadata.
进一步地,该装置还包括:完整性验证模块84B,用于当元数据中包含数据对象的唯一标识符,根据唯一标识符对合并后的数据对象进行完整性验证。Further, the apparatus further includes: an integrity verification module 84B, configured to include a unique identifier of the data object in the metadata, and perform integrity verification on the merged data object according to the unique identifier.
上述数据合并装置的实现方法及原理与数据合并方法相似,在此不再赘述。The implementation method and principle of the foregoing data merging device are similar to the data merging method, and are not described herein again.
以下将结合上面的拆分及合并方法及装置的各个实施例,以一个具体例子给出关于本发明的一种软/硬件实现方法。 A soft/hardware implementation method in accordance with the present invention will be presented in conjunction with various embodiments of the above split and merge method and apparatus, in a specific example.
针对基于拆分的应用系统来说,拆分主要是在系统架构中考虑系统如何将数据分布在多个存储之中。这样的系统一般采用元数据、编码以及领域相关的数据内容拆分。因而可以针对应用领域进行自然地拆解,即使用领域相关的拆分方法。数据的拆分/剥离、合并流程往往是内置在系统的数据访问层,与领域相关的业务逻辑相关联。无论是领域相关的数据拆分还是领域无关的数据拆分,其数据拆分/剥离方法都可以是多种多样的。因此,我们引入“数据拆分描述语言(可作为拆分/合并规约的一部分)”的概念来对数据的拆分过程进行配置。这样,系统或者用户可以在运行时使用动态的数据拆分/剥离方法对数据进行拆分/剥离。数据拆分/剥离方法的描述本身(可作为拆分规约的一部分)作为剥离出元数据的一部分可以存储在特定的存储中。不同的数据就可以有不同的拆分/剥离方法。最后,数据的合并也就会因数据而异,合并过程必须建立在对拆分/剥离方法描述的理解之上。数据拆分/剥离/合并引擎就是对数据拆分/剥离描述信息进行解析、执行来完成数据拆分/剥离/合并的系统组件。对于数据拆分描述语言和数据拆分/剥离/合并模型的核心是数据处理器模型。数据处理器是对数据进行加工处理的软件/硬件组件。用于实现拆分功能的叫做拆分器,相应合并数据的叫做合并器,它们也都是数据处理器。此外,压缩器、解压器、加密器、解密器、保存器、提取器等也都是数据处理器。数据处理器的核心是处理过程,此外还包括若干个输入端口(包括数据输入端口和参数输入端口两种)和若干个输出。数据输入端口对应数据输入,输出端口对应数据输出,参数输入端口对应数据处理过程中需要用到的参数信息。例如,压缩器有一个输入端口(当有压缩密码时,还需要一个额外的密码参数输入端口)、一个数据输出;拆分器有一个数据输入、多个数据输出;合并器有多个数据输入、一个数据输出;保存器有一个数据输入、多个参数输入(对应存储位置、存取访问信息等)、没有输出(其处理过程是将输入提交到存储);提取器没有输入、一个数据输出;还有一类很特别的数据处理器——生成器,没有数据输入(有时候有参数输入),一个或者多个数据输出,其数据输出往往作为数据处理的参数参与整个数据处理过程。分发器是一个数据输入,多个数据输出,每个输出的数据和输入的数据是一样的。一个处理器的输出必需连接到另外一个处理器的输入(可以是数据输入,也可以是参数输入)。此外,我们可以看到,几乎每个数据处理 器都有对应的逆向处理器,否则,我们无法通过数据拆分描述来完成数据合并的过程(唯一的例外是数据生成器,数据生成的过程一般不可逆。在系统进行逆向处理是,生成的数据可以从存储以及其他处理器中直接或者间接获得)。一般说来,一个数据处理器的数据输入就是其对应逆向处理器的数据输出,数据输出是其逆向处理器的数据输入;参数输入保持不变。拆分器对应合并器,加密器对应解密器,压缩器对应解压器,保存器对应提取器,分发器对应的还是分发器(分发器求逆有一个数据输入端口选择的过程),等等。整个数据拆分/剥离/合并的过程实际上是由数据处理器构成的网络实现的,其实质可以用Petri网模型进行刻画。处理过程是变迁(Transition),输入端口是库所(Place),输出到下一个输入端口就是一个有向弧(Connection),从数据处理器输入端口到本处理器处理过程的有向弧是隐含在处理器内部的——当所有数据端口都拥有数据(令牌)时,处理过程自动激活,数据向下流动。For split-based applications, splitting is primarily about considering how the system distributes data across multiple stores in the system architecture. Such systems typically use metadata, coding, and domain-related data content splitting. Therefore, it is possible to naturally disassemble the application domain, that is, to use a domain-related split method. The data split/stripe, merge process is often built into the system's data access layer, associated with domain-related business logic. Whether it is domain-related data splitting or domain-independent data splitting, its data splitting/stripping methods can be varied. Therefore, we introduce the concept of "data split description language (which can be used as part of the split/merge protocol)" to configure the data splitting process. In this way, the system or user can split/stripe the data at runtime using a dynamic data split/peel method. The description of the data split/peel method itself (which can be part of the split specification) can be stored in a particular store as part of the stripped out metadata. Different data can have different split/peel methods. Finally, the merging of data will vary from data to data, and the merging process must be based on an understanding of the split/peel method description. The data split/peel/merge engine is a system component that parses and executes the data split/peel description information to complete the data split/peel/merge. At the heart of the data split description language and data split/peel/merge model is the data processor model. A data processor is a software/hardware component that processes data. The splitter is used to implement the split function, and the corresponding merged data is called the combiner. They are also data processors. In addition, compressors, decompressors, encryptors, decryptors, savers, extractors, etc. are also data processors. The core of the data processor is the processing, in addition to several input ports (including data input port and parameter input port) and several outputs. The data input port corresponds to the data input, the output port corresponds to the data output, and the parameter input port corresponds to the parameter information that needs to be used in the data processing process. For example, the compressor has an input port (and an additional password parameter input port when there is a compressed password), a data output; the splitter has one data input, multiple data outputs; the combiner has multiple data inputs , a data output; saver has a data input, multiple parameter input (corresponding storage location, access access information, etc.), no output (the process is to submit the input to the storage); the extractor has no input, a data output There is also a very special kind of data processor - generator, no data input (sometimes with parameter input), one or more data output, and its data output often participates in the entire data processing process as a parameter of data processing. The distributor is a data input, multiple data outputs, and each output data is the same as the input data. The output of one processor must be connected to the input of another processor (either data input or parameter input). In addition, we can see that almost every data processing The device has a corresponding reverse processor. Otherwise, we can't complete the data merge process through the data split description. The only exception is the data generator. The data generation process is generally irreversible. The reverse processing in the system is the generated data. Can be obtained directly or indirectly from storage and other processors). In general, the data input of a data processor is the data output of its corresponding reverse processor, and the data output is the data input of its reverse processor; the parameter input remains unchanged. The splitter corresponds to the combiner, the encryptor corresponds to the decryptor, the compressor corresponds to the decompressor, the saver corresponds to the extractor, the distributor corresponds to the distributor (the process of the distributor inversion has a data input port selection), and so on. The whole process of data splitting/stripping/merging is actually implemented by a network of data processors, and its essence can be characterized by the Petri net model. The processing is transition, the input port is the library, and the output to the next input port is a directed arc. The directed arc from the data processor input port to the processor is hidden. Included inside the processor - when all data ports have data (tokens), the process is automatically activated and the data flows down.
其中,前面提到的数据拆分描述语言主要用来描述数据处理器的装配流图。用数据拆分描述语言描述的文档称为数据拆分描述文档。数据拆分描述文档中描述的数据流图其实质也是一个数据处理器。因此,在一个数据流图中也可以使用另外一个数据流图作为一个数据处理器。数据拆分描述文档实际上定义的就是一个或者多个的数据流图。对于直接用于数据拆分描述的文档,需要指定最终的入口流图。每个数据流图包括多个数据处理器,以及它们的连接关系。连接关系在数据处理器的数据输出端口中进行描述。数据流图有一个指定的起始数据处理器。数据拆分描述文档可以以图形进行呈现和编辑。再有,数据拆分合并引擎是按照数据拆分描述文档的描述对数据进行拆分、合并。对应的数据拆分流程如图2J所示:步骤1001B、获取待分离数据对象的元数据;步骤1002B、依据元数据创建分离存档文档;步骤1003B、读入数据分离存档文档;步骤1004B将数据分离存储文档实例化为数据流图(实例化数据处理器并建立之间的连接);步骤1005B、将待分离数据传递给数据流图的起始数据处理器;步骤1006B、流图执行结束后销毁数据流图。Among them, the aforementioned data split description language is mainly used to describe the assembly flow diagram of the data processor. A document described in a data split description language is called a data split description document. Data Split Description The data flow diagram described in the document is essentially a data processor. Therefore, another data flow graph can be used as a data processor in one data flow graph. The data split description document actually defines one or more data flow graphs. For documents that are directly used for data split descriptions, you need to specify the final ingress flow graph. Each data flow graph includes multiple data processors and their connection relationships. The connection relationship is described in the data output port of the data processor. The data flow graph has a specified starting data processor. Data split description documents can be rendered and edited graphically. Furthermore, the data splitting and merging engine splits and merges the data according to the description of the data split description document. The corresponding data splitting process is as shown in FIG. 2J: step 1001B, acquiring metadata of the data object to be separated; step 1002B, creating a separate archive document according to the metadata; step 1003B, reading the data to separate the archive document; and step 1004B separating the data The storage document is instantiated into a data flow graph (instantiating the data processor and establishing a connection between them); step 1005B, passing the data to be separated to the starting data processor of the data flow graph; step 1006B, destroying the flow graph after execution Data flow graph.
我们可以看到,实际上数据拆分的主要过程是由数据流图内的数据处理器执行的,数据拆分合并引擎主要是负责装载数据拆分描述文档,并将之实例化为可执行的数据流图,最后将数据传递给该流图使之进行数据处理。数 据处理器为主动对象,也就是说实例化的处理器对象拥有自己的线程/进程,其不断检查自己的可执行条件,一旦发现所有输入端口有数据,就自动执行,将结果传递给其他的数据处理器。完成这些操作后就自行销毁。其流程图如图2K所示,步骤1101B、判断是否有数据传送到输入端口;若有执行步骤1102B,若没有执行步骤1103B;步骤1102B、接收输入数据;步骤1103B、判断所有数据端口是否都有数据;如果发现空输入端口(一般是参数端口),也就是没有任何数据来源的输入端口,则通过交互界面让用户输入相应信息。若有执行步骤1104B、若没有返回执行步骤1101B;步骤1104B、执行数据处理过程;步骤1105B、将处理结果传递给输出对应的数据处理器。We can see that the main process of data splitting is actually performed by the data processor in the data flow graph. The data splitting and merging engine is mainly responsible for loading the data split description document and instantiating it as executable. The data flow graph finally passes the data to the flow graph for data processing. Number According to the processor as an active object, that is, the instantiated processor object has its own thread/process, which constantly checks its own executable conditions. Once it finds that all input ports have data, it executes automatically and passes the result to other Data processor. After completing these operations, it will destroy itself. The flowchart is as shown in FIG. 2K. Step 1101B, determining whether data is transmitted to the input port; if step 1102B is performed, if step 1103B is not performed; step 1102B, receiving input data; step 1103B, determining whether all data ports have Data; if an empty input port (usually a parameter port) is found, that is, an input port without any data source, the user is allowed to enter the corresponding information through the interactive interface. If there is an execution step 1104B, if not returning to the execution of step 1101B; step 1104B, executing a data processing procedure; step 1105B, passing the processing result to the output corresponding data processor.
数据合并的对应流程如图2L所示:步骤1201B、根据输入信息定位对应数据分离存储文档;步骤1202B、读入数据分离存储文档;步骤1203B、将数据分离存储文档实例化为对应的逆向数据流图;步骤1204B、流图执行完成后销毁数据流图。The corresponding process of data merging is as shown in FIG. 2L: step 1201B, locating the corresponding document according to the input information to separate the stored document; step 1202B, reading the data to separate the stored document; step 1203B, instantiating the data separated storage document into the corresponding reverse data stream Figure 1204B. After the flow graph is executed, the data flow graph is destroyed.
对拆分的数据进行恢复时,输入信息可以是数据拆分文档的引用编码,也可以是拆分后的部分数据内容。对于后者,通过散列函数(又称哈希函数,是一种从数据内容中创建小的数字“指纹”的方法。同样数据通过散列函数获取的数字指纹总是一样的,而且认为不会同其他的数字指纹冲突。)获取的散列值也可以作为文档的引用编码。通过该编码,可以获得对应的数据拆分文档。数据拆分文档描述的是数据的拆分流程,进行数据合并时需要获取对应的逆向流程。这个求逆过程实际上是从其实数据处理器开始,根据输出端口遍历相关数据处理器进行求逆即可。对数据处理器的求逆过程因类型而异,但是一般说来,是将类型更改为逆过程类型,数据输入端口变为输出端口,输出端口变为数据输入端口。输入参数端口不变。When restoring the split data, the input information may be a reference code of the data split document, or may be a part of the data content after the split. For the latter, a hash function (also known as a hash function) is a method of creating a small digital "fingerprint" from the data content. The same digital fingerprint obtained by the hash function is always the same, and it is considered not It will conflict with other digital fingerprints.) The obtained hash value can also be used as a reference code for the document. With this encoding, a corresponding data split document can be obtained. The data splitting document describes the data splitting process, and the corresponding reverse process needs to be obtained when data is merged. This inversion process is actually started from the actual data processor, and the inversion is performed according to the output port traversing the relevant data processor. The process of reversing the data processor varies by type, but in general, the type is changed to the inverse process type, the data input port becomes the output port, and the output port becomes the data input port. The input parameter port is unchanged.
举例来说,数据拆分描述语言定义如图2M所示;数据拆分描述语言可视化流程图如图2N所示;数据拆分描述文档样例如表1所示:For example, the data split description language definition is shown in Figure 2M; the data split description language visualization flow chart is shown in Figure 2N; the data split description document sample is shown in Table 1:
表1、数据拆分描述文档样例Table 1, data split description document sample
Figure PCTCN2015086672-appb-000012
Figure PCTCN2015086672-appb-000012
Figure PCTCN2015086672-appb-000013
Figure PCTCN2015086672-appb-000013
Figure PCTCN2015086672-appb-000014
Figure PCTCN2015086672-appb-000014
具体拆分过程如下:待拆分数据首先被进行DES加密,加密密钥来自于系统配置存储;被加密的数据通过4字节分割编码被拆分成块数据和编码数据;编码数据被存储于Amazon S3云存储中,其对应SHA1散列值被作为寻址对应元数据的键值存储于元数据数据库中;块数据被存储于本地文件中,文件名为系统生成的GUID,该GUID也作为键值存储于元数据数据库中。元数据数据库相关记录如表2所示;拆分项、元数据映射表如表3所示;The specific splitting process is as follows: the data to be split is first DES encrypted, the encryption key is from the system configuration storage; the encrypted data is split into block data and encoded data by 4-byte split coding; the encoded data is stored in In Amazon S3 cloud storage, the corresponding SHA1 hash value is stored in the metadata database as the key value for addressing the corresponding metadata; the block data is stored in a local file, and the file name is a GUID generated by the system, and the GUID is also used as Key values are stored in the metadata database. The metadata database related records are shown in Table 2; the split items and metadata mapping tables are shown in Table 3;
表2、元数据表:Table 2, metadata table:
Figure PCTCN2015086672-appb-000015
Figure PCTCN2015086672-appb-000015
表3、拆分项、元数据映射表:Table 3, split items, metadata mapping table:
Figure PCTCN2015086672-appb-000016
Figure PCTCN2015086672-appb-000016
当获取到这两个键值的任意一个,都有机会获取对应的数据拆分描述文档,从而将数据恢复出来。When any of these two key values are obtained, there is a chance to obtain the corresponding data split description document, thereby recovering the data.
从以上的描述不难发现,对于本发明的三个构思而言:即(1)手写输入系统和方法;(2)基于对象的数据编码方案;以及(3)基于对象的数据拆分方案,单独实施上述每一种技术方案即可以获得各自的技术效果。优选的 是,可以将这些构思组合在一起、或者将其中之一或两者以上与其他应用相结合,此时,更能发挥或体现出这些发明构思的价值和有益效果。图2O示出了上述三种构思下的各种概念之间的关联性、以及随着这些概念和构思可扩展出的某些具体应用实例。这些具体应用仅仅是示例性的,实际应用中还可以有更多的变化,因此本发明有着十分广阔的应用前景。It is not difficult to find from the above description for the three concepts of the present invention: (1) handwriting input system and method; (2) object-based data encoding scheme; and (3) object-based data splitting scheme, The respective technical effects can be obtained by implementing each of the above technical solutions separately. Preferred Yes, these concepts can be combined together, or one or more of them can be combined with other applications, and at this time, the value and beneficial effects of these inventive concepts can be more exerted or embodied. FIG. 2O illustrates the correlation between various concepts under the above three concepts, and some specific application examples that can be extended with these concepts and concepts. These specific applications are merely exemplary, and there are more variations in practical applications, so the present invention has a very broad application prospect.
经过几十年的发展,信息技术如今已进入到同通信技术高度融合的网络时代。传统的标准化编码的数据处理系统为现代的各种计算机技术奠定了坚实的基础,但其并不能满足网络化个人计算的各种需求——个性化、安全性、高效性等。为了适应时代的发展,弥补这些不足,本发明不仅提供了一种新颖的手写输入方法和系统,还结合了本发明基于对象的开放式编解码方案、以及基于对象的数据拆分/剥离/合并方法的数据处理方法和系统,在传统的数据处理系统基础之上,构筑出正真意义上的面向未来、基于网络环境下的开放、安全、高效的数据处理体系。After decades of development, information technology has now entered a network era that is highly integrated with communication technologies. The traditional standardized coded data processing system lays a solid foundation for modern computer technology, but it does not meet the various needs of networked personal computing - personalization, security, efficiency and so on. In order to adapt to the development of the times and to make up for these deficiencies, the present invention not only provides a novel handwriting input method and system, but also combines the object-based open codec scheme of the present invention, and object-based data splitting/stripping/merging. The data processing method and system of the method, based on the traditional data processing system, constructs an open, secure and efficient data processing system in the true sense of the future and based on the network environment.
另外,在本发明中,关于下述提及的编解码处理方法,首先先介绍一下基础背景内容,计算机的产生和发展离不开编码技术。目前已经有各种各样的编码技术。作为计算机基础的编码技术,广泛应用在数据的传输、存储和处理中,其重要性不言而喻。另一方面,云计算、大数据的兴起,物联网(The internet of things)的蓄势待发,给编码技术带来了新的机遇与挑战。In addition, in the present invention, regarding the codec processing method mentioned below, first, the basic background content is first introduced, and the generation and development of the computer are inseparable from the coding technique. There are various coding techniques available. As a computer-based coding technology, it is widely used in the transmission, storage and processing of data, and its importance is self-evident. On the other hand, the rise of cloud computing, big data, and the Internet of things are poised to bring new opportunities and challenges to coding technology.
具体的,计算机的产生和发展离不开编码技术。目前已经有各种各样的编码技术。实质上,编码方式可以分为两类:内容编码和引用编码。Specifically, the generation and development of computers are inseparable from coding techniques. There are various coding techniques available. In essence, the encoding methods can be divided into two categories: content encoding and reference encoding.
其中,内容编码是对编码对象的内容进行数字化或者转换的方法。Base64编码,各种数据压缩编码(包括无损压缩、有损压缩等),图像编码(JPEG、SVG等)、视音频编码(PCM、MP3、MP4等)等等都属于内容编码的范畴。数据本身的数字化内容是直接包含在内容编码的结果中,可以被计算机分析和处理。还有一类结构化编码技术,用于描述数据的结构信息。其主要是对结构化数据/文档内容进行编码。如HTML,MathML,SVG等都是具体的结构化描述语言,对应编码规范是元元语言XML。类似的编码规范还有JSON,Protocol Buffer等。Among them, the content encoding is a method of digitizing or converting the content of the encoding object. Base64 encoding, various data compression encoding (including lossless compression, lossy compression, etc.), image encoding (JPEG, SVG, etc.), video and audio encoding (PCM, MP3, MP4, etc.) are all in the category of content encoding. The digitized content of the data itself is directly included in the results of the content encoding and can be analyzed and processed by the computer. There is also a type of structured coding technique for describing the structural information of data. It mainly encodes structured data/document content. For example, HTML, MathML, SVG, etc. are specific structured description languages, and the corresponding coding specification is meta-language XML. Similar coding specifications are JSON, Protocol Buffer, etc.
同内容编码不同,引用编码处理的结果并不是数据内容本身,而是对内容的引用或者对访问对象的寻址路径的描述。哈夫曼编码就是一个对源符号 (内容本身)建立优化的引用编码的方法。URL,IP地址,RFID,条码,二维码,ISBN,邮编等都是引用编码。值得一提的是,文字编码(特别是标准编码)其实质上也是一种引用编码,是对应文字编码方案中特定文字位置的编码。作为文字本体的音、形、义等数据只是在编码规范中有所体现。Unlike content encoding, the result of a reference encoding process is not the data content itself, but a reference to the content or a description of the addressing path of the access object. Huffman coding is a pair of source symbols (The content itself) establishes an optimized reference encoding method. URL, IP address, RFID, barcode, QR code, ISBN, zip code, etc. are all reference codes. It is worth mentioning that the text encoding (especially the standard encoding) is essentially a reference encoding, which is the encoding corresponding to the specific text position in the text encoding scheme. As the text body, the sound, shape, meaning and other data are only reflected in the coding specification.
随着一些引用编码(而不是编码方法)的标准化,计算机程序可以直接对编码进行一定处理,而并不需要编码对应的内容(或者计算机程序已经内置了对应的内容)。例如ASCII、Unicode等标准化编码体系。这样的编码以及编码组合本身就已经构成了更高级别的数据内容。标准化的文字编码就是这样一种典型的例子。如今的很多基于文字的编码规范(如JSON,CSV,XML等)就是建立在这个基础上的。With the standardization of some reference encodings (rather than encoding methods), a computer program can directly process the encoding without encoding the corresponding content (or the corresponding content has been built into the computer program). For example, standardized coding systems such as ASCII and Unicode. Such encoding and encoding combinations themselves already constitute a higher level of data content. Standardized text encoding is such a typical example. Many of today's text-based coding conventions (such as JSON, CSV, XML, etc.) are based on this.
关于对象与模型,对象(object),台湾译作物件,是面向对象(Object Oriented)中的术语,既表示客观世界问题空间(Namespace)中的某个具体的事物,又表示软件系统解空间中的基本元素。About objects and models, objects, and Taiwanese translations are terminology in Object Oriented, which represents a specific thing in the objective world problem space and in the solution space of the software system. fundamental element.
关于OMG,一个计算机领域的非盈利标准化组织,成功地定义了一系列对象建模的语言和标准。OMG将模型分为四个层次的抽象,它们分别是:元元模型层(M3)、元模型层(M2)、模型层(M1)、运行时的数据对象(M0)。其中元元模型层包含了定义建模语言所需的元素;元模型层定义了一种建模语言的结构和语法,可以具体对应到UML(统一建模语言)或者基于对象的程序设计语言如Java,C#等;模型层定义了一个具体的系统的模型,具体说来也就是我们常说的类(Class)或者对象模型;运行时包含了一个模型的对象在运行时的状态等,也就我们说的对象或者实例。About OMG, a non-profit standardization organization in the computer field, successfully defined a set of languages and standards for object modeling. OMG divides the model into four levels of abstraction: meta-model layer (M3), meta-model layer (M2), model layer (M1), and runtime data object (M0). The meta-model layer contains the elements needed to define the modeling language; the meta-model layer defines the structure and syntax of a modeling language, which can be specifically mapped to UML (Unified Modeling Language) or object-based programming languages such as Java, C#, etc.; the model layer defines a specific system model, specifically the class or object model we often say; the runtime contains the state of a model object at runtime, etc. The object or instance we are talking about.
图3为现有技术中元模型的示意图,如图3所示,元对象机制(Meta-Object Facility;简称:MOF)就是OMG定义的一套建立元模型(M2)的标准化规范。MOF包括元建模语言(M3模型)以及创建、操作模型、元模型的方法。3 is a schematic diagram of a meta model in the prior art. As shown in FIG. 3, a Meta-Object Facility (MOF) is a standardized specification for establishing a metamodel (M2) defined by the OMG. MOF includes a metamodeling language (M3 model) and methods for creating, manipulating models, and metamodels.
对象模型有多个层面,有表示结构和功能的静态模型,也有描述运行时行为的动态模型。本文所关注的主要是与编码相关的静态模型,包括数据和接口。The object model has multiple levels, static models that represent structure and functionality, and dynamic models that describe runtime behavior. The main focus of this paper is on static models related to coding, including data and interfaces.
对于引用编码与对象标识,对象的标识符(ID)实际上就是一种引用编码,在对象标识符使用的上下文中,标识符必须是唯一的,同对象一一对 应。这样,系统就能通过标识符寻址定位到对应的对象。For reference encodings and object identifiers, the object's identifier (ID) is actually a reference encoding. In the context of the object identifier used, the identifier must be unique, paired with the object. should. In this way, the system can locate the corresponding object by identifier addressing.
在大多时候,对象的引用编码和对象标识符是一个概念,因为他们的使用目标是一致的。但是有的时候,引用编码不一定可以作为对象标识。引用编码只是保证能够正确寻址到目标,并不一定能够保证和对象一一对应,有时候会存在多对一的情况(一个对象,多个编码)。例如,一个主机可以有多个IP地址;同一个网站,也可以有多个URL。Most of the time, object reference encoding and object identifiers are a concept because their usage goals are consistent. However, sometimes the reference code may not be used as an object identifier. The reference code is only guaranteed to be correctly addressed to the target, and does not necessarily guarantee a one-to-one correspondence with the object. Sometimes there is a many-to-one situation (one object, multiple encodings). For example, a host can have multiple IP addresses; the same website can have multiple URLs.
另外,在计算机科学领域,反射是指一类应用,它们能够自描述和自控制。也就是说,这类应用通过采用某种机制来实现对自己行为的描述(self-representation)和监测(examination),并能根据自身行为的状态和结果,调整或修改应用所描述行为的状态和相关的语义。In addition, in the field of computer science, reflection refers to a class of applications that are self-describing and self-controlling. That is to say, such applications use a mechanism to achieve self-representation and examination of their own behavior, and can adjust or modify the state of the behavior described by the application according to the state and result of their behavior. Relevant semantics.
反射技术已经被现代的软件开发的平台、工具及程序设计语言所支持。例如,可以在运行时利用反射直接获取Java和.Net平台中运行对象的元数据。Reflection technology has been supported by modern software development platforms, tools, and programming languages. For example, you can use reflection to get metadata directly from running objects in Java and .Net platforms at runtime.
另外,在本发明中,编码和解码的处理的方法是基于对象的编码系统,图4为本发明的编码系统的架构示意图,如图4所示,该编码系统主要分为三个部分:客户端,编码服务端,数据存储端。其中,编码服务端和数据存储端一起构成了编码仓库。In addition, in the present invention, the method of encoding and decoding is an object-based encoding system, and FIG. 4 is a schematic diagram of the architecture of the encoding system of the present invention. As shown in FIG. 4, the encoding system is mainly divided into three parts: a client. End, encoding server, data storage. Among them, the encoding server and the data storage end together constitute an encoding warehouse.
如图4所示,客户端通过向编码仓库发送编码可以获得对应的数据对象;将新的数据对象发送到编码仓库,可以获取对应的编码。在编码仓库内部,编码服务端提供对客户端的服务。一个编码仓库可以包括一个或者多个数据存储端,真正的数据都存储于其中。编码服务端可以向数据存储端发送数据查询,以获得、更新、插入相关数据。As shown in FIG. 4, the client can obtain a corresponding data object by sending an encoding to the encoding warehouse; and sending the new data object to the encoding warehouse, the corresponding encoding can be obtained. Inside the encoding repository, the encoding server provides services to the client. An encoding repository can include one or more data stores in which real data is stored. The encoding server can send data queries to the data storage terminal to obtain, update, and insert related data.
编码仓库提供中心化的编码服务,可以使得不同的客户端通过引用编码共享数据对象及编码元对象。更进一步的,各种不同的系统可以向编码仓库注册新的编码元对象以满足各种不同的编码需求。这种中心化的编码服务使得各种系统的数据集成和交换变得更加容易。一般的,编码仓库内置了数据访问控制系统,可以为不同的数据对象以及编码元对象提供不同的访问权限。特别的,编码元对象和数据对象可以存储于不同的数据存储端,和或设置不同的数据访问权限。在基于对象的编码系统中,编码元信息存储于编码仓库,数据对象本身可以存在于编码流(内容编码)或者编码仓库的存储系 统中(引用编码),数据对象的引用编码存在于编码流中。编码流和编码仓库中的数据对象可以置于不同的安全通道中。这种信息的分离一方面有着天然的安全性,另一方面有着更好的编码效率。The code repository provides a centralized encoding service that allows different clients to share data objects and encode meta-objects by reference encoding. Further, a variety of different systems can register new coded meta-objects with the code repository to meet a variety of different coding requirements. This centralized coding service makes data integration and exchange of various systems easier. In general, the code repository has a built-in data access control system that provides different access rights for different data objects and coded meta objects. In particular, the encoded meta-objects and data objects can be stored on different data storage ends, and or set with different data access rights. In an object-based coding system, the encoded meta information is stored in an encoding repository, and the data object itself may exist in the encoding stream (content encoding) or the storage system of the encoding repository. In the system (reference code), the reference code of the data object exists in the encoded stream. The data objects in the code stream and the code repository can be placed in different secure channels. The separation of this information has natural security on the one hand and better coding efficiency on the other hand.
在具体实现中,数据存储端可以用文件存储、关系型数据库、NoSQL数据库、云存储等不同的存储系统来实现。In a specific implementation, the data storage end can be implemented by using different storage systems such as file storage, relational database, NoSQL database, and cloud storage.
具体的,本发明提出了一种全新的基于对象的编码和解码方案和系统,也是一个开放式的解决方案。同标准编码方案相反,基于对象的开放式编码方案可以是完全个人化的、非标准的。这种非标准是指不同于传统的由组织或者机构先制定、再使用的标准,但是其实质是基于编码仓库的事实标准(编码规约)。这种方案不仅可以提供更加灵活多样的数据服务,还可以为数据提供更加可靠的安全保障。Specifically, the present invention proposes a new object-based coding and decoding scheme and system, and is also an open solution. In contrast to standard coding schemes, object-based open coding schemes can be completely personal and non-standard. This non-standard refers to a standard that is different from the traditional ones that are developed and reused by the organization or organization, but the essence is based on the de facto standard (coding protocol) of the coding warehouse. This solution not only provides more flexible and diverse data services, but also provides more reliable security for data.
本发明的编码方案,可以对任意类型和任意长度的数据进行编码,可以具有任意的编码格式和任意的编码字长,并且编码规则可以不固定,即编码规则可根据需要随机变化。从而可以创建完全个性化的编码。换句话说,本发明的编码方案是一种可以对任意对象进行编码、而且与对象数据的长度、编码规则、以及编码字长等都可以无关的编码方案。这大大突破了现有标准编码的固有形式、以及局限性。这种编码方案可以任意扩展。同一编码还可以在不同的编码过程中重复使用,互不影响,因此也大大提高了编码的利用率。The coding scheme of the present invention can encode data of any type and any length, can have any coding format and arbitrary coding word length, and the coding rules can be not fixed, that is, the coding rules can be randomly changed as needed. This makes it possible to create fully personalized coding. In other words, the coding scheme of the present invention is an encoding scheme that can encode an arbitrary object and is independent of the length of the object data, the encoding rule, and the length of the encoded word. This greatly breaks through the inherent form and limitations of existing standard coding. This coding scheme can be arbitrarily expanded. The same code can also be reused in different encoding processes without affecting each other, thus greatly improving the utilization of the code.
本发明编码方案的构思在于依据数据对象的元数据为数据对象创建编码规约,并依据该编码规约为该数据对象产生编码。换句话说,本发明可以以编码的方式来获取数据对象的特征或结构,并依据被编码对象的这些特征和/或结构来为该数据对象产生相应的编码。The concept of the coding scheme of the present invention consists in creating an encoding protocol for the data object based on the metadata of the data object and generating the encoding according to the encoding specification. In other words, the present invention can acquire the features or structures of the data objects in an encoded manner and generate corresponding codes for the data objects in accordance with the features and/or structures of the encoded objects.
更进一步,基于现有的标准文字编码方案的数据,在数据的传输过程中,任何参与传输的各方、以及接收、存储方都有机会获得数据中的全部信息。这既不利于数据的保密,又使得数据的传输量很大,增加了网络带宽以及CPU处理的负担,特别是对于大块的数据传输,更因此降低了数据传输效率。Furthermore, based on the data of the existing standard text encoding scheme, in the process of data transmission, any party involved in the transmission, as well as the receiving and storing parties have the opportunity to obtain all the information in the data. This is not conducive to the confidentiality of data, but also makes the data transmission amount large, increasing the network bandwidth and the burden of CPU processing, especially for large-scale data transmission, and thus reducing the data transmission efficiency.
本发明的另一个特征在于:仅将需要传输的数据对象存储到编码仓库,并设置好相应的数据访问权限,获得其对应的引用编码。在传输时,仅需传 输数据对象的引用编码即可,只有最终拥有数据访问权限的接收方才能获得完整数据。这可以大大减少数据的传输量,同时增加了数据的安全性和可靠性。Another feature of the present invention is that only the data objects that need to be transmitted are stored in the code repository, and the corresponding data access rights are set to obtain the corresponding reference code. When transmitting, only need to pass The reference code of the data object can be exported, and only the receiver that has the data access right can get the complete data. This can greatly reduce the amount of data transferred, while increasing the security and reliability of the data.
此外,与现有技术中对数据的加密过程不同的是,通常,对数据的加密过程并不需要任何元数据的参与,只需要通过加密算法将原始数据转换成不可正常识别或显示的内容即可。尽管本发明也可以达到加密的效果,但一方面,本发明通过完全不同的方式实现了数据保护。具体来说是借助于数据对象的元数据、以编码隔离的方式来保护数据内容的。另一方面,通常情况下,经过加密的密文数据大小往往同原始的明文相同或者更大,但本发明仅仅需要传送对应的引用编码等极少量的信息。再者,由于本发明的构思,除了安全性以外,还为数据处理提供了更多有益的功能和操作空间。例如,但不限于,可以减少数据的传输、降低网络负载;编码的灵活性同时也为后续的数据处理提供更大的便利性等等。In addition, unlike the encryption process of data in the prior art, generally, the encryption process of data does not require any metadata participation, and only the encryption data is needed to convert the original data into content that cannot be normally recognized or displayed. can. Although the invention can also achieve the effect of encryption, on the one hand, the invention achieves data protection in a completely different way. Specifically, the data content is protected by means of metadata of the data object in a coded isolation manner. On the other hand, in general, the encrypted ciphertext data size is often the same as or larger than the original plaintext, but the present invention only needs to transmit a very small amount of information such as a corresponding reference code. Moreover, due to the concept of the present invention, in addition to security, more useful functions and operational space are provided for data processing. For example, but not limited to, it can reduce the transmission of data and reduce the network load; the flexibility of coding also provides greater convenience for subsequent data processing and the like.
尽管在加密后,也需要将秘钥和被加密的数据分开存储或传输,但一方面,加密需要将原始数据通过预定的规则或算法转换成与原始数据完全不同的代码或数据,从而无法轻易地被第三方识别出来。然而,本发明完全可以保留数据内容的原始形态,在不必对内容做任何改动的情况下,同样可以实现数据的安全保密,这是常规加密系统无法做到的。Although the secret key and the encrypted data need to be stored or transmitted separately after encryption, on the one hand, the encryption needs to convert the original data into a code or data completely different from the original data by a predetermined rule or algorithm, so that it cannot be easily The ground is identified by a third party. However, the present invention can completely preserve the original form of the data content, and can also realize the security and confidentiality of the data without any modification to the content, which is not possible by the conventional encryption system.
此外,在加密过程中,通常仅需要一个秘钥即可,而本发明的开放系统在编码过程中,可以对每一个数据片段都赋予不同的编码,还可以对不同的用户设置不同的访问权限,从而可以实现更加细粒度的安全保障。In addition, in the encryption process, usually only one secret key is needed, and the open system of the present invention can assign different encodings to each data segment in the encoding process, and can also set different access rights for different users. This allows for more granular security.
如前所述,由于对象引用编码同标准文字编码的相似性,我们可以将基于对象编码的基本编码形式从标准文字编码的形式扩展而来。这样,标准字符就成为了一种特殊的对象(内置编码元数据的对象编号);对象引用编码就成了一种特殊字符——非标准字符。不同于现有技术的是,本发明可以用于直接接受人类自然输出的数字化结果,将之按照一定规则划分成不同的数据对象,置于编码仓库,形成非标准字符(在本文中,非标准字符就是基于编码仓库的对象引用编码,只不过侧重于强调这个数据对象是对人类自然输出数字化结果进行拆分而得到的数据片段)。可以不关心每个字符的内容或前后字符的关联性,因此,可以如现有的基于标准文字的系统一样,以字符 为基本单位来存储和处理数据。这也为后续的编辑、编码和存储等操作的灵活性提供了极大的拓展空间。As mentioned earlier, due to the similarity between object reference coding and standard text coding, we can extend the basic coding form based on object coding from the standard text coding form. Thus, the standard character becomes a special object (the object number of the built-in encoding metadata); the object reference encoding becomes a special character - non-standard characters. Different from the prior art, the present invention can be used to directly accept the digitized result of human natural output, divide it into different data objects according to certain rules, and place it in an encoding warehouse to form non-standard characters (in this paper, non-standard) The character is based on the object reference encoding of the encoding repository, but focuses on emphasizing that the data object is a piece of data obtained by splitting the human digital output result. You can not care about the content of each character or the relevance of the characters before and after, so you can use the same characters as the existing standard text-based system. Store and process data for the base unit. This also provides a great opportunity to expand the flexibility of subsequent editing, encoding and storage operations.
优选的是,本发明可以通过对每一个人类个体自然输出的数字化结果的全部或者片段赋予自定义的唯一编码或代码的形式,来建立针对该书写人的专有字库。在这种情况下,由于不需要用户预先输入的任何信息以作为参考基准,因此用户可以随时输入而随时建立或补充自己的字库,省却了如中国专利CN103136769 A所公开的、需要事先输入基准字库等信息的麻烦。Preferably, the present invention can establish a proprietary font for the writer by assigning a custom unique code or code to all or a fragment of the digitized result of the natural output of each human individual. In this case, since any information input by the user is not required as a reference datum, the user can input or add his own font at any time, thereby eliminating the need to input the reference font in advance as disclosed in Chinese Patent No. CN103136769A. The trouble with information.
本发明还可以将对象引用编码置于不同的编码空间,如按照用户划分的用户编码空间,不同的用户可以使用同一个引用编码对应到编码仓库中不同的数据对象;还有按照日期划分的编码空间;按照地理位置划分的编码空间;按照部门划分的编码空间;按照在线会话划分的编码空间;等等。按照会话划分的编码空间具有极高的安全特征——数据的引用编码都存在于会话对应的编码空间中,会话结束,对应的编码空间会随之消失,所有该空间内的编码将无法正确解码。利用该特征可以实现“阅后即焚”的效果。优选的,引入编码空间并采用变长编码可以大大减少引用编码的存储消耗,提高传输、处理、存储的效率。The invention can also place the object reference coding in different coding spaces, such as the user coding space divided by the user, different users can use the same reference code to correspond to different data objects in the coding warehouse; and the coding according to the date Space; coding space divided by geographic location; coding space divided by department; coding space divided according to online session; The coding space divided by the session has a very high security feature - the reference code of the data exists in the coding space corresponding to the session. When the session ends, the corresponding coding space will disappear, and all the codes in the space will not be decoded correctly. . With this feature, the effect of "reading and burning" can be achieved. Preferably, introducing the coding space and adopting variable length coding can greatly reduce the storage consumption of the reference code and improve the efficiency of transmission, processing and storage.
由于现代存储技术的迅猛发展,存储手段的不断扩增,使得大容量、海量存储成为可能,特别是以云存储为强大支持的背景下,将全部人类自然输出的数字化内容原原本本地保留下来已经成为可能。Due to the rapid development of modern storage technology and the expansion of storage means, large-capacity and mass storage are possible. Especially in the context of strong support of cloud storage, the digital content of all human natural output has been retained locally. may.
有人曾经测算过,假设某人每天不停地书写60年,其全部的手写信息存储容量也不过250GB。这对于现有的海量存储技术以及云存储技术而言,俨然是小巫见大巫了。这使得原创作品(例如小说、编曲、印谱等)的完整保留成为可能。Someone once calculated that assuming that someone writes for 60 years every day, all of their handwritten information storage capacity is only 250GB. This is a slap in the face of the existing mass storage technology and cloud storage technology. This makes complete retention of original works (such as novels, arrangements, prints, etc.) possible.
另外,当将本文前面的手写输入系统与基于对象编码方案构思结合到一起时,可以建立如下新的数据处理系统。新的数据处理系统引入了编码仓库的概念,应用程序不仅可以查询和使用编码仓库中已有的编码元对象,还可以注册和使用新的编码元对象。新的系统从四个不同层面突破了现有系统的局限。In addition, when the handwriting input system before this article is combined with the object-based coding scheme concept, a new data processing system as follows can be established. The new data processing system introduces the concept of an encoding repository. The application can not only query and use the encoding meta-objects already in the encoding repository, but also register and use new encoding meta-objects. The new system breaks through the limitations of existing systems from four different levels.
第一层面、内置的安全性First level, built-in security
在新的数据处理系统中,文字编码是非标准化的。文字编码和对应的解 码信息分别存储在应用系统和编码仓库中。编码仓库能够同时支持用户、应用以及内容等不同级别的编码隔离。因此,我们可以通过编码仓库的访问控制管理来实现对文字内容的访问和使用进行授权。也就是说,新的数据处理系统具有内置的安全性。In new data processing systems, text encoding is non-standardized. Text encoding and corresponding solution The code information is stored in the application system and the code repository, respectively. The code repository can support different levels of code isolation for users, applications, and content. Therefore, we can authorize the access and use of text content through the access control management of the code repository. In other words, the new data processing system has built-in security.
这种安全性是多层次的。我们可以对不同的用户、不同的应用、不同的文字内容,甚至是不同的编码来设置不同的访问权限。这在传统的建立在标准化文字编码基础之上的数据处理系统中是完全无法做到的。This security is multi-layered. We can set different access rights for different users, different applications, different text content, and even different encodings. This is completely impossible in traditional data processing systems based on standardized text encoding.
此外,不仅是单纯的文本内容,凡是使用了新数据处理系统编码的应用系统、以及数据都会拥有相应的安全性。In addition, not only simple text content, but also the application system and data that are encoded by the new data processing system will have corresponding security.
第二层面、全面的编码能力The second level, comprehensive coding ability
在现有的数据处理系统中,人们建立了各种通用、专用的文本格式,用以描述各种通用、专用的数据结构。例如XML、JSON、CSV、RTF等等。但是,这些格式都使用同样的编码标准来进行标记和定义,这使得内容文本和标记文本都有诸多限制,存储和解析也显得比较低效。例如,XML中,”>”、”<”、”&”等字符有特殊含义,在文本内容中,不能使用。我们不得不使用转义序列“&gt;”、“&lt;”、“&amp;”来代替,或者将文字放入”<![CDATA[“和”]]>”或者引号的保护之中。In existing data processing systems, various general purpose, proprietary text formats have been created to describe various general purpose, proprietary data structures. For example, XML, JSON, CSV, RTF, and so on. However, these formats use the same coding standards for marking and definition, which makes content text and markup text have many limitations, and storage and parsing are also less efficient. For example, in XML, characters such as ">", "<", "&" have special meanings and cannot be used in text content. We have to use the escape sequences "&gt;", "&lt;", "&amp;" instead, or put the text in the "<![CDATA[" and"]]>" or quotation mark protection.
在新的数据处理系统中,开放的编码使得我们可以完全突破这些限制。我们可以对标记使用某几种编码类型,而对文字内容使用另外的类型,对应的文字解析器就可以根据编码元数据来区分哪些文字是标记,哪些文字是内容。In the new data processing system, open coding allows us to completely break through these limitations. We can use some encoding types for the markup, and use another type for the text content. The corresponding text parser can distinguish which text is the mark and which is the content according to the encoded metadata.
同时,由于新系统编码的任意性,任何可以串行编码的事物都可以通过本系统进行存储并编码,如音乐旋律、舞蹈动作、棋谱、视频字幕甚至计算机指令等。存储的结果都分为两部分,一部分是编码仓库中的数据对象,可以是多媒体数据,或者是专有数据,另一部分是编码后的编码序列。这种数据对象的引用编码化并不是本系统所特有的,传统的基于标准化编码的数据处理系统也可以实现对任意数据进行编码。但远远没有基于对象编码系统实现得简单、高效、自然。At the same time, due to the arbitrariness of the new system coding, anything that can be serially encoded can be stored and encoded by the system, such as music melody, dance action, game data, video subtitles and even computer instructions. The stored results are divided into two parts, one is the data object in the encoding warehouse, which can be multimedia data, or proprietary data, and the other part is the encoded code sequence. The reference encoding of such data objects is not unique to the system. Traditional data processing systems based on standardized encoding can also encode arbitrary data. But far from being based on object coding systems, it is simple, efficient, and natural.
第三层面、简洁、高效The third level, simple and efficient
基于对象编码系统中的对象编码可以包括元编码和实例编码部分,对于 一个确定的系统来说,元编码的个数非常有限,例如两个字节16位就能编码6万多个元编码,实际上可以对应6万多个对象种类,这对于绝大多数应用系统都足够。对某种具体对象来说,由于对象编码的任意性,我们可以直接简单地用一个数字来表示其实例编码,例如4个字节32位能够编码40多亿个对象个体,再加上我们可以将引用编码置于不同的编码空间,32位对多数系统也足够。也就是说,6个字节即可表示大多数应用系统中对象的引用编码。此外,如果采用变长编码,我们通过设置缺省元编码、使用客户端编码等机制,我们往往能够可以使用更少的字计数就能表达一个对象引用编码。相比之下,目前云存储中为了防止数据块的冲突,动辄用十几个甚至几十个字节来对一个数据块进行引用编码的方案要简洁有效得多。The object coding in the object-based coding system may include a meta-encoding and an instance coding part, for For a certain system, the number of metacodes is very limited. For example, two bytes of 16 bits can encode more than 60,000 yuan codes, which can actually correspond to more than 60,000 object types, which is for most applications. All are enough. For a specific object, due to the arbitrariness of the object encoding, we can directly use a number to represent its instance code, for example, 4 bytes 32 bits can encode more than 4 billion object individuals, plus we can Putting the reference code in a different encoding space, 32 bits is sufficient for most systems. That is, 6 bytes can represent the reference encoding of objects in most applications. In addition, if variable-length encoding is used, we can often express an object reference encoding with fewer word counts by setting default meta-encoding, using client-side encoding, and so on. In contrast, in order to prevent data block conflicts in cloud storage, it is much simpler and more effective to use a dozen or even dozens of bytes to reference and encode a data block.
此外,在新的数据处理系统中,我们能够将对象引用编码对应的数据对象存储于编码仓库,这可以极大提高数据对象的存储效率,从而提高数据的传输和处理效率。例如,将网页的HTML使用对象编码技术再编码,将标准的HTML各种标签的元素和属性进行对象编码,将相关元信息放到编码仓库,得到的网页文档的大小就会大大减少,可以为网页的网络传输节省流量。In addition, in the new data processing system, we can store the data object corresponding to the object reference encoding in the encoding warehouse, which can greatly improve the storage efficiency of the data object, thereby improving the data transmission and processing efficiency. For example, the HTML of the webpage is re-encoded using the object encoding technique, and the elements and attributes of the standard HTML various tags are encoded, and the relevant meta-information is put into the encoding repository, and the size of the obtained webpage document is greatly reduced, which can be Network transmission of web pages saves traffic.
第四层面、个性化的文字编码The fourth level, personalized text encoding
同标准文字编码方案相反,基于对象编码的数据处理系统使用的编码方案可以是个性化的、非标准的。这个主要是通过上下文编码空间的隔离来实现,不同的用户、不用的应用等都有各自的上下文编码空间。通过访问个性化的上下文编码空间就能进一步访问个性化的编码。每个对象引用编码同编码仓库中的数据对象有一一对应的关系。文字输入时,输入的数据对象内容存储于编码仓库,该内容在编码仓库中的位置被转换成对应的对象引用编码。文字输出时,系统根据对象编码在编码仓库中找到对应的数据对象内容,将该内容输出到特定的设备。In contrast to standard text encoding schemes, the encoding scheme used by object-based data processing systems can be personalized and non-standard. This is mainly achieved by the isolation of the context coding space. Different users and unused applications have their own context coding space. Further access to personalized coding is achieved by accessing a personalized contextual coding space. Each object reference code has a one-to-one correspondence with the data objects in the encoding repository. When text is input, the input data object content is stored in the encoding repository, and the location of the content in the encoding repository is converted into a corresponding object reference encoding. When the text is output, the system finds the corresponding data object content in the encoding warehouse according to the object encoding, and outputs the content to a specific device.
由于基于对象编码系统的开放性,我们可以以任意方式对人类输出的数字化结果进行划分和编码,也可以表达任意想要表达的内容,只需要将内容和编码对应起来。也就是说,该数据处理系统可以动态添加数据对象种类及其编码。Due to the openness of the object-based coding system, we can divide and encode the digitized results of human output in any way, and can also express any content that we want to express, and only need to associate the content with the code. That is, the data processing system can dynamically add data object types and their encodings.
因此,在该系统下,人们可以用最接近自然的方式进行输入,这种输入 也并不局限于前文中的手写输入,还可以是任意的数据流,例如但不限于:语音、图像、多媒体流、盲文、手语、唇语、旗语、甚至还可以是有含义或无含义的猝发串(bust)等等。该系统在输入的同时会自动地将输入内容存储到编码仓库,并对该内容在编码仓库的位置进行编码。输出过程就是根据对象引用编码,从编码仓库中取出输入的内容,并对其进行自然的回放。Therefore, under this system, people can input in the most natural way, this kind of input It is also not limited to the handwriting input in the foregoing, and may be any data stream, such as but not limited to: voice, image, multimedia stream, Braille, sign language, lip language, semaphore, or even meaning or meaningless. Burst and so on. The system automatically stores the input to the encoding repository as it is entered and encodes the location of the content in the encoding repository. The output process is based on the object reference code, the input content is taken from the code repository, and it is played back naturally.
仍以前面的手写输入系统为例。具体说来,对应一个手写文字输入的场景,书写者在一个自然的书写约束(如行约束或者列约束)下进行书写,系统对书写内容按照自然的分字(如汉字的作文格分字)或者分词(如表音语言中单词的空格分词)规则进行划分,将拆分出来的字或者词的形状存储到编码仓库,同时生成其对应引用编码。这些编码会按照特定排版顺序存储到文本内容-即文字编码的集合中去。Still take the previous handwriting input system as an example. Specifically, corresponding to a handwritten text input scene, the writer writes under a natural writing constraint (such as row constraint or column constraint), and the system writes the content according to natural participle (such as Chinese character segmentation). Or the division of words (such as the word segmentation of words in the phonetic language) rules, the shape of the word or word that is split is stored in the code warehouse, and its corresponding reference code is generated. These encodings are stored in a textual content--ie, a collection of textual encodings in a specific typographical order.
可以看出,上述手写文字输入过程是介于文字识别手写输入与非识别手写输入之间的。同文字识别系统类似,该过程需要进行字和词的划分。但不同的是,并不需要分析输入内容对应的标准编码,而是“输入即所得”。这种方法并不存在识别率的问题,永远是100%。这一点同非识别系统相同。但不同的是,该过程对输入内容进行了划分,并分别编码。这使得我们完全可以像对待普通文字一样,对新系统中的编码结果进行一些文字处理,如编辑、拷贝、粘贴、传输、查找、检索等等。It can be seen that the above handwritten text input process is between the text recognition handwriting input and the non-recognition handwriting input. Similar to the text recognition system, this process requires the division of words and words. But the difference is that you don't need to analyze the standard code corresponding to the input, but "input is what you get." This method does not have the problem of recognition rate, always 100%. This is the same as the non-identifying system. But the difference is that the process divides the input content and encodes them separately. This allows us to perform some word processing on the coding results in the new system, such as editing, copying, pasting, transferring, searching, retrieving, etc., just like ordinary text.
类似的,基于开放编码的数据处理系统同样也可以使用在基于光学识别的输入系统中。特别是在手写输入的识别中,笔迹的潦草与否并不重要,基于开放编码的光学识别系统只需要对输入图像进行分行、分词就能够将图像进行划分并存储于编码仓库,并生成相应的图像对象引用编码。值得一提的是,由于该编码的个性化特征,基于该系统形成的编码仓库中的对应数据对象可以作为很好的样本。对之进行分析训练的结果能够反过来提高常规的对该特定个体的文字识别率。Similarly, data processing systems based on open coding can also be used in optical recognition based input systems. Especially in the recognition of handwriting input, it is not important whether the handwriting is scribbled or not. The optical recognition system based on open coding only needs to divide and input the input image to divide the image and store it in the code warehouse, and generate corresponding Image object reference encoding. It is worth mentioning that due to the personalized characteristics of the code, the corresponding data objects in the code repository formed by the system can be used as a good sample. The results of analytical training can in turn increase the conventional text recognition rate for that particular individual.
同样,该数据处理系统也适用于语音输入系统,对于输入的声音信号并不需要进行识别,只需要进行简单地处理、划分就可以存储于编码仓库并得到相应编码。Similarly, the data processing system is also applicable to a voice input system. The input sound signal does not need to be identified, and only needs to be simply processed and divided, and can be stored in the code warehouse and encoded accordingly.
该数据处理系统也可应用到其他的文字输入方法中,如盲文、唇语、手语、旗语的输入。此外,基于这个新的数据处理系统,也可以创造新的文字 输入方法。例如在一个小尺寸屏幕触摸屏设备上,可以设计特定手势作为分行、分词以及结束标记,然后用全屏幕手写,或者语音的方式进行输入。输入内容按照分词标记划分,分别存储于编码仓库中,并得到对应文字编码。再如,可以设计基于3D手套的手语输入方法。将3D手套的运动信息作为文字内容存储到编码仓库,编码对应到字符,一定的时间间隔作为动作的分隔。该手语的输出就是将编码仓库中的3D手套运动信息通过三维模型回放出来。The data processing system can also be applied to other text input methods, such as Braille, lip language, sign language, and semaphore input. In addition, new text can be created based on this new data processing system. Input method. For example, on a small-sized screen touch screen device, specific gestures can be designed as branches, word breakers, and end markers, and then input in full-screen handwriting or voice. The input content is divided according to the word segmentation, and is stored in the code warehouse, and the corresponding text code is obtained. As another example, a 3D glove-based sign language input method can be designed. The motion information of the 3D glove is stored as a text content in the code repository, and the code corresponds to the character, and a certain time interval is used as a separation of the actions. The output of the sign language is to play back the 3D glove motion information in the code warehouse through the 3D model.
综上,该新数据处理系统由主要有如下几个方面的优点:In summary, the new data processing system has the following advantages:
第一方面、简单自然The first aspect, simple and natural
新的数据处理系统并不需要生成特定的标准编码,因此可以针对普通用户设计最简单自然的输入方式,直接将结果编码成个性化编码。The new data processing system does not require the generation of specific standard encodings, so the simplest and most natural input method can be designed for the average user to directly encode the result into a personalized encoding.
由于没有了编码标准的限制,用户可以输入任何他想表达的内容,包括图形、符号、声音、视频等多媒体数据。不同于传统的各类文字识别系统,新数据处理系统中的文字输出并不需要识别,这就保证了输入的不间断高效进行。保证了流畅自然的用户输入体验。Since there is no restriction on the coding standard, the user can input any content he wants to express, including graphics, symbols, sounds, videos and other multimedia data. Unlike traditional text recognition systems, the text output in the new data processing system does not need to be recognized, which ensures uninterrupted and efficient input. A smooth and natural user input experience is guaranteed.
第二方面、安全Second aspect, security
新的数据处理系统是非标准化的基于对象的引用编码。人们并不能从文字编码序列来理解其内容,还需要从编码仓库中获取编码的具体内容信息。编码仓库的访问控制就能保证数据内容的安全。同时,由于引用编码和数据对象的分离,使得在获得编码序列后,非标准文字的可读性/可见性完全依赖于对应编码仓库的安全设置。因此,编码仓库实质上是一个全方位的密码服务器。进一步的,编码序列和编码仓库中的数据可以置于不同的安全通道,大大提高了数据窃取者完全获得全部数据的难度。此外,不同于传统标准化文字编码的上下文无关性,基于对象编码的非标准文字可以是上下文相关的文字。通过上下文空间的隔离,相同的编码就可以因人而异、因应用而异、因文档而异、因时间而异、因地点而异,等等。应用系统、甚至用户个人都能够向编码仓库注册新的上下文规约,从而引入新的编码空间对文字编码进行进一步的隔离。同传统数据处理系统相比,新系统具有天然的安全、私密性。The new data processing system is a non-standardized object-based reference encoding. People can't understand the content from the text coding sequence, and they need to get the specific content information of the code from the code repository. The access control of the code repository ensures the security of the data content. At the same time, due to the separation of the reference code and the data object, the readability/visibility of the non-standard text after obtaining the code sequence is completely dependent on the security settings of the corresponding code store. Therefore, the code repository is essentially a full-featured cryptographic server. Further, the code sequence and the data in the code repository can be placed in different secure channels, which greatly increases the difficulty for the data thefter to completely obtain all the data. Furthermore, unlike the context-independentness of traditional standardized text encoding, non-standard text based on object encoding can be context-sensitive text. Through the isolation of context space, the same encoding can vary from person to person, from application to application, from document to document, from time to time, from location to location, and so on. The application system, and even the individual user, can register a new context specification with the code repository, thereby introducing a new coding space to further isolate the text code. Compared with traditional data processing systems, the new system has natural security and privacy.
软件开发商可以为用户存储编码后的非标准文字信息,也可以对这些非 标准文字进行进一步的处理,如检索、分析等。但他们并不能理解真正的非标准文字内容。同样,编码仓库提供商也可以对编码仓库中的内容进行分析、处理,乃至识别,但是由于其没有对象引用编码最终的排列顺序,非标准文字内容对其也是未知的。只有那些同时拥有相应应用系统以及编码仓库访问权限的用户,才能获得完全的文字内容信息。因此,对一个授权访问的网络应用来说,用户必须同时拥有两种权限—应用权限以及编码仓库权限,才能获得完全的非标准文字信息。Software developers can store encoded non-standard text information for users, or they can Standard text is further processed, such as retrieval, analysis, and so on. But they can't understand the real non-standard text content. Similarly, the code repository provider can also analyze, process, and even identify the content in the code repository, but because it does not have the final order of object reference encoding, non-standard text content is also unknown. Only those users who have access to the corresponding application system and the code repository can get complete text content information. Therefore, for a network application that is authorized to access, the user must have both permissions—application rights and code repository permissions—in order to obtain full non-standard text information.
由于基于对象编码的开放性,我们还可以直接将需要保护的数据内容(包括传统的标准化文字编码)再次编码,编码仓库的授权访问服务可以对这些特殊编码进行专门的控制,从而实现对特定条件、特定文字编码的加密。这里的特定条件可以是基于上下文(时间、地点、环境、用户、应用等)的规则,从而实现复杂、灵活的文字编码安全性。Due to the openness of object-based coding, we can also directly re-encode the data content that needs to be protected (including traditional standardized text encoding). The authorized access service of the encoding warehouse can specifically control these special encodings to achieve specific conditions. , the encryption of a specific text encoding. The specific conditions here may be rules based on context (time, place, environment, user, application, etc.) to achieve complex, flexible text encoding security.
编码仓库在上下文感知的安全性的基础之上,还可以提供用户或者系统的身份认证、数字版权保护方面的服务。Based on the context-aware security, the encoding repository can also provide users or systems for identity authentication and digital copyright protection.
第三方面、开放The third aspect, open
从对象引用编码到非标准文字内容,从编码服务到非标准文字服务,基于对象的编码数据处理系统是一个全面开放的系统。可以将任意数据对象放入编码仓库并通过非标准文字记录其引用编码。软件开发商可以向系统注册新的上下文对象规约、新的编码空间、新的编码元对象、新的数据对象,也可以向系统添加新的编码服务、新的非标准文字服务(包括新的非标准文字输入输出、非标准文字编辑等系统)等。From object reference coding to non-standard text content, from encoding services to non-standard text services, object-based coded data processing systems are a fully open system. Any data object can be placed in the code repository and its reference code can be recorded in non-standard text. Software developers can register new context object specifications, new encoding spaces, new encoding meta objects, new data objects, or add new encoding services to the system, including new non-standard text services (including new non- Standard text input and output, non-standard text editing and other systems).
同时,由于新数据处理系统带来的更加高效、安全的通用文字数据(包括非标准文字和标准化文字)解决方案,我们可以用其构筑任意特定领域的模型。也就是说,不同的应用系统都可以使用对象编码数据处理系统对其领域模型进行编码,并将编码部署在编码仓库中。这样,该应用系统以及相应的数据对象内容不但拥有了新数据处理系统的各种优点——高效、安全等,还能充分利用各种文字服务对其数据进行处理。At the same time, we can use it to build models of any specific domain because of the more efficient and secure common text data (including non-standard text and standardized text) solutions brought by the new data processing system. That is to say, different application systems can use the object coded data processing system to encode their domain model and deploy the code in the code repository. In this way, the application system and the corresponding data object content not only have the advantages of the new data processing system - efficient, secure, etc., but also make full use of various text services to process its data.
第四方面、灵活The fourth aspect, flexible
在非识别手写应用系统中,人们能够输入任意的文字、图形内容;声音录制软件能够录制人的语音信息;视频录制软件也能录制人的运动信息(包 括手语)。不同于这些全内容记录系统,新的数据处理系统是将同样的内容划分、拆分存储并编码。在这个过程中,系统可以将无用的信息直接过滤掉,只保留人们所关注的重要信息,如可以过滤掉音频中的噪音、扫描文字中的噪声点等。而且,通过内容归一服务,重复的内容不用重复存储,极大地减小了存储空间、提高了传输速度。更为重要的是,我们可以利用现有文字处理的基础设施和工具,对新数据处理系统中形成的文字编码内容进行处理和加工,如查找、索引、编辑等。In the non-recognition handwriting application system, people can input arbitrary text and graphic content; the sound recording software can record people's voice information; the video recording software can also record people's motion information (package) Including sign language). Unlike these full content recording systems, the new data processing system divides, splits, and encodes the same content. In this process, the system can directly filter out useless information, and only retain important information that people pay attention to, such as filtering out noise in the audio, scanning noise points in the text, and so on. Moreover, through the content normalization service, the duplicate content does not need to be repeatedly stored, which greatly reduces the storage space and improves the transmission speed. More importantly, we can use the existing word processing infrastructure and tools to process and process the text-encoded content formed in the new data processing system, such as searching, indexing, editing, and so on.
此外,灵活性还表现在编码部署以及访问控制上。编码部署的灵活性是指对同一编码类型,我们可以有选择地将其配置到不同的编码空间,从而拥有不同的安全级别和可见性。访问控制的灵活性是指用户或者应用系统的管理员通过对编码仓库的访问控制设置,能够非常灵活地配置对对象编码的访问:一方面可以将访问控制配置到不同的编码级别,可以是编码空间,或者编码元数据,甚至是特定数据对象;另一方面对编码的访问控制可以是基于不同的条件,如时间、地点、用户、应用、领域模型的状态等等。In addition, flexibility is also reflected in coding deployment and access control. The flexibility of coding deployment means that for the same encoding type, we can selectively configure it into different encoding spaces, thus having different security levels and visibility. The flexibility of access control means that the user or the administrator of the application system can configure the access to the object code very flexibly through the access control settings of the code repository: on the one hand, the access control can be configured to different coding levels, which can be coding. Space, or encoding metadata, or even specific data objects; on the other hand, access control for encoding can be based on different conditions, such as time, location, user, application, state of the domain model, and so on.
第五方面、高效The fifth aspect, efficient
在网络化环境中,新的数据处理系统中数据对象编码和内容的拆分存储保证了高效的存储和传输。数据对象的内容只有在真正需要使用时才需要从编码仓库传输到使用方。In a networked environment, the data object encoding and the split storage of content in the new data processing system ensure efficient storage and transmission. The content of the data object needs to be transferred from the encoding repository to the consumer only when it is really needed.
在非标准文字处理系统中,新数据处理系统中形成的未经识别的数据对象内容可以成为很好的个性化的识别训练样本。经过训练之后的文字识别系统能够更加高效地将个性化的非标准文字识别成对应的标准编码。In non-standard word processing systems, the unidentified data object content formed in the new data processing system can be a good personalized identification training sample. The trained text recognition system can more effectively identify personalized non-standard text into corresponding standard codes.
在非标准文字数据处理系统中,文字的格式信息可以存储于编码仓库。文字格式字符采用非标准编码,文字数据可以任意使用标准字符而无需转义,这些将带来高效的文字数据传输和处理。In a non-standard text data processing system, the format information of the text can be stored in the code repository. Text format characters use non-standard encoding, text data can use standard characters arbitrarily without escaping, which will bring efficient text data transmission and processing.
进一步的,新数据处理系统主要有如下几个方面的意义:Further, the new data processing system mainly has the following aspects:
第一方面、有利于个人计算的普及和深入The first aspect is conducive to the popularity and depth of personal computing.
新的数据处理系统使得接近自然的传统文字输入方式成为可能,解决了很多人“电脑输入难”的问题。安全、自然的数据处理系统更能让普通人接收。这样的计算机文字输入不再是一个与个人的文化背景、熟悉键盘程度有关的事情,这有利于个人计算的普及和深入。 The new data processing system makes it possible to access traditional text input methods that are close to nature, solving many people's problems of "computer input is difficult". A safe, natural data processing system is more acceptable to ordinary people. Such computer text input is no longer a matter related to the individual's cultural background and familiarity with the degree of the keyboard, which is conducive to the popularity and depth of personal computing.
第二方面、有利于云计算的普及和深入The second aspect is conducive to the popularity and depth of cloud computing.
近年来,越来越多的互联网应用和服务转换到云计算这种按需消费、动态分配的计算模式。但是,对于基于云的系统来说,尤其是公有云,安全性是一个不可忽视的挑战。新的数据处理系统中,数据对象编码和内容的拆分可以大大提高系统的安全级别。只要将编码仓库部署在企业的防火墙之内,企业就可以放心地使用各种基于公有云的应用和服务,也能允许其员工在企业内部随意地使用其私人的移动设备。所有存储于公有云中的企业数据信息对于防火墙之外的人来说,都是毫无意义的“乱码”。类似的,家庭或者个人只要保护好其家庭或个人编码仓库的安全。其存储于公有云中的信息就是安全、可靠的了。在这里,编码仓库充当了密码本的角色。这种高级别的安全特性能够加速企业和个人接受和使用公有云服务的步伐。In recent years, more and more Internet applications and services have been converted to cloud computing, an on-demand consumption, dynamic allocation computing model. However, for cloud-based systems, especially public clouds, security is a challenge that cannot be ignored. In the new data processing system, data object encoding and content splitting can greatly improve the security level of the system. By deploying the code repository within the enterprise's firewall, companies can confidently use a variety of public cloud-based applications and services, as well as allow their employees to use their private mobile devices at will. All enterprise data information stored in the public cloud is meaningless "garbled" for people outside the firewall. Similarly, families or individuals only need to protect the security of their home or personal code warehouse. The information stored in the public cloud is safe and reliable. Here, the code repository acts as a codebook. This high level of security features accelerates the adoption and adoption of public cloud services by businesses and individuals.
第三方面、有利于物联网的发展和普及The third aspect is conducive to the development and popularization of the Internet of Things.
物联网(The internet of things)融合了智能感知技术、识别技术、普适计算技术,被称为继计算机、互联网之后信息产业发展的第三次浪潮。物联网是互联网的延伸。一方面,物联网在感知层、网络层、应用层这三个层面都有对对象寻址编码/标识的迫切需求,其节点数量巨大、种类繁多、处理能力有限等特点给相关的编码带来了巨大的挑战,目前还没有形成一个通用的标准。简洁灵活的对象编码机制能够很好地满足这些需求。The internet of things combines intellisense technology, recognition technology, and pervasive computing technology, and is called the third wave of information industry development after computers and the Internet. The Internet of Things is an extension of the Internet. On the one hand, the Internet of Things has an urgent need for object addressing coding/identification at the three levels of the sensing layer, the network layer, and the application layer. The number of nodes is large, the variety is large, and the processing capability is limited. A huge challenge has not yet formed a common standard. A simple and flexible object coding mechanism can well meet these needs.
另一方面,感知层大量的传感器需要将感知的数据记录存储下来,对象编码技术可以很有效地提供相关的编码存储支持。On the other hand, a large number of sensors in the sensing layer need to store the perceptual data records, and the object encoding technology can effectively provide relevant encoding storage support.
第四方面、有利于文化保护与传承The fourth aspect is conducive to cultural protection and inheritance
全球现在有七千多种通用语言,方言更是不计其数。Unicode只覆盖了其中几百种。在现有计算机数据处理系统下,很多语言文字很难被输入到计算机系统中。而新的数据处理系统中,语言、文字的使用几乎没有任何的限制(对于手写文字,排版方式是唯一的限制,需要预先指定)。人们可以直接将任意的非标准文字内容储存到计算机系统中,或者通过计算机同他人交流。打破了原有计算机文字的“先标准化,后使用”的不合理约束。There are now more than 7,000 common languages in the world, and dialects are countless. Unicode covers only a few hundred of them. Under the existing computer data processing system, many language words are difficult to input into the computer system. In the new data processing system, there is almost no limit to the use of language and text (for handwritten text, the typesetting method is the only restriction, which needs to be specified in advance). People can directly store any non-standard text content into a computer system or communicate with others through a computer. It broke the unreasonable constraint of “standardization first, then use” of the original computer text.
现有计算机文字的键盘输入造成了很多人的“提笔忘字”。新的数据处理系统能够保持人类原有的书写传统。The keyboard input of the existing computer text has caused many people to "write the pen and forget the word". The new data processing system maintains the original writing tradition of humans.
第五方面、利于环境保护 The fifth aspect is conducive to environmental protection
新的数据处理系统使得电子设备上文字的直接输入和使用变得更加自然、方便、安全。有利于无纸化环境的形成,最终会节省纸张的使用。The new data processing system makes the direct input and use of text on electronic devices more natural, convenient and secure. Conducive to the formation of a paperless environment, and ultimately save the use of paper.
本发明下述各实施例所提供的编码处理方法和解码处理方法即可基于上述编码系统来实现。下面通过附图和具体实施例,对本发明的技术方案做进一步的详细描述。The encoding processing method and the decoding processing method provided by the following embodiments of the present invention can be implemented based on the above encoding system. The technical solution of the present invention will be further described in detail below through the accompanying drawings and specific embodiments.
图5C为本发明提供的一种编码处理方法的实施例一的流程图,如图5A所示,本实施例的方法的执行主体为编码系统,该方法包括:FIG. 5C is a flowchart of Embodiment 1 of an encoding processing method provided by the present invention. As shown in FIG. 5A, an execution body of the method in this embodiment is an encoding system, and the method includes:
步骤101C、根据接收的编码处理请求,获取待编码的数据对象及其元数据。 Step 101C: Acquire a data object to be encoded and its metadata according to the received encoding processing request.
在本实施例中,获取对象的元数据主要是获取对象的编码元数据。编码元数据可以是元数据的子集或全集。例如但不限于:对象的类型、对应的数据结构、存储和传输的约束、控制等等信息。对象的元数据是本系统的基础,必须通过某种方式从数据中提取出来。利用现代的软件平台,如Java,.Net等中的反射机制都可以自动获得对象的元数据。In this embodiment, the metadata of the acquired object is mainly the encoded metadata of the acquired object. The encoded metadata can be a subset or a complete set of metadata. For example, but not limited to, the type of object, the corresponding data structure, constraints on storage and transmission, control, and the like. The metadata of the object is the basis of the system and must be extracted from the data in some way. The object's metadata can be automatically obtained using modern software platforms such as reflection mechanisms in Java, .Net, etc.
另外,在本实施例中,数据对象(本文也简称为对象)是本发明中进行数据处理的基本对象,也就是本发明需要编码的目标对象。它可以是任意的数据形式,既可以是单个字词、符号、它们的局部,也可以是音频、视频、多媒体流或其片段,还可以是编码本身或文档等等。它至少包括数据对象的元数据部分(或称元数据),并且通常情况下还包括了数据对象的内容数据部分,后者是在剥离了元数据之后的数据对象的剩余部分,或称为数据对象的内容、或数据内容、或内容数据。内容数据可以与元数据部分相关或者无关。In addition, in the present embodiment, the data object (also referred to herein as an object) is the basic object of data processing in the present invention, that is, the target object to be encoded by the present invention. It can be in any form of data, either as a single word, symbol, part of it, or as an audio, video, multimedia stream or fragment thereof, or as an encoding itself or a document. It includes at least the metadata portion (or metadata) of the data object, and usually includes the content data portion of the data object, which is the remainder of the data object, or data, after stripping the metadata. The content of the object, or the data content, or the content data. The content data can be related or unrelated to the metadata portion.
而元数据就是关于数据对象的数据,是对数据对象的特征、属性、内在逻辑关系、和/或结构等的描述。元数据可以出现在:数据内部、独立于数据之外、伴随着数据、或与数据结合在一起。元数据可以包括诸如对象的类型、创建和或修改日期、历史版本信息、数据结构、接口、存储约束、传输约束、编码约束、编码上下文约束等等。具体的元数据示例可以包括但不限于如下方面的信息:程序集的说明;标识(名称、版本、区域性、公钥);导出的类型;该程序集所以来的其他程序集;运行所需的安全权限;类型的说明;名称、可见性、基类和实现的接口;成员(方法、字段、属性、事 件、嵌套的类型);属性;修饰类型和成员的其他说明性元素;表格的表头和/或表格结构信息;图画文件中的调色板等等。Metadata is data about data objects, and is a description of the characteristics, attributes, intrinsic logical relationships, and/or structures of data objects. Metadata can appear inside, outside the data, along with the data, or with the data. Metadata may include such things as the type of object, creation and or modification dates, historical version information, data structures, interfaces, storage constraints, transmission constraints, encoding constraints, encoding context constraints, and the like. Specific metadata examples may include, but are not limited to, information on the following: description of the assembly; identification (name, version, culture, public key); type of the export; other assemblies from the assembly; Security permissions; description of the type; name, visibility, base class and implementation interface; members (methods, fields, properties, things Pieces, nested types); attributes; other descriptive elements that modify types and members; header and/or table structure information for tables; palettes in drawing files, and more.
对于不同的数据对象,元数据是不同的。例如,对于数据对象的元数据部分我们称之为数据对象的元数据;而对于后面提及的编码对象的元数据部分我们可以称之为编码元数据。能够在运行时获取或者添加数据对象对应的元数据是本系统对数据对象进行编码的基础。Metadata is different for different data objects. For example, for the metadata portion of the data object we call it the metadata of the data object; for the metadata portion of the encoding object mentioned later we can call it the encoding metadata. The ability to acquire or add metadata corresponding to a data object at runtime is the basis for the system to encode data objects.
步骤102C、根据编码仓库和所述数据对象及其元数据,获取所述数据对象的对象编码。 Step 102C: Acquire an object code of the data object according to the encoding warehouse and the data object and metadata thereof.
在本实施例中,通过根据接收的编码处理请求,获取待编码的数据对象及其元数据,并根据编码仓库和数据对象及其元数据,获取该数据对象的对象编码,由于可以依据数据对象的元数据和编码仓库,来实现对数据对象的编码,因此实现了灵活多样的编码方式。In this embodiment, the data object to be encoded and its metadata are obtained according to the received encoding processing request, and the object encoding of the data object is obtained according to the encoding warehouse and the data object and its metadata, because the data object can be obtained according to the data object. Metadata and encoding repositories to encode data objects, thus enabling flexible and diverse encoding.
进一步的,举例来说,图5D为上述图5C中步骤102C的一种具体实现方式的流程图,如图5D所示,步骤102C的一种具体实现方式为:Further, for example, FIG. 5D is a flowchart of a specific implementation manner of step 102C in FIG. 5C. As shown in FIG. 5D, a specific implementation manner of step 102C is as follows:
步骤102C1、根据编码仓库以及所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码。Step 102C1: Select or create an encoding protocol according to the encoding repository and at least a portion of the metadata, and generate a meta encoding corresponding to the metadata according to the encoding specification.
在本实施例中,基于预定的提取规则,可以从元数据中进一步甄选出与后面的编码处理相关的元数据,随后可以根据这些甄选出元数据创建或生成相应的编码规约。In this embodiment, based on the predetermined extraction rule, metadata related to the subsequent encoding process may be further selected from the metadata, and then the corresponding encoding specification may be created or generated based on the selected metadata.
另外,基于从对象中提取出的元数据,选择或创建编码规约,并保存该编码规约。将利用该编码规约为对象产生相应的编码。也可以为系统设置缺省或默认的编码规约来进行相应的编解码,此时就仅需选择而无需再创建新的编码规约。可以通过交互的方式由用户选择或创建编码规约的部分或者全部。值得一提的是,在编码过程中生成的编码规约可以在完成编码过程后(出了编码工厂后)自动销毁掉,也可以保存下来。In addition, based on the metadata extracted from the object, an encoding specification is selected or created, and the encoding specification is saved. The encoding protocol will be utilized to generate the corresponding encoding. You can also set the default or default encoding protocol for the system to perform the corresponding encoding and decoding. In this case, you only need to select without creating a new encoding protocol. Some or all of the coding conventions can be selected or created by the user in an interactive manner. It is worth mentioning that the encoding protocol generated during the encoding process can be automatically destroyed after the encoding process is completed (after the encoding factory), and can also be saved.
添加或创建编码规约的过程可以在对象建模时进行;也可以在具体的应用系统运行时进行。既可以通过一定规则自动进行,也可以通过交互的方式进行。The process of adding or creating a coding specification can be done while the object is being modeled; it can also be done while the specific application is running. It can be done automatically by certain rules or by interaction.
编码规约主要包括对象的编码方式、以及对象内部结构的编码约束等。The coding protocol mainly includes the coding mode of the object, and the coding constraints of the internal structure of the object.
步骤102C2、根据所述编码规约,对所述数据对象的数据内容进行编 码,获取实例编码,并根据所述元编码和实例编码,获取与所述数据对象对应的对象编码。Step 102C2, compiling data content of the data object according to the coding protocol And obtaining an instance code, and acquiring an object code corresponding to the data object according to the meta code and the instance code.
其中,所述对象编码是引用编码形式或者内容编码形式。Wherein, the object coding is a reference coding form or a content coding form.
更进一步的,由图3可知,编码系统主要包括编码仓库和客户端,其编码处理流程可以有两种实现方式,具体的细节如下;Further, as can be seen from FIG. 3, the encoding system mainly includes an encoding warehouse and a client, and the encoding processing flow can have two implementation manners, and the specific details are as follows;
第一种实现方式:The first way to achieve:
步骤1a、客户端根据接收的编码处理请求,获取待编码的数据对象及其元数据。Step 1a: The client acquires the data object to be encoded and its metadata according to the received encoding processing request.
步骤2a、客户端将该待编码的数据对象及其元数据发送给编码仓库。Step 2a: The client sends the data object to be encoded and its metadata to the code repository.
步骤3a、编码仓库根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码。Step 3a: The encoding repository selects or creates an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.
在本实施例中,对象编码规约(可以称之为编码规约)是指对数据对象如何进行编解码的规范和约束。可以包括对数据对象的编码方式(内容编码、引用编码或者两者的混合)、对象元数据的编码约束(如相关数据序列化的方案、字长、字节序、数据对齐等细节)等等。对象编码规约也可以作为数据对象的元数据的一部分。In this embodiment, the object coding protocol (which may be referred to as an encoding protocol) refers to the specification and constraints on how the data object is coded and decoded. It can include encoding of data objects (content encoding, reference encoding, or a mixture of both), encoding constraints of object metadata (such as schemes for related data serialization, word length, endianness, data alignment, etc.), etc. . The object encoding protocol can also be used as part of the metadata of the data object.
对象编码规约可以在对象建模时手动(通过建模人员)或者自动(通过工具)添加,也可以是在运行时交互(通过用户)或者自动(通过系统策略)添加。Object encoding conventions can be added manually (through the modeler) or automatically (via the tool) when the object is modeled, or interactively (by the user) or automatically (via system policy) at runtime.
编码元数据是指与数据对象编解码相关的元数据。编码元数据可以是元数据的部分或者全部。数据对象的编码元数据是系统对数据对象进行编解码的基础。Encoding metadata refers to metadata associated with a data object codec. The encoded metadata can be part or all of the metadata. The encoding metadata of the data object is the basis for the system to encode and decode the data object.
步骤4a、编码仓库根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述数据对象对应的对象编码。Step 4a: The code repository encodes the data content of the data object according to the coding protocol, obtains an instance code, and acquires an object code corresponding to the data object according to the meta code and the instance code.
在本实施例中,数据对象及其元数据存储于编码仓库。另外,编码仓库产生对应的对象编码实际上是该数据对象在编码仓库中的引用编码。In this embodiment, the data object and its metadata are stored in an encoding repository. In addition, the corresponding object code generated by the code repository is actually the reference code of the data object in the code repository.
步骤5a、客户端接收该编码仓库返回的对象编码。Step 5a: The client receives the object code returned by the encoding warehouse.
第二种实现方式为:The second implementation is:
步骤1b、客户端根据接收的编码处理请求,获取待编码的数据对象及其 元数据。Step 1b: The client obtains the data object to be encoded according to the received encoding processing request and Metadata.
步骤2b、客户端查询编码仓库,以根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码。Step 2b: The client queries the encoding warehouse to select or create an encoding specification according to at least a part of the metadata, and generates a meta encoding corresponding to the metadata according to the encoding specification.
在本实施例中,客户端向编码仓库中的编码服务端提出编码处理请求,获得编码元对象对应的元编码(实际上是编码元对象在编码仓库中的引用编码)。In this embodiment, the client proposes an encoding process request to the encoding server in the encoding repository to obtain a meta-encoding corresponding to the encoding meta-object (actually a reference encoding of the encoding meta-object in the encoding repository).
可选地,所述元编码可以包括如下一种或者几种的组合和/或嵌套:类型编码,空间编码和上下文编码。Optionally, the meta-encoding may include one or a combination and/or nesting of: type coding, spatial coding, and context coding.
步骤3b、客户端根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述数据对象对应的对象编码。Step 3b: The client encodes the data content of the data object according to the coding protocol, obtains an instance code, and obtains an object code corresponding to the data object according to the meta code and the instance code.
在本实施例中,上述步骤3b中,针对两种不同形式的对象编码——内容编码和引用编码,实例编码的产生也对应地分为两种:对于内容编码形式的实例编码,编码客户端根据编码规约,直接将数据对象的内容进行序列化,成为实例编码。对于引用编码形式的实例编码,编码客户端向编码服务端发出编码请求;编码服务端根据请求获得相应的数据对象和编码规约及相关信息,依据编码规约及相关信息将数据对象存储于编码仓库;产生相应的实例编码,并返回给客户端。In this embodiment, in the above step 3b, for two different forms of object coding - content encoding and reference encoding, the generation of the example encoding is also correspondingly divided into two types: for the example encoding of the content encoding form, the encoding client According to the coding convention, the content of the data object is directly serialized into an instance code. For the example encoding of the reference encoding form, the encoding client sends an encoding request to the encoding server; the encoding server obtains the corresponding data object and the encoding specification and related information according to the request, and stores the data object in the encoding warehouse according to the encoding specification and related information; Generate the corresponding instance code and return it to the client.
对应的,对象编码的解码过程是编码过程的逆过程。一般的,编码服务端根据编码客户端的解码处理请求,获取待解码的对象编码。根据该编码定位到编码仓库中的数据对象,并将其返回给客户端。Correspondingly, the decoding process of the object encoding is the inverse of the encoding process. Generally, the encoding server obtains the object code to be decoded according to the decoding processing request of the encoding client. The data object in the encoding repository is located according to the encoding and returned to the client.
特别地,针对读多个步骤获得的对象编码。编码客户端根据预设的规则将对象编码解析成元编码和实例编码。向编码服务端发出元编码的解码请求。获得对应的编码元对象,根据编码元对象中的编码规约及相关信息,将实例编码解码,结合编码元对象,获得对应的数据对象。In particular, the object encoding obtained for reading multiple steps. The encoding client parses the object encoding into a meta-code and an instance code according to a preset rule. A metacoded decoding request is sent to the encoding server. Obtaining the corresponding coding element object, decoding the instance code according to the coding protocol and related information in the coding element object, and combining the coding element object to obtain the corresponding data object.
针对两种不同形式的对象编码——内容编码和引用编码,上述实例编码的解码过程也相应分为两种:对于内容编码形式,编码客户端可以根据编码规约直接将实例编码解码成对应的数据对象内容。对于引用编码形式,编码客户端向编码服务端发出实例编码解码请求;编码服务端根据请求获得相应的实例编码和编码规约及相关信息,定位到编码仓库中的数据对象,并将其 返回给客户端。For two different forms of object coding - content encoding and reference encoding, the decoding process of the above example encoding is also divided into two types: for the content encoding form, the encoding client can directly decode the instance code into corresponding data according to the encoding protocol. Object content. For the reference encoding form, the encoding client issues an instance encoding and decoding request to the encoding server; the encoding server obtains the corresponding instance encoding and encoding protocol and related information according to the request, and locates the data object in the encoding warehouse, and Return to the client.
另外,在基于对象编码的解码过程中,系统首先获取到编码的元数据;然后根据这个元数据获取对应的内容编码。具体的,编码元数据可以包括用于定位、装载或者传输编码内容的编码类型信息、以及对编码所属目标编码空间的约束信息等。对编码元数据进行编码从而可以获得元编码。实际上,元编码在编码仓库中的编码内容主要就是编码元对象。元编码一般是编码的一个组成部分。解码器从编码中解析出元编码之后,就能按照一定的机制获取相应的编码元数据。In addition, in the decoding process based on the object encoding, the system first acquires the encoded metadata; and then obtains the corresponding content encoding according to the metadata. Specifically, the encoding metadata may include encoding type information for locating, loading, or transmitting the encoded content, and constraint information for the target encoding space to which the encoding belongs. The encoding metadata is encoded to obtain a meta-encoding. In fact, the encoded content of the meta-encoding in the encoding repository is mainly the encoding meta-object. Meta-encoding is generally an integral part of encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism.
在本实施例中,值得一提的是,作为一个编码系统,我们也可以把编码元数据直接视为一种数据对象,即以编码元数据为内容的数据对象,此时可以称之为编码元对象,也可以有其自己的元编码。因此,作为一种数据对象的编码元数据也可以有其对应的元数据的编码,叫做元元编码。In this embodiment, it is worth mentioning that, as an encoding system, we can also directly regard the encoding metadata as a data object, that is, a data object that encodes metadata as content, which may be referred to as encoding. Meta objects can also have their own metacode. Therefore, the encoded metadata as a data object may also have its corresponding metadata encoding, called meta-encoding.
优选地,图6为数据对象、元数据、编码规约、编码元对象四者之间的关系,如图6所示,编码元对象也是一个数据对象(对于普通数据对象来说,它是M1抽象级别的对象),其元数据构成的模型(抽象级别为M2)称为编码元模型。编码元对象的编码元数据是编码元模型的一部分。Preferably, FIG. 6 is a relationship between data objects, metadata, encoding protocols, and encoding meta objects. As shown in FIG. 6, the encoding meta object is also a data object (for a normal data object, it is an M1 abstraction. The level of the object), the model of its metadata (the abstraction level is M2) is called the encoding metamodel. The encoded metadata of the encoded meta-object is part of the encoding metamodel.
编码元模型是对象编码系统的基石,一般说来,编码元模型在运行时相对稳定,不太会动态变化,但是可以扩展。也就是说编码元对象的编码元数据是内置在系统中的。因此,系统可以直接存储、传输和编解码这些编码元对象。The coding element model is the cornerstone of the object coding system. Generally speaking, the coding element model is relatively stable at runtime and does not change dynamically, but can be extended. That is to say, the encoding metadata of the encoding meta object is built into the system. Therefore, the system can directly store, transfer, and encode and decode these encoded meta-objects.
一个对象编码系统可以对应到一个唯一的核心编码元模型(可以有扩充机制)。具体的,图7为该核心编码元模型的示意图。An object coding system can correspond to a unique core coding metamodel (which can have an extension mechanism). Specifically, FIG. 7 is a schematic diagram of the core coding element model.
另外,元编码,做为编码元对象的对象编码,它是否也有自己的元编码呢?这实际上与编码元模型以及编解码方法的具体设计有关。如果编码元模型中只有一种编码元对象,那么元编码就是该编码元对象的全部。如果元模型中有多种编码元对象,而且他们能够同时编码到同一元编码中去,那么这种情况也不需要元编码的元编码。否则,就需要元编码的元编码来区分它们。有时候,编码元对象之间有一定的层次结构关系,此时可能还需要多级解码才能得到最终数据对象的编码元对象。In addition, the meta-encoding, as the object encoding of the encoding meta-object, does it also have its own meta-encoding? This is actually related to the specific design of the coding metamodel and the codec method. If there is only one encoding meta-object in the encoding metamodel, the meta-encoding is all of the encoding meta-object. If there are multiple encoding meta objects in the metamodel and they can be encoded into the same metacode at the same time, then this case does not require metacoded metacode. Otherwise, metacoded metacodes are needed to distinguish them. Sometimes, there is a certain hierarchical relationship between the encoded meta-objects. In this case, multi-level decoding may be required to obtain the encoded meta-object of the final data object.
一般说来,变长编码对于这种元对象层次结构的表达更加直接、灵活, 且易于处理:前一个编码字是后一个编码字的元编码,后一个编码字是再后一个编码字的元编码,这样可以嵌套多个级别。In general, variable length coding is more direct and flexible for the expression of this meta-object hierarchy. And easy to handle: the previous code word is the meta code of the next code word, and the latter code word is the meta code of the next code word, so that multiple levels can be nested.
具体的,图8为对象编码、元编码、实例编码(也就是对象编码去除掉元编码部分)三者以及数据对象与编码元对象的概念模型,如图8所示,示出了如下几层关系:Specifically, FIG. 8 is a conceptual model of object coding, meta-encoding, instance coding (that is, object coding removes the meta-encoding part), and a conceptual model of the data object and the coding meta-object. As shown in FIG. 8, the following layers are shown. relationship:
1.编码元对象也可以作为一种数据对象1. The encoding meta object can also be used as a data object.
2.元编码本身也可以作为一种对象编码2. Meta-encoding itself can also be used as an object encoding
3.数据对象和编码元对象互相关联3. Data objects and encoding meta objects are related to each other
4.对象编码包括元编码和实例编码4. Object coding includes meta coding and instance coding
5.对象编码和对应的数据对象相关联,这里隐含了元编码和编码元对象之间同样的对应关系(主要隐含在上面的关系1和关系2中)。5. The object encoding is associated with the corresponding data object, which implies the same correspondence between the meta-encoding and the encoding meta-object (mainly implicit in relation 1 and relationship 2 above).
另外,元编码中包括多种编码元对象的例子,图9为本实施例中元编码的一个示例图。如图9所述,该对象编码是一个128位的定长编码,编码元模型中只有两种编码元对象:对象的拥有者、以及对象类型。它们可以相关,也可以无关,这取决于编码元模型中的定义。相关还是无关对应的编码逻辑是不同的。In addition, an example of a plurality of encoding meta-objects is included in the meta-encoding, and FIG. 9 is an exemplary diagram of the meta-encoding in the present embodiment. As shown in Figure 9, the object encoding is a 128-bit fixed-length encoding. There are only two encoding meta-objects in the encoding meta-model: the owner of the object, and the object type. They can be related or unrelated, depending on the definition in the encoding metamodel. Correlation or irrelevant corresponding coding logic is different.
再举例来说,图10为一个类似的编码元对象逐层相关的例子(16位字长的变长编码)的示例图。As another example, FIG. 10 is an exemplary diagram of a similar layer-by-layer correlation of coded meta-objects (variable-length coding of 16-bit word length).
进一步的,图11为对应编码的元模型示意图,如图11所示,这里面有两种编码元对象:用户和编码类型。编码类型可以有一个拥有者(01),或者没有拥有者(00)。因此,上面两种编码形式都是合法的。只有类型编码作为元编码的对象编码对应没有拥有者的数据对象。另外一个表示有拥有者的数据对象。Further, FIG. 11 is a schematic diagram of a meta model corresponding to the encoding. As shown in FIG. 11, there are two kinds of encoding meta objects: user and encoding type. The encoding type can have one owner (01) or no owner (00). Therefore, both of the above encoding forms are legal. Only the type encoding as the meta-encoded object encoding corresponds to the data object without the owner. The other one represents a data object with the owner.
在本实施例中,基于元数据和编码规约生成元编码,并根据数据内容生成实例编码。可以利用编码工厂来实现这些具体步骤。编码工厂是系统的另一个重要组件,可以由编码仓库动态创建,也可以跨组件或跨系统存在。编码工厂可以提供对相关对象直接的编解码服务。In the present embodiment, the meta-encoding is generated based on the metadata and the encoding protocol, and an instance encoding is generated based on the data content. These specific steps can be implemented using a coding factory. A coding factory is another important component of a system that can be dynamically created by an encoding repository or across components or across systems. The coding factory can provide direct codec services for related objects.
编码仓库可以提供两组重要服务:编码元数据的注册和访问;对象引用编码的编码和解码。The code repository can provide two important services: registration and access to encoded metadata; encoding and decoding of object reference encoding.
编码仓库也可以使用外部的存储服务来储存编码元数据以及对象数据 等。The encoding repository can also use external storage services to store encoded metadata as well as object data. Wait.
基于预定的规则由元编码和实例编码生成最终的对象编码。可以以任意的方式将元编码与实例编码构成对象编码,例如拼接或通过某种运算等,只要能够在解码时逆向拆解还原出两者即可。可以将生成对象编码的过程放置在用户端,也可以由编码工厂自动执行,这取决于实际的设计。而且,还可以在最终的对象编码中包含一个代表元编码与实例编码的组合或拼接方式的编码。必要时,还可以将代表该组合或拼接方式的编码与对象编码分开存储在不同的安全通道下,并分别设置各自的访问权限,只有经过授权并通过验证才能获得对象编码以及相应的代表元编码与实例编码的组合或拼接方式的编码,从而在解码过程中能够正确拆解出元编码与实例编码。The final object encoding is generated from the meta-code and the instance code based on predetermined rules. The meta-encoding and the instance coding may be combined into an object coding in an arbitrary manner, such as splicing or by some kind of operation, etc., as long as the two can be reversely disassembled and restored at the time of decoding. The process of generating the object encoding can be placed on the client side or automatically by the encoding factory, depending on the actual design. Moreover, it is also possible to include in the final object code a code representing a combination or splicing manner of the meta code and the instance code. If necessary, the code representing the combination or splicing mode can be stored separately from the object code under different secure channels, and the respective access rights are set separately. Only after authorization and verification can the object code and the corresponding representative element code be obtained. The combination with the example coding or the coding of the splicing method, so that the meta coding and the example coding can be correctly disassembled in the decoding process.
在本实施例中,内容数据也可以是应用对象本身,也可以是应用对象的定位、索引信息。在后者的情况下,应用系统的数据访问组件能够根据该内容数据通过某种途径或者算法获取相应的应用数据,从而得到最终的应用对象。In this embodiment, the content data may also be the application object itself, or may be positioning and index information of the application object. In the latter case, the data access component of the application system can obtain the corresponding application data through some means or algorithm according to the content data, thereby obtaining the final application object.
另外,优选地,数据对象的内容可以存储在与编码仓库相接口的第三方存储系统中,在这种情况下,编码仓库中需要存储访问第三方存储系统中数据对象的相关信息。Additionally, preferably, the content of the data object can be stored in a third party storage system that interfaces with the encoding repository, in which case the encoding repository needs to store information about accessing data objects in the third party storage system.
在本实施例中,对数据对象进行编码的过程我们称之为基于对象的编码。数据序列化,简称序列化,就是对数据进行内容编码的过程。数据对象的元数据、以及内容数据最终都需要通过序列化,或者保存于基于对象编码的结果中(内容编码方式),或者保存于结果之外的存储中(引用编码方式)。此外,在编解码过程中,数据对象的内容和元数据的内容都需要经过序列化之后,才能在系统中进行传输。In this embodiment, the process of encoding a data object is referred to as object-based encoding. Data serialization, referred to as serialization, is the process of encoding content into data. The metadata of the data object and the content data ultimately need to be serialized, or stored in the result based on the object encoding (content encoding method), or stored in a storage other than the result (reference encoding method). In addition, during the encoding and decoding process, the content of the data object and the content of the metadata need to be serialized before being transmitted in the system.
实际上,数据对象的序列化,也就是内容编码本身也完全可以建立在基于对象的编码方法之上。其关键就是编码元数据是通过该方法存储到编码仓库中得到对应的编码元对象引用编码,即元编码。在元编码对应的编码元数据的参与下,之后的数据对象的序列化就可以顺畅地进行。因此,可以说基于对象的引用编码是本方法的基础。在此基础上可以对编码元对象进行引用编码,从而得到元编码。在元编码的基础上,我们既可以进行数据对象的引用编码,又可以进行数据对象的序列化,即内容编码。在实施引用编码的过 程中,更优的,需要先得到数据对象的内容编码(将本方法用于自身),将内容编码传输到编码仓库进行存储,之后才得到引用编码。In fact, the serialization of data objects, that is, the content encoding itself, can also be built entirely on object-based coding methods. The key is that the encoded metadata is stored in the encoding warehouse by the method to obtain the corresponding encoded meta-object reference encoding, that is, the meta-encoding. With the participation of the encoded metadata corresponding to the metacode, the serialization of the subsequent data objects can be smoothly performed. Therefore, it can be said that object-based reference coding is the basis of this method. On this basis, the encoded meta-object can be reference coded to obtain the meta-encoding. On the basis of meta-encoding, we can both reference the data object and serialize the data object, that is, content encoding. In the implementation of the reference code In the process, better, you need to get the content encoding of the data object (use this method for itself), transfer the content encoding to the encoding warehouse for storage, and then get the reference encoding.
在本实施例中,对象编码是指对任意对象的编码。这里的对象既可以是实体对象如数据、内容信息、图像、语音等(一般可以对它们采用引用编码),也可以是值对象(例如,日期,一般可以对其采用实例编码),还可以是包括内部对象结构的高级别对象,如数组对象、表对象、树/文档对象等。对象编码是本系统对任意对象进行编码后的输出之一,也是进行对象解码时的输入之一。In the present embodiment, object encoding refers to encoding of an arbitrary object. The objects here can be either entity objects such as data, content information, images, voices, etc. (generally they can be reference coded), or they can be value objects (for example, dates, which can be encoded by examples), or High-level objects that include internal object structures, such as array objects, table objects, tree/document objects, and more. Object encoding is one of the outputs of this system for encoding arbitrary objects, and is also one of the inputs for object decoding.
举例来说,图12为该对象编码的概念模型示意图,如图12所示,对象编码可以包括两个部分,一是元编码,二是实例编码。元编码就是对编码元对象的编码。元编码一般是对象编码的一个组成部分。解码器从编码中解析出元编码之后,就能按照一定的机制获取相应的编码元数据。内容编码是在对应的编码约束下对数据内容的编码。For example, FIG. 12 is a schematic diagram of a conceptual model of the object encoding. As shown in FIG. 12, the object encoding may include two parts, one is a meta-encoding, and the other is an example encoding. Meta-encoding is the encoding of an encoded meta-object. Meta-encoding is generally an integral part of object encoding. After the decoder parses the meta-encoding from the encoding, the corresponding encoding metadata can be obtained according to a certain mechanism. Content encoding is the encoding of data content under the corresponding encoding constraints.
图13为本发明提供的一种编码处理方法的实施例二的流程图,在上述图5C所示实施例的基础上,如图13所示,本实施例的方法还包括:FIG. 13 is a flowchart of Embodiment 2 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5C, as shown in FIG. 13, the method in this embodiment further includes:
步骤201C、对所述编码仓库中的数据设置访问权限。 Step 201C: Set access rights to data in the encoding warehouse.
在本实施例中,该数据可以为元数据、数据对象等。可选地,所述元数据包括如下一种或者几种组合:In this embodiment, the data may be metadata, data objects, and the like. Optionally, the metadata includes one or a combination of the following:
数据对象的类型、数据对象的创建时间、数据对象的修改时间、数据对象的历史版本信息、数据对象的数据结构、数据对象的接口、数据对象的存储约束、数据对象的传输约束、数据对象的编码约束(包括编码空间的约束)。Type of data object, creation time of data object, modification time of data object, historical version information of data object, data structure of data object, interface of data object, storage constraint of data object, transmission constraint of data object, data object Encoding constraints (including constraints on the encoding space).
进一步的,该方法还可以包括:Further, the method may further include:
步骤202C、将对象编码发送给目标客户端。 Step 202C: Send the object code to the target client.
图14为本发明提供的一种编码处理方法的实施例三的流程图,在上述图5D所示实施例的基础上,如图14所示,步骤102C2的一种具体实现方式为:FIG. 14 is a flowchart of Embodiment 3 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5D, as shown in FIG. 14, a specific implementation manner of step 102C2 is:
步骤301C、获取上下文对象。 Step 301C: Acquire a context object.
步骤302C、根据所述上下文对象和所述编码的规约,获取对应的编码空间。 Step 302C: Acquire a corresponding coding space according to the context object and the coded protocol.
步骤303C、在所述编码空间,对所述数据对象中的数据内容进行编码,获取实例编码。 Step 303C: Encode the data content in the data object in the coding space to obtain an instance code.
步骤304C、根据所述元编码和实例编码,获取与所述数据对象对应的对象编码。 Step 304C: Acquire an object code corresponding to the data object according to the meta code and the instance code.
在本实施例中,编码仓库(在本文中也称编码仓库)可以是存储编码元数据、编码元对象以及对象数据的存储库,它同时也可以提供相关的各种服务。同基于标准化编码系统中的字库类似,本发明手写输入系统中字符编码对应的字形也可以存储于编码仓库中。图15为本实施例的手写输入系统中非标准字符编码对应的字形存储在编码仓库的示意图,如图15所示,通过访问编码仓库中的字形信息,使用新数据处理系统的应用程序能够渲染出任意的文字字形。In this embodiment, the encoding repository (also referred to herein as an encoding repository) may be a repository that stores encoded metadata, encoded meta-objects, and object data, which may also provide related services. Similar to the font library based on the standardized encoding system, the glyph corresponding to the character encoding in the handwriting input system of the present invention can also be stored in the encoding warehouse. 15 is a schematic diagram of a glyph corresponding to a non-standard character encoding stored in an encoding warehouse in the handwriting input system of the embodiment, as shown in FIG. 15, by accessing the glyph information in the encoding warehouse, the application using the new data processing system can render Any text font.
但是,同传统字库不同,编码仓库中不仅仅只存放字形信息。新的数据处理系统采用的是基于对象开放编码的解决方案。可以对图形、语音,或其他多媒体数据进行编码,也可以对不同的领域数据进行编码。这些编码的元数据也存储于编码仓库中。应用系统不光能查询和使用编码仓库中的各种编码,也能向编码仓库注册新的编码种类,并向其提交编码数据。However, unlike traditional fonts, not only glyph information is stored in the encoding repository. The new data processing system uses a solution based on object open coding. You can encode graphics, voice, or other multimedia data, as well as encode different domain data. These encoded metadata are also stored in the encoding repository. The application system can not only query and use various encodings in the encoding warehouse, but also register new encoding types with the encoding warehouse and submit encoded data to them.
图16为一个示例性的上下文相关的对象编码系统的编码元模型的核心概念图,如图16所示,其示意出该编码元模型中一些核心概念之间的关系。随后给出了对这些具体概念的定义。16 is a core conceptual diagram of an encoding metamodel of an exemplary context-dependent object encoding system, as shown in FIG. 16, which illustrates the relationship between some of the core concepts in the encoding metamodel. The definition of these specific concepts is then given.
对于编码空间,是指将对象编码进行隔离的逻辑空间。不同编码空间中相同对象类型的不同实例编码对应的对象是不同的。编码空间同某个或者某几个编码对象直接相关(上述编码元模型中只有一个),称这个(几个)编码对象为该空间以及该空间内编码对象的直接上下文。称该编码空间为这个(几个)对象的编码空间。For the encoding space, it refers to the logical space that isolates the object encoding. Objects corresponding to different instance codes of the same object type in different coding spaces are different. The coding space is directly related to one or several coding objects (only one of the above-mentioned coding metamodels), and the (several) coding object is called the space and the direct context of the coding object in the space. This encoding space is called the encoding space of this (several) object.
编码空间内编码对象的编码空间称为子空间。称编码空间为其子空间的父空间。没有父空间的编码空间称为根空间。根空间一般就是编码仓库的编码空间。The coding space of the coding object in the coding space is called a subspace. The encoding space is called the parent space of its child space. The encoding space without a parent space is called the root space. The root space is generally the encoding space of the encoding repository.
计算机世界中,我们是用二进制位来进行编码的。给予足够多的位数,我们就能够使用尽可能多的编码,也包括元编码。但在实现过程中,更多的位数意味着性能和存储的代价。另外,扁平的元编码也不利于管理。这也是 程序设计语言(如C++,Java等)以及XML技术采用名字空间的原因之一。类似的,我们也引入编码空间的概念对编码进行更有效的管理。实际上,编码空间就是对编码元数据进行层次化分类和隔离的一种手段。编码空间是层次化的,也就是说,编码空间还可以有子空间。隶属于不同编码空间的相同编码可以对应不同的对象。同一元编码在不同的空间中也可以是完全不同的。实际上,不同的编码空间对编码进行了不同层次的安全隔离。In the computer world, we encode with binary bits. Given enough digits, we can use as many encodings as possible, including metacode. But in the implementation process, more bits mean the cost of performance and storage. In addition, flat meta-coding is also not conducive to management. This is also One of the reasons for programming languages (such as C++, Java, etc.) and XML technology to use namespaces. Similarly, we also introduce the concept of coding space to manage coding more effectively. In fact, the coding space is a means of hierarchically classifying and isolating the encoded metadata. The coding space is hierarchical, that is, the coding space can also have subspaces. The same code belonging to different coding spaces can correspond to different objects. The same element code can be completely different in different spaces. In fact, different coding spaces have different levels of security isolation for encoding.
我们可以按照不同的方式来进行编码空间的划分。但是在对编码进行使用和处理的过程中,不可避免地涉及到一些基本的对象。举例来说,图17为可以应用到基本编码空间的基本对象的示意图。We can divide the coding space in different ways. However, in the process of using and processing the code, some basic objects are inevitably involved. For example, Figure 17 is a schematic diagram of a base object that can be applied to a basic coding space.
对于本发明而言,任何编码都存在于编码仓库中,当然,标准编码除外。实际上,不同的编码仓库就对应了不同的编码空间,一个编码仓库对应的编码空间就是这个编码仓库所有编码的根空间。For the purposes of the present invention, any code is present in the code repository, with the exception of standard codes. In fact, different encoding warehouses correspond to different encoding spaces. The encoding space corresponding to an encoding warehouse is the root space of all encodings of this encoding warehouse.
同样,在同一编码仓库中,每个编码都有其拥有者。那么不同用户的编码就属于不同的用户编码空间。随着编码仓库中用户模型的复杂程度不同,用户空间的划分也可以更加复杂。例如,可以存在由多个用户所共享的组空间。Similarly, in the same code repository, each code has its own owner. Then the coding of different users belongs to different user coding spaces. With the complexity of user models in the coding warehouse, the division of user space can be more complicated. For example, there may be a group space shared by multiple users.
同样一种数据对象往往会被不同的应用程序来使用,针对某个编码仓库的具体用户,可以让不同的应用程序共享相同的编码;也可以让这些应用使用各自独立的编码。对于前者,同样的文字内容能够被不同的应用程序处理和使用,无需转换。而对于后者,独立的编码提高了数据的安全性——从恶意应用或者被破解的应用中泄漏的编码只影响该应用对应的数据。当然,前者的优势就对应后者的劣势,反之亦然。互操作性和安全性向来就是一个硬币的两面。但此处,我们可以看到,空间概念的引入使我们拥有了选择的灵活性。The same kind of data objects are often used by different applications. For a specific user of a code repository, different applications can share the same code; they can also use separate codes for these applications. For the former, the same text content can be processed and used by different applications without conversion. For the latter, independent coding increases the security of the data—the code that leaks from a malicious application or a compromised application only affects the data corresponding to that application. Of course, the advantage of the former corresponds to the disadvantage of the latter, and vice versa. Interoperability and security have always been two sides of a coin. But here, we can see that the introduction of the concept of space gives us the flexibility of choice.
更进一步,编码是要被序列化到一个具体的数据存储中去的。这个数据存储可以是一个文件,也可以是数据库字段,或者是在网络中传输的一个字符串。针对这个数据内容本身对编码进行隔离,会最大化编码的安全性。实际上,这种基于数据内容隔离的内容空间是建立了一个内容到编码一一对应的密码本。Further, the encoding is to be serialized into a specific data store. This data store can be a file, a database field, or a string that is transmitted over the network. Separating the encoding for this data content itself maximizes the security of the encoding. In fact, this content space based on data content isolation is a password book that establishes a content-to-code correspondence.
最后,可以对编码进行不同领域的划分以便于管理,这个可以称之为管 理空间。通常可以采用名字/标识符来区分不同的管理空间,因此也叫做命名编码空间。Finally, you can divide the coding into different areas for management. This can be called tube. Space. Names/identifiers can often be used to distinguish between different management spaces, so they are also called named encoding spaces.
在编码形成和使用的上下文中,上述两种编码空间(命名编码空间、上下文编码空间)可以是隐含存在的。我们称之为上下文空间。In the context of encoding formation and use, the above two encoding spaces (named encoding space, context encoding space) may be implicitly present. We call this the context space.
在一个编码仓库中,不同种类的上下文对象的排列组合决定了最终的上下文空间。例如,不同的用户和应用的排列组合就对应着不同的上下文空间。但一般说来,非标准文字内容中的编码同该内容是唯一对应的,内容本身就隐含对应的应用和用户(当然,多应用、多用户内容除外)。因此内容空间中没必要再划分应用子空间或者用户子空间。在所有上下文空间中,有个特殊的空间,就是与上下文无关的编码空间,我们称其为公有编码。实际上,标准化的编码都是公有编码。根空间中的编码其实并不是共有编码,而是编码仓库相关的编码,其编码空间就是编码仓库对应的根空间。In an encoding repository, the permutation of different kinds of context objects determines the final context space. For example, different user and application permutations combine to correspond to different context spaces. But in general, the code in the non-standard text content is uniquely corresponding to the content, and the content itself implies the corresponding application and user (except, of course, multi-application, multi-user content). Therefore, it is not necessary to divide the application subspace or the user subspace in the content space. In all context spaces, there is a special space, which is a context-independent coding space, which we call public coding. In fact, standardized coding is public coding. The encoding in the root space is not a common encoding, but an encoding related to the encoding warehouse. The encoding space is the root space corresponding to the encoding warehouse.
对一个编码系统来说,任何事物最后都将具体化为编码。编码空间最后对应到的编码是一种元编码,我们可以称之为空间编码。编码空间其实也是一种特殊的编码元对象——其对应的对象实例还是编码元对象。对于上下文无关的空间编码来说,该编码并不存在编码空间。但是对于上下文相关空间编码来说,依据上下文对象的不同,该编码可以对应到不同的编码空间。因此,对于上下文无关的编码空间,如命名编码空间,我们可以直接采用空间编码,对应的实例编码是子空间编码或者其他元编码。对于上下文编码空间,我们可以直接用上下文对象的编码来做为对应的空间编码。如编码仓库空间对应的编码是编码仓库编码。内容空间对应的是实例编码。应用空间对应的是应用编码。用户空间对应的是用户编码。For an encoding system, anything will eventually be embodied as a code. The last code corresponding to the coding space is a meta-code, which we can call spatial coding. The encoding space is actually a special encoding meta-object - its corresponding object instance is still an encoding meta-object. For context-independent spatial coding, there is no coding space for this encoding. However, for context-dependent spatial coding, the coding can correspond to different coding spaces depending on the context object. Therefore, for context-independent coding spaces, such as named encoding space, we can directly use spatial encoding, and the corresponding instance encoding is subspace encoding or other metacoding. For the context coding space, we can directly use the encoding of the context object as the corresponding spatial encoding. For example, the code corresponding to the coded warehouse space is the coded warehouse code. The content space corresponds to the instance code. The application space corresponds to the application code. User space corresponds to the user code.
举例来说,图18为一个128定长编码方案的编码构成的示意图。此外,上述编码的排列和组合方式并不是唯一的,例如可以将实例编码放在对象编码中的任意位置,只要事先定义清楚即可。For example, Figure 18 is a schematic diagram of the coding structure of a 128 fixed length coding scheme. In addition, the arrangement and combination of the above codes are not unique. For example, the example code can be placed at any position in the object code as long as it is clearly defined in advance.
在实际使用过程中,上下文空间编码隐含在对编码进行使用的上下文中,并不需要在最终的对象编码中出现。例如,当前使用的编码仓库就隐含了编码仓库编码;当前使用编码的应用程序就隐含了对应的应用编码;当前编码所在的文档内容就隐含了实例编码以及编码拥有者的用户编码(假定是单用户文档)。然而,当同一文字内容中同时出现来自同一种类多个空间的 编码时,上下文空间编码就必须在文字中出现,用以设置不同的编码上下文,以将不同空间隔离开来。比如说,一篇文档中的文字包括多个编码仓库的编码。这种情况下,对应的编码仓库编码就必须出现在该文档内容中,用以区别不同的编码仓库空间。当然,支持编码仓库编码的编码仓库必须提供用以访问库编码所对应的编码仓库的信息。同样,多用户的文字内容必须使用用户编码;能被多个应用读写并且使用了应用空间隔离的内容中必须使用应用编码。内容空间是一个例外,因为内容编码是对文档内容本身的编码,与文档内容是一一对应的。任何内容中不可能对应多个内容编码,因此,内容编码不需要显示在编码中。从实现上来说,内容编码可以是文档内容的散列值,或者是应用编码、时间戳的散列值。所以,内容编码要么通过实时计算得出,要么作为内容元数据进行存储。In actual use, context space coding is implicit in the context in which the encoding is used and does not need to appear in the final object encoding. For example, the currently used encoding repository implies the encoding warehouse encoding; the currently used encoding application implies the corresponding application encoding; the current encoding of the document content implies the instance encoding and the encoding owner's user encoding ( Assume a single-user document). However, when the same text content appears simultaneously from multiple spaces of the same kind When encoding, context space encoding must appear in the text to set different encoding contexts to isolate different spaces. For example, the text in a document includes the encoding of multiple encoding repositories. In this case, the corresponding encoding warehouse code must appear in the content of the document to distinguish different encoding warehouse spaces. Of course, an encoding repository that supports encoding repository encoding must provide information to access the encoding repository for the library encoding. Similarly, multi-user text content must use user encoding; application encoding must be used in content that can be read and written by multiple applications and that uses application space isolation. Content space is an exception, because content encoding is the encoding of the content of the document itself, one-to-one correspondence with the content of the document. It is not possible to encode multiple content in any content, so the content encoding does not need to be displayed in the encoding. In terms of implementation, the content encoding can be a hash value of the document content, or a hash value of the application encoding and time stamp. Therefore, content encoding is either calculated in real time or stored as content metadata.
上面提到,一般情况下,编码中并不需要包括空间编码,但是需要指出使用了哪种空间编码,这可以在编码中使用空间位来进行指定。这个空间位实际上就对应了编码规约中的编码上下文规约。As mentioned above, in general, encoding does not need to include spatial encoding, but it is necessary to indicate which spatial encoding is used, which can be specified by using spatial bits in the encoding. This space bit actually corresponds to the coding context specification in the coding protocol.
另外,再举例来说,图19为四个二进制位就是四个空间位的示意图,如图19所示,编码仓库位也可以叫做保留位。一个示例性的实例可以是,诸如,保留位为0时,编码来自于当前编码仓库。否则,需要额外的信息定义编码或者指定编码来源,比如后面会提到的客户端编码。内容位为0时,该编码与内容无关;为1时,编码是针对该特定内容而存在的。应用位为0时,该编码与应用无关;位为1时,是该应用特定的编码。用户位为0时,该编码是公有编码;为1时,是当前文档用户拥有的编码。反之亦然。只要能够有效地区分不同的空间,也可以采用任意其他的编码方案。In addition, for example, FIG. 19 is a schematic diagram of four binary bits being four spatial bits. As shown in FIG. 19, the coded storage bit may also be called a reserved bit. An illustrative example may be, for example, when the reserved bit is 0, the encoding is from the current encoding repository. Otherwise, additional information is required to define the encoding or specify the encoding source, such as the client encoding that will be mentioned later. When the content bit is 0, the encoding is independent of the content; when it is 1, the encoding exists for the specific content. When the application bit is 0, the code is independent of the application; when the bit is 1, it is the application-specific code. When the user bit is 0, the code is a public code; when it is 1, it is the code owned by the current document user. vice versa. Any other coding scheme can be used as long as it can effectively distinguish different spaces.
值得一提的是,类型编码同普通编码一样,也会存在编码空间。而且类型编码和实例编码的空间可以不同。例如,将公有编码用于用户空间,可以起到对该用户空间安全隔离的作用。这个例子中,该编码的编码类型为用户空间,实例编码则为公有空间。由于实例编码必须隶属于某一编码类型,因此相同类型的实例编码的空间位都相同。而在具体解码过程中,根据类型编码就能访问到编码仓库中编码类型的元数据。因此,类型编码中必须包含对应空间,来保证解码器能够从编码仓库中取得正确的编码类型信息。编码仓库中的类型信息可以包含对应实例编码的空间位,因此该空间位并不需要在 实例编码中出现。It is worth mentioning that the type encoding is the same as the normal encoding, and there is also a coding space. Moreover, the space of type coding and instance coding can be different. For example, using public coding for user space can serve as a security isolation for the user space. In this example, the encoding type of the encoding is user space, and the instance encoding is public space. Since the instance code must belong to a certain encoding type, the same type of instance encodes the same spatial bits. In the specific decoding process, the metadata of the encoding type in the encoding warehouse can be accessed according to the type encoding. Therefore, the type encoding must contain the corresponding space to ensure that the decoder can get the correct encoding type information from the encoding repository. The type information in the encoding repository can contain the spatial bits corresponding to the instance encoding, so the spatial bits do not need to be Appears in the example code.
上下文空间是对编码进行安全隔离的主要手段,管理和设置应用程序同生成编码目标空间的主体应该是上下文对象对应的个人(如用户本人)以及管理员(如系统管理员、应用管理员)。管理空间是方便编码的分级管理,由应用程序对其进行注册和使用。Context space is the main means to securely isolate the code. The main body that manages and sets the application with the generated encoding target space should be the individual corresponding to the context object (such as the user) and the administrator (such as system administrator and application administrator). The management space is a hierarchical management that facilitates coding and is registered and used by the application.
编码字长是指在一个文字编码系统中,编码一个字符所需要的最小位数。例如,UTF-8的编码字长就是8个二进制位,或者一个字节。UTF-16的编码字长是两个字节。在某个编码字长的编码中,并不是所有编码都是这个长度。但其长度必须是编码字长的整数倍。对于多字节字长的编码系统来说,还需要考虑一个编码字长中的字节序问题。单字节字长中则不存在该问题,所有数据都是以字节为单位,从低往高顺序排列。The code word length is the minimum number of bits required to encode a character in a text encoding system. For example, the encoded word length of UTF-8 is 8 binary bits, or one byte. The encoded word length of UTF-16 is two bytes. In the encoding of a coded word length, not all codes are of this length. But its length must be an integer multiple of the code word length. For an encoding system with a multibyte word length, it is also necessary to consider the endian problem in a coded word length. This problem does not exist in single-byte word lengths. All data is arranged in bytes from low to high.
另外,对于定长编码和变长编码,在一个编码系统中,所有编码的长度都等于其编码字长,这样的编码系统称为定长编码系统。反之,则称为变长编码系统。In addition, for fixed length coding and variable length coding, in an coding system, all coding lengths are equal to their coding word lengths, and such an encoding system is called a fixed length coding system. On the contrary, it is called a variable length coding system.
在对象编码系统中,编码字长以及相关的编码方法同编解码过程有着密切的关系,而同编码元模型无关。也就是说,同一个编码元模型对应的对象编码系统可以选用不同的编码字长,以及对应不同的编码方法。甚至可以同时支持多种字长即编码方法的组合,当然,需要设计有效的机制将它们区分开来。In the object coding system, the coding word length and the associated coding method are closely related to the coding and decoding process, and are independent of the coding element model. That is to say, the object coding system corresponding to the same coding element model can select different coding word lengths and corresponding different coding methods. It is even possible to support multiple word lengths or combinations of encoding methods at the same time. Of course, it is necessary to design an effective mechanism to distinguish them.
需要指出的是,系统的编码字长和编码方法同具体对象编码规约中指定的序列化字长及方法没有直接关系。只不过如果序列化结果作为对象编码的一部分时,需要考虑到对象编码字长及方法的兼容性。It should be pointed out that the coding length and encoding method of the system are not directly related to the serialization word length and method specified in the specific object coding protocol. However, if the serialization result is part of the object encoding, the compatibility of the object encoding word length and the method needs to be considered.
同Unicode类似,对象编码系统可以是一个与编码字长无关的系统。也就是说,基于同样一个编码仓库,可以有不同字长的编码方案。在短字长编码方案中,一个编码字长往往不能放下一个完整的编码(如前所述,包括空间编码、类型编码和实例编码等三部分)。在这种情况下,我们可以采用变字长编码,即一个编码可以包括多个字。例如,元编码部分和实例编码部分拆分成多个个连续的编码字。即使如此,有时候一个字长的编码并不能覆盖对应所有的编码实例。我们可以使用Unicode中的变长编码技巧——利用标记位来定义编码字长。例如,对于字长为一个字节的编码来说,图20为一 个编码方案的示例图,如图20所示,该编码方案能够让编码器通过前一个或者前两字节自动获得对应的编码字长。该方案能够表示的编码范围为0到265-1。Similar to Unicode, the object encoding system can be a system that is independent of the encoding word length. That is to say, based on the same code repository, there can be different word length coding schemes. In the short word length coding scheme, a code word length often cannot put down a complete code (as mentioned above, including spatial coding, type coding, and instance coding). In this case, we can use variable word length coding, that is, one code can include multiple words. For example, the metacode portion and the instance code portion are split into a plurality of consecutive code words. Even so, sometimes a word length encoding does not cover all encoding instances. We can use the variable length encoding technique in Unicode - using the flag bits to define the encoding word length. For example, for a code with a word length of one byte, Figure 20 is a An example diagram of a coding scheme, as shown in FIG. 20, enables the encoder to automatically obtain the corresponding codeword length through the previous or first two bytes. The scheme can represent a coding range of 0 to 265-1.
图21为UTF-8的编码方案的示例图,对比UTF-8的编码方案(如图21所示),会发现两种编码方案的编码结果互不冲突,可以出现在同一文档中。当编码的第一个字节的第一位是0时,该字节对应UTF-8中的ASCII码部分;当编码的第一个字节的前两位是10时,对应的编码是对象编码;当编码的第一个字节的前两位是11时,对应的编码是Unicode编码。通过这种方式,就可以实现对象编码和Unicode的混合编码。FIG. 21 is an exemplary diagram of the encoding scheme of UTF-8. Compared with the encoding scheme of UTF-8 (as shown in FIG. 21), it is found that the encoding results of the two encoding schemes do not conflict with each other and may appear in the same document. When the first bit of the first byte of the code is 0, the byte corresponds to the ASCII code portion of UTF-8; when the first two bits of the first byte of the code are 10, the corresponding code is the object. Encoding; when the first two bits of the first byte of the encoding are 11, the corresponding encoding is Unicode encoding. In this way, hybrid encoding of object encoding and Unicode can be achieved.
类似的,还可以设计其他一个字节字长以及多个字节字长的变长编码方案。Similarly, another variable length coding scheme with one byte word length and multiple byte word lengths can be designed.
另外,对于编码类型,编码类型就是加上了相关编码规约的对象类型。In addition, for the encoding type, the encoding type is the object type to which the relevant encoding convention is added.
另外,对于编码上下文,编码上下文是对上下文对象的抽象。实际上是运行时对上下文对象进行选择的选择条件。上面的编码元模型使用的是编码类型加上对象角色名。在同一个编码上下文环境(一般是指一个具体应用)中,同一类型的角色名必须唯一。In addition, for an encoding context, the encoding context is an abstraction of the context object. It is actually the selection criteria for the selection of context objects at runtime. The above encoding metamodel uses the encoding type plus the object role name. In the same encoding context (generally a specific application), the same type of role name must be unique.
例如,在一个网络博客应用中,有作者,也有读者,他们都是用户对象,但是是不同的角色。博客内容中数据对象的编码上下文就应该是作者用户。这样,当任意读者打开内容时,就不会因为当前登录用户不是作者而出现解码错误的问题。当然,正确解码的前提是正确设置编码上下文对象。对于博客的例子,就是在打开每个具体博客内容时,将对应的作者用户对象设置为编码上下文对象。For example, in a web blog application, there are authors and readers. They are all user objects, but they are different roles. The encoding context of the data object in the blog content should be the author user. In this way, when any reader opens the content, there is no problem that the decoding error occurs because the currently logged in user is not the author. Of course, the premise of correct decoding is to correctly set the encoding context object. For the blog example, when opening each specific blog content, the corresponding author user object is set as the encoding context object.
另外,对于编码路径,编码上下文路径简称编码路径,对应一系列的编码上下文规约,是对对应数据对象的实例编码所属编码空间的约束。编码空间的定义表明了编码空间是一个拥有关联性编码的被编码对象所关联的层次结构——子空间还可以有子空间。编码路径就是定位到确定编码对象的编码空间路径。例如,一个个性化日记中的图片编码路径可能是这样的:In addition, for the encoding path, the encoding context path is referred to as the encoding path, and corresponds to a series of encoding context conventions, which is a constraint on the encoding space to which the instance code of the corresponding data object belongs. The definition of the coding space indicates that the coding space is a hierarchy associated with the encoded object with the associated encoding - the subspace can also have subspaces. The encoding path is the encoding space path that is positioned to determine the encoding object. For example, the image encoding path in a personalized journal might look like this:
编码仓库的空间|用户001的空间|应用个性化日记的空间Coding warehouse space | user 001 space | application personalized diary space
在最终的应用空间中就能找到图片对象编码对应的图片。The image corresponding to the image object encoding can be found in the final application space.
上面例举的编码路径是运行时具体的路径。在编码元模型中的编码路径 是更高抽象层次的编码路径,对应为:The encoding path exemplified above is a runtime specific path. Encoding path in the encoding metamodel Is the encoding path of a higher level of abstraction, corresponding to:
根空间|作者空间|应用空间Root space|author space|application space
在运行时,这个编码路径会通过选择对应的上下文对象实例化为上面的编码路径实例。At runtime, this encoding path is instantiated to the above encoded path instance by selecting the corresponding context object.
所谓上下文对象,就是一个被对应到上下文规约的具体对象,该对象必须符合上下文规约的约束,而且在相应编码编码过程中必须是可以访问的。例如,有一个“作者”上下文约束,其对应类型为“用户”。在设置该上下文约束时,就不能把当前应用设置为对应的上下文对象。必须用“用户”类型的对象来进行设置。一般的,在获取文档对应的作者信息后,就可以将其设置为对应到这个“作者”上下文约束的上下文对象。如果该作者对象对当前用户来说是不可访问的,这个上下文对象就无法实例化,也就是说这个编码上下文约束并不能满足,接下来的相关实例编码就无法进行解码。这也是本方法中基于上下文编码安全性的一个具体体现。The so-called context object is a concrete object corresponding to the context specification. The object must conform to the constraints of the context specification and must be accessible in the corresponding encoding and encoding process. For example, there is an "author" context constraint whose corresponding type is "user". When the context constraint is set, the current application cannot be set to the corresponding context object. It must be set with an object of the "user" type. In general, after obtaining the author information corresponding to the document, it can be set to the context object corresponding to the "author" context constraint. If the author object is inaccessible to the current user, the context object cannot be instantiated, which means that the encoding context constraint is not satisfied, and the subsequent related instance encoding cannot be decoded. This is also a concrete manifestation of context-based coding security in this method.
实际上,在系统的实现中,编码路径实例与对应数据对象实例编码在编码仓库中的编码空间直接相关,可选的,对应数据对象在编码仓库中的存储位置也可以受编码空间的制约。编码仓库对编码路径的具体实现因存储方案的不同可以有多种选择。这里给出一个具体的实现例子。在一个用关系型数据库技术实现的编码仓库中,一个简单的实现就是用简单的上下文名字拼接来形成上下文相关的数据对象的表名。接上例,这个图片表的表名可以为:In fact, in the implementation of the system, the encoding path instance is directly related to the encoding space of the corresponding data object instance code in the encoding warehouse. Optionally, the storage location of the corresponding data object in the encoding warehouse may also be restricted by the encoding space. The specific implementation of the encoding path for the encoding warehouse can have multiple choices depending on the storage scheme. Here is a concrete implementation example. In a code repository implemented with relational database technology, a simple implementation is to use simple context name splicing to form table names for context-sensitive data objects. In the example above, the table name of this picture table can be:
用户_001_应用_005_图片表User_001_Application_005_Picture Table
对应数据对象的实例编码可以直接使用该表的键。The instance code of the corresponding data object can directly use the keys of the table.
另外一种编码空间的实现方案是将数据对象统一存储,仅将编码用编码空间加以区分。这里给出一个具体的实现例子。在一个用关系型数据库技术实现的编码仓库中,系统维护一个编码空间的表,如下:Another implementation of the coding space is to uniformly store the data objects, and only distinguish the coding space for coding. Here is a concrete implementation example. In an encoding repository implemented with relational database technology, the system maintains a table of encoding spaces as follows:
编码空间IDEncoding space ID 父空间IDParent space ID 上下文对象引用编码Context object reference encoding
00 NullNull NullNull
... ... ...
88 00 (用户001的引用编码)(reference code for user 001)
... ... ...
100100 88 (应用005的引用编码)(reference code for application 005)
... ... ...
其中,编码空间ID字段为该表主键;父空间ID为本表的外键,用于表示编码空间的嵌套关系。The code space ID field is the table primary key; the parent space ID is a foreign key of the table, and is used to represent the nested relationship of the code space.
对于每种置于数据仓库的数据对象来说,都存在两个表。一个是数据对象的数据表本身,如图片表:There are two tables for each data object placed in the data warehouse. One is the data table itself of the data object, such as a picture table:
图片ID Picture ID 字段1Field 1 ...
... ... ...
其中,图片ID字段为该表主键。所有图片的数据都放置于该表。另一个是对应的图片编码表:The picture ID field is the primary key of the table. The data for all images is placed in the table. The other is the corresponding picture encoding table:
编码空间IDEncoding space ID 编码coding 图片IDPicture ID
... ... ...
100100 001001 ...
100100 002002 ...
... ... ...
其中,编码空间ID字段是系统编码空间表的外键,图片ID字段是图片表的外键。编码空间ID字段加上编码字段是该表的主键。The code space ID field is a foreign key of the system code space table, and the picture ID field is a foreign key of the picture table. The Encoding Space ID field plus the Encoding field is the primary key of the table.
另外,对于编码目录项,编码目录项是上下文相关对象编码的具体编码元对象。每个编码空间中有且仅有一个编码目录,编码目录就是编码目录项的列表。每个编码目录项在编码目录中有一个唯一的编号,就是元编码。在上面的编码元模型中,编码目录项具体就是编码类型加上编码路径。编码路径可以是相对路径,即编码目录项的当前空间,或者是绝对路径-基于根空间;也可以两者同时支持,只需要建立区分两者的机制即可。Additionally, for an encoded directory entry, the encoded directory entry is a specific encoded meta-object encoded by the context-dependent object. There is one and only one encoding directory in each encoding space, and the encoding directory is a list of encoding directory entries. Each encoding directory entry has a unique number in the encoding directory, which is the metacode. In the above encoding metamodel, the encoding directory entry is specifically the encoding type plus the encoding path. The encoding path can be a relative path, that is, the current space of the encoding directory item, or an absolute path-based root space; or both can be supported at the same time, and only a mechanism for distinguishing the two needs to be established.
也就是说,上下文相关对象编码系统中,对象编码中的元编码(编码目录项对应的编码)和实例编码可以不在一个编码空间。That is to say, in the context-dependent object encoding system, the meta-encoding (encoding corresponding to the encoding directory entry) and the instance encoding in the object encoding may not be in one encoding space.
编码目录项可以将前面提到的空间编码和类型编码统一起来,如果一个元编码,对应对象数据(实际就是编码目录项)中的编码类型还是一个编码目录项,那么这个元编码就对应的是一个编码空间;该元编码之后的实例编码实际上还是一个元编码。这样,元编码就既能表示空间编码,又能表示编码目录项的编码,取决于对应的编码类型是不是编码目录项类型。因此,在这个设计的支持下,一个对象编码的元编码可以是一个或者多个元编码的组 合;最后一个元编码对应一个普通编码元对象,之前的元编码都对应编码空间。此外,我们还能通过编码目录项将前面提到的空间位的概念隐藏到编码仓库中去,而不是直接暴露在编码中。编码路径比编码位更加灵活、安全,可以设定不同的上下文对象组合。The encoding directory entry can unify the spatial encoding and type encoding mentioned above. If a meta-encoding, the encoding type in the corresponding object data (actually the encoding directory entry) is still an encoding directory entry, then the meta-encoding corresponds to An encoding space; the instance encoding after the meta-encoding is actually a meta-encoding. In this way, the meta-encoding can represent both the spatial encoding and the encoding of the encoded directory entry, depending on whether the corresponding encoding type is an encoding directory entry type. Therefore, with the support of this design, the meta-encoding of an object encoding can be one or more meta-encoded groups. The last meta-code corresponds to a common encoding meta-object, and the previous meta-encoding corresponds to the encoding space. In addition, we can hide the concept of the aforementioned space bits into the code repository by encoding directory entries instead of directly exposing them to the code. The encoding path is more flexible and secure than the encoding bits, and different context object combinations can be set.
另外,对于编码目录项实例化,编码目录项的实例化主要就是在上下文相关对象编码系统运行时对编码路径(一系列上下文规约)实例化为目标编码空间的过程。这样,随着编解码过程中上下文对象的不同,同一个元编码(编码目录项对应的编码)就会对应不同的目标编码空间,对象实例编码将会随之被编码到不同的编码空间中去(当然,只有引用编码形式才会对应到编码空间)。对于编码路径为空的编码目录项,并不存在实例化的过程,其对应目标编码空间就是目录项所在的空间。In addition, for the encoding directory item instantiation, the instantiation of the encoding directory entry is mainly the process of instantiating the encoding path (a series of context conventions) into the target encoding space when the context-related object encoding system is running. Thus, with the different context objects in the encoding and decoding process, the same meta-encoding (encoding corresponding to the encoding directory entry) will correspond to different target encoding spaces, and the object instance encoding will be encoded into different encoding spaces. (Of course, only the reference encoding form will correspond to the encoding space). For an encoded directory entry whose encoding path is empty, there is no instantiation process, and the corresponding target encoding space is the space where the directory entry is located.
编码目录项实例化是上下文相关对象编码系统实现上下文相关的关键。Encoding directory entry instantiation is the key to context-dependent object coding system implementation context.
另外,对于编码工厂,编码工厂就是编码目录项实例化对应的运行时的对象编解码器。它包括对应的编码目录项、当前编码空间(编码目录所在空间)、目标编码空间(对象实例数据所在空间,实际上是由编码路径通过相应的上下文相关对象实例化而来)。编码工厂包含了除对象的数据内容之外,对数据对象进行编解码的所有信息。编码工厂提供了对对应编码目录项(实际上就是特定目标空间的特定类型)数据对象的编解码服务。In addition, for the coding factory, the coding factory is the object codec of the runtime corresponding to the coded directory entry instantiation. It includes the corresponding encoding directory entry, the current encoding space (the space where the encoding directory is located), and the target encoding space (the space where the object instance data is located, which is actually instantiated by the encoding path through the corresponding context-related object). The encoding factory contains all the information that encodes and decodes the data object in addition to the data content of the object. The encoding factory provides a codec service for data objects corresponding to the encoded directory entry (actually a specific type of specific target space).
编码空间可以作为一种特殊的编码工厂,与之对应的编码目录项的编码类型就是编码目录项类型本身。也就是说,编码空间提供了对编码目录项,也就是编码元对象的编码解码服务。The encoding space can be used as a special encoding factory, and the encoding type of the corresponding encoding directory item is the encoding directory item type itself. That is to say, the encoding space provides a codec service for encoding directory entries, that is, encoding meta objects.
编码工厂最终输出的应该是对象编码,其包括了元编码和实例编码。但将元编码与实例编码组合或拼接为对象编码的过程可以放置在用户端,也可以放置在编码仓库中,这取决于实际的设计。而且,还可以在最终的对象编码中包含一个代表元编码与实例编码的组合或拼接方式的编码。必要时,还可以将代表该组合或拼接方式的编码与对象编码分开存储在不同的安全通道下,并分别设置各自的访问权限,只有经过授权并通过验证才能获得对象编码以及相应的代表元编码与实例编码的组合或拼接方式的编码,从而正确拆解出元编码与实例编码。The final output of the coding factory should be the object code, which includes the meta code and the instance code. However, the process of combining or splicing the meta-code with the instance code can be placed on the client side or in the code repository, depending on the actual design. Moreover, it is also possible to include in the final object code a code representing a combination or splicing manner of the meta code and the instance code. If necessary, the code representing the combination or splicing mode can be stored separately from the object code under different secure channels, and the respective access rights are set separately. Only after authorization and verification can the object code and the corresponding representative element code be obtained. The combination with the example code or the coding of the splicing method, so that the meta code and the instance code are correctly disassembled.
另外,对于上下文相关对象编码系统的系统编码,由于上下文相关对象 编码系统的多级元编码组合特性,使用变长编码方法实现较为直接。目录项编码和实例编码都可以是一个字长。In addition, for system coding of context-sensitive object coding systems, due to context-sensitive objects The multi-level coding combination feature of the coding system is relatively straightforward using the variable length coding method. Both the directory entry code and the instance code can be one word long.
另外,对于上下文对象设置编码,这个系统编码用于设置当前(编码、解码时刻)上下文对象,这个设置会对之后编码目录中用到相关上下文的数据对象起作用。In addition, for the context object to set the encoding, this system encoding is used to set the current (encoding, decoding time) context object, this setting will work on the data object in the encoding directory that uses the relevant context.
该编码的可能形式:Possible forms of the encoding:
[该系统编码标记][编码上下文编码][对象编码][The system code mark] [Code Context Code] [Object Code]
在上面的编码元模型核心概念图中,需要将编码上下文对象修改为一个编码对象来支持这个系统编码,也就是说,上下文对象的编码化是上面编码形式的基础。In the above core diagram of the coding metamodel, the coding context object needs to be modified into an encoding object to support the system coding, that is, the coding of the context object is the basis of the above coding form.
另一种可能的形式是:Another possible form is:
[该系统编码标记][编码上下文标识][对象编码][The system code mark] [Code Context Identifier] [Object Code]
编码上下文标识可以是上下文类型名和上下文角色名的组合。The encoding context identifier can be a combination of a context type name and a context role name.
对于终结编码,终结编码用于告知解码程序一个对象编码解析的终结。终结编码并不是必需的。在大多数情况下,对象编码总是终结于实例编码,如果没有实例编码就会一直解析下去。因此,可以将系统设置为以实例编码的结尾作为编码解析的终结标识。这里隐含了编码空间不能循环嵌套,必须是严格的树结构。可以是一个字长的标记。For final encoding, the final encoding is used to tell the decoder the end of an object encoding parsing. Final encoding is not required. In most cases, the object encoding is always terminated in the instance encoding, and will be parsed if there is no instance encoding. Therefore, the system can be set to use the end of the instance code as the final identifier for encoding resolution. It is implied that the encoding space cannot be loop nested and must be a strict tree structure. Can be a word length mark.
对于根空间编码,在用空间编码设定好缺省工厂之后,有时候也需要使用缺省工厂之外的编码。这时候,我们可以使用根空间编码来将当前编码转换到其他空间。根空间编码是所有完整编码的起始点,所有其他编码以及元编码都可以从根空间开始解码。一个文字内容只能对应唯一的一个根空间。在没有设置缺省工厂的情况下,缺省工厂就是根空间。根空间编码可以是一个字长的特殊标记,之后可以是对象的从根目录编码到实例编码的完整对象编码。For root space encoding, after setting the default factory with spatial encoding, it is sometimes necessary to use encodings outside the default factory. At this point, we can use root space encoding to convert the current encoding to another space. Root-space coding is the starting point for all complete encodings, and all other encodings as well as meta-encodings can be decoded from the root space. A text content can only correspond to a single root space. In the case where the default factory is not set, the default factory is the root space. The root space encoding can be a special token of a word length, which can be followed by the object's full object encoding from the root encoding to the instance encoding.
对于缺省元编码设置编码,缺省元编码设置编码实际上是对编码空间或者编码工厂的设定。根空间编码可以打破这个设定。除了根空间编码开始的对象编码外,都由编码工厂进行解码。For the default metacode setting encoding, the default metacoding setting encoding is actually a setting for the encoding space or the encoding factory. Root space coding can break this setting. In addition to the object encoding at the beginning of the root space encoding, it is decoded by the encoding factory.
由于这个编码必定结束于元编码,所以必须使用终结编码来终结。Since this encoding must end in meta-encoding, it must be terminated using final encoding.
该编码的可能形式: Possible forms of the encoding:
[该系统编码标记][多个目录编码][终结编码][The system code mark] [multiple directory code] [terminator code]
上下文相关对象编码可以在提高编码表达力的同时还能缩短编码的长度,非常适合大数据、云存储中丰富的数据类型、带有复杂关系的海量数据对象的编码存储和传输,也适合物联网标识轻量、多样的需求。Context-sensitive object coding can improve the coding expression while shortening the length of coding. It is very suitable for large data, rich data types in cloud storage, encoding storage and transmission of massive data objects with complex relationships, and also suitable for Internet of Things. Identify lightweight, diverse needs.
对于对象编码与文字,标准文字编码实际上是对字符对象的引用编码。因此我们可以把对象编码序列看成是一种特殊的文字内容。针对传统文字的一些操作概念以及处理工具,我们都可以加以借鉴和复用,结合对象编码的特点来使用。如文字查找、检索、编辑、替换等。For object encoding and text, standard literal encoding is actually a reference encoding of a character object. So we can think of the object coding sequence as a special text content. For some of the operational concepts and processing tools of traditional text, we can learn and reuse them, combined with the characteristics of object coding. Such as text search, retrieval, editing, replacement, and so on.
同时,对象编码和文字编码还可以混用,只要将文字编码作为一种特殊的对象编码即可。At the same time, object encoding and text encoding can also be mixed, as long as the text encoding is a special object encoding.
当对象编码和文字编码混合在一起时,相应的编解码方法可以有三种做法:When object encoding and text encoding are mixed together, there are three ways to do the corresponding encoding and decoding methods:
1.给文字编码分配一个特殊的元编码1. Assign a special metacode to the text encoding
2.使用特定的文字编码,在需要使用对象编码时,通过指定的转义字符转义到对象编码。2. Use a specific text encoding to escape to the object encoding with the specified escape character when object encoding is required.
3.扩展特定的文字编码,将其扩展成能够表达对象编码的扩展文字编码。3. Extend the specific text encoding and extend it to an extended text encoding that expresses the object encoding.
对于结构化对象编码,上面提到,可以将对象编码序列看成是一种特殊的文字内容,而在标准文字的基础之上,已经有大量结构化文档的编码标准和格式,如逗号分隔的文本表格式CSV,基于置标语言的结构化文档标准SGML/XML,利用JavaScript语法打包数据结构的JSON格式等等。一方面,我们可以直接使用相关格式和标准,将对象编码字符作为内容,混合在其中。For structured object coding, as mentioned above, the object coding sequence can be regarded as a special text content. On the basis of standard text, there are already a large number of coding standards and formats of structured documents, such as comma-separated Text table format CSV, structured document standard SGML/XML based on markup language, JSON format for packing data structures using JavaScript syntax, and so on. On the one hand, we can directly use the relevant formats and standards to mix object-encoded characters as content.
另一方面,我们也可以将对象编码序列构成的结构化文档看成是一个特殊的对象,用对象编码的方式对其进行编码,编码结果就是所有组成该对象的子对象对应编码的序列化。对这个结构化对象的编解码过程,可以像普通数据对象一样,将对象结构信息作为编码元数据的一部分,放到编码仓库,根据编码元数据对内容进行编解码。这个编解码过程会将作为结构化对象序列化内容的对象编码序列进行合成、解析,对其中的对象编码还会有进一步的编解码。这个过程可以是一个递归、嵌套的过程。此外,对结构化对象的 编解码还可以以其他形式进行定义,如下所示:On the other hand, we can also regard the structured document composed of the object coding sequence as a special object, which is encoded by the object encoding method. The encoding result is the serialization of the corresponding encoding of all the sub-objects that make up the object. The encoding and decoding process of this structured object can be used as a part of the encoded metadata as a common data object, placed in the encoding warehouse, and the content is encoded and decoded according to the encoded metadata. This codec process synthesizes and parses the object code sequence that is the serialized content of the structured object, and further encodes and decodes the object code. This process can be a recursive, nested process. In addition, for structured objects The codec can also be defined in other forms as follows:
对象数组的编码Encoding of an array of objects
一般是指对元编码相同的一组对象的编码。在变长编码方法中,定义数组系统编码,可以去掉冗余的元编码。可以定义数组系统编码如下:Generally refers to the encoding of the same set of objects that are encoded by the meta. In the variable length coding method, the array system coding is defined, and the redundant element coding can be removed. The array system can be defined as follows:
数组编码:=数组系统编码+数组长度n+数组第一个元素的对象编码(包括元编码+实例编码)+n-1个剩余元素的实例编码Array encoding: = array system encoding + array length n + array object encoding of the first element (including meta-encoding + instance encoding) + n-1 instance coding of the remaining elements
在这个定义之下,可以认为数组系统编码是数组对象的元编码。数组对象的元信息隐含在整个数组编码中,包括数组长度、数组类型等。Under this definition, the array system encoding can be thought of as the meta-encoding of array objects. The meta information of an array object is implicit in the entire array encoding, including array length, array type, and so on.
对象二维表的编码Object two-dimensional table encoding
一般是指每列元编码相同的二维数组编码。同样,定义表系统编码,可以去掉冗余元编码。Generally speaking, each column element encodes the same two-dimensional array code. Similarly, the table system code is defined and the redundant element code can be removed.
表编码:=表系统编码+数据行数n+第一行元素的对象编码(包括元编码+实例编码)+n-1个剩余行的实例编码Table code: = table system code + number of data lines n + object code of the first line element (including meta code + instance code) + n-1 case code of the remaining lines
在这个定义之下,可以认为数组系统编码是数组对象的元编码。数组对象的元信息隐含在整个数组编码中,可以包括数组长度、数组类型等。Under this definition, the array system encoding can be thought of as the meta-encoding of array objects. The meta information of the array object is implicit in the entire array encoding, and can include array length, array type, and so on.
对象树的编码Object tree encoding
树形结构很常用,可以表示复杂对象组合情况,如文档树、抽象语法树等。可以定义一类特殊的标签编码。标签编码实际上就是树结点开始的标记,标签对象的元数据中指定了标签结束标记。当解码器解析到结束标签时,将标签与结束标记之间的的数据对象组合,形成一个树节点对象。树节点对象可以嵌套、组合。The tree structure is very common and can represent complex object combinations, such as document trees, abstract syntax trees, and so on. A special type of tag encoding can be defined. The tag encoding is actually the tag at the beginning of the tree node, and the tag end tag is specified in the tag object's metadata. When the decoder parses the end tag, the data object between the tag and the end tag is combined to form a tree node object. Tree node objects can be nested and combined.
除了可以将树结构信息全部放到根节点编码对应的编码元数据中去之外,还可以分级放到与树节点对应的编码元数据中。In addition to all the tree structure information can be placed in the coding metadata corresponding to the root node code, it can also be hierarchically placed into the coded metadata corresponding to the tree node.
元元编码Element code
元元编码是对编码元数据相关元数据的编码。也是元编码的一部分。The meta-encoding is the encoding of the metadata associated with the encoded metadata. It is also part of the meta code.
标记编码Tag encoding
对于对象编码,还存在一种没有实例编码的情形。也就是说,对象编码只有元编码部分。这种编码叫做标记(Token)编码,只对应到编码元数据。其主要作用就是向解码器提供语义标签。在结构化的编码流中会大量使用。 For object coding, there is also a case where there is no instance coding. That is to say, the object encoding is only the metacoded part. This type of encoding is called token encoding and only corresponds to the encoding metadata. Its main role is to provide semantic tags to the decoder. It is used extensively in structured coded streams.
进一步的,步骤304C的一种具体实现方式为:Further, a specific implementation manner of step 304C is:
采用预定规则,将所述元编码和实例编码生成所述对象编码。The meta-encoding and instance encoding are used to generate the object encoding using predetermined rules.
在本实施例中,由元编码和实例编码构成对象编码的方式可以多种多样。可以通过直接将元编码与实例编码组合或拼接在一起来构成对象编码。举例来说,图22为元编码和实例编码构成的对象编码的示意图。In the present embodiment, the manner in which the object encoding is constituted by the meta-encoding and the example encoding can be various. The object encoding can be constructed by directly combining or splicing the meta code with the instance code. For example, FIG. 22 is a schematic diagram of object encoding composed of meta-encoding and example encoding.
另外,在举例来说,也可以通过元编码与实例编码之间的某种运算或其他可行的混合方式来获得对象编码,如下所示:In addition, for example, the object encoding can also be obtained by some kind of operation between meta-encoding and instance encoding or other feasible hybrids, as follows:
对象编码=实例编码X 101+元编码Object encoding = instance encoding X 101 + meta encoding
这样我们可以通过对应的运算将对象编码剥离成元编码和实例编码:In this way, we can strip the object code into meta-code and instance code through the corresponding operations:
Figure PCTCN2015086672-appb-000017
Figure PCTCN2015086672-appb-000017
因此,只要能够以可逆的方式重新获得元编码及实例编码,任何由元编码和实例编码获得对象编码的方式都是可以适用于本发明。Therefore, any manner in which object encoding is obtained by meta-encoding and example encoding can be applied to the present invention as long as the meta-encoding and the example encoding can be regained in a reversible manner.
元编码和实例编码都是由对象编码系统内部使用,通常也是在该系统内部自动生成、而且对于建立在该系统之上的应用系统来说它们是不可见的。取决于元数据部分与数据内容部分的相关性,实例编码可以与元编码相关或无关。Both meta-coded and instance-encoded are used internally by the object encoding system, and are typically generated automatically within the system and are invisible to applications built on top of the system. Depending on the relevance of the metadata portion to the data content portion, the instance code may be related or unrelated to the meta-code.
类型编码是一种典型的元编码。通过类型编码可以获取对象实例的类型信息,以及对相关类型的编码规约。Type coding is a typical meta code. Type coding can be used to obtain type information of an object instance, as well as a coding convention for related types.
优选的,该方法还可以包括:Preferably, the method may further include:
在所述对象编码中加入一个代表所述预定规则的编码。A code representing the predetermined rule is added to the object code.
或者,or,
将代表所述预定规则的编码和所述对象编码分别存储在不同的安全通道下,并分别为所述预定规则的编码和所述对象编码设置不同的访问权限。The code representing the predetermined rule and the object code are respectively stored under different secure channels, and different access rights are respectively set for the encoding of the predetermined rule and the object encoding.
在本实施例中,关于上下文相关的编码,上面提到,基于对象的编码已经具备了基于类型的编码隔离。但是,对于一个确定类型的数据对象,统一的编码空间还是存在两大弊端:其一,编码不够安全。通过直接修改编码或 者使用随机编码,可能会直接访问到其他用户的同类型数据对象。其二,编码不够有效。为了保证同一类型数据对象的编码互不冲突,对象编码本身所占用的存储空间就会随着数据对象数量的增长而增长。最终很容易地就导致了编码效率的降低。In the present embodiment, regarding context-dependent coding, as mentioned above, object-based coding already has type-based coding isolation. However, for a certain type of data object, there are two major drawbacks to the unified coding space: First, the coding is not secure enough. By directly modifying the code or Using random encoding, you may have direct access to other users of the same type of data object. Second, the coding is not efficient enough. In order to ensure that the encoding of the same type of data objects does not conflict with each other, the storage space occupied by the object encoding itself increases as the number of data objects increases. Eventually it leads to a reduction in coding efficiency.
上下文相关的编码正是引入了上下文相关编码空间的概念,解决了上述两个问题。Context-sensitive coding is the concept of introducing a context-sensitive coding space that solves both of these problems.
所谓编码空间,就是将数据对象的编码进行隔离的一个抽象概念。某一确定类型数据对象在一个确定编码空间中的编码是唯一的。但是其在不同编码空间中可能对应不同的编码。同时,相同的类型、相同的编码,在不同的编码空间中可能对应不同的数据对象。The so-called coding space is an abstract concept that isolates the encoding of data objects. The encoding of a certain type of data object in a certain encoding space is unique. But it may correspond to different encodings in different coding spaces. At the same time, the same type, the same code, may correspond to different data objects in different coding spaces.
上下文对象是指与编码使用环境相关的数据对象,如用户、应用系统、时间、地点、领域,等等。有些数据对象的编码和这些使用环境密切相关。例如,用户私有的数据对象就同该用户密切相关,因此,相应的编码也应该同该用户相关。A context object refers to a data object related to the encoding usage environment, such as a user, an application system, a time, a place, a domain, and the like. The encoding of some data objects is closely related to these usage environments. For example, a user-private data object is closely related to the user, so the corresponding encoding should also be relevant to the user.
上下文相关的编码空间就是指隶属于上下文对象的编码空间。通过在数据对象的元信息中使用上下文对象的信息,我们可以指定对应数据对象的编码空间。这样,我们就可以直接用该编码空间内的编码来对数据对象进行编码了。在编码使用、解析的过程中,随着上下文对象的不同,同样的对象编码可以对应不同的编码空间。这样就进一步提高了编码的有效性。The context-dependent coding space refers to the coding space that belongs to the context object. By using the information of the context object in the meta information of the data object, we can specify the encoding space of the corresponding data object. In this way, we can directly encode the data object with the encoding in the encoding space. In the process of encoding use and parsing, the same object encoding can correspond to different encoding spaces with different context objects. This further improves the effectiveness of the coding.
此外,对一些关键上下文对象提供一定的安全访问机制,就可以保证对应编码空间的安全性,从而保证了该空间中编码的安全性。In addition, by providing certain security access mechanisms for some key context objects, the security of the corresponding coding space can be guaranteed, thereby ensuring the security of coding in the space.
更为重要的是,在本实施例中,基于对象编码的关键是数据对象的元信息。数据对象的序列化(内容编码)、传输和存储都是由其元信息来控制。数据对象的类型是一个重要的元信息。各种各样的数据对象有着不同的数据类型,这些类型之间有着一定的关系,如复杂类型由简单类型组合而成,一种或者多种类型的多个数据对象按照某种约定排列可以形成某种特殊结构,等等。所有这些类型一起构成了一个类型系统。基于对象编码系统就是建立在一个完整的类型系统之上的。也就是说,在对应编码系统中,所有的数据对象都有其对象类型。而且这个类型系统是可扩充的,用户可以基于已有的类型,以及类型定义和扩展机制,定义自己的自定义类型。类型系统主要给 对应的编码系统提供三种好处:More importantly, in this embodiment, the key to object-based coding is the meta-information of the data object. The serialization (content encoding), transmission, and storage of data objects are all controlled by their meta information. The type of data object is an important meta information. A variety of data objects have different data types, and these types have a certain relationship. For example, complex types are composed of simple types, and multiple data objects of one or more types can be formed according to certain conventions. Some kind of special structure, and so on. All of these types together form a type system. The object-based coding system is built on top of a complete type system. That is to say, in the corresponding coding system, all data objects have their object type. And this type of system is extensible, users can define their own custom types based on existing types, as well as type definition and extension mechanisms. Type system mainly gives The corresponding coding system offers three benefits:
第一、类型检查First, type checking
有了对象类型,我们对对应对象的数据合法性就有了验证的依据。这对数据编码、传输的可靠性极为重要。With the object type, we have a basis for verification of the data legitimacy of the corresponding object. This is extremely important for the reliability of data encoding and transmission.
第二、类型推导Second, type derivation
有了对象类型,我们就可以推导出其局部类型或者相关类型。因此,在编码过程中,这个局部类型或相关类型就可以省略。这样就大大提高了编码效率。With object types, we can derive their local types or related types. Therefore, this local type or related type can be omitted during the encoding process. This greatly improves the coding efficiency.
第三、编码隔离Third, code isolation
有了对象类型,我们可以针对不同类型来重用编码(具体的说,是引用编码)。这也提高了编码的有效性,和安全性。With object types, we can reuse encodings for different types (specifically, reference encodings). This also improves the validity and security of the code.
另外,在本实施例中,我们引入了OTF-8编码,首先,关于OTF-8编码中的文字编码,这里的目标编码是一种文字编码。但是同传统文字编码不同,编解码过程需要编码仓库的参与。因此,编码结果和解码源能够支持非标准字符。非标准字符的数据存在于编码仓库中。In addition, in the present embodiment, we introduce OTF-8 encoding. First, regarding the character encoding in OTF-8 encoding, the target encoding here is a text encoding. However, unlike traditional text encoding, the encoding and decoding process requires the participation of the encoding warehouse. Therefore, the encoding result and the decoding source can support non-standard characters. Data for non-standard characters exists in the encoding repository.
这个文字编码建立在UTF-8的基础之上,我们称之为OTF-8。OTF-8以一个字节为单位,不存在字节序的问题。其向后兼容UTF-8。也就是说任何UTF-8的内容可以直接以OTF-8编码形式解码,解码结果同UTF-8解码结果完全一致。This text code is based on UTF-8, which we call OTF-8. OTF-8 is in one byte and there is no problem with endianness. It is backward compatible with UTF-8. That is to say, the content of any UTF-8 can be directly decoded in OTF-8 encoding, and the decoding result is exactly the same as the UTF-8 decoding result.
其次,关于OTF-8编码的数字表示,OTF-8除了能够编码传统的UTF-8字符以外,还能编码0到128位的数字。这里使用变长编码:对于0到31,用一个字节表示;对于32到255,用两个字节表示;28到216-1,用三个字节表示;依次类推。具体的,用100开始的字节表示0到31,后面的五个二进制位对应的是具体的数字。如,0x80(字节的二进制表示为10000000)表示0,0x81(10000001)表示1,0x82(10000010)表示2......以此类推,直到0x9F(10011111)对应的31。对于大于等于32的数字,我们使用以101开始的首字节表示之后的字节个数,后接对应字节个数的大端数字编码(高位在前,低位在后,高位补0)。0xA0(10100000)表明后面有1个字节用于表示数字;0xA1(10100001)表明其后有两个字节的数字;0xA2(10100010)表明其后三个字节……以此类推,直到0xAF(10101111)表明其后有16个字节,即128 位的数字。如,0xA0 0x20 (10100000 00100000)表示数字32;0xA0 0xFF(10100000 11111111)表示数字255;0xA1 0x01 0x00 (0x10100001 00000001 00000000)表示数字256;0xA2 0x01 0x00 0x00 (10100010 00000001 00000000 00000000)表示数字65536。对应的编码细节如图23所示。Second, with regard to the digital representation of OTF-8 encoding, OTF-8 can encode numbers from 0 to 128 in addition to traditional UTF-8 characters. Variable length coding is used here: for one to two for 0 to 31; two bytes for 32 to 255; three bytes for 28 to 216-1; and so on. Specifically, the byte starting with 100 represents 0 to 31, and the next five binary bits correspond to specific numbers. For example, 0x80 (binary representation of bytes is 10000000) means 0, 0x81 (10000001) means 1, 0x82 (10000010) means 2... and so on, until 0x9F (10011111) corresponds to 31. For numbers greater than or equal to 32, we use the first byte starting with 101 to indicate the number of bytes afterwards, followed by the big endian number encoding of the corresponding number of bytes (high bit first, low bit first, high bit 0). 0xA0 (10100000) indicates that there is 1 byte followed by a number; 0xA1 (10100001) indicates a number of two bytes followed; 0xA2 (10100010) indicates the last three bytes... and so on, until 0xAF (10101111) indicates that there are 16 bytes, 128 The number of digits. For example, 0xA0 0x20 (10100000 00100000) represents the number 32; 0xA0 0xFF (10100000 11111111) represents the number 255; 0xA1 0x01 0x00 (0x10100001 00000001 00000000) represents the number 256; 0xA2 0x01 0x00 0x00 (10100010 00000001 00000000 00000000) represents the number 65536. The corresponding coding details are shown in Figure 23.
最后,关于OTF-8编码之对象引用编码,OTF-8中出现的数字,如果没有特殊的标记,或者特别的上下文环境,其缺省是用来对编码仓库中的对象进行引用编码。Finally, with regard to OTF-8 encoded object reference encoding, the numbers appearing in OTF-8, if there is no special markup, or a special context, are used by default to reference code the objects in the encoding repository.
下面,再简要描述一下编码空间、编码目录项以及元编码:Below, a brief description of the encoding space, encoding directory items and metacode:
这个编码主要是通过数字编号来完成,而且是层次化的编号。这个层次化主要体现在编码仓库中编码空间的层次化上。This code is mainly done by numerical numbering and is a hierarchical number. This layering is mainly reflected in the layering of the coding space in the coding warehouse.
为了访问编码空间中的各种编码,OTF-8的编码空间中有且仅有一个编码目录。每个编码目录项包括一个编码类型,以及一个编码路径。编码路径可以是从当前编码空间到其他编码空间的上下文序列。如,当编码路径为“当前用户”时,对应编码空间就是当前用户在当前空间的子空间。当编码路径为空(不包含任何上下文)时,对应编码所属编码空间为该编码目录项所在的编码空间。编码路径还可以是一个字符串,即一个名字,对应编码空间就是当前空间的命名子空间。当编码目录项的编码类型为编码目录项时,该编码对应的数据对象就是目标空间,该编码称为空间编码。In order to access various encodings in the encoding space, there is one and only one encoding directory in the OTF-8 encoding space. Each encoding directory entry includes an encoding type and an encoding path. The encoding path can be a sequence of contexts from the current encoding space to other encoding spaces. For example, when the encoding path is “current user”, the corresponding encoding space is the subspace of the current user in the current space. When the encoding path is empty (does not contain any context), the encoding space to which the corresponding encoding belongs is the encoding space where the encoding directory entry is located. The encoding path can also be a string, that is, a name, and the corresponding encoding space is the named subspace of the current space. When the encoding type of the encoding directory entry is an encoding directory entry, the data object corresponding to the encoding is the target space, and the encoding is called spatial encoding.
编码目录项对应的编号为目录项编码。The number corresponding to the encoding directory entry is the directory entry encoding.
目录项编码和空间编码都是元编码,其并不对应具体的数据对象实例,而是对应对象的元数据对象。具体的,对应编码目录项和编码空间。元编码之后需要有实例编码来构成完整的对象编码。The directory entry code and the spatial code are all meta-coded, which does not correspond to a specific data object instance, but a metadata object corresponding to the object. Specifically, the corresponding directory entry and the encoding space are corresponding. After the meta-encoding, an instance code is needed to form the complete object encoding.
缺省的编码是从当前编码仓库的根编码空间开始。如,编码仓库的根空间的编码目录如下表一所示:The default encoding starts with the root encoding space of the current encoding repository. For example, the encoding directory of the root space of the encoding repository is shown in Table 1 below:
表一Table I
编号Numbering 类型Types of 编码路径Encoding path
0000 编码目录项Encoding directory entry  
0101 类型供应器Type provider  
0202 存储驱动Storage driver  
0303 编码类型Coding type  
0404 编码上下文Coding context  
0505 用户user  
0606 应用application  
0707 文档Document  
0808 编码空间Coding space 用户user
0909 编码空间 Coding space 应用application
1010 编码空间 Coding space 文档Document
1111 手写文字Handwritten text  
1212 手写文字Handwritten text 用户user
那么我们用两级编号05|256就能表示编号为256的用户。用前面提到的OTF-8数字编码方案,用四个字节就可以表示这个用户对象的引用编码:Then we can use the two-level number 05|256 to represent the user numbered 256. With the aforementioned OTF-8 digital encoding scheme, the reference encoding of this user object can be represented in four bytes:
10000101 10100001 00000001 0000000010000101 10100001 00000001 00000000
在这里,规约编码”10000101“就是该用户对象编码的元编码;后面的“10100001 00000001 00000000”为该对象编码的实例编码。Here, the protocol code "10000101" is the metacode of the user object code; the latter "10100001 00000001 00000000" is the instance code of the object code.
假设当前用户的编码空间的编码目录如下表二所示:Assume that the encoding directory of the current user's encoding space is as shown in Table 2 below:
表二Table II
编号Numbering 类型Types of 编码路径Encoding path
0000 编码规约Coding protocol  
0101 应用application  
0202 文档Document  
0303 编码空间Coding space 应用application
0404 编码空间Coding space 文档Document
0505 手写文字Handwritten text  
那么我们可以用三级编号08|05|256就能表示当前用户的第256号手写文字。用五个字节可以表示这个手写文字对象的引用编码:Then we can use the three-level number 08|05|256 to represent the current user's 256th handwritten text. The reference code of this handwritten text object can be represented by five bytes:
10001000 10000101 10100001 00000001 0000000010001000 10000101 10100001 00000001 00000000
在这里,根空间的规约编码“10001000”对应的是用户编码空间,也就是空间编码。之后的“10000101”对应的是用户空间的编号为55的规约编码。因此空间编码和规约编码共同构成了该手写文字对象的元编码 “10001000 10000101”;后面的“10100001 00000001 00000000”为该对象编码的实例编码。Here, the protocol code "10001000" of the root space corresponds to the user coding space, that is, the spatial coding. The subsequent "10000101" corresponds to the protocol code of the user space number 55. Therefore, spatial coding and protocol coding together constitute the meta-encoding of the handwritten text object. "10001000 10000101"; the latter "10100001 00000001 00000000" is the instance code of the object encoding.
我们注意到,根空间编码目录编号为11的编码目录项同当前用户空间中编码目录编号为05的编码目录项内容相同。但是他们对应的数据对象是来自于不同的编码空间,一个是根空间,一个是当前用户空间。实际上,根空间编码目录中编号为12的编码目录项指向的数据对象就是当前用户空间中的手写文字。因此,上面的编码所对应的数据对象还可以用二级编号12|256来表示,具体形式如下:We noticed that the encoding directory entry with the root space encoding directory number 11 is the same as the encoding directory entry with the encoding directory number 05 in the current user space. But their corresponding data objects are from different coding spaces, one is the root space and the other is the current user space. In fact, the data object pointed to by the encoding directory entry numbered 12 in the root space encoding directory is the handwritten text in the current user space. Therefore, the data object corresponding to the above encoding can also be represented by the secondary number 12|256, and the specific form is as follows:
10001100 10100001 00000001 0000000010001100 10100001 00000001 00000000
这里节省了一个字节,只需要四个字节。This saves one byte and only takes four bytes.
另外,关于编码上下文及其设置,对比上述手写文字对象的两个编码,除了元编码不同之外,还有一点不同:前者随着当前用户的不同而可能对应不同的编码类型,而后者对应的编码类型永远是手写文字。这是因为不同用户编码空间的编码目录并不一定相同。In addition, regarding the encoding context and its setting, comparing the two encodings of the above handwritten text object, in addition to the different metacoding, there is a difference: the former may correspond to different encoding types depending on the current user, and the latter corresponds to The encoding type is always handwritten text. This is because the encoding directories of different user encoding spaces are not necessarily the same.
实际上,根空间编码目录中编号为08的编码目录项对应的编码空间并不是一个确定的编码空间,而是根据当前上下文“用户”对象而确定的一个该用户的编码空间。随着当前用户的不同,对应的编码空间也不同。In fact, the code space corresponding to the coded directory entry numbered 08 in the root space code directory is not a certain code space, but a code space of the user determined according to the current context "user" object. The corresponding coding space is different depending on the current user.
上下文是在编码使用过程中系统出现的某个角色,实际对应某个具体的对象,称之为上下文对象。上下文对象可以在使用编码之前确定,如用户登录能够确定当前的“用户”上下文。上下文对象也可以在编码使用过程中动态切换,如一个多人聊天应用中,聊天记录的文档中,当前用户就需要来回切换。我们用一个特定字节0xBD(10111101)“开始的编码序列来指定当前的某个上下文对象。这个编码序列称为上下文设置编码,其具体语法如下:A context is a role that appears in the system during the use of the code. It actually corresponds to a specific object and is called a context object. The context object can be determined before using the encoding, such as the user login can determine the current "user" context. The context object can also be dynamically switched during the encoding process. For example, in a multi-person chat application, the current user needs to switch back and forth in the chat record document. We use a specific byte 0xBD (10111101) "starting code sequence to specify the current context object. This code sequence is called context setting code, and its specific syntax is as follows:
0xBD<上下文编码或者上下文名字><上下文对象编码>0xBD<context encoding or context name><context object encoding>
假如根空间中的上下文内容如下表三所示:If the context of the root space is as shown in Table 3 below:
表三Table 3
编号Numbering 类型Types of 名字first name
0000 编码仓库Coding warehouse  
0101 编码元对象Encoding meta object 缺省元对象Default meta object
0202 用户user 当前用户Current user
0303 应用application 当前应用Current application
0404 文档Document 当前文档Current document
0505 ... ...
那么,如下编码:Then, the following code:
0xBD 0x84 0x82 0x85 0xA1 0x00 0x010xBD 0x84 0x82 0x85 0xA1 0x00 0x01
就是将编号为256的用户对象(05|256)设置为当前用户(04|02)。这个7个字节的设置对之后到再次设置之前用户相关的编码都会产生影响。That is, the user object (05|256) numbered 256 is set as the current user (04|02). This 7-byte setting will have an effect on the user-related encoding before it is set again.
进一步,对于编码终结符,“当前用户”只是一个编码上下文,随着应用的不同,可能会出现各种各样不同的编码上下文。一个常见的系统上下文是“缺省元对象”。前面提到,系统的缺省元对象为当前编码仓库的根空间。这个根空间就是我们的”缺省元对象“,我们可以通过上述”上下文设置编码“来更改。Further, for the encoding terminator, the "current user" is just an encoding context, and depending on the application, a variety of different encoding contexts may occur. A common system context is the "default meta object." As mentioned earlier, the default meta object of the system is the root space of the current encoding repository. This root space is our "default meta object", which we can change by the above "context setting encoding".
在传统文字编码中有编码点(Code Point)的概念,一个编码点对应一个字符。OTF-8有类似的概念,只不过OTF-8的编码点对应一个Unicode代码点,OTF-8数字以及,或者一个完整的设置,如所述上下文设置。那么,在编码中如何表示元对象呢?直接使用元编码会将之后的编码误认为实例编码。这里我们使用一个称为”编码终结符“的特定字节来告诉解码程序编码点的结束。该字节为0xB8(10111000)。下面这个编码就是将根空间编码目录中的12号编码目录项对应的元对象设置为缺省元对象:In traditional text encoding, there is a concept of a code point, and one code point corresponds to one character. OTF-8 has a similar concept, except that the OTF-8 encoding point corresponds to a Unicode code point, an OTF-8 number as well, or a complete setting, as described in the context. So, how do you represent meta objects in coding? Direct use of meta-encoding will mistake the subsequent encoding for instance encoding. Here we use a specific byte called the "encoding terminator" to tell the decoder the end of the code point. This byte is 0xB8 (10111000). The following encoding is to set the meta object corresponding to the 12th encoding directory entry in the root space encoding directory as the default meta object:
10111101 10000100 10000001 10001100 1011100010111101 10000100 10000001 10001100 10111000
在这个设置之后,原先的二级编号12|256就变成了一级编号256。之前的编码:After this setting, the original secondary number 12|256 becomes the primary number 256. Previous code:
10001100 10100001 00000001 0000000010001100 10100001 00000001 00000000
就变成了两个对象编码,第一个是编号为12的当前用户私有手写字符,第二个是编号为256的当前用户私有手写字符。It becomes two object encodings, the first is the current user's private handwritten character numbered 12, and the second is the current user's private handwritten character numbered 256.
由此可见,编码终结符主要是用于元对象对应的编码。It can be seen that the encoding terminator is mainly used for the encoding corresponding to the meta object.
更进一步的,对于根空间前缀,系统缺省元对象更改后,还需要通过某种方法从根空间开始编码某些对象,在OTF-8中我们使用一个特殊的字节来“10111001”表示根空间,称之为根空间前缀。这样,下面的这个编码就和 当前的缺省元对象无关了:Further, for the root space prefix, after the system default meta object is changed, some methods need to be used to encode some objects from the root space. In OTF-8, we use a special byte to represent the root "10111001". Space, called the root space prefix. In this way, the following code is The current default meta object is irrelevant:
10111001 10001100 10100001 00000001 0000000010111001 10001100 10100001 00000001 00000000
其对应的还是从根空间开始的二级编号12|256。It also corresponds to the secondary number 12|256 starting from the root space.
对于OTF-8中所有的对象引用编码来说,没有根空间前缀的编码都是由当前缺省元对象开始解码。For all object reference encodings in OTF-8, the encoding without the root space prefix is decoded by the current default meta object.
再进一步的,对于系统客户端编码。我们已经看到,通过设置缺省元对象可以缩短编码长度,提高编码效率。但是有时候,在一个文档内部,可能会出现多个种类的编码,分别属于不同的编码空间,系统缺省元对象只能针对其中一种编码来提高编码效率。OTF-8提供了8个系统客户端编码来绑定任意的编码对象(包括编码元对象),它们都是一个字节,分别是:Further, for system client coding. We have already seen that by setting the default meta-object, the encoding length can be shortened and the encoding efficiency can be improved. However, sometimes, within a document, multiple kinds of codes may appear, which belong to different coding spaces. The system default meta-object can only improve coding efficiency for one of the codes. OTF-8 provides 8 system client encodings to bind arbitrary encoding objects (including encoding meta objects), which are all one byte, respectively:
1011000010110000
1011000110110001
1011001010110010
1011001110110011
1011010010110100
1011010110110101
1011011010110110
1011011110110111
我们还是用同样的特定字节”10111101“开始的编码序列来指定客户端编码对应的数据对象。这个编码序列称为客户端编码设置编码,其具体语法如下:We still use the same specific byte "10111101" to start the encoding sequence to specify the data object corresponding to the client encoding. This code sequence is called client code set code, and its specific syntax is as follows:
10111101<客户端编码><数据对象编码>10111101<Client Encoding><Data Object Encoding>
例如,下面的设置编码就将客户端编码“10110000”设置成了二级编码05|256对应的用户对象。For example, the following setting code sets the client code "10110000" to the user object corresponding to the secondary code 05|256.
10111101 10110000 10000101 10100001 00000001 0000000010111101 10110000 10000101 10100001 00000001 00000000
一旦扩客户端编码被定义,我们就能用其代替它所对应的数据对象的编码。那么,如下编码:Once the extended client code is defined, we can use it instead of the encoding of the data object it corresponds to. Then, the following code:
10111101 10000100 10000010 1011000010111101 10000100 10000010 10110000
同之前7个字节的上下文设置编码对应的语义完全一致。这里就用一个字节的客户端编码替代了原来的四个字节对象编码。 The semantics corresponding to the previous 7 bytes of context setting encoding are exactly the same. Here we replace the original four-byte object encoding with a one-byte client-side encoding.
再进一步的,对于OTF-8编码之对象表示,前面提到了,OTF-8中出现的数字缺省用于表示编码仓库中对象的引用编码。那么,OTF-8中如何直接表示数字呢?更进一步,如何直接编码对象本身而不是其引用/编号呢?Further, for the OTF-8 encoded object representation, as mentioned earlier, the numbers appearing in OTF-8 are by default used to represent the reference encoding of objects in the encoding repository. So how do you directly represent numbers in OTF-8? Further, how do you directly encode the object itself instead of its reference/number?
其答案就是自动类型推导,以及带类型编码的直接对象编码。The answer is automatic type derivation and direct object coding with type coding.
关于类型推导,在OTF-8内容解码过程中,可以用经典的“合一算法”进行类型推导。所有的OTF-8内容都有一个类型,缺省类型为OTF-8字符串类型,即根/通用对象数组。解码时,有一个系统的解码类型栈。栈顶放的是当前要解析的具体类型,当前类型对应的数据对象解析完成后,栈顶就被置换为当前类型结构的下一个元素的类型。如果当前结构完成,栈顶退栈,栈顶内容为父结构的下一个元素。Regarding type derivation, in the OTF-8 content decoding process, type derivation can be performed using the classic "integration algorithm". All OTF-8 content has a type, the default type is OTF-8 string type, which is the root/generic object array. When decoding, there is a system's decoding type stack. The top of the stack is the specific type to be parsed. After the data object corresponding to the current type is parsed, the top of the stack is replaced with the type of the next element of the current type structure. If the current structure is complete, the top of the stack is unstacked and the top of the stack is the next element of the parent structure.
例如,有如下结构:For example, there are the following structures:
Figure PCTCN2015086672-appb-000018
Figure PCTCN2015086672-appb-000018
当解析这个类型时,第一个遇到的数字会被解析成整数,而不是什么对象引用编码。而且在此时,如果解析的内容并不是OTF-8数字的话,实际上就是一个数据类型错误。类型信息在这里也为我们提供了类型检查的基础。When parsing this type, the first number encountered will be parsed into an integer instead of the object reference encoding. And at this time, if the parsed content is not an OTF-8 number, it is actually a data type error. The type information here also provides us with the basis for type checking.
当解析到该类型的第二个元素时,系统会根据类型自动接收整数或者字符串的内容,由于OTF-8中数字和字符串的编码格式完全不同,因此解析器根据编码格式就能自动判断该处数据对象的实际类型。When parsing the second element of the type, the system will automatically receive the contents of the integer or string according to the type. Since the encoding format of the numbers and strings in OTF-8 is completely different, the parser can automatically judge according to the encoding format. The actual type of data object there.
当解析第三个元素时,由于byte是int的子集,这两个类型的编码形式会有一定的重叠。因此,解析器的类型推断会有一定的困难。OTF-8提供了系统上下文“当前解析类型”来允许细化紧接着的数据对象的类型。此时,可以用When parsing the third element, since byte is a subset of int, there will be some overlap between the two types of encoding. Therefore, the type inference of the parser will have certain difficulties. OTF-8 provides the system context "current parsing type" to allow refinement of the type of data object that follows. At this point, you can use
Figure PCTCN2015086672-appb-000019
Figure PCTCN2015086672-appb-000019
来指定接下来的数据对象是byte类型。或者用To specify that the next data object is of type byte. Or use
0xBD<”当前解析类型”上下文的引用编码><“int”类型引用编码>0xBD<"current parsing type" context reference encoding><"int" type reference encoding>
来指定接下来的数据对象是int类型。To specify that the next data object is of type int.
在设置这个“当前解析类型”上下文时,我们不能使用不兼容的类型。例如,在这个例子中,int32是一个和int兼容的类型,因此可以使用。但是,string类型和byte以及int都不兼容,将其设置为“当前解析类型”将会产生类型错误。When setting this "current parsing type" context, we can't use incompatible types. For example, in this example, int32 is a type that is compatible with int, so it can be used. However, the string type is not compatible with both byte and int, and setting it to "current parsing type" will result in a type error.
关于直接对象编码,除了如上所述通过设置“当前解析类型”后进行直接对象编码以外,OTF-8还允许在编码类型的引用编码或者编码目录项的引用编码之后直接紧跟其对应的数据内容编码。Regarding direct object encoding, in addition to performing direct object encoding by setting "current parsing type" as described above, OTF-8 also allows direct reference to the corresponding data content after the encoding of the encoding type or the reference encoding of the encoding directory item. coding.
对于参数化类型,需要在类型之后紧跟类型参数对应的类型应用编码列表。For parameterized types, you need to apply a code list to the type corresponding to the type parameter immediately after the type.
因此,OTF-8中需要表示的所有数据对象的基本类型必需都存放在编码仓库中。前面提到的根空间编码目录中,编号为03的编码目录项就是编码类型。其对应的信息如下表四所示:Therefore, the basic types of all data objects that need to be represented in OTF-8 must be stored in the code repository. In the root space encoding directory mentioned above, the encoding directory entry numbered 03 is the encoding type. The corresponding information is shown in Table 4 below:
表四Table 4
编号Numbering 编码类型Coding type
0000 类型Types of
0101 无符号整数Unsigned integer
0202 有符号整数Signed integer
0303 浮点数Floating point number
0404 GUIDGUID
0505 布尔量Boolean
0606 UTF-8字符UTF-8 character
0707 UTF-8字符串UTF-8 string
0808 对象引用Object reference
0909 可空对象 Nullable object
1010 数组 Array
1111 元组Tuple
1212 字典dictionary
那么,对各种类型的数据对象的表示如下:Then, the representation of the various types of data objects is as follows:
1、数字的表示 1, the representation of the number
2、无符号整数的表示2, the representation of unsigned integer
对于无符号整数,直接将数据放于无符号整数类型编码之后。例如,如下编码表示数字256:For unsigned integers, place the data directly after the unsigned integer type encoding. For example, the following code represents the number 256:
0x83 0x81 0xA1 0x00 0x010x83 0x81 0xA1 0x00 0x01
3、有符号整数的表示3, the representation of signed integers
对于有符号整数,我们需要用无符号整数来表示,这里需要使用ZigZag编码。For signed integers, we need to use unsigned integers, which need to use ZigZag encoding.
ZigZag实际上是用偶数表示正整数,奇数表示负整数。如下表所示:ZigZag actually uses an even number to represent a positive integer and an odd number to represent a negative integer. As shown in the following table:
有符号整数Signed integer 编码结果(无符号整数)Encoding result (unsigned integer)
00 00
-1-1 11
11 22
-2-2 33
21474836472147483647 42949672944294967294
-2147483648-2147483648 42949672954294967295
ZigZag编码通过如下算法可以将无符号整数解码成对应的有符号整数:(n>>1)^(-(n&1))ZigZag encoding can decode unsigned integers into corresponding signed integers by the following algorithm: (n>>1)^(-(n&1))
如下编码表示的是有符号128:The following code represents the signed 128:
0x83 0x82 0xA1 0x00 0x010x83 0x82 0xA1 0x00 0x01
4、浮点数的表示4, the representation of floating point numbers
对于浮点数的表示,OTF-8直接采用IEEE 754标准。支持常见的单精度32位(四字节)浮点,以及双精度64位(八字节)浮点。分别用OTF-8的四字节和八字节数字表示。数值部分用大端编码。具体的数字形式为:For the representation of floating point numbers, OTF-8 directly uses the IEEE 754 standard. Supports common single-precision 32-bit (four-byte) floating point and double-precision 64-bit (eight-byte) floating point. They are represented by the four-byte and eight-byte numbers of OTF-8, respectively. The numerical part is encoded with big endian. The specific numerical form is:
0x83 0x83 0xA3 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx0x83 0x83 0xA3 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
以及as well as
Figure PCTCN2015086672-appb-000020
Figure PCTCN2015086672-appb-000020
如果需要,也可以支持半精度浮点以及四精度浮点。Half-precision floating point and four-precision floating point can also be supported if needed.
GUID的表示GUID representation
类似的,GUID可以直接用16个字节的数字来表示,其形式如下: Similarly, the GUID can be represented directly by a 16-byte number, which has the following form:
Figure PCTCN2015086672-appb-000021
Figure PCTCN2015086672-appb-000021
5、布尔量的表示5, the representation of Boolean
OTF-8直接定义了两个特殊的字节来表示布尔量。OTF-8 directly defines two special bytes to represent booleans.
字节0xBB(10111011)表示逻辑真;字节0xBC(10111100)表示逻辑假。Byte 0xBB (10111011) represents logically true; byte 0xBC (10111100) represents logically false.
字符和字符串的表示Character and string representation
OTF-8能够直接表示UTF-8的字符以及字符串。为了分隔连续的多个字符串,OTF-8中约定字符串可以以“0x0”结尾(如果不是以“0x0”结尾,OTF-8字符串到最后一个连续的OTF-8字符处结束);只包括一个“0x0”字符的字符串为空串。OTF-8 can directly represent UTF-8 characters and strings. In order to separate consecutive multiple strings, the OTF-8 convention string can end with "0x0" (if not ending with "0x0", the OTF-8 string ends with the last consecutive OTF-8 character); only A string consisting of a "0x0" character is an empty string.
6、复杂对象的表示6, the representation of complex objects
复杂对象由简单对象通过某种规则组合而来。在OTF-8中需要用两个特殊的系统对象加以标记,一个是对象开始标记,用字节0xFE(11111110)表示;另一个是对象结束标记,用字节0xFF(11111111)表示。数据对象的内容在开始和结束标记之间加以编码表示。Complex objects are composed of simple objects by some sort of rule. In OTF-8, two special system objects need to be marked, one is the object start tag, which is represented by byte 0xFE (11111110); the other is the object end tag, which is represented by byte 0xFF (11111111). The content of the data object is encoded between the start and end tags.
进一步的,在本实施例中,关于OTF-8编码与类型系统,我们可以看出,编码类型对于OTF-8的对象表示至关重要。实际上,OTF-8建立在一套可扩展的完全的类型系统之上。OTF-8内置了一些基本类型:整数型,Unicode字符型,布尔型,浮点型,Unicode字符串型,OTF-8字符串(实际上就是对象数组)。同时OTF-8也支持参数化类型,一些内置的参数化类型包括:编码引用类型,可空类型、元组类型、数组类型、字典类型。OTF-8允许用户自定义结构,接口以及服务;还允许用户在已有类型的基础之上进行继承和扩充。此外,还允许用户引入外部的编码方法来扩充已有类型。Further, in this embodiment, regarding the OTF-8 encoding and type system, we can see that the encoding type is crucial for the object representation of the OTF-8. In fact, OTF-8 is built on a scalable, complete type system. OTF-8 has some basic types built in: integer, Unicode character, boolean, float, Unicode string, OTF-8 string (actually an array of objects). At the same time, OTF-8 also supports parameterized types. Some built-in parameterized types include: coded reference type, nullable type, tuple type, array type, and dictionary type. OTF-8 allows users to customize structures, interfaces, and services; it also allows users to inherit and extend on top of existing types. In addition, users are allowed to introduce external encoding methods to augment existing types.
OTF-8定义了一个编码类型定义语言。用户可以通过其定义新的类型。这个定义语言独立于任何现有编程语言。但是可以建立同现有编程语言中元素的映射关系,从而实现语言之间的自动转换,如从编码仓库中的类型描述生成具体编程语言的类型声明;从具体编程语言的源代码或者构造结果(可执行文件)中提取编码类型定义的描述。在这个类型定义语言中,我们对内 置类型使用简洁描述,对应表五如下:OTF-8 defines an encoding type definition language. Users can define new types through them. This definition language is independent of any existing programming language. However, it is possible to establish a mapping relationship with elements in an existing programming language, thereby realizing automatic conversion between languages, such as generating a type declaration of a specific programming language from a type description in an encoding repository; source code or constructing a result from a specific programming language ( Extract the description of the encoding type definition in the executable file). In this type definition language, we are inside The type is a concise description, and the corresponding table five is as follows:
表五Table 5
实际类型Actual type 简略类型Abbreviated type
OpenCode.ObjectOpenCode.Object **
OpenCode.IntegerOpenCode.Integer intInt
OpenCode.CharOpenCode.Char charChar
OpenCode.StringOpenCode.String stringString
OpenCode.BooleanOpenCode.Boolean boolBool
OpenCode.FloatOpenCode.Float floatFlo
OpenCode.Object[]OpenCode.Object[] STRINGSTRING
在本实施例中,关于类型标识,OTF-8的类型都有一个唯一的类型标识。为保证类型标识的唯一性,一般采用特定的命名约定,如指定分隔符、名字空间、命名规则等。In this embodiment, regarding the type identifier, the type of OTF-8 has a unique type identifier. To ensure the uniqueness of the type identifier, a specific naming convention is generally adopted, such as specifying a separator, a namespace, a naming rule, and the like.
关于根类型,OTF-8能够表达的数据对象都有一个公共的根类型。这样UTF-8的标准字符串就对应到OTF-8的对象串。这个根类型为”OpenCode.Object”类型。通过任何OpenCode.Object能够获得其编码类型和编码空间。Regarding the root type, the data objects that OTF-8 can express have a common root type. Thus UTF-8's standard string corresponds to the OTF-8 object string. This root type is of type "OpenCode.Object". The encoding type and encoding space can be obtained by any OpenCode.Object.
Figure PCTCN2015086672-appb-000022
Figure PCTCN2015086672-appb-000022
在OTF-8的类型定义语法中,用星号(*)代表跟类型,实际上是表示任意类型。In the type definition syntax of OTF-8, an asterisk (*) is used to represent a type, which actually means any type.
关于空类型,空类型是指不对应任何数据对象的类型。例如,前面提到的上下文设置、扩展码设置等编码对应的就是空类型。在OTF-8的类型定义语法中,用符号“()”代表空类型;在该语法中,如果方法或者函数的返回类型为空则可以可省略。例如,如下函数: Regarding empty types, an empty type is a type that does not correspond to any data object. For example, the aforementioned context settings, extension code settings, etc., correspond to an empty type. In the type definition syntax of OTF-8, the symbol "()" represents an empty type; in this syntax, if the return type of the method or function is empty, it can be omitted. For example, the following function:
Start()Start()
就表示输入类型为空,返回类型为空的一个函数。其对应的类型为A function that indicates that the input type is empty and the return type is empty. Its corresponding type is
()->()()->()
简单类型和复杂类型Simple types and complex types
这里简单还是复杂是就编码表达而言。在OTF-8中,简单类型包括:编码引用类型、整数类型、布尔类型、浮点类型、Unicode字符类型、Unicode字符串类型以及它们的扩展类型。其中,除了Unicode字符串类型对应多个对象以外,其他类型都对应单个对象。简单类型在OTF-8中可以直接编码表达。Simple or complex here is in terms of coding expression. In OTF-8, simple types include: encoded reference types, integer types, boolean types, floating point types, Unicode character types, Unicode string types, and their extended types. Among them, except for the Unicode string type corresponding to multiple objects, other types correspond to a single object. Simple types can be directly encoded in OTF-8.
关于类型别名,类型别名是指将现有类型定义为一个不同类型表示的新类型。对应的编码类型定义语法如下:Regarding type aliases, a type alias is a new type that defines an existing type as a different type representation. The corresponding encoding type definition syntax is as follows:
<新类型标识>:type<现有类型标识><new type identifier>: type<existing type identifier>
如:Such as:
MyTypes.YesOrNo:type OpenCode.BooleanMyTypes.YesOrNo:type OpenCode.Boolean
关于约束类型,通过类型约束,可以将现有的简单类型(主要包括数值类型、字符类型以及字符串类型)加以限定,得到一个新的带约束的数值和字符串类型。对应的编码类型定义语法如下:With regard to constraint types, existing simple types (mainly including numeric types, character types, and string types) can be qualified by type constraints to obtain a new constrained numeric value and string type. The corresponding encoding type definition syntax is as follows:
<新类型标识>:type<数值类型、字符类型或者字符串类型>{约束条件}<new type identifier>: type<numeric type, character type or string type>{constraint}
对于数值类型,约束条件是数值的取值范围,如:For numeric types, the constraint is the range of values, such as:
OpenCode.Byte:type OpenCode.Integer{[0,255]}OpenCode.Byte:type OpenCode.Integer{[0,255]}
表示0到255的整数类型。Represents an integer type from 0 to 255.
对于字符类型,约束条件是Unicode的字符范围。For character types, the constraint is a range of characters in Unicode.
对于字符串类型,约束条件是字符串的长度限制,以及正则表达式匹配模式,如:For string types, the constraint is the length limit of the string, and the regular expression matching pattern, such as:
邮政编码:type OpenCode.String{[0-9]{6}}Postal code: type OpenCode.String{[0-9]{6}}
表示6个数字的字符串类型。A string type representing 6 digits.
关于参数化类型,OTF-8还支持参数化类型,又叫通用类型或者范型类型。参数化类型是指构成类型的子元素是参数,而并不是确定的类型。最终的类型在将参数具体化后才确定。例如,一个通用数组类型,将其参数指定 为整形,则对应的类型就变成了整数数组类型;将其参数指定为字符串,则对应的类型变成了字符串数组。OTF-8中的所有复杂类型的定义都可以是参数类型,在定义过程中,也可以直接使用参数化类型。参数类型中参数定义的语法形式为在类型关键字(class,enum,type等)之后用尖括号“<”、“>”包围,多个参数之间用“,”分割。类型定义中,参数类型的使用中可以直接将参数具体化,其语法形式为参数类型标识之后紧跟用尖括号“<”、“>”包围,并用“,”分割的参数列表。Regarding parameterized types, OTF-8 also supports parameterized types, also known as generic types or generic types. A parameterized type means that the child elements that make up the type are parameters, not the determined types. The final type is determined after the parameters are specified. For example, a generic array type that specifies its parameters For shaping, the corresponding type becomes an integer array type; if its argument is specified as a string, the corresponding type becomes an array of strings. The definition of all complex types in OTF-8 can be a parameter type, and the parameterized type can also be used directly during the definition process. The syntax of the parameter definition in the parameter type is surrounded by angle brackets "<", ">" after the type keyword (class, enum, type, etc.), and multiple parameters are separated by ",". In the type definition, the parameter type can be directly used to specify the parameter. The syntax is the parameter type identifier followed by the parameter list surrounded by angle brackets "<", ">", and separated by ",".
类型别名的定义中可以直接将参数化类型的全部类或者部分参数确定下来。The definition of a type alias can directly determine all or part of the parameters of the parameterized type.
如,对于参数化的字典类型,具有两个类型参数,一个是键类型,一个是值类型。我们可以定义一个字符串到字符串的字典如下:For example, for a parameterized dictionary type, there are two type parameters, one is a key type and the other is a value type. We can define a dictionary of strings to strings as follows:
字符串字典:type字典<string,string>String dictionary: type dictionary <string, string>
也可以定义一个键类型为整数的参数化字典,如下所示:You can also define a parameterized dictionary whose key type is an integer, as follows:
整数键字典:type<T>字典<int,T>Integer key dictionary: type<T> dictionary <int,T>
这里T是一个类型参数,对应到字典的值类型。Here T is a type parameter that corresponds to the value type of the dictionary.
在编码参数化类型的数据对象时,在编码数据对象本身之前,还需要给出参数对应的类型的引用或者类型代码。类型引用和数据对象之间用一个专门的分隔符加以区分。OTF-8的系统分隔符对象为字节0xBA(10111010)。该分隔符用于分隔一个结构中的不同语法元素。例如,一个直接编码参数化字典类型数据对象的示例如下:When encoding a data object of a parameterized type, it is necessary to give a reference or type code of the type corresponding to the parameter before encoding the data object itself. Type quotes and data objects are distinguished by a special separator. The system separator object for OTF-8 is byte 0xBA (10111010). This separator is used to separate different syntax elements in a structure. For example, an example of directly encoding a parameterized dictionary type data object is as follows:
Figure PCTCN2015086672-appb-000023
Figure PCTCN2015086672-appb-000023
由于字节0xFE,0x00,0xFF均不为能够正常显示的字符,所以这里高亮显示以示区别。Since the bytes 0xFE, 0x00, and 0xFF are not characters that can be displayed normally, they are highlighted here to show the difference.
关于合并类型,合并类型是指一个类型同时存在多种类型的编码形式。合并类型的定义有如下语法形式:Regarding the merge type, the merge type refers to a type in which multiple types of encodings exist simultaneously. The definition of a merge type has the following syntax:
<新类型标识>:type<现有类型标识1>{约束条件1}|<现有类型标识2>{约束条件2}|…<new type identifier>: type<existing type identifier 1>{constraint 1}|<existing type identifier 2>{constraint 2}|...
如:Such as:
OpenCode.SmartFloat:type OpenCode.Float64|OpenCode.String {[+-]?[0-9]*(\.[0-9]+)?|-?[1-9]\.?[0-9]+([eE][-+]?[0-9]+)?}OpenCode.SmartFloat:type OpenCode.Float64|OpenCode.String {[+-]? [0-9]*(\.[0-9]+)? |-? [1-9]\.? [0-9]+([eE][-+]?[0-9]+)? }
在适当的时候可以用更少的字节表达原本需要9个字节的双精度浮点。例如“1”只有一个字节、“.24356”只有6个字节、“6e23”只有4个字节。A double-precision floating point that would otherwise require 9 bytes can be represented with fewer bytes when appropriate. For example, "1" has only one byte, ".24356" has only 6 bytes, and "6e23" has only 4 bytes.
在定义合并类型时,允许递归定义,即被定义的目标类型能够直接在类型定义体中进行使用。例如,一个树类型的定义如下:When defining a merge type, a recursive definition is allowed, ie the defined target type can be used directly in the type definition body. For example, a tree type is defined as follows:
树:type<T>(T,树[])|TTree: type<T>(T, tree[])|T
一个对应的字符串树数据对象的编码如下:The encoding of a corresponding string tree data object is as follows:
Figure PCTCN2015086672-appb-000024
Figure PCTCN2015086672-appb-000024
Figure PCTCN2015086672-appb-000025
Figure PCTCN2015086672-appb-000025
可以看出这是一个中国行政区域划分的树形结构。其中的换行符和空白/制表符是为了方便阅读,人为添加上去的,真正的编码内容中并不存在这些控制符。但是,根据之前定义的树类型,OTF-8解析器能够编码、解码并验证对应的数据对象。It can be seen that this is a tree structure of the administrative division of China. The newline characters and whitespace/tabs are added for the convenience of reading, and these control symbols do not exist in the actual encoded content. However, based on the previously defined tree type, the OTF-8 parser is capable of encoding, decoding, and verifying the corresponding data object.
关于空对象,不同于空类型,空对象是一个对象而不是类型。空对象有其自身的特殊类型(而不是没有任何实例的空类型),我们将其记为Null。但是这个类型只有一个实例,就是这个空对象。而且这个特殊类型并不会被直接使用。Regarding an empty object, unlike an empty type, an empty object is an object rather than a type. An empty object has its own special type (instead of an empty type without any instances), which we remember as Null. But this type has only one instance, which is this empty object. And this special type is not used directly.
空对象表示对应的数据对象并不存在。我们直接用一个编码终结符(0xB8)来表示空对象。An empty object indicates that the corresponding data object does not exist. We use an encoding terminator (0xB8) to represent an empty object.
关于可空类型,可空类型实际上是将任意类型和Null进行合并形成的类型。可空类型对应可以没有数据的数据类型。可以用类型语法描述如下:Regarding the nullable type, the nullable type is actually a type formed by combining any type and Null. The nullable type corresponds to a data type that can have no data. The type syntax can be described as follows:
可空类型:type<T>T|NullNullable type: type<T>T|Null
OTF-8编码的类型系统内置了对可空对象的直接支持,在类型定义语法中能够以简化的实时加以使用-直接在对应类型后加上问好”?”就将该类型变成了可空类型。如下所示:The OTF-8 encoded type system has built-in direct support for nullable objects, which can be used in a simplified real-time in the type definition syntax - directly after the corresponding type, "Well"? Types of. As follows:
string? String?
表示可空字符串。这个类型空对象和空字符串是两个完全不同的对象。前者表示不存在。后者表示内容为空字符串。Represents a nullable string. This type of empty object and empty string are two completely different objects. The former said that it does not exist. The latter indicates that the content is an empty string.
关于数组类型,数组类型也是一个参数化类型,可以将任意类型的多个数据对象顺序排放。OTF-8编码的类型系统对数组类型也提供了内置的支持,也有简洁的表达形式-将一对方括号置于特定的类型之后,就将该类型转换为对应的数组类型。Regarding array types, array types are also a parameterized type that can be used to sequentially discharge multiple data objects of any type. The OTF-8 encoded type system also provides built-in support for array types, as well as a concise expression - after placing a parenthesis after a particular type, the type is converted to the corresponding array type.
方括号中的数字可以给数组的元素个数予以一定的限制。The number in square brackets can be used to limit the number of elements in the array.
例如,如下类型为一个整数数组,数组元素个数并无限制:For example, the following type is an array of integers, and the number of array elements is not limited:
int[]Int[]
如下类型为一个只能有5个字符串的字符串数组:The following type is an array of strings with only 5 strings:
string[5]String[5]
OTF-8解码系统在解析对应的数据对象时,如果得到的元素不是5个,就会产生类型检查的错误。When the OTF-8 decoding system parses the corresponding data object, if there are not five elements, a type check error will occur.
如下类型为一个布尔数组,其中元素个数只能是5、6或者7个The following type is a boolean array, where the number of elements can only be 5, 6 or 7
bool[5..7]Bool[5..7]
此外,OTF-8还支持对多维数组的定义。如:In addition, OTF-8 also supports the definition of multidimensional arrays. Such as:
string[3][4..5]String[3][4..5]
这就是一个3行,4列或者5列的二维数组。对于一个具体的二维数组对象,其只能是3X4或者3X5的数组,不能有的行4列,有的行5列。This is a two-dimensional array of 3 rows, 4 columns or 5 columns. For a specific two-dimensional array object, it can only be an array of 3X4 or 3X5, there can be no rows of 4 columns, and some rows of 5 columns.
关于元组类型,元组类型也是一个参数化类型,其参数可以是任意个数的任意类型。其对应的数据为相应类型数据对象的顺序排列。只有一个数据类型的元组类型等同于该数据类型。没有任何数据类型元组类型就是空类型。Regarding the tuple type, the tuple type is also a parameterized type, and its parameters can be any number of any number. The corresponding data is arranged in the order of the corresponding type of data objects. Only a tuple type of one data type is equivalent to this data type. No data type tuple type is an empty type.
OTF-8中内置元组类型的支持,类型参数列表用括号“(”和“)”包围,类型之间用逗号分隔就能表示一个元组。OTF-8 has built-in support for tuple types. The list of type parameters is surrounded by parentheses "(" and ")", which can be separated by commas to represent a tuple.
例如,(int,string)[]?就是一个整数、字符串组成的元组的可空数组类型。For example, (int,string)[]? Is a nullable array type of a tuple of integers and strings.
元组对象在序列化/编码时,也需要用开始(0xFE)和结束(0xFF)标记来包围。When the tuple object is serialized/encoded, it also needs to be surrounded by the start (0xFE) and end (0xFF) flags.
关于字典类型,字典类型也是一个参数化类型,有两个参数:键类型、 值类型。其实质是对应元组类型的数组。只不过多了一个约束:数组元素对象中的键部分必需唯一,不可重复。OTF-8中内置字典类型的支持,键、值类型之间用冒号(“:”)分隔,并用方括号(“[”、“]”)包围就可表示对应的字典类型。如:Regarding the dictionary type, the dictionary type is also a parameterized type with two parameters: the key type, Value type. The essence is an array of corresponding tuple types. There is only one more constraint: the key parts of the array element object must be unique and not repeatable. OTF-8 has built-in dictionary type support. The key and value types are separated by a colon (":"), and surrounded by square brackets ("[", "]") can represent the corresponding dictionary type. Such as:
[string:int][string:int]
就表示一个字符串到数字映射的字典类型。字典的单个元素不用开始、结束标记包围。A dictionary type that represents a string to a numeric map. A single element of a dictionary is not surrounded by a start or end tag.
关于类,同面向对象的类相同,OTF-8中的类包括成员和方法。类定义的语法形式如下:Regarding classes, like object-oriented classes, classes in OTF-8 include members and methods. The syntax of the class definition is as follows:
Figure PCTCN2015086672-appb-000026
Figure PCTCN2015086672-appb-000026
在进行相应的对象编码时,依次按照成员出现的顺序将成员数据对象的内容进行编码。此外,当某成员是缺省值时,可以使用系统定义的特殊标记来告知系统。该缺省值标记为一个特殊字节0xBE(10111110)。When the corresponding object encoding is performed, the contents of the member data object are encoded in the order in which the members appear. In addition, when a member is the default, the system-defined special tag can be used to inform the system. This default value is marked as a special byte 0xBE (10111110).
定义类成员时,可以使用一个系统关键字context。具有该关键字标记的成员数据内容会被存储到相应的编码空间中去;而没有该标记的成员数据内容则会存储到统一的存储中。When defining a class member, you can use a system keyword context. The member data content with the keyword tag is stored in the corresponding encoding space; the member data content without the tag is stored in the unified storage.
例如,如下联系人类别:For example, the following contact categories:
Figure PCTCN2015086672-appb-000027
Figure PCTCN2015086672-appb-000027
Figure PCTCN2015086672-appb-000028
Figure PCTCN2015086672-appb-000028
那么一个对应的数据对象就会编码如下:Then a corresponding data object will be encoded as follows:
Figure PCTCN2015086672-appb-000029
Figure PCTCN2015086672-appb-000029
这个数据对象最终保存到编码仓库中,这个联系人往往会存在于不同用户的通讯录中,因此该联系人的主要信息会作为共享存储而被不同用户引This data object is finally saved to the code repository. This contact will often exist in the address book of different users, so the main information of the contact will be cited as shared storage by different users.
用;但是“昵称”一般因人而异,因此,这里的这个”context”就是指示该字段存储于目标上下文空间中。一个可能的数据存储服务端的具体联系人上下文无关的存储如下:Use; but "nickname" generally varies from person to person, so the "context" here means that the field is stored in the target context space. The specific contact context-independent storage of a possible data storage server is as follows:
联系人IDContact ID 姓名Name 邮件地址Email address 联系电话contact number
... ... ... ...
46234784623478 张三Zhang San zhangsan12345@sina.comZhangsan12345@sina.com 1323456789013234567890
       
该类型上下文相关的存储如下:The context-dependent storage of this type is as follows:
编码空间IDEncoding space ID 联系人编号Contact number 联系人IDContact ID 昵称nickname
       
(用户1的编码空间ID)(User 1's code space ID) 005005 46234784623478 老张Lao Zhang
       
(用户1的编码空间ID)(User 1's code space ID) 007007 46234784623478 小三儿Little three children
       
通过这种方式就能保证不同的用户可以共享同一个联系人,但是这些用户对同一个联系人的编号和昵称是由编码空间来进行隔离的。这样可以提高存储空间的利用率,数据对象中上下文无关的部分不用多次存储。In this way, different users can share the same contact, but the number and nickname of the same contact are separated by the coding space. This can increase the utilization of storage space, and the context-independent part of the data object does not need to be stored multiple times.
不同于面向对象编程语言中的对象方法,OTF-8中的方法只是个语法定义。在OTF-8编码的文档中可以直接应用定义中的方法。方法的定义确定了该方法的类型,客户端和服务端都需要根据类型信息来验证方法应用语法的正确性。最终方法的具体实现由远端的服务来执行。Unlike object methods in object-oriented programming languages, the methods in OTF-8 are just grammatical definitions. The methods in the definition can be directly applied in the OTF-8 encoded document. The definition of the method determines the type of the method. Both the client and the server need to verify the correctness of the method application syntax based on the type information. The specific implementation of the final method is performed by the remote service.
关于接口,接口只有方法。接口是一个抽象类型,主要定义的是对象之间的角色和对象之间的交互协议。接口最终会被类实现。 Regarding interfaces, interfaces have only methods. An interface is an abstract type that defines the interaction between objects and objects between objects. The interface will eventually be implemented by the class.
Figure PCTCN2015086672-appb-000030
Figure PCTCN2015086672-appb-000030
例如:E.g:
Figure PCTCN2015086672-appb-000031
Figure PCTCN2015086672-appb-000031
继承和实现Inheritance and implementation
同面向对象编程语言中的类一样,一个类可以是另一个类的子类,一个接口可以是另一个接口的子接口。类也可以实现接口。OTF-8的接口支持单继承;类也是只支持单继承,即最多只能从一个类派生出来,但是可以同时实现多个接口。Like classes in an object-oriented programming language, a class can be a subclass of another class, and an interface can be a subinterface of another interface. Classes can also implement interfaces. The OTF-8 interface supports single inheritance; the class also supports only single inheritance, that is, it can only be derived from one class at most, but multiple interfaces can be implemented at the same time.
子类成员的编码是从根对象开始,按照继承链将所有祖先类、父类以及自身的成员顺序编码。子类的方法编号也是按照继承链将所有祖先类及父类的方法、所实现接口中的方法,以及自身定义的方法来顺序进行的。The encoding of the subclass members is started from the root object, and all ancestor classes, parent classes, and their own members are sequentially encoded according to the inheritance chain. The method number of the subclass is also sequentially performed according to the inheritance chain, the methods of all the ancestor classes and the parent class, the methods in the implemented interface, and the methods defined by itself.
Figure PCTCN2015086672-appb-000032
Figure PCTCN2015086672-appb-000032
关于编码引用类型,编码引用类型为一个参数化类型,其参数只能是一个类。其数据对象内容就是该对象对应编码空间中类型存储对应的编号。编 码引用类型是OTF-8中最为重要的类型。通过这个类型,我们可以通过编码号引用到编码仓库中的数据。编码仓库中的编码目录项也是通过编码引用的形式来作为元编码出现的。在OTF-8的类型语法定义中,我们用类标识之后紧跟一个”#”来表示对应的编码引用类型。例如:Regarding the encoding reference type, the encoding reference type is a parameterized type, and its argument can only be one class. The content of the data object is the number corresponding to the type storage in the corresponding coding space of the object. Edit The code reference type is the most important type in OTF-8. With this type, we can reference the data in the encoding repository by the encoding number. Encoding directory entries in the encoding repository are also presented as metacodes by encoding the referenced form. In the type syntax definition of OTF-8, we use the class identifier followed by a "#" to indicate the corresponding encoding reference type. E.g:
联系人#Contact#
就表示“联系人”类对应的引用类型,其实例就是对应的编码仓库应用编码。The reference type corresponding to the "contact" class is an instance of the corresponding coded warehouse application code.
枚举enumerate
OTF-8中的枚举有两种,一种是符号枚举,一种是对象枚举。There are two types of enumerations in OTF-8, one is symbol enumeration and the other is object enumeration.
符号枚举同普通编程语言中的枚举类型一样,就是一个数字化的符号列表。其定义就是一组命名整数。其语法形式如下:Symbol enumeration, like the enumeration type in a normal programming language, is a list of digitized symbols. Its definition is a set of named integers. Its grammatical form is as follows:
<新类型标识>:enum{<名字1[=number1]>,<名字2[=number1]>,…}<new type identifier>: enum{<name 1[=number1]>, <name 2[=number1]>,...}
不同于普通编程语言中的枚举类型,OTF-8的对象枚举类型是一个参数化类型,其定义就是一组命名对象。其语法形式如下:Unlike the enumerated types in ordinary programming languages, the object enumeration type of OTF-8 is a parameterized type whose definition is a set of named objects. Its grammatical form is as follows:
<新类型标识>:enum<<被枚举类型>>{<对象1[=number1]>,<对象2[=number1]>,…}<new type identifier>: enum<<enumerated type>>{<object 1[=number1]>, <object 2[=number1]>,...}
如:Such as:
星期:enum<string>{“周日”,“周一”,“周二”,“周三”,“周四”,“周五”,“周六”}Week: enum<string>{"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"}
当对象没有对应的数字时,第一个对象编码为0,其对象顺延。When the object does not have a corresponding number, the first object is encoded as 0 and its object is postponed.
也可以明确指定名字对应的编号,如:You can also explicitly specify the number corresponding to the name, such as:
Poker.Figure:enum<string|int>{3=3,4=4,5=5,6=6,7=7,8=8,9=9,10=10,“Jake”=11,“Queen”=12,“King”=13,“A”=14,“2”=15,“Black Joker”=16,“Red Joker”=17}Poker.Figure:enum<string|int>{3=3,4=4,5=5,6=6,7=7,8=8,9=9,10=10, “Jake”=11,” Queen"=12, "King"=13, "A"=14, "2"=15, "Black Joker"=16, "Red Joker"=17}
实际上,OTF-8的类型定义语言支持所有类型的对象描述,主要就是用于对象枚举类型定义中的对象描述以及类定义中的缺省值描述。In fact, OTF-8's type definition language supports all types of object descriptions, mainly for object descriptions in object enumeration type definitions and default value descriptions in class definitions.
服务 Service
服务不同于对象方法,服务并不隶属于某个对象,而是一个函数集合。通常对应网络中某个节点上的网络服务。A service is different from an object method. A service is not affiliated with an object, but a collection of functions. Usually corresponds to a network service on a node in the network.
Figure PCTCN2015086672-appb-000033
Figure PCTCN2015086672-appb-000033
例如,一个数字天气预报网络服务可以定义如下:For example, a digital weather forecasting network service can be defined as follows:
Figure PCTCN2015086672-appb-000034
Figure PCTCN2015086672-appb-000034
关于外部类型,OTF-8除了内置支持上述类型以外,还可以通过类型供应器来支持外部类型,从而实现容纳任意现有编码格式。Regarding external types, in addition to the built-in support for the above types, OTF-8 can also support external types through type providers, so as to accommodate any existing encoding format.
现有编码格式无外乎两类编码方式:文本编码和二进制编码。文本编码对应string类型。在OTF-8中可以直接表达。而对于二进制编码,OTF-8中有一个特定标记字节0xBF(10111111)用于表示二进制字节流。其后为一个OTF-8整数,表示该字节流的大小,再之后的内容就是具体的二进制字节流。Existing encoding formats are nothing more than two types of encoding: text encoding and binary encoding. The text encoding corresponds to the string type. It can be expressed directly in OTF-8. For binary encoding, there is a specific tag byte 0xBF (10111111) in OTF-8 for representing a binary byte stream. This is followed by an OTF-8 integer representing the size of the byte stream, and then the content is the specific binary byte stream.
在对文本和二进制编码内容支持基础之上,OTF-8编码系统通过对编码类型提供不同的编码驱动来支持具体不同的编码语法和语义。Based on support for text and binary encoded content, the OTF-8 encoding system supports specific encoding syntax and semantics by providing different encoding drivers for encoding types.
具体的,在本实施例中,结合上述描述,下面通过两个具体的例子来进行说明:Specifically, in the present embodiment, in combination with the above description, the following two specific examples are used for explanation:
第一个例子,关于XML编码。The first example is about XML coding.
XML是一个基于文本的标记语言。在OTF-8中可以通过两种方式对其提供支持。XML is a text-based markup language. There are two ways to support it in OTF-8.
一种是直接将XML文档的内容嵌入到OTF-8文档中,实际上是对应一个OTF-8的string对象。但是通过XML的类型供应器(嵌入型),我们能够得到并访问该对象的文档对象模型(DOM)。 One is to directly embed the contents of the XML document into the OTF-8 document, which is actually a string object corresponding to an OTF-8. But with the XML type provider (embedded), we can get and access the object's Document Object Model (DOM).
另一种方式是直接将XML的类型系统扩展到OTF-8中。XML是一种元语言,可以用DTD、XML Schema、RelaxNG等语言定义一个具体的XML文档的语法结构。例如,标准的网络矢量图形格式SVG就是由DTD定义的。通过DTD类型供应器,我们可以将SVG的DTD定义读入、解析,产生对应的一系列元素类型、属性类型。这些类型之间有一定的关系和约束条件。这些类型可以根据因此进行语法检查和类型推导。DTD类型供应器(映射型)根据SVG的DTD定义,在编码仓库产生对应的空间,并将形成的类型对象直接编码于其中。因此,对于相应SVG类型的数据对象。可以直接根据编码仓库中的SVG类型(相应的元素类型和属性类型)对一个SVG文档进行编码。这种编码方式要比传统的XML文本方式要有效得多。而且可以最大程度重用已有的XML技术遗产。Another way is to extend the XML type system directly into OTF-8. XML is a meta-language that can define the syntax structure of a specific XML document in languages such as DTD, XML Schema, and RelaxNG. For example, the standard network vector graphics format SVG is defined by the DTD. Through the DTD type provider, we can read and parse the SVG DTD definition to generate a corresponding series of element types and attribute types. There are certain relationships and constraints between these types. These types can be grammatically checked and type derived based on this. The DTD type provider (map type) generates a corresponding space in the code repository according to the DTD definition of the SVG, and directly encodes the formed type object therein. Therefore, for the corresponding SVG type of data object. An SVG document can be encoded directly from the SVG type (corresponding element type and attribute type) in the encoding repository. This encoding is much more efficient than the traditional XML text approach. And you can maximize the reuse of existing XML technology heritage.
例如:E.g:
Figure PCTCN2015086672-appb-000035
Figure PCTCN2015086672-appb-000035
Figure PCTCN2015086672-appb-000036
Figure PCTCN2015086672-appb-000036
为一个SVG文件的内容,其渲染结果如下图24所示。As a content of an SVG file, the rendering result is as shown in Figure 24.
通过DTD类型供应器,我们得到一系列的SVG元素与属性类型。如图24所示,很容易看出,XML中大量冗余的主要是作为语法标记的元素名,属性名,以及将节点名称同节点值区别开来的一些系统字符,如“>”、“<”、“/”、“=”等。由于在OTF-8中,我们可以不受标准编码的限制,直接将XML对应信息集(XML Infoset)中的信息项使用开放编码进行编码,这样就可以大大减少冗余。Through the DTD type provider, we get a series of SVG elements and attribute types. As shown in Figure 24, it is easy to see that the large amount of redundancy in XML is mainly the element name of the syntax mark, the attribute name, and some system characters that distinguish the node name from the node value, such as ">", " <", "/", "=", etc. Since in OTF-8, we can directly encode the information items in the XML corresponding information set (XML Infoset) using open coding without the limitation of standard coding, which can greatly reduce redundancy.
我们可以把部分XML信息项属性放入编码仓库,直接使用对应的编码。我们得到编码仓库类型信息内容如下:We can put some XML information item attributes into the code repository and use the corresponding code directly. We get the content of the coded warehouse type as follows:
Figure PCTCN2015086672-appb-000037
Figure PCTCN2015086672-appb-000037
Figure PCTCN2015086672-appb-000038
Figure PCTCN2015086672-appb-000038
类型xml.infoset.element的相关编码仓库数据如下:The associated encoding repository data for type xml.infoset.element is as follows:
Figure PCTCN2015086672-appb-000039
Figure PCTCN2015086672-appb-000039
类型xml.infoset.attribute的相关编码仓库数据如下:The associated encoding repository data for type xml.infoset.attribute is as follows:
Figure PCTCN2015086672-appb-000040
Figure PCTCN2015086672-appb-000040
Figure PCTCN2015086672-appb-000041
Figure PCTCN2015086672-appb-000041
通过OTF-8编码,原来的SVG文档可以如下表示:With OTF-8 encoding, the original SVG document can be represented as follows:
Figure PCTCN2015086672-appb-000042
Figure PCTCN2015086672-appb-000042
Figure PCTCN2015086672-appb-000043
Figure PCTCN2015086672-appb-000043
其文档对象模型同之前的完全相同,但是后者的数据内容只有380个字节,比前者的980多字节节省了超过60%的数据量。Its document object model is exactly the same as before, but the latter's data content is only 380 bytes, saving more than 60% of the data volume than the former's 980 bytes.
观察上面的OTF-8文档,对比之前的中国行政区域划分的字符串树的例子。我们会发现这个文档中有不少类型标签,如绿色的元素标签,绿蓝色的属性标签。这是由于DTD中的类型表达较为有限,属性类型大多为字符串型,因此类型推导很难推导出正确的类型。因此类型标签必不可少。实际上,基于XML Schema或者RelaxNG的类型供应器产生的类型会更加丰富,最终产生相应的XML OTF-8文档会更加紧凑、高效。Observe the above OTF-8 document and compare the previous example of the string tree in the administrative division of China. We will find that there are many types of labels in this document, such as green element labels, green and blue attribute labels. This is because the type expression in DTD is more limited, and the attribute types are mostly string type, so type derivation is difficult to derive the correct type. Therefore type labels are essential. In fact, type providers based on XML Schema or RelaxNG will generate more types, and eventually the corresponding XML OTF-8 documents will be more compact and efficient.
第二个例子,关于Buffer Protocol编码。 The second example is about Buffer Protocol encoding.
谷歌的Buffer Protocol也是一个带Schema的对象序列化格式,其类型定义语言可以直接作为对应类型的类型定义,通过Buffer Protocol类型供应器,我们可以将Buffer Protocol编码的二进制数据对象对应到OTF-8的一个类型的数据对象。具体的,在OTF-8中我们定义了一个系统编码0xBF(10111111)作为内嵌二进制数据块的开始标记。在这个标记字节之后是表示二进制数据块字节个数的一个整数(以开放编码形式编码),之后才是对应的二进制字节流。Google's Buffer Protocol is also an object serialization format with Schema. Its type definition language can be directly defined as the type of the corresponding type. With the Buffer Protocol type provider, we can match the binary data object encoded by the Buffer Protocol to the OTF-8. A type of data object. Specifically, in OTF-8 we define a system code 0xBF (10111111) as the starting tag for embedded binary data blocks. Following this tag byte is an integer representing the number of bytes of the binary data block (encoded in open coding), followed by the corresponding binary byte stream.
实际上,依据类型推导,对二进制的数据类型直接对应数据块长度加上数据块就足够了。我们这里引入这个二进制数据块标记主要是保证编码解析的可靠性。因为在二进制流中可能会出现OTF-8的任何代码点(包括系统编码),我们需要在没有任何数据元信息(包括类型信息)的前提下,避开对内嵌的二进制流的解析。这个二进制标记系统编码正是起到了这个作用。In fact, depending on the type derivation, it is sufficient for the binary data type to directly correspond to the data block length plus the data block. We introduce this binary block mark here mainly to ensure the reliability of code parsing. Because any code point of OTF-8 (including system code) may appear in the binary stream, we need to avoid parsing the embedded binary stream without any data element information (including type information). This binary markup system code does exactly that.
可以看出,在OTF-8中,“类型供应器”是实现对已有编码标准或者自定义编码方式的关键。It can be seen that in OTF-8, the "type provider" is the key to achieving the existing coding standard or custom coding mode.
实际上,OTF-8对所有的代码点都定义了相应的类型以及这些类型组合的规则,这些一起构成了OTF-8的类型系统。所谓“类型供应器”有两种,一种是映射型,是指将外部类型定义中的具体类型对应到OTF-8的类型系统,这样,我们就能够以OTF-8的方式重构外部类型的编码。使得在保留原有编码Schema定义的基础之上,增加了编码仓库带来的种种好处,如更加安全的元数据授权访问模型、中心化的元数据共享、更加精简的编码形式,等等。前面的SVG实例中的“DTD类型供应器”就是这种映射型。In fact, OTF-8 defines the corresponding types and rules for the combination of these types for all code points, which together form the OTF-8 type system. There are two types of "type providers", one is mapping type, which means that the specific type in the external type definition corresponds to the type system of OTF-8, so that we can reconstruct the external type in OTF-8 mode. Coding. On the basis of retaining the original coding Schema definition, the benefits of the coding warehouse are increased, such as a more secure metadata authorization access model, centralized metadata sharing, a more streamlined coding form, and the like. The "DTD type provider" in the previous SVG instance is this type of mapping.
另外一种“类型供应器”是嵌入型,是指将整个外部编码方式的数据直接嵌入到OTF-8的编码中来,对应到一个数据类型。由原有的编码、解码器直接编码解码对应的内容,形成一个对应的OTF-8对象。具体的,对于基于文字的数据序列化方法,嵌入的就是一个UTF-8字符串(如果原有编码不是UTF-8,需要做一个对应的编码转换);对于二进制的数据序列化方法,嵌入的就是前面提到的,以0xBF二进制标记编码引导的块长度加上具体的二进制块内容。前面提到的XML类型供应器是一个嵌入型的文本编码,Buffer Protocol类型供应器就是一个嵌入型的二进制编码。 Another type of "type provider" is an embedded type, which means that the data of the entire external coding mode is directly embedded into the code of the OTF-8, corresponding to a data type. The original code and decoder directly encode and decode the corresponding content to form a corresponding OTF-8 object. Specifically, for the text-based data serialization method, the embedded is a UTF-8 string (if the original encoding is not UTF-8, a corresponding encoding conversion is needed); for the binary data serialization method, the embedded As mentioned earlier, the block length guided by the 0xBF binary mark is added to the specific binary block content. The XML type provider mentioned above is an embedded text encoding, and the Buffer Protocol type provider is an embedded binary encoding.
总之,OTF-8是一个建立在基于对象上下文相关编码方法之上的一个具体编码系统。在内置完善的类型系统的基础之上,其既能对编码数据仓库中的数据对象进行引用编码,又能直接对对象进行高效、安全的内容编码(编码元数据,包括类型信息,置于编码仓库)。In summary, OTF-8 is a specific coding system based on object-based context-dependent coding methods. Based on the built-in perfect type system, it can not only reference and encode the data objects in the encoded data warehouse, but also directly and efficiently encode the objects (encoding metadata, including type information, placed in the encoding). warehouse).
参考图25,OTF-8除UTF-8之外的编码点都罗列于此。此外,按照这个编码定义,还有很多待定义的编码用于系统扩展。它们都列于图26。例如,我们可以将双字节0xA00x00定义为应用函数/方法。加以实现,就可以在OTF-8的基础上提供对远程过程调用(RPC)的支持,将会比XML-RPC、SOAP等现有方式有效得多。Referring to Figure 25, code points other than UTF-8 for OTF-8 are listed here. In addition, according to this coding definition, there are still many codes to be defined for system expansion. They are all listed in Figure 26. For example, we can define double-byte 0xA00x00 as an application function/method. By implementing it, it is possible to provide support for remote procedure calls (RPC) on the basis of OTF-8, which will be much more effective than existing methods such as XML-RPC and SOAP.
类似的,在本实施例中,还可以进一步引入OTF-16、OTF-32等Unicode的扩展方案。分别扩展自UTF-16和UTF-32。同OTF-8相比,编码仓库、基于对象的上下文编码方法、类型系统等概念和构成完全相同。其主要的不同在于开放编码(主要包括对数字的编码和系统编码)的具体定义会因对应Unicode的编码方式而异,这里不再赘述。Similarly, in this embodiment, an extension scheme of Unicode such as OTF-16 and OTF-32 can be further introduced. Expanded from UTF-16 and UTF-32 respectively. Compared with OTF-8, the concept and composition of the coding warehouse, object-based context coding method, and type system are identical. The main difference is that the specific definition of open coding (mainly including digital coding and system coding) will vary depending on the encoding method corresponding to Unicode, and will not be described here.
进一步的,该方法还可以包括:Further, the method may further include:
对对应的编码内容为引用编码的数据内容进行归一化处理。The data content corresponding to the encoded content is normalized by reference coding.
在本实施例中,建立在本发明编码仓库基础之上的处理系统,除了最基本的编解码服务之外,还可以通过使用编码仓库的编码元数据以及相关的各种服务,对编码数据(字节流)提供各种分析和处理服务。这包括两个不同层次的服务:其一,是不依赖具体编码数据的编码分析处理服务。这种服务主要是对特定用户,特定种类的编码进行统计分析,将分析结果存储起来,以备进一步的利用——例如文字检索服务。我们称该服务层次为文字编码服务层。该类服务只是对编码本身进行处理,并不需要对应的文字内容信息,因此完全保证了用户文字内容及个人隐私的安全,这是标准化文字很难做到的。另一个层次是在文字编码及其对应的数据之上提供各种相关的服务,用以方便应用程序对新数据处理系统的使用。称之为文字内容服务层。第一个层次的分析结果可以直接被第二个层次直接使用。In this embodiment, the processing system based on the encoding warehouse of the present invention, in addition to the most basic codec service, can also encode the data by using the encoding metadata of the encoding warehouse and related various services ( Byte stream) provides a variety of analysis and processing services. This includes two different levels of service: one is a code analysis processing service that does not rely on specific encoded data. This service is mainly for the statistical analysis of specific users, specific types of codes, and the analysis results are stored for further use - such as text retrieval services. We call this service level a text encoding service layer. This kind of service only processes the code itself, and does not need corresponding text content information, so the security of the user's text content and personal privacy is completely guaranteed, which is difficult to standardize. Another level is to provide a variety of related services on top of the text encoding and its corresponding data to facilitate the application's use of the new data processing system. Call it the text content service layer. The results of the first level of analysis can be directly used directly by the second level.
对于传统的数据处理系统来说,文字编码不仅仅用于文字的处理,还广泛地用于通用数据的表达和传递。一些通用结构化文本以及专有领域文本的处理技术也层出不穷,例如SGML/XML(以及之上的HTML,SVG,MathML, 等等)系列技术,程序设计语言的处理技术,特定领域的建模语言等。新的数据处理系统完全建立在传统数据处理系统之上,除了带来全新概念的个性化字处理之外,还可以将基于编码仓库的开放编码文字引入现有文本数据处理技术。只需要在现有技术的基础之上稍作改造,就能形成更加安全、高效的新的文本数据处理技术。因此,新数据处理系统中的文字处理系统实际上又包括两个方面,一个是新的字处理系统,另一个是新的文本数据处理系统。当然,这两个方面也可以结合起来,如基于手写程序设计语言的处理等。For traditional data processing systems, text encoding is not only used for text processing, but also widely used for the expression and delivery of general data. Some common structured text and proprietary domain text processing techniques are also emerging, such as SGML/XML (and HTML, SVG, MathML, and above). Etc.) Series technology, programming language processing techniques, domain-specific modeling languages, etc. The new data processing system is completely built on the traditional data processing system. In addition to the new concept of personalized word processing, it can also introduce open coded text based on the code warehouse into the existing text data processing technology. Only a little modification on the basis of the existing technology can form a new text data processing technology that is safer and more efficient. Therefore, the word processing system in the new data processing system actually includes two aspects, one is a new word processing system, and the other is a new text data processing system. Of course, these two aspects can also be combined, such as processing based on handwritten programming languages.
可选的,还可以提供一些其他的服务或应用,包括但不限于如下服务选项:数据内容归一服务。Optionally, some other services or applications may also be provided, including but not limited to the following service options: data content normalization service.
具体的,数据内容归一是指将编码仓库中完全相同或者相似的数据内容进行合并,让他们使用同一个编码。例如,同一个人在不同时间书写的同一个字,尽管最终字形不一定完全相同,但是按照某种特征分类,可以将它们进行归一。Specifically, data content normalization refers to merging identical or similar data content in an encoding repository, allowing them to use the same encoding. For example, the same word written by the same person at different times, although the final glyphs are not necessarily identical, can be grouped according to a certain feature.
归一可以是按照某种规则自动进行。例如对声音的归一,可以只保留最高采样频率的相同声音,较低采样频率的声音可以由之生成。归一也可以通过人工干预的方式半自动进行,即内容归一服务在编码仓库中找到相同或相似的内容项,然后将之输出至指定用户(例如内容项拥有者),由该用户按其标准指定最后保留的内容项。Normalization can be done automatically according to certain rules. For example, the normalization of the sound can only retain the same sound of the highest sampling frequency, and the sound of the lower sampling frequency can be generated therefrom. Normalization can also be done semi-automatically by manual intervention, ie content normalization services find the same or similar content items in the code repository and then output them to the specified user (eg content item owner) by the user according to their criteria Specifies the content item that was last retained.
归一服务可以实时进行。这种情况之下,每当编码仓库接收到输入内容时,内容归一服务都会在编码仓库中查找相同/相似项,如果存在相同或者相似内容项时,就会直接将其编码返回,如果必要(根据一定规则),还需要用新的内容替换原有内容项。归一服务还可以离线,非实时进行。这时,内容归一服务在编码仓库中找到可以归一的内容之后,建立起原有实例编码与归一后编码之间的对应关系。根据这个对应关系,归一服务将输入的字符串转换成使用归一后的字符串返回。The normalization service can be performed in real time. In this case, whenever the encoding repository receives the input, the content normalization service will look for the same/similar items in the encoding repository. If the same or similar content items exist, they will be encoded directly, if necessary. (According to certain rules), you also need to replace the original content item with the new content. The normalization service can also be offline, not in real time. At this time, after the content normalization service finds the content that can be normalized in the encoding warehouse, the correspondence between the original instance encoding and the normalized encoding is established. According to this correspondence, the normalization service converts the input string into a string returned using the normalization.
归一服务需要使用具体的内容匹配算法来完成。如对手写内容匹配需要使用图形匹配或者图像匹配算法。对语音内容匹配需要使用声音匹配算法,等等。The normalization service needs to be done using a specific content matching algorithm. For matching handwritten content, a pattern matching or image matching algorithm is required. Matching voice content requires the use of a sound matching algorithm, and so on.
尽管内容归一是一个可选服务,但是,实现了内容归一的编码仓库可以 实现编码冗余最小化,从而最大限度的使用现有文字基础设施及相关工具。Although content normalization is an optional service, the code repository that implements content normalization can Minimize code redundancy to maximize the use of existing text infrastructure and related tools.
另外,进一步的,还可以提供一些其他的服务或应用,包括但不限于如下服务选项:In addition, further, some other services or applications may be provided, including but not limited to the following service options:
一、编码管理服务First, the code management service
编码仓库中的内容可以是多种类型的,这将给系统带来极大的灵活性和开放性——可以混用不同的输入、输出方法;同一种类型的输入方法可以混用不同的具体实现;一种具体输入/输出方案中可以使用不同种类的编码;可以动态增加新的编码方案;等等。在这种情况下,需要对编码进行一定的管理。The content in the code repository can be of various types, which will bring great flexibility and openness to the system - different input and output methods can be mixed; the same type of input method can be mixed with different concrete implementations; Different kinds of encodings can be used in a specific input/output scheme; new encoding schemes can be dynamically added; and so on. In this case, some management of the encoding is required.
编码管理主要是对编码元数据的访问和维护。其中包括对编码空间、编码类型、编码规约等的管理。Encoding management is mainly the access and maintenance of encoding metadata. This includes management of the coding space, coding type, coding protocol, and the like.
由于新数据处理系统的个性化以及编码的任意性,需要引入编码类型注册、查询的机制。这样,应用系统可以动态增加编码类型。也能查询和使用已有编码类型,及相关的元数据,如对应编码规约的具体细节等等。Due to the personalization of the new data processing system and the arbitrariness of encoding, it is necessary to introduce a mechanism for encoding type registration and query. In this way, the application system can dynamically increase the encoding type. It is also possible to query and use existing coding types, and related metadata, such as the specific details of the corresponding coding specification.
二、内容选择服务Second, content selection service
不同的环境,对文字内容的输出也会有不同的要求。例如,高精度文字印刷设备需要高精度的字形信息;低带宽的网络设备不得不在字形质量和数据大小间寻找平衡;有高安全要求的系统希望文字内容隐藏笔顺信息;电影配音和视频聊天需要不同质量的音频输出;等等。这些都需要内容选择服务。Different environments have different requirements for the output of text content. For example, high-precision text printing equipment requires high-precision glyph information; low-bandwidth network equipment has to find a balance between glyph quality and data size; systems with high security requirements want text content to hide stroke information; movie dubbing and video chat need to be different Quality audio output; and more. These all require a content selection service.
内容选择实际上就是有条件地输出内容。输出的内容可以直接是编码仓库中的数据对象。对应同一编码,编码仓库中有可能存在多个数据对象(归一服务就可以为同一编码保留多个数据对象)。内容选择服务就需要选择最适合的数据对象进行输出。输出的数据对象也可以是动态生成的。例如,文字图像输出可以通过文字图形数据动态渲染得到;低采样率音频可以由高采样率音频降级得到;等等。Content selection is actually conditional output. The output can be directly the data object in the encoding repository. Corresponding to the same code, there may be multiple data objects in the code repository (the normal service can reserve multiple data objects for the same code). The content selection service needs to select the most suitable data object for output. The output data object can also be dynamically generated. For example, text image output can be dynamically rendered by text graphics data; low sample rate audio can be degraded by high sample rate audio; and so on.
三、内容缓存服务Third, the content cache service
编码仓库的具体实现可以是在某个应用程序内的存储及相关服务,可以是系统共享的服务,也可以是公有云或者私有云中的服务。The specific implementation of the code repository may be a storage and related service within an application, and may be a service shared by the system, or a service in a public cloud or a private cloud.
当编码仓库共享于网路环境中时,内容需要通过网络下载到本地。有 时候,由于网络传输可靠性、带宽等限制,提供编码仓库的本地缓存非常必要。本地缓存可以将网络中共享的编码仓库的部分或者全部数据对象缓存在客户端或者中间节点,以支持快速、可靠的输出。同样,在编码仓库访问不可靠甚至离线的情况下,输入也可以直接在本地缓存进行,得到的是临时编码。在内容缓存同编码仓库同步时,临时编码被更新为正式编码,相应的编码内容也会相应更新。When the code repository is shared in the network environment, the content needs to be downloaded locally via the network. Have At the time, due to limitations in network transmission reliability, bandwidth, etc., it is necessary to provide a local cache of the encoding repository. The local cache can cache some or all of the data objects of the shared code repository in the network on the client or intermediate nodes to support fast and reliable output. Similarly, in the case where the code repository access is unreliable or even offline, the input can also be directly cached locally, resulting in a temporary encoding. When the content cache is synchronized with the encoding repository, the temporary encoding is updated to the official encoding, and the corresponding encoded content is updated accordingly.
四、编码转换服务Fourth, the code conversion service
基于新的数据处理系统,计算机系统能够将各种输入分解成编码仓库中的数据对象以及编码后的编码内容。之后,计算机系统也能基于编码仓库将这个输出还原为人类(至少是输入者本人)能够理解的内容。Based on the new data processing system, the computer system is capable of decomposing various inputs into data objects in the code repository and encoded content. The computer system can then restore this output to what the human (at least the importer himself) can understand based on the encoding repository.
然而,由于本系统文字编码的非标准性,编码后的文字内容在没有编码仓库的环境下无法被任何人或者机器理解。编码转换主要就是提供将个性化文字编码转换成标准文字编码的服务。转换的结果就是传统的标准文字,可以在脱离编码仓库的传统应用环境中使用。However, due to the non-standard nature of the text encoding of this system, the encoded text content cannot be understood by anyone or machine in the environment without an encoding warehouse. The coding conversion is mainly to provide a service for converting personalized text encoding into standard text encoding. The result of the conversion is the traditional standard text, which can be used in traditional application environments that are out of the code repository.
具体说来,将基于手写的对象编码转换成标准文字编码就是对相应的文字内容进行手写识别;将基于语音的对象编码转换成标准文字编码就是对相应的文字内容进行语音识别。这个识别的结果也可用于实现内容归一服务。Specifically, converting the handwritten object encoding into standard text encoding is to perform handwriting recognition on the corresponding text content; converting the speech-based object encoding into standard text encoding is to perform speech recognition on the corresponding text content. The result of this identification can also be used to implement a content normalization service.
一旦建立了基于对象编码到标准文字编码的对应关系,系统就可以在一定程度上实现从标准编码到基于对象编码之间的转换。Once the correspondence between object coding and standard text coding is established, the system can realize the conversion from standard coding to object-based coding to a certain extent.
更进一步,不同的对象编码之间也可以互相转换。可以是同一个人不同文字输出方式之间的转换。例如,对手写输入的结果文本进行语音输出。也可以是不同用户之间的编码转换。例如,秘书的手写草稿直接转换成经理的笔迹。有两种方法实现对象编码之间的转换。一种是以标准文字编码作为中间码进行转换。将一种对象码转换成标准文字编码,然后再将该标准文字编码转换成另一种对象编码。另一种对象编码之间的转换方法是直接建立两种编码的映射关系。Furthermore, different object encodings can also be converted to each other. It can be a conversion between different text output methods of the same person. For example, the result text of the handwritten input is voice output. It can also be a code conversion between different users. For example, the secretary's handwritten draft is directly converted into the manager's handwriting. There are two ways to implement conversion between object encodings. One is to convert the standard text code as an intermediate code. Convert an object code to a standard text encoding and then convert the standard text encoding to another object encoding. Another method of converting between object encodings is to directly establish a mapping relationship between the two encodings.
此外,有的对象编码本身就是建立在标准文字编码基础之上的,例如,以加密为目的的敏感词编码、以压缩为目的的常见词编码等。这些编码本身就是用来同标准文字编码进行转换的。In addition, some object encodings are based on standard text encoding, such as sensitive word encoding for encryption purposes, common word encoding for compression purposes, and so on. These codes are themselves used to convert to standard text encoding.
值得一提的是,不同编码之间的关系并不是一一映射的关系。例如在很 多语言中,一音多意的现象很普遍,因此基于语音输入形成的编码同标准文字编码之间时常会出现一对多的关系。It is worth mentioning that the relationship between different codes is not a one-to-one mapping relationship. For example, very In multi-language, the phenomenon of polyphony is very common, so there is often a one-to-many relationship between the code based on speech input and the standard text code.
五、访问控制服务V. Access Control Service
对于一个有安全要求的环境来说,对编码仓库的的访问需要由系统级别的访问控制系统来进行保护。当然,这个访问控制是可选的。在某些单用户系统中,没有必要单独设置内容访问控制服务。For a security-critical environment, access to the code repository needs to be protected by a system-level access control system. Of course, this access control is optional. In some single-user systems, there is no need to set up the content access control service separately.
在多用户环境中,访问控制系统确认系统的用户身份,并针对该身份,根据编码仓库设定的规则来允许或者禁止对编码仓库所提供服务的使用。例如,拥有编码仓库文字输入帐户的用户可以将其输入的数据对象存储到编码仓库。而只有该用户、以及该用户授权的其他用户才有权限取得编码仓库中该用户的数据对象。In a multi-user environment, the access control system confirms the user identity of the system and, for that identity, allows or prohibits the use of services provided by the code repository in accordance with rules set by the code repository. For example, a user with an encoded warehouse text entry account can store their input data objects into an encoding repository. Only the user, and other users authorized by the user, have permission to obtain the data object of the user in the encoding repository.
编码仓库中的编码在使用的过程中,是有相关的上下文模型的。如文档模型、用户模型、应用模型等。因此,我们完全可以根据这些模型设置对不同编码访问的权限,而且这个权限可以设置在不同级别,可以是编码空间级别,元编码级别,甚至是实例编码级别。同传统的资源访问控制(如文件、计算机等)以及网站访问控制不同,这种编码级别的权限设置能够实现更加细粒度的信息访问控制。The coding in the code repository is in the process of use and has an associated context model. Such as document models, user models, application models, and so on. Therefore, we can set permissions for different encodings based on these models, and this permission can be set at different levels, which can be encoding space level, meta encoding level, or even instance encoding level. Unlike traditional resource access controls (such as files, computers, etc.) and website access control, this level of code-level permissions enables more granular access control.
这里需要强调的是,该访问控制系统并不保护编码内容本身(对象编码集合),保护的只是编码对应的编码仓库中的数据对象。因此,被授权的用户能够结合编码仓库中的数据对象还原出原始的输入内容。而没有被授权的用户则无法正确输出所拿到的同样的编码内容,得到的只能是无序的内容或“乱码”。It should be emphasized here that the access control system does not protect the encoded content itself (object encoding set), and only protects the data objects in the corresponding encoding warehouse. Thus, an authorized user can restore the original input in conjunction with the data objects in the encoding repository. Users who are not authorized can not correctly output the same encoded content, only the unordered content or "garbled".
六、文字服务Six, text service
在编码仓库提供的编码服务基础之上,基于对象编码的文字系统还还可以包括一些服务子系统来提供高级的文字服务。Based on the encoding services provided by the encoding repository, the object encoding based text system can also include some service subsystems to provide advanced text services.
七、文字查找和替换Seven, text search and replace
同传统的文字查找一样,在新的数据处理系统中可以对对象编码进行查找(文字编码层),尤其是对归一之后的文字内容。除此之外,由于新的数据处理系统编码和内容是一一对应的,文字查找还可以是基于内容的查找。以手写输入文字为例,可以根据文字的部分内容(如偏旁部首)来进行查找 (文字内容层);可以根据内容进行模糊查找;可以按照笔划数进行查找等等。As with traditional text search, object encoding can be searched in the new data processing system (text encoding layer), especially for normalized text content. In addition, since the new data processing system code and content are one-to-one correspondence, the text lookup can also be a content-based lookup. Taking handwritten input text as an example, you can search according to part of the text (such as the radicals). (Text content layer); can be fuzzy search based on content; can be searched according to the number of strokes, and so on.
另外,由于新数据处理系统的开放性,任意类别的数据都能通过编码仓库进行对象编码,新的文字查找服务还能够根据对象编码的类型以及相关类型的领域特征进行查找替换。In addition, due to the openness of the new data processing system, any type of data can be encoded by the encoding warehouse, and the new text search service can also perform search and replace according to the type of object encoding and the domain characteristics of the relevant type.
八、文字转换Eight, text conversion
文字转换服务是指将开放编码转换成标准编码的服务。该服务是建立在编码仓库的编码转换基础之上的。但不同于编码仓库的编码转换,文字转换还需要基于语法语义分析,在多个候选目标编码中选取最优结果。实际上是一个更综合、更高一层的识别系统。A text conversion service is a service that converts open code into standard code. The service is based on the coding transformation of the code repository. However, unlike the coding conversion of the coding warehouse, the text conversion needs to be based on grammatical semantic analysis to select the optimal result among multiple candidate target codes. It is actually a more integrated, higher level recognition system.
九、文字匹配Nine, text matching
由于新的数据处理系统能够支持高度个性化的文字输入,因此应用程序可以根据个性化的输入制订匹配规则,将输入对应到特定输出。例如,互联网浏览器可以将手写输入的不同字符或图符对应到不同的网站;手写编程系统可以将特定输入映射成对应关键字等等。Because the new data processing system can support highly personalized text input, the application can formulate matching rules based on personalized input to map the input to a specific output. For example, an Internet browser can map different characters or icons input by handwriting to different websites; a handwriting programming system can map specific inputs into corresponding keywords and the like.
十、文字数据服务X. Text data service
新的数据处理系统的安全、高效性同样也适用于结构化文本技术。基于开放编码改造的文本数据技术将会带来同现有二进制数据比肩的性能和效率——元数据可以完全存储于编码仓库、互不冲突的对象编码可以保证编码字长的最小化。应用程序完全有理由将文字内容、结构化、半结构化的数据统一由对象编码系统来描述。文字数据服务就提供了开放编码字符串和应用程序专有模型之间来回转换的服务。The security and efficiency of the new data processing system also applies to structured text technology. Text data technology based on open coding transformation will bring performance and efficiency comparable to existing binary data - metadata can be completely stored in the encoding warehouse, and object codes that do not conflict with each other can ensure the minimization of the encoding word length. The application has every reason to unify the textual content, structured, semi-structured data by the object coding system. The literal data service provides services for converting back and forth between open coded strings and application-specific models.
另外,不同于传统的文字输入,新数据处理系统中的文字输入并不需要生成标准的编码,而是输入在先,生成编码在后。因此,该文字输入系统能够使用最为自然、高效的方式进行输入。需要将输入的结果按照自然、合理的方式划分成最小单位,如文字的字符或者词、语音的片段等。继而将这些内容通过编码系统发送给编码器或编码系统,得到对应的编码。In addition, unlike traditional text input, the text input in the new data processing system does not need to generate a standard encoding, but rather input first, and generate encoding later. Therefore, the text input system can input in the most natural and efficient manner. The input result needs to be divided into minimum units in a natural and reasonable manner, such as characters of characters or words, segments of speech, and the like. These contents are then sent to the encoder or encoding system via the encoding system to obtain the corresponding encoding.
我们可以看到,输入子系统至少包括两个功能,即输入的接收以及内容单元的切分。We can see that the input subsystem includes at least two functions, namely the reception of the input and the segmentation of the content unit.
值得一提的是,由于个性化编码的私有性及开放性,不同的输入方法还 可以混合进行,只要使用不同的编码类型或者不同的编码空间就能将它们混合放入同一文本。如,在手写输入的文本中插入语音输入的文本。It is worth mentioning that due to the privacy and openness of personalized coding, different input methods are still It can be mixed, as long as they are mixed into the same text using different encoding types or different encoding spaces. For example, insert text of a voice input into the text input by handwriting.
新数据处理系统的输入允许输入内容的多样性,如图形、图像、视频、声音等。也允许输入内容的多维性,如在手写的过程中同时读出书写内容的读音。编码仓库的内容选择服务可以对多维内容选择适当的形式进行输出。多维内容也提供了更多信息从而有助于系统进行内容切分以及内容识别。The input to the new data processing system allows for the diversity of input content such as graphics, images, video, sound, and the like. It also allows the multi-dimensionality of the input content, such as reading the pronunciation of the written content at the same time during the handwriting process. The content selection service of the code repository can output the appropriate form for multi-dimensional content selection. Multidimensional content also provides more information to help the system to segment content and identify content.
对于输出系统,输出子系统是将文本编码还原成输入的原始信息。不同于传统的输出系统,新系统的输出完全依赖于开放的编码仓库。其输出形式和内容取决于输入的形式和内容。对于没有输入过的内容是无法进行输出的。For an output system, the output subsystem is the original information that restores the text encoding to the input. Unlike traditional output systems, the output of the new system is completely dependent on an open code repository. The form and content of the output depends on the form and content of the input. It is not possible to output content that has not been input.
对于编辑系统,在输入的同时,往往需要进行适当的修改调整。同传统编辑系统一样,基于个性化对象编码的编辑系统也会提供基本的增、删、改功能。但不同的是,新的编辑系统还可以提供对输入内容修改调整以及对内容单元的切分进行管理等功能。For the editing system, it is often necessary to make appropriate modification adjustments while inputting. As with traditional editing systems, editing systems based on personalized object encoding also provide basic addition, deletion, and modification functions. But the difference is that the new editing system can also provide functions such as modifying the input content and managing the segmentation of the content unit.
需要说明的是,新的数据处理系统并没有,也不可能取代现有数据处理系统。相反,通过适当的设计,我们还可以最大限度地利用现有系统的基础设施及工具,并将两种系统有机地融合在一起。这种利用与融合至少包含以下几个方面:It should be noted that the new data processing system does not and cannot replace the existing data processing system. Instead, with the right design, we can also make the most of the infrastructure and tools of the existing system and organically combine the two systems. This use and integration includes at least the following aspects:
第一方面、标准控制符First aspect, standard control
在现有文字处理系统和工具中,有的只是通用数据工具,并不针对具体编码做任何特别处理,如压缩、加密、存储等。在新的数据处理系统中,我们可以直接使用它们。Among the existing word processing systems and tools, some are just general-purpose data tools, and do not do any special processing for specific encodings, such as compression, encryption, storage, and so on. In the new data processing system, we can use them directly.
但是,有些文字处理系统和工具中,需要针对一些字符的特别处理。最常见的是控制字符,如换行、空格、制表符等。例如,文本行计数器就是计算文本中换行字符的个数;文本的版本管理系统或者文本比较与合并工具也是基于英文单词的索引系统,也是以行为单位进行的;单词计数以及英文检索的分词也是以标准控制字符以及标点符号作为单词分割的。However, some word processing systems and tools require special handling for some characters. The most common are control characters such as line breaks, spaces, tabs, and so on. For example, the text line counter is to calculate the number of newline characters in the text; the text version management system or the text comparison and merge tool is also based on the index system of English words, and is also performed in units of behavior; the word count and the English word segmentation are also Standard control characters and punctuation are segmented as words.
因此,只要在新的文字输入系统中提供方法输入这类标准控制符和标点符号,就有更多的传统的文字处理系统和工具能够在新的数据处理系统中使用。 Therefore, as long as methods are provided to input such standard control symbols and punctuation in a new text input system, more conventional word processing systems and tools can be used in new data processing systems.
第二方面、混合编码The second aspect, hybrid coding
此外,如果在新数据处理系统的文字编码中考虑到传统标准文字编码的兼容性,我们可以很容易地将传统文字和新的文字混合在一起。可以直接有效地使用已有文字,也可以混用现有的和新的文字输入编辑系统。一个简单的混合编码方案就是直接在现有标准文字编码方案的基础上进行扩充,将对象编码以某种方式同标准编码区别开来。这样,对象编码的字符、甚至其他语音或多媒体流就可以与标准字符同时出现在文本中。In addition, if the compatibility of traditional standard text encoding is taken into account in the text encoding of the new data processing system, we can easily mix traditional text with new text. Existing text can be used directly or effectively, and existing and new text input editing systems can be mixed. A simple hybrid coding scheme is to directly expand on the existing standard text encoding scheme, and the object encoding is distinguished from the standard encoding in some way. In this way, the object-encoded characters, even other speech or multimedia streams, can appear in the text at the same time as the standard characters.
利用混合编码,可以有效地改造现有文本数据技术。传统的文本数据技术中,数据字符和格式字符都来自于标准文字编码,这就导致在数据字符中不能直接使用格式字符,而是得通过字符转义来完成,不方便,而且低效。例如,CVS表格化文本数据中,逗号作为分割符分隔文本数据。因此,文本数据中如果包含逗号,就得将该数据置于引号中加以保护。如果数据文本中出现了引号,还得对引号进行特殊处理。混合编码就能很好地解决这个问题——由于对象编码能够同标准文字编码区分开来,我们完全可以用之作为格式字符。这样,文字数据中可以任意使用标准化字符,没有任何限制;对应的解析程序也可以直接处理对应的数据,不用做任何字符转义的处理。更进一步,数据的模式(Schema)以及格式数据的详细信息都可以放入编码仓库,极大地减少了数据冗余,提高了传输和处理的效率。With hybrid coding, existing text data technologies can be efficiently modified. In traditional text data technology, data characters and format characters are derived from standard text encoding, which results in the inability to directly use format characters in data characters, but rather through character escaping, which is inconvenient and inefficient. For example, in CVS tabular text data, a comma is used as a separator to separate text data. Therefore, if the text data contains a comma, you must protect the data in quotation marks. If quotes appear in the data text, you have to specialize the quotes. Hybrid coding is a good solution to this problem - since object encoding can be distinguished from standard text encoding, we can use it as a format character. In this way, standardized characters can be used arbitrarily in the text data without any limitation; the corresponding parsing program can also directly process the corresponding data without performing any character escaping processing. Furthermore, the schema of the data and the details of the format data can be placed in the code repository, which greatly reduces data redundancy and improves the efficiency of transmission and processing.
第三方面、关键字映射The third aspect, keyword mapping
混合编码的一个直接好处就是我们可以将新的数据处理系统用到传统的结构化文本、带语法的文本中去。关键字和特殊符号仍然使用原有的标准文字编码,标识符或者数据内容使用对象编码。这意味着手写编程或者语音编程成为可能。A direct benefit of hybrid coding is that we can apply new data processing systems to traditional structured text and grammatical text. Keywords and special symbols still use the original standard text encoding, and identifiers or data content are encoded using objects. This means handwritten programming or voice programming is possible.
在这种混合编码系统中,我们可以使用新的文字输入系统来完成所有文字的输入。只需要对系统的关键字以及特殊符号定义其对应的对象编码文字内容。对于其他字符,通过转义的方式,也可以编码成标准字符。在文字输入过程中或者文字数据的处理过程中,系统能够根据内容匹配的结果自动转换成相应的标准文字编码,交由传统的文字处理工具去处理,返回的结果再映射回对象编码,以可视化的形式呈现给用户。一个典型的例子就是手写编程系统,我们只需要在前端提供这个对象编码和标准编码的映射系统,后端 可以使用传统的编译器、连接器、调试器等一系列工具链,即可达到预定的效果。In this hybrid coding system, we can use the new text input system to complete all text input. It only needs to define the corresponding object coded text content for the system's keywords and special symbols. For other characters, it can also be encoded into standard characters by escaping. During the text input process or the processing of the text data, the system can automatically convert the corresponding standard text code according to the result of the content matching, and then process it by the traditional word processing tool, and the returned result is mapped back to the object code to be visualized. The form is presented to the user. A typical example is the handwriting programming system. We only need to provide this object encoding and standard encoding mapping system on the front end. You can achieve the desired results by using a series of toolchains such as traditional compilers, connectors, and debuggers.
同样,我们也可以将标准编码映射到对象编码。这样,可以使用传统文字输入系统输入预先设置的标准文字编码序列,系统将之自动匹配到对应的对象编码。这对对象编码的编辑和修改有着重要意义。例如,对于一个支持对象编码的XML编辑器来说,我们就可以按照传统的方式对XML文档进行编辑和修改,而在文档序列化时将其存储为对象编码。Similarly, we can also map standard encodings to object encodings. In this way, a conventional text input system can be used to input a preset standard text encoding sequence, and the system automatically matches the corresponding object encoding. This has important implications for the editing and modification of object coding. For example, for an XML editor that supports object encoding, we can edit and modify the XML document in the traditional way, and store it as object encoding when the document is serialized.
图27为本发明提供的一种编码处理方法的实施例四的流程图,在上述图5C所示实施例的基础上,如图27所示,该方法还包括:FIG. 27 is a flowchart of Embodiment 4 of an encoding processing method according to the present invention. On the basis of the foregoing embodiment shown in FIG. 5C, as shown in FIG. 27, the method further includes:
步骤401C、当存在多个类型相同且属于同一拥有者的对象编码时,将所述多个类型相同且属于同一拥有者的对象编码,或者所述多个类型相同且属于同一拥有者的对象编码中的元编码映射到指定的系统编码。 Step 401C: When there are multiple object encodings of the same type and belonging to the same owner, encoding the plurality of objects of the same type and belonging to the same owner, or encoding the plurality of objects of the same type and belonging to the same owner The metacode in the map is mapped to the specified system code.
其中,所述系统编码包括如下一种:缺省元编码设置编码、根空间编码,以及客户端编码设置编码。The system coding includes the following: a default meta code setting code, a root space code, and a client code set code.
在本实施例中,系统编码是指能够改变系统编解码行为的编码。对应的数据对象直接跟系统编解码的组件相关。一般来说系统编码会内置在编解码系统之中,也允许一定的扩展机制。后面将提到的终结编码、缺省元编码设置编码、根空间编码、以及客户端编码设置编码都是系统编码。In the present embodiment, system coding refers to an code capable of changing the coding and decoding behavior of the system. The corresponding data object is directly related to the components of the system codec. In general, system coding is built into the codec system and allows for certain extension mechanisms. The finalization code, the default metacode setting encoding, the root space encoding, and the client encoding setting encoding which will be mentioned later are all system encodings.
例如,接着上面的例子,如果有大量的相同类型的数据对象都属于同一拥有者,那么它们对应的对象编码都是三个编码点(用户编码+类型编码+实例编码),其中,前两个编码点都是相同的,这是一种冗余。For example, following the above example, if there are a large number of data objects of the same type belonging to the same owner, then their corresponding object encodings are three encoding points (user encoding + type encoding + instance encoding), of which the first two The code points are all the same, which is a kind of redundancy.
我们可以引入一个系统编码来一定程度减少这种冗余,例如使用客户端编码设置编码。所谓客户端编码是指出于某种目的对已经解码的数据对象的一个引用编码。该编码直接对应数据对象,而不需要额外的解码过程。一般说来,客户端编码会比其对应的数据对象原有的编码要简短。该编码的编解码过程都没有编码仓库的参与。从编码形式上,客户端编码会直接区别于其他普通编码。客户端编码可以对应到一个数据对象,也可以对应到一个编码元对象。We can introduce a system code to reduce this redundancy to a certain extent, such as using client-side encoding to set the encoding. The so-called client-side coding is a reference code that indicates a data object that has been decoded for some purpose. This encoding directly corresponds to the data object without the need for an additional decoding process. In general, client-side encoding is shorter than the original encoding of its corresponding data object. The coding and decoding process of this code does not involve the participation of the code repository. From the coding form, the client code is directly different from other common codes. The client code can correspond to a data object or to an encoded meta object.
客户端编码设置编码是一个设置客户端编码的系统编码。其一般的形式 为:The client encoding setting code is a system code that sets the client encoding. Its general form for:
客户端编码设置编码+客户端编码+对象编码/元编码Client encoding setting encoding + client encoding + object encoding / meta encoding
就是将指定的对象编码/元编码映射到指定的的客户端编码。这样,之后该客户端编码的任何出现就能代表对应的对象编码/元编码。It is to map the specified object encoding/metacoding to the specified client encoding. Thus, any occurrence of the client-side encoding can then represent the corresponding object encoding/meta-encoding.
在本例中,这个客户端编码设置编码的作用就是将两个编码点的元编码定义成一个字长的编码。之后这个一个字长的元编码就可以代替之前的两编码点元编码来使用了。对应编码元模型更新如图28所示。In this example, the purpose of this client-side encoding setting code is to define the meta-encoding of the two code points as a code of one word length. Then the meta-encoding of this word length can be used instead of the previous two-encoding point element encoding. The corresponding coding element model update is shown in FIG.
根据这个编码元模型,系统就会增加两种新的编码组合,具体如图29所示:图中的目标元编码与替换类型编码相对应。According to this coding element model, the system adds two new coding combinations, as shown in Figure 29: The target element coding in the figure corresponds to the replacement type coding.
通过这种方式,上述情况的编码存储可以减少三分之一的内容。In this way, the code storage of the above case can reduce one-third of the content.
在必要的时候,也可以在不同的对象编码系统中设计有着不同作用的系统编码。When necessary, system codes with different functions can also be designed in different object coding systems.
进一步的,该方法还可以包括:Further, the method may further include:
对所述对象编码进行加密处理。The object encoding is encrypted.
或者,or,
对所述待编码的数据对象进行压缩或加密处理。Compressing or encrypting the data object to be encoded.
图30为本发明提供的一种编码处理方法的实施例五的流程图,在上述图5C所示实施例的基础上,如图30所示,若所述待编码的数据对象为手写文字,则该方法还包括:FIG. 30 is a flowchart of Embodiment 5 of an encoding processing method according to the present invention. On the basis of the embodiment shown in FIG. 5C, as shown in FIG. 30, if the data object to be encoded is handwritten text, Then the method further includes:
步骤501C、接收编码转换请求,并根据所述编码转换请求,查询所述编码仓库中的映射表,采用字形匹配方式,获取所述手写文字对应的标准语言参数。 Step 501C: Receive a code conversion request, and query a mapping table in the coding warehouse according to the code conversion request, and obtain a standard language parameter corresponding to the handwritten character by using a glyph matching manner.
步骤502C、根据所述手写文字对应的标准语言参数,以及所述手写文字对应的对象编码,将所述手写文字对应的对象编码进行编码转换处理,以获取与所述手写文字对应的标准文字。 Step 502C: Perform encoding conversion processing on the object code corresponding to the handwritten character according to the standard language parameter corresponding to the handwritten character and the object code corresponding to the handwritten character, to obtain a standard text corresponding to the handwritten character.
其中,该标准语言参数包括一种或者几种组合:数字、符号、关键字、公有标识符和私有标识符。The standard language parameters include one or several combinations: numbers, symbols, keywords, public identifiers, and private identifiers.
在本实施例中,举例来说,图31为手写输入程序,对应编程语言为Lua 语言,这是一种嵌入式的脚本语言。对应的字形库编码为如下所示:In this embodiment, for example, FIG. 31 is a handwriting input program, and the corresponding programming language is Lua. Language, this is an embedded scripting language. The corresponding font library is encoded as follows:
Figure PCTCN2015086672-appb-000044
Figure PCTCN2015086672-appb-000044
图31所示的手写程序中有三类编码:字形编码、字间距编码以及换行编码。我们将字形编码表示为W+(具体的字形编码)的形式,将字间距编码表示为S+(字间距数值)的形式。对于换行符,为方便起见,我们不将其编码嵌入内容,而是直接用新行来表示。因此,上面的手写程序对应的编码可以表示如下:There are three types of codes in the handwriting program shown in Fig. 31: font coding, word spacing coding, and line feed coding. We represent the glyph encoding as W+ (specific glyph encoding) and the word spacing encoding as S+ (word spacing value). For line breaks, for convenience, we don't embed the code in the content, but directly with the new line. Therefore, the code corresponding to the above handwriting program can be expressed as follows:
Figure PCTCN2015086672-appb-000045
Figure PCTCN2015086672-appb-000045
Figure PCTCN2015086672-appb-000046
Figure PCTCN2015086672-appb-000046
对该代码进行转换,用户准备字形数字符号映射表如下:To convert the code, the user prepares the glyph number symbol mapping table as follows:
Figure PCTCN2015086672-appb-000047
Figure PCTCN2015086672-appb-000047
字形关键字映射表如下:The glyph keyword mapping table is as follows:
Figure PCTCN2015086672-appb-000048
Figure PCTCN2015086672-appb-000048
Figure PCTCN2015086672-appb-000049
Figure PCTCN2015086672-appb-000049
字形接口标识符映射表如下:The glyph interface identifier mapping table is as follows:
Figure PCTCN2015086672-appb-000050
Figure PCTCN2015086672-appb-000050
在这里,系统设置的语法间隔阈值为20。私有标识符自动生成规则为两个下划线(_)之后跟随用下划线相连的字形编码序列。Here, the system sets a syntax interval threshold of 20. The private identifier auto-generation rule is two underscores (_) followed by a glyph code sequence connected by an underscore.
最终,根据之前的流程,可以得这样的标准码程序代码:Finally, according to the previous process, you can get such standard code program code:
Figure PCTCN2015086672-appb-000051
Figure PCTCN2015086672-appb-000051
可以看到,有四个私有标识符被生成了出来:As you can see, there are four private identifiers generated:
Figure PCTCN2015086672-appb-000052
Figure PCTCN2015086672-appb-000052
Figure PCTCN2015086672-appb-000053
Figure PCTCN2015086672-appb-000053
其中,第一个标识符实际上是注释内容,没有意义。如果我们采用优化的转换过程,在识别到其为注释内容时,可以直接省略对其的转换。Among them, the first identifier is actually a comment content, meaningless. If we use an optimized conversion process, we can omit the conversion directly when it is identified as a comment.
这段生成的程序能够被传统Lua解释器正常解释执行,其执行语义同手写源代码中的也是完全相同。This generated program can be interpreted and executed normally by the traditional Lua interpreter, and its execution semantics are exactly the same as those in the handwritten source code.
图32为本发明提供的一种解码处理方法的实施例一的流程图,如图32所示,该方法包括:FIG. 32 is a flowchart of Embodiment 1 of a decoding processing method according to the present invention. As shown in FIG. 32, the method includes:
步骤601C、接收解码处理请求,并根据所述解码处理请求,获取待解码的对象编码。 Step 601C: Receive a decoding processing request, and acquire an object code to be decoded according to the decoding processing request.
步骤602C、对所述对象编码进行拆解,获取元编码,或者所述元编码和实例编码。 Step 602C: Decompose the object code to obtain a meta code, or the element code and the instance code.
步骤603C、查询编码仓库,根据所述元编码获取对应的元数据和编码规约。 Step 603C: Query an encoding warehouse, and obtain corresponding metadata and a coding specification according to the meta code.
步骤604C、根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述对象编码对应的数据对象。 Step 604C: Acquire a data object corresponding to the object encoding according to the metadata and the encoding protocol, or the metadata, the encoding protocol, and the instance encoding.
在本实施例中,对象编码中包含或隐含有相关编码元对象的元编码。编码仓库正是通过这个元编码获取到对应的编码元数据,并返回或为之创建编码元对象。如果在编码过程中或之后,曾经对所述对象编码的访问设置过授权信息或其他控制信息,则在解码前,必须首先对这些访问控制权限进行授权验证。In this embodiment, the object code contains or implicitly contains the meta code of the associated coded meta object. It is through this meta-encoding that the encoding repository obtains the corresponding encoding metadata and returns or creates an encoding meta-object for it. If authorization information or other control information has been set for access to the object code during or after the encoding process, these access control rights must first be authorized for verification before decoding.
另外,在获得对象编码后,需要将其拆解开,从而获得其中的元编码和/或实例编码。在获得元编码之后,依据所获得的元编码获得相应的编码元数据和/或编码规约。并依据编码元数据和/或编码规约、以及实例编码,恢复出原始数据对象。 In addition, after the object encoding is obtained, it needs to be disassembled to obtain the meta-encoding and/or instance encoding therein. After the meta-encoding is obtained, corresponding encoding metadata and/or encoding conventions are obtained in accordance with the obtained meta-encoding. The original data object is restored according to the encoding metadata and/or the encoding specification and the instance encoding.
其中数据对象的解码会根据编码规约的内容进行。可以包括直接的内容解码,或者通过编码仓库的引用解码,或者两者都有。The decoding of the data object is performed according to the content of the coding protocol. It can include direct content decoding, or decoding by reference to the encoding repository, or both.
本系统是一个开放体系,现有的内容编解码技术都可以被编码元对象所使用(只要编码规约中有对应描述),也可以用于编码仓库的传输和存储。The system is an open system, and the existing content codec technology can be used by the encoded meta-object (as long as there is a corresponding description in the coding protocol), and can also be used for the transmission and storage of the encoding warehouse.
图33为本发明提供的一种解码处理方法的实施例二的流程图,在上述图32所示的基础上,如图33所示,步骤602C的一种的具体实现方式为:FIG. 33 is a flowchart of Embodiment 2 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 33, a specific implementation manner of one of the steps 602C is:
步骤701C、获取所述对象编码对应的预定规则。 Step 701C: Acquire a predetermined rule corresponding to the object code.
步骤702C、根据所述预定规则,对所述对象编码进行拆解,以获取所述元编码,或者所述元编码和实例编码。 Step 702C: Decompose the object code according to the predetermined rule to obtain the meta code, or the element code and the instance code.
进一步的,该方法还包括:Further, the method further includes:
对所述预定规则进行访问权限认证;Performing access authority authentication on the predetermined rule;
则步骤702C的具体实现方式为:Then the specific implementation manner of step 702C is:
若对所述预定规则访问权限认证成功后,则根据所述预定规则,对所述对象编码进行拆解,以获取所述元编码,或者所述元编码和实例编码。After the authentication of the predetermined rule access authority is successful, the object code is disassembled according to the predetermined rule to obtain the meta code, or the meta code and the instance code.
图34为本发明提供的一种解码处理方法的实施例三的流程图,在上述图32所示的基础上,如图34所示,该方法还包括:FIG. 34 is a flowchart of Embodiment 3 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 34, the method further includes:
步骤801C、对所述元编码进行访问权限认证。 Step 801C: Perform access authority authentication on the meta code.
则步骤603C的一种具体实现方式为:Then a specific implementation manner of step 603C is:
步骤802C、若对所述预定规则访问权限认证成功后,则根据所述预定规则,对所述对象编码进行拆解,以获取所述元编码,或者所述元编码和实例编码。 Step 802C: After the access authority of the predetermined rule is successfully authenticated, the object code is disassembled according to the predetermined rule to obtain the meta code, or the meta code and the instance code.
图35为本发明提供的一种解码处理方法的实施例四的流程图,在上述图32所示的基础上,如图35所示,该步骤604C的一种具体实现方式为:FIG. 35 is a flowchart of Embodiment 4 of a decoding processing method according to the present invention. On the basis of the foregoing FIG. 32, as shown in FIG. 35, a specific implementation manner of the step 604C is:
步骤901C、获取上下文对象。 Step 901C: Acquire a context object.
步骤902C、根据所述上下文对象和所述编码规约,获取对应的编码空间。 Step 902C: Acquire a corresponding coding space according to the context object and the coding protocol.
步骤903C、从所述编码空间中,对所述实例编码进行解码,获取对应的数据内容。 Step 903C: Decode the instance code from the coding space to obtain corresponding data content.
步骤904C、根据所述元数据和所述数据内容,获取与所述对象编码对应的数据对象。 Step 904C: Acquire a data object corresponding to the object encoding according to the metadata and the data content.
基于上述的各实施例的描述,下面将以本发明的手写输入系统为例,示意性地介绍基于编码处理的手写输入系统的具体应用。Based on the description of the above embodiments, the specific application of the handwriting input system based on the encoding process will be schematically described below by taking the handwriting input system of the present invention as an example.
举例来说,以基于行、间距分词的手写输入为例,用户在当前行中输入了如图36所示。那么,输入系统根据间距分词算法形成了四个字符,存储于编码仓库(假设编码仓库中已有64个字符0x1–0x40):For example, taking the handwriting input based on line and spacing word segmentation as an example, the user inputs the current line as shown in FIG. Then, the input system forms four characters according to the spacing word segmentation algorithm and stores it in the encoding repository (assuming there are 64 characters 0x1–0x40 in the encoding repository):
Figure PCTCN2015086672-appb-000054
Figure PCTCN2015086672-appb-000054
其中,0x41,0x42,0x43,0x44为16进制表达法,分别表示十进制的65,66,67,68。对象编码可以直接是数据对象在编码仓库的位置,也可以是该位置的哈希值。每个编码项的具体内容是图形数据,可以是通用格式,如SVG,或者是私有格式。Among them, 0x41, 0x42, 0x43, 0x44 are hexadecimal notation, which means 65, 66, 67, 68 in decimal. The object encoding can be directly the location of the data object in the encoding repository, or it can be a hash of the location. The specific content of each code item is graphic data, which can be a common format, such as SVG, or a proprietary format.
相应的,输入系统也生成了对应的文字数据,如下:Correspondingly, the input system also generates corresponding text data, as follows:
0x41 0x20 0x42 0x20 0x43 0x20 0x440x41 0x20 0x42 0x20 0x43 0x20 0x44
其中0x20是标准ASCII码中的空格符(假定该系统使用标准空格来分隔字符)。上述文字在传统文本查看环境中看到的是这样:Where 0x20 is a space character in the standard ASCII code (assuming the system uses standard spaces to separate characters). The above text is seen in the traditional text viewing environment:
A B C DA B C D
这是因为0x41、0x42、0x43、0x44分别对应于ASCII码中的A、B、C、D四个字符,传统文本输出时,通过这些编码从相应基于标准编码的字库中取出相应的字符轮廓。This is because 0x41, 0x42, 0x43, and 0x44 respectively correspond to the four characters A, B, C, and D in the ASCII code. When the conventional text is output, the corresponding character contours are extracted from the corresponding standard code-based fonts by these codes.
在新的数据处理系统中,文本输出会到编码仓库中将对应的图形取出来并按序绘制到输出显示。绘制结果如图36所示。In the new data processing system, the text output will be taken out of the code repository and the corresponding graphics will be drawn to the output display in order. The result of the drawing is shown in Figure 36.
另外,针对对类型编码,前面提到,新数据处理系统中,多种类型的编码会同时存在。我们可以对不同类型的字符/词符进行统一编码。但统一编 码存在的问题是,解码时系统需要针对每个编码到编码仓库中获取相应的编码类型信息,以便对编码进行正确的解码及输出。这极大地影响了系统性能。In addition, for type encoding, as mentioned earlier, in the new data processing system, multiple types of encoding will exist at the same time. We can uniformly encode different types of characters/terms. But unified The problem with the code is that the decoding system needs to obtain the corresponding encoding type information for each encoding to the encoding warehouse in order to correctly decode and output the encoding. This greatly affects system performance.
另一种方案是对类型编码,将编码类型信息存储于编码仓库中。这样,基于对象编码的文字编码就会包括两部分:编码类型编码(元编码)以及该类型下的具体编码(实例编码)。这可能会增大编码结果的大小,但是可以极大地提高编解码的灵活性和开放性。Another solution is to encode the type and store the encoding type information in the encoding repository. Thus, text encoding based on object encoding will include two parts: encoding type encoding (meta encoding) and specific encoding (instance encoding) under that type. This may increase the size of the encoding result, but it can greatly improve the flexibility and openness of the codec.
基于前面的例子,编码仓库需要添加类型编码信息(编码元信息):Based on the previous example, the encoding repository needs to add type encoding information (encoding meta information):
Figure PCTCN2015086672-appb-000055
Figure PCTCN2015086672-appb-000055
同时,所有的编码项需要根据相应的类型编码,放置在编码仓库的不同位置。例如,对于基于数据库的实现,不同编码类型的编码可以放到不同的表中,对象工厂可以根据系统约定(例如,使用类型ID作为对应编码的表名)根据类型编码(元编码)找到对应的表。At the same time, all code items need to be coded according to the corresponding type and placed in different locations in the code repository. For example, for database-based implementations, encodings of different encoding types can be placed in different tables, and the object factory can find corresponding ones according to the type coding (meta-encoding) according to the system convention (for example, using the type ID as the corresponding encoded table name). table.
本例中“com.sample.handwriting.word”表的内容如下。The contents of the "com.sample.handwriting.word" table in this example are as follows.
Figure PCTCN2015086672-appb-000056
Figure PCTCN2015086672-appb-000056
Figure PCTCN2015086672-appb-000057
Figure PCTCN2015086672-appb-000057
相应的,输入系统生成的文字数据会变成如下编码:Correspondingly, the text data generated by the input system will become the following code:
0x01 0x41 0x02 0x01 0x42 0x02 0x01 0x43 0x02 0x01 0x440x01 0x41 0x02 0x01 0x42 0x02 0x01 0x43 0x02 0x01 0x44
其中,0x02对应的是空格。这是一个控制符,并不需要具体的文字内容,编码仓库中也没有对应的表。Among them, 0x02 corresponds to a space. This is a control character and does not require specific text content. There is no corresponding table in the encoding repository.
我们可以对编码类型使用动态编码,这样可以实现新数据处理系统的高效、安全和开放性。多种输入方法、编码方式可以在同一应用系统中混合使用。未授权的系统或者个人无法从编码结果中获取任何信息。新的输入方法、编码类型、应用程序可以动态增加到新数据处理系统中。We can use dynamic coding for the encoding type to achieve efficient, secure and open new data processing systems. A variety of input methods and encoding methods can be mixed in the same application system. Unauthorized systems or individuals cannot obtain any information from the encoded results. New input methods, encoding types, and applications can be dynamically added to new data processing systems.
另外,针对对数据编码,对于一个能对任意数据对象进行编码的系统来说,有时候仅仅只提供对文字内容本身的编码往往是不够的,我们还需要对其他一些相关信息进行编码,也就是对数据的编码。不同于对象数据的编码,数据内容可以不存储于文字编码仓库,而直接编码在对象编码中,即前面提到的内容编码。In addition, for encoding data, for a system that can encode arbitrary data objects, sometimes it is often not enough to only provide the encoding of the text content itself. We also need to encode some other related information, that is, Encoding the data. Different from the encoding of the object data, the data content may not be stored in the text encoding warehouse, but directly encoded in the object encoding, that is, the content encoding mentioned above.
一个典型的例子就是文字的间距。在传统的ASCII编码系统中,空格是一个控制字符。在对应的文字输出结果中,一个空格的宽度是固定的。被空格分隔的字符之间的距离是由它们之间空格的个数决定的。这个间距只能是空格宽度的整数倍。但是在自然书写的文字中,字符或者单词之间的间距是任意的(当然,都在纸张的范围之内)。在前面的示例中,仔细观察,会发现手写输入的图形和对应的输出并不一致,主要是字符之间的间隔并不一致。示例中的编码结果对字符之间的间距使用的同一编码。为了保证所见即所得的效果,可以将字符间距的长度也编码到字符对象编码结果中去。我们可以将这个长度信息放到编码仓库中,然后再将该内容项的位置编码到文字中。显然,将文字间距进行二进制编码并直接放入文字中要直接、有效得多。图37将字符间距的长度可视化出来。如图37所示,其中,长度使用的是逻辑单位,可以适应不同设备以及不同字体大小的输出。我们更新编码类型信息如下:A typical example is the spacing of text. In traditional ASCII encoding systems, a space is a control character. In the corresponding text output, the width of a space is fixed. The distance between characters separated by spaces is determined by the number of spaces between them. This spacing can only be an integer multiple of the width of the space. However, in naturally written text, the spacing between characters or words is arbitrary (of course, all within the scope of the paper). In the previous example, a closer look reveals that the handwritten input graphic and the corresponding output are not consistent, mainly because the spacing between characters is not consistent. The encoding in the example uses the same encoding for the spacing between characters. To ensure the WYSIWYG effect, the length of the character spacing can also be encoded into the character object encoding result. We can put this length information into the code repository and then encode the location of the content item into the text. Obviously, it is much more straightforward and efficient to binary code the text and place it directly into the text. Figure 37 visualizes the length of the character spacing. As shown in Figure 37, the length is in logical units and can be adapted to different devices and output of different font sizes. We update the encoding type information as follows:
Figure PCTCN2015086672-appb-000058
Figure PCTCN2015086672-appb-000058
Figure PCTCN2015086672-appb-000059
Figure PCTCN2015086672-appb-000059
其中,空格的编码长度从0改为1,是指在空格编码之后有一个字节的长度编码。编码数据类型为空表示对该长度编码的解码并不需要访问编码仓库。编码程序可以直接将字符间的间隔长度转换成字节存储在编码结果中。对应的文字编码如下所示:Among them, the encoding length of the space is changed from 0 to 1, which means that there is a byte length encoding after the space encoding. The encoded data type is null to indicate that decoding of the length encoding does not require access to the encoding repository. The encoding program can directly convert the interval length between characters into bytes and store them in the encoding result. The corresponding text encoding is as follows:
0x01 0x41 0x02 0x0C 0x01 0x42 0x02 0x10 0x01 0x43 0x02 0x01 0x0A 0x440x01 0x41 0x02 0x0C 0x01 0x42 0x02 0x10 0x01 0x43 0x02 0x01 0x0A 0x44
这样,文字输出子系统可以根据这个编码将原始的输入内容完全还原出来。In this way, the text output subsystem can completely restore the original input based on this encoding.
值得一提的是,示例中的间隔为手写字符之间的长度间距。但是,对于其他的输入方法,还会存在其他种类的间距,例如语音输入中声音单元间的时间间距。我们可以提供不同的编码类型来支持不同种类的间距编码。It is worth mentioning that the interval in the example is the length spacing between handwritten characters. However, for other input methods, there are other kinds of spacing, such as the time interval between sound units in the speech input. We can provide different encoding types to support different kinds of spacing encoding.
在这个例子中,我们看到了对数据直接进行对象编码的作用。在这里,我们是对整数进行编码。实际上,在计算机系统中,对于各种数据的二进制表达/编码是数据存储、处理的基础,这些技术已经非常成熟。如IEEE 754标准就是对浮点数进行二进制编码的标准。我们可以使用所有这些技术直接将任意数据直接编码到对象编码结果中去。In this example, we saw the effect of directly encoding the data on the object. Here we are coding integers. In fact, in computer systems, binary representation/encoding of various data is the basis for data storage and processing, and these technologies are very mature. For example, the IEEE 754 standard is a standard for binary encoding of floating point numbers. We can use all of these techniques to directly encode arbitrary data directly into the object's encoded results.
因此,在新数据处理系统的编码方案中,我们的数据对象的数据内容不光可以存储到编码仓库中,还可以以某种方式直接放置到对象编码中。因此,新数据处理系统的文字编码实际上可能是引用编码和内容编码的混合体。我们可以通过编码类型来区分它们。更进一步,还可以通过编码类型的类型安全检查来判断编码是否符合类型约束,以及通过类型推导来确定编码 的具体类型。Therefore, in the coding scheme of the new data processing system, the data content of our data object can be stored not only in the encoding warehouse, but also in some way directly into the object encoding. Therefore, the text encoding of the new data processing system may actually be a mixture of reference encoding and content encoding. We can distinguish them by coding type. Furthermore, it is also possible to determine whether the encoding conforms to the type constraint by type type security check of the encoding type, and to determine the encoding by type derivation. The specific type.
另外,针对混合编码,新的数据处理系统允许我们用新的编码从头到尾创建基于对象编码的文字内容。但是在很多情况下,人们希望能够直接利用现有文字资源,在已有的基于标准编码的文字之上直接进行修改。有时候,也希望能够混用键盘和新的输入方法对文字进行修改和编辑。这就要求新的文字编码方案能够同现有标准编码兼容,这样,两种系统的文字能够混合出现在同一个文档中。In addition, for hybrid coding, the new data processing system allows us to create object-based encoded text content from beginning to end with new encoding. But in many cases, people want to be able to directly use existing text resources to make changes directly on existing standard code-based text. Sometimes, I also want to be able to modify and edit the text by mixing keyboards and new input methods. This requires that the new text encoding scheme be compatible with existing standard encodings so that the text of the two systems can be mixed in the same document.
混合编码的实现可以有很多种方案。一种简单直接的方案就是将每个标准编码序列作为对象数据内容放入编码仓库中,为这些内容定义新的对象编码。另一种方案就是在文本内容中的每个标准文字编码之前放置一个类型编码,由这个类型编码告诉解码器之后编码是标准文字编码。这两种方案有一个主要的问题,就是现有的标准编码文字内容都需要转换才能成为目标编码,而且编码结果同原有标准编码完全不兼容。很难使用现有的文字基础设施和工具去处理和分析。There are many ways to implement mixed coding. A simple and straightforward solution is to put each standard code sequence into the code repository as object data content, defining a new object code for the content. Another solution is to place a type code before each standard text encoding in the text content. This type of code tells the decoder that the code is a standard text code. One of the main problems with these two schemes is that the existing standard encoded text content needs to be converted to become the target encoding, and the encoding result is completely incompatible with the original standard encoding. It is difficult to use existing text infrastructure and tools to process and analyze.
一个更好的方案就是将新的文字编码直接建立在已有的标准编码基础之上。这里给出一个具体的基于UTF-16的文字编码方案:A better solution is to base the new text encoding directly on the existing standard coding. Here is a specific UTF-16 based text encoding scheme:
1.所有UTF-16标准码均采用原有编码标准进行编码,如BOM和Surrogate Pair等。1. All UTF-16 standard codes are encoded using the original coding standards, such as BOM and Surrogate Pair.
2.所有对象编码的元编码均采用UTF-16的私有扩展编码(从U+E000到U+F8FF)2. The meta-encoding of all object encodings is based on UTF-16's private extension encoding (from U+E000 to U+F8FF)
3.类型编码之后的实例编码字长(这里一个字为2字节)以编码仓库中信息为准3. The example code word length after type encoding (here a word is 2 bytes) is subject to the information in the code repository.
4.类型编码之后的文字实例编码字高位均为1(即从0x8000到0xFFFF),以免出现同其他控制符相冲突的情况。4. After the type encoding, the code example code word height is 1 (ie from 0x8000 to 0xFFFF), so as to avoid conflicts with other control characters.
针对这个编码方案,解码过程如图38所示。For this encoding scheme, the decoding process is shown in FIG.
另外,这里给出一个具体的例子。如图39所示,这是一个混合编码的内容显示。In addition, a specific example is given here. As shown in Figure 39, this is a mixed-coded content display.
在对应的文字编码中,使用了五个标准的Unicode字符U+0049(I),U+0020(空格),U+0061(a),U+006D(m)和U+002E(.)。其他均为非标准编码。对应的,我们有编码信息如下: In the corresponding text encoding, five standard Unicode characters U+0049(I), U+0020 (space), U+0061(a), U+006D(m) and U+002E(.) are used. Others are non-standard codes. Correspondingly, we have the coding information as follows:
Figure PCTCN2015086672-appb-000060
Figure PCTCN2015086672-appb-000060
编码仓库中类型“com.sample.handwriting.word”编码内容为:The code type "com.sample.handwriting.word" in the code repository is:
Figure PCTCN2015086672-appb-000061
Figure PCTCN2015086672-appb-000061
类型“com.sample.photo”编码内容为:The type "com.sample.photo" is encoded as:
Figure PCTCN2015086672-appb-000062
Figure PCTCN2015086672-appb-000062
例中文字内容对应的编码为:The code corresponding to the Chinese text content is:
U+0049 U+0020 U+0061 U+006D U+0020 U+E0001 0x8000 U+002E U+0020 U+E0000 0x8041 U+0020 U+E0000 0x8042 U+0020 U+E0000 0x8043  U+0020 U+E0000 0x8044U+0049 U+0020 U+0061 U+006D U+0020 U+E0001 0x8000 U+002E U+0020 U+E0000 0x8041 U+0020 U+E0000 0x8042 U+0020 U+E0000 0x8043 U+0020 U+E0000 0x8044
该编码在传统的的UTF-16数据处理系统中会显示为:This code will appear in the traditional UTF-16 data processing system as:
I am 耀. 聁 聂 聃 聄I am 耀. 聁 聂 聃 聄
其中由于两个类型编码U+E0000和U+E0001是私有字符,属于标准UTF-16字体不支持的编码,因此其输出会因实现而异。在这里是以空白(上面这五个汉字之前的空白)作为输出。有的系统是以方框或者黑块出现。Since the two types of codes U+E0000 and U+E0001 are private characters and are not supported by the standard UTF-16 font, their output will vary depending on the implementation. Here, the output is blank (the blank before the five Chinese characters above). Some systems appear as boxes or black blocks.
我们可以看到,基于这种编码方案,我们传统的UTF-16文字在新的数据处理系统中不需要任何转换就可以直接使用。新数据处理系统的编码结果也可以用支持UTF-16的基础设施和工具来处理。如,在传统文字编辑器中,将例中的“I am”替换成“我是”。通过新的数据处理系统输出,相应的改动就能直接体现出来,具体如图40所示。We can see that based on this encoding scheme, our traditional UTF-16 text can be used directly without any conversion in the new data processing system. The coding results of the new data processing system can also be handled with infrastructure and tools that support UTF-16. For example, in the traditional text editor, replace "I am" in the example with "I am". Through the output of the new data processing system, the corresponding changes can be directly reflected, as shown in Figure 40.
也就是说,原有对UTF-16的处理能力和工具在新的系统中可以继承和保留。同时,新的编码结果也可以完好无损地存储于任何支持UTF-16的存储系统中。In other words, the original UTF-16 processing power and tools can be inherited and retained in the new system. At the same time, the new encoding results can be stored intact in any storage system that supports UTF-16.
类似的,我们也可以扩展UTF-8、UTF-32等其他标准编码系统来支持新的数据处理系统。Similarly, we can also extend other standard encoding systems such as UTF-8 and UTF-32 to support new data processing systems.
另外,关于转换编码,新的对象编码系统中,我们除了可以将数据对象的内容放到编码仓库中以外,还可以将编码本身作为数据内容放到编码仓库。这种对其他编码进行转换的编码类型称为转换编码。转换编码在编码仓库中存储的具体内容就是文字。一个简单的应用就是对标准码的转换。如下所示,我们定义一种转换编码:In addition, with regard to conversion coding, in the new object coding system, in addition to putting the contents of the data object into the code repository, we can also put the code itself as the data content into the code repository. This type of encoding that converts other encodings is called transcoding. The specific content stored in the encoding repository is the text. A simple application is the conversion of standard codes. As shown below, we define a conversion encoding:
编码(内容ID)Encoding (content ID) 内容content 其他属性Other attributes
……...... ……...... ……......
0x410x41 0x54(T)0x54(T) ……......
0x420x42 0x68(h)0x68(h) ……......
0x430x43 0x69(i)0x69(i) ……......
0x440x44 0x73(s)0x73(s) ……......
0x450x45 0x20(空格)0x20 (space) ……......
0x460x46 0x61(a)0x61(a) ……......
0x470x47 0x53(S)0x53(S) ……......
0x480x48 0x45(E)0x45(E) ……......
0x490x49 0x43(C)0x43(C) ……......
0x500x50 0x52(R)0x52(R) ……......
0x510x51 0x21(!)0x21(!)  
……...... ……...... ……......
这样,我们的原有ASCII码字符串“This is a SECRET!”,在新数据处理系统下就会编码为“0x41 0x42 0x43 0x44 0x45 0x43 0x44 0x45 0x46 0x45 0x47 0x48 0x49 0x50 0x48 0x41 0x51”。对于没有相应编码仓库访问权限的人来说,他如果获得了文字编码,在新的数据处理系统中无法输出。将该编码在传统ASCII码系统中会输出为“ABCDECDEFEGHIJHAK”。这样,没有被编码仓库授权的用户就无法获得真实内容。这实际上是实现了一种加密功能。这种加密同传统的加密并不相同。传统的加密是对整个文本数据进行整体加密。这种基于编码转换的内容保护依赖的是对编码仓库的授权访问,可以做到细粒度的内容保护。如只对需要保护的字符或者单词进行编码转换,或者对不同的编码授予不同的访问权限。Thus, our original ASCII string "This is a SECRET!" will be encoded as "0x41 0x42 0x43 0x44 0x45 0x43 0x44 0x45 0x46 0x45 0x47 0x48 0x49 0x50 0x48 0x41 0x51" under the new data processing system. For those who do not have access to the corresponding coded warehouse, if they get the text encoding, they cannot be output in the new data processing system. This code is output as "ABCDECDEFEGHIJHAK" in the conventional ASCII code system. In this way, users who are not authorized by the code repository will not be able to obtain real content. This actually implements an encryption function. This encryption is not the same as traditional encryption. Traditional encryption is the overall encryption of the entire text data. This contention-based content protection relies on authorized access to the encoding repository for fine-grained content protection. For example, only encode or convert characters or words that need to be protected, or grant different access rights to different encodings.
例如,基于前面提到的UTF-16混合编码,我们可以只将文字中的部分内容再编码,其他内容使用UTF-16编码。这里使用新的类型编码:For example, based on the aforementioned UTF-16 hybrid encoding, we can re-encode only part of the text, and other content is encoded in UTF-16. Here we use the new type encoding:
Figure PCTCN2015086672-appb-000063
Figure PCTCN2015086672-appb-000063
对应如下编码仓库:Corresponding to the following code warehouse:
Figure PCTCN2015086672-appb-000064
Figure PCTCN2015086672-appb-000064
Figure PCTCN2015086672-appb-000065
Figure PCTCN2015086672-appb-000065
原有的UTF-16字符串“This is a SECRET!”在新的数据处理系统中就编码为“U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0061 U+0020 U+E002 0x8000 U+0021”。在新的数据处理系统中,对于类型“com.sample.secrete”,可以不同用户进行特别的显示输出。如,对于已授权用户,U+E0002 0x8000对应的内容能够正常获取,结果显示为:The original UTF-16 string "This is a SECRET!" is coded as "U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 in the new data processing system. U+0061 U+0020 U+E002 0x8000 U+0021”. In the new data processing system, for the type "com.sample.secrete", special display output can be performed by different users. For example, for an authorized user, the content corresponding to U+E0002 0x8000 can be obtained normally, and the result is displayed as:
This is a SECRETE!This is a SECRETE!
对于未被授权用户,无法获取U+E0002 0x8000对应的内容对应内容,结果显示为:For an unauthorized user, the content corresponding to the content corresponding to U+E0002 0x8000 cannot be obtained. The result is displayed as:
This is a     !This is a !
该编码在UTF-16文字环境中输出为:The code is output in the UTF-16 text environment as:
This is a 耀!This is a Yao!
这里我们可以看到,这种灵活性是传统加密很难做到的。此外,传统加密方法和编码转换还可以同时使用:对文本编码进行整体加密,或者对文本内容进行加密等等。这样,系统的内容安全能够达到更高的级别。用户在获取了密文之后,需要一个密钥才能拿到明文,但这个明文是无法理解的,还需要通过获取编码仓库的身份验证,来获取对应内容,如果内容本身也加密了,还需要对内容解密才能最终获取相应的信息。Here we can see that this flexibility is difficult to achieve with traditional encryption. In addition, traditional encryption methods and transcoding can also be used simultaneously: the entire text encoding is encrypted, or the text content is encrypted. In this way, the content security of the system can reach a higher level. After obtaining the ciphertext, the user needs a key to get the plaintext, but the plaintext is incomprehensible. You need to obtain the corresponding content by obtaining the identity verification of the encoding repository. If the content itself is encrypted, you need to The content decryption can finally get the corresponding information.
同时,需要指出的是,这里将多个字符变成一个编码的做法实际上也达到了对文本进行压缩的效果。At the same time, it should be pointed out that the practice of turning multiple characters into one code here actually achieves the effect of compressing the text.
除了标准码可以通过转换编码实现加密和压缩的目的,其他任意编码也可以使用转换编码的方式实现编码的分组和转换。In addition to the standard code, encryption and compression can be achieved by transcoding. Other arbitrary encodings can also implement encoding and grouping and conversion using conversion encoding.
这里有个具体的例子:前面提到,新的数据处理系统编码结果和传统键盘输入的字符可以混合在一起。假定这个时候,我们使用的是手写输入方法,如果直接在传统字符的内容上面进行手写输入会得到什么结果呢?如果允许这种交互,那么直观的结果就是手写笔划落在字符输出的结果之上。如图41所示。Here's a concrete example: As mentioned earlier, the new data processing system coded results can be mixed with the characters entered by traditional keyboards. Suppose this time, we are using the handwriting input method. What results will be obtained if handwriting input directly on the content of traditional characters? If this interaction is allowed, the intuitive result is that the stylus stroke falls above the result of the character output. As shown in Figure 41.
在这里,我们就可以用转换编码将不同类型的编码混合到一起,形成一 个编码。用到的编码类型如下:Here, we can use transform coding to mix different types of codes together to form a Coding. The encoding types used are as follows:
Figure PCTCN2015086672-appb-000066
Figure PCTCN2015086672-appb-000066
编码类型“com.sample.handwriting.word”的内容项如下:The content items of the encoding type "com.sample.handwriting.word" are as follows:
Figure PCTCN2015086672-appb-000067
Figure PCTCN2015086672-appb-000067
编码类型“com.sample.handwriting.mixedword”的相关内容项如下:The related content items of the encoding type "com.sample.handwriting.mixedword" are as follows:
Figure PCTCN2015086672-appb-000068
Figure PCTCN2015086672-appb-000068
在这里,编码U+E003 0x8000对应的实际上就是这个混合了UTF-16编码和手写字符对象编码的混合内容。编码仓库在获取这一内容时,会检测到编码内容中还存在编码仓库中的编码,它会将所有直接或者间接引用的对象 数据内容都取出发送给客户端。这样可以最小化访问服务的次数,也便于检测循环引用(同一编码直接或者间接被其自身引用)的问题。对应的文字输出系统就会将此编码内容分解成两部分,第一个部分是一个手写编码,之前可能会包括一个间隔编码。这个间隔编码是手写内容与之前位置的空间间隔。手写编码之后的是第二个部分,是UTF-16编码和间隔编码的任意混合。依次渲染这两部分可以得到正确的结果。Here, the code U+E003 0x8000 corresponds to this mixed content of UTF-16 encoding and handwritten character object encoding. When the encoding repository obtains this content, it will detect that there is still an encoding in the encoding repository in the encoded content, and it will refer to all objects directly or indirectly referenced. The data content is taken out and sent to the client. This minimizes the number of times the service is accessed and also the problem of detecting circular references (the same code is directly or indirectly referenced by itself). The corresponding text output system will decompose this code into two parts. The first part is a handwritten code, which may include an interval code before. This interval code is the spatial separation of the handwritten content from the previous location. After the handwritten coding is the second part, which is any mixture of UTF-16 coding and interval coding. Rendering the two parts in turn gives the correct results.
在本实施例中,个性化的文字编码使得文字必须依赖于其编码仓库才能正确输出、为人所理解。这具有一种天然的安全优势。我们可以将文字编码同文字编码仓库分别部署在两个不同的系统中。这样,只有同时拥有了这两个系统相关访问权限的用户才能获取最终的文字信息。这就是之前所述的拆分存储的概念。例如,对于一个传统的网络微博系统,网站管理员或者系统数据库管理员可以很容易地看到任何存储在其系统之内的微博内容,不管这个内容是公开还是私有的。但是,如果微博内容采用的是基于对象编码的手写文字内容,而对应的编码仓库是由另一个互联网服务供应商提供的,那么,没有编码仓库访问权限的管理员虽然能看到微博的文字编码,他/她还是得不到文字内容。与此同时,编码仓库服务供应商的管理员虽然能够获取到每个文字编码对应的字形,但是他们并没有整个微博的文字编码,因此微博内容对他们来说也是未知的。类似的,对这种手写微博系统进行中间人攻击的黑客来说,他们必须同时破解微博和编码仓库两个系统才能完全截获该系统的微博信息。这种方式极大地提高了攻击成本。In this embodiment, the personalized text encoding makes the text dependent on its encoding warehouse for correct output and human understanding. This has a natural safety advantage. We can deploy text encodings and text encoding repositories in two different systems. In this way, only the users who have access to the two systems at the same time can get the final text information. This is the concept of split storage as described earlier. For example, for a traditional web microblogging system, a webmaster or system database administrator can easily see any microblog content stored in its system, whether it is public or private. However, if the Weibo content uses handwritten text content based on object encoding, and the corresponding encoding warehouse is provided by another Internet service provider, then the administrator who does not have access to the encoding warehouse can see Weibo. Text code, he/she still can't get the text content. At the same time, although the administrators of the coding warehouse service providers can obtain the glyphs corresponding to each text code, they do not have the text encoding of the entire Weibo, so the Weibo content is unknown to them. Similarly, for a hacker who makes a man-in-the-middle attack on this handwritten microblogging system, they must simultaneously crack the two systems of the microblog and the code repository to completely intercept the microblog information of the system. This approach greatly increases the cost of attack.
除了非标准文字编码,我们还可以使用前面所提到的转换编码将标准码通过编码仓库进行再编码来去标准化,以实现内容的保护。In addition to non-standard text encoding, we can also use the conversion encoding mentioned above to standardize the standard code through the encoding warehouse to standardize to achieve content protection.
除了基于对象编码数据处理系统这种编码同内容的拆分所带来的安全性外,新的系统还能通过其他机制(例如但不限于:编码空间、访问控制、加密编码、内容验证编码等)给文字内容提供更加细致的保护。In addition to the security of the encoding of the object-based data processing system, the new system can also pass other mechanisms (such as but not limited to: encoding space, access control, encryption encoding, content verification encoding, etc.) ) Provide more detailed protection for text content.
另外,前面提到,编码访问空间可以将不同安全级别的编码完全隔离开。例如,对于一个在企业内部部署的编码仓库来说,任何对私有编码内容的直接请求都会被拒绝。同样,一个在公有云中部署的编码仓库也会拒绝对企业编码和私有编码的文字内容请求。In addition, as mentioned earlier, the coded access space can completely isolate the code of different security levels. For example, for an encoding repository deployed inside the enterprise, any direct request for privately encoded content is rejected. Similarly, an encoding repository deployed in a public cloud will also reject textual requests for enterprise encoding and private encoding.
我们可以通过指定类型编码的范围来明确相应的编码空间。例如,在某 一基于开放编码的数据处理系统中,我们定义0–99是公有编码,100–199为企业编码,200–255是私有编码。这样,99之上的类型编码并不能被公有编码仓库直接支持。而对于企业内部基于私有云的编码仓库来说,大于199的类型编码是不支持的编码,100–199的类型编码为其直接存储支持的编码类型,0–99的类型编码为其间接支持的编码类型。这种间接支持可以作为公有云编码仓库的内容缓存服务来实现。We can specify the corresponding encoding space by specifying the range of type encoding. For example, at some In a data processing system based on open coding, we define 0–99 as public coding, 100–199 as enterprise coding, and 200–255 as private coding. Thus, type encoding above 99 is not directly supported by the public encoding repository. For intra-enterprise cloud-based code repositories, type encodings greater than 199 are unsupported encodings, 100–199 type encodings are supported for direct storage, and type encodings of 0–99 are indirectly supported. The type of encoding. This indirect support can be implemented as a content caching service for public cloud encoding repositories.
由此我们可以知道,对于同一个人来说,其公有编码仓库只能有一个,存在于公有云中。具体的说,就是存在于一个互联网的服务中。但是,其私有编码仓库和企业编码仓库可以有多个,分别存在于不同的网络环境和计算机系统中。针对这些不同的编码仓库,有必要生成不同的编码仓库标识。相应的文本文件或者文本数据需要存储对应编码仓库的标识,以保证正确的编码、解码、输入、输出。From this we can know that for the same person, there can only be one public code warehouse, which exists in the public cloud. Specifically, it exists in an Internet service. However, there may be multiple private code repositories and enterprise code repositories, which exist in different network environments and computer systems. For these different encoding repositories, it is necessary to generate different encoding warehouse identities. The corresponding text file or text data needs to store the identifier of the corresponding encoding warehouse to ensure correct encoding, decoding, input and output.
不同的非公有编码仓库将会导致信息孤岛的出现。所以在特定条件下,也允许封闭的编码仓库向开放编码仓库中提交内容,以方便实现内容的共享。Different non-public code repositories will lead to the emergence of information silos. Therefore, under certain conditions, the closed code repository is also allowed to submit content to the open code repository to facilitate content sharing.
有时候,三级编码访问空间并不能满足实际需求。例如,有的应用系统还希望建立部门级别的共享机制,这时,应用系统可以在企业编码空间内部定义更细的子空间。子空间的管理由应用系统来完成。Sometimes, the three-level code access space does not meet the actual needs. For example, some application systems also want to establish a department-level sharing mechanism. In this case, the application system can define a finer subspace within the enterprise coding space. The management of the subspace is done by the application system.
这里给出一个具体的例子:Here is a concrete example:
一个个人手写日记本应用,使用的是本地私有编码仓库。日记的正文内容存储于互联网的云存储。而编码仓库存放在用户随身携带的U盘中。这样,即使有黑客获取了云存储中的日记内容,没有相应的U盘,他们也无法获取里面的信息。同一个应用系统,当用户将日记内容发表为博客时,系统需要将相应的文字内容从私有编码空间转换为公有个人编码空间,这个过程实际上就是将相应编码内容从U盘编码仓库中取出,存放至公有编码仓库,并得到对应公有编码的过程。A personal handwritten diary application that uses a local private code repository. The body content of the diary is stored in the cloud storage of the Internet. The code warehouse is stored in the U disk that the user carries with him. In this way, even if a hacker obtains the journal content in the cloud storage, and there is no corresponding U disk, they cannot obtain the information inside. In the same application system, when the user publishes the journal content as a blog, the system needs to convert the corresponding text content from the private encoding space to the public personal encoding space. This process is actually taking the corresponding encoded content from the U disk encoding warehouse. Stored in the public code repository and get the process corresponding to the public code.
另外,对编码仓库中编码内容的保护主要是通过编码仓库的访问控制服务来完成的。访问控制主要是针对编码元数据以及具体的数据对象的。不同于普通的访问控制,对象编码的访问控制可以实现细粒度的对文字内容访问的控制。前面已经举例说明了结合访问控制和转换编码实现了对部分文字内 容的加密。In addition, the protection of the encoded content in the encoding warehouse is mainly done by the access control service of the encoding warehouse. Access control is primarily for encoding metadata as well as specific data objects. Unlike ordinary access control, object-coded access control enables fine-grained control over access to text content. The previous example has been combined with access control and conversion coding to achieve partial text. Encryption.
另外,对于加密编码,前面对部分文字内容加密的例子中,转换编码的编码仓库中保存有敏感文字内容的编码。那么,实际上编码仓库的系统管理员或者侵入到该编码仓库的黑客实际上是可以根据这个编码内容从文字编码仓库中获取这段敏感文字的所有信息。而且,从编码仓库中获取的明文会直接通过网络传输,也存在安全隐患。另一种方案就是使用加密编码。所谓加密编码就是一个特别的编码类型。加密编码对应的文字内容为密钥。加密编码之后是被加密内容的长度,之后这个长度的编码都是被这个密钥加密之后的密文。在文字输出时,如果能够正常获得加密编码对应的密钥,密文就能被解密过程正确还原成原始编码,得以正确输出。因此,对加密编码的访问控制就能实现对被加密编码的动态访问控制。传统的加密、解密技术都可以在这里使用。在这里,作为示例,我们定义一个简单的加密方案:密钥为一个伪随机数(可以在设定加密时自动生成),加密、解密函数完全相同,即每个实例编码同密钥异或。Further, in the case of encrypting the code, in the example of encrypting a part of the text content, the coded warehouse of the converted code stores the code of the sensitive text content. Then, in fact, the system administrator of the encoding repository or the hacker who invades the encoding repository can actually obtain all the information of the sensitive text from the text encoding warehouse according to the encoded content. Moreover, the plaintext obtained from the code repository will be transmitted directly through the network, and there are also security risks. Another option is to use encryption encoding. Encryption coding is a special type of coding. The text content corresponding to the encryption code is a key. Encrypted encoding is followed by the length of the encrypted content, after which the encoding of this length is the ciphertext after being encrypted by this key. In the text output, if the key corresponding to the encryption code can be obtained normally, the ciphertext can be correctly restored to the original code by the decryption process, and the ciphertext can be correctly output. Therefore, the access control of the encryption code can implement dynamic access control for the encrypted code. Traditional encryption and decryption techniques can be used here. Here, as an example, we define a simple encryption scheme: the key is a pseudo-random number (which can be automatically generated when setting encryption), and the encryption and decryption functions are identical, that is, each instance code is XORed with the key.
将这个方案用于之前的例子中,更新编码类型信息如下:Using this scheme in the previous example, update the encoding type information as follows:
Figure PCTCN2015086672-appb-000069
Figure PCTCN2015086672-appb-000069
“com.sample.scrambling”编码仓库如下:The "com.sample.scrambling" code repository is as follows:
Figure PCTCN2015086672-appb-000070
Figure PCTCN2015086672-appb-000070
原有的UTF-16字符串“This is a SECRET!”在新的数据处理系统中就编 码为“U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0061 U+0020 U+E004 0x8000 0x0006 U+FFAC U+FFBA U+FFBC U+FFCD U+FFAC U+FFCA U+0021”。这里U+E004 0x8000 0x0006实际上就是加密编码。当解码程序读入U+E004时,它会发现这是一个加密编码类型。其后有两个参数,0x8000是具体编码,对应编码仓库中是其解码密钥。0x0006是该加密编码作用的数据长度,这里是6个字(此处一个字2个字节)。解密程序会试图从编码仓库中读入0x8000对应的内容,如果能够获得,这个密钥就可用于解密之后的6个16位数字。得到对应的编码:U+0053 U+0045 U+0043 U+0052 U+0045 U+0054。The original UTF-16 string "This is a SECRET!" is compiled in the new data processing system. The code is "U+0054 U+0068 U+0069 U+0073 U+0020 U+0069 U+0073 U+0020 U+0061 U+0020 U+E004 0x8000 0x0006 U+FFAC U+FFBA U+FFBC U+FFCD U+FFAC U+FFCA U+0021”. Here U+E004 0x8000 0x0006 is actually the encryption code. When the decoder reads U+E004, it will find that it is an encryption encoding type. There are two parameters, 0x8000 is the specific code, and the corresponding code in the code repository is its decoding key. 0x0006 is the data length of the encryption encoding, here is 6 words (here a word 2 bytes). The decryption program will attempt to read the contents of 0x8000 from the encoding repository. If available, the key can be used to decrypt the six 16-digit numbers. Get the corresponding code: U+0053 U+0045 U+0043 U+0052 U+0045 U+0054.
否则,之后的6个字就是加密文字,无法正确显示,解码程序会直接跳过6个字,显示输出如下所示:Otherwise, the next 6 words are encrypted text, which cannot be displayed correctly. The decoding program will skip 6 words directly, and the display output is as follows:
This is a 【此处加密12个字节】!This is a [Encrypt 12 bytes here]!
该编码方式可以轻松实现对文字的实时授权。例如,我们将文字加密后通过email发送了出去。之后,由于某种原因,我们不希望收件人能够看到邮件内容。这时,我们只需要将相应的加密编码设置为收件人禁止访问。这样,已经发送的邮件就变成了不可读了。我们可以利用这个机制实现邮件撤销的功能。另外,值得一提的是,由于被加密文字编码已经改变,搜索引擎对其是无效的。This encoding method makes it easy to authorize text in real time. For example, we encrypt the text and send it out by email. After that, for some reason, we don't want the recipient to see the content of the message. At this point, we only need to set the corresponding encryption code to be forbidden by the recipient. In this way, the already sent mail becomes unreadable. We can use this mechanism to implement the function of mail revocation. In addition, it is worth mentioning that because the encrypted text encoding has changed, the search engine is invalid.
对于内容验证编码,同加密编码类似,我们也可以将对部分或者全部文字编码的验证信息放置到编码仓库并形成一个编码。这个编码叫做内容验证编码。通过内容验证编码,我们可以监测文字内容是否被篡改。For content verification coding, similar to encryption coding, we can also place verification information that encodes some or all of the text into the code repository and form a code. This code is called content verification code. With content verification coding, we can monitor whether text content has been tampered with.
例如,一位领导在电子邮件中对某项目给予明确的批示,他可以将这段文字设置为“防篡改”。这时,系统可以对该文字执行一个哈希算法形成一个128位的数字,这个数字同该段文字会是一个一一对应的关系。系统将这个128位数字存储于编码仓库中,形成一个内容验证编码(包括该文字的长度),并将此编码放置于该文字之前。在该邮件经过若干转发之后,解码程序可以根据内容验证编码取到的验证码,来和对应文字的哈希值进行对比,来确定该文字是否是原作者的原始信息。如果验证无误,则可以通过某种形式将验证结果可视化出来,让最终读者知道读到的是未经篡改的信息。For example, a leader gives a clear indication of an item in an email, and he can set the text as "tamper-proof." At this time, the system can perform a hash algorithm on the text to form a 128-bit number, which has a one-to-one correspondence with the text. The system stores this 128-bit number in the code repository, forms a content verification code (including the length of the text), and places the code before the text. After the mail has undergone some forwarding, the decoding program may compare the hash code obtained by the content verification code with the hash value of the corresponding text to determine whether the text is the original information of the original author. If the verification is correct, the verification result can be visualized in some form, so that the final reader knows that the information has not been tampered with.
对于多用户编码方案,在多用户环境中,文字编码仓库中会存储多个用 户的文字内容。这时候,只需要使用用户标识将不同用户的文字内容区分开来。如果需要,还可以将编码类型信息按照不同的用户区分开来。这样,不同用户对同一种编码的类型编码有可能不同,从而进一步加大了系统的安全性。For multi-user coding schemes, multiple copies are stored in the text encoding repository in a multi-user environment. The text content of the user. At this time, you only need to use the user ID to distinguish the text content of different users. If necessary, you can also distinguish the encoding type information from different users. In this way, different users may have different types of encoding of the same encoding, thereby further increasing the security of the system.
对于编码归属空间,有的时候,不同用户需要共享编码。我们通过不同的编码空间来区分。前面提到,个人编码因人而异,共享编码人人相同。在一个企业编码仓库中,如果把企业的徽标放入其中,对应的编码就是一个典型的共享编码。现存的各种标准编码就是典型的公有共享编码。此外,一些控制编码,如手写文字的间隔编码,以及系统编码,如表示用户ID的编码,均可采用共享编码。这样,一些系统工具(如检索系统)可以更加高效地使用这些编码。其实Unicode中也存在编码归属空间的概念,其中绝大部分是共享编码,但也预留了一个私有区,实际上就是我们这里说的个人编码。For encoding the home space, sometimes users need to share the code. We distinguish by different coding spaces. As mentioned earlier, personal codes vary from person to person, and the shared code is the same for everyone. In an enterprise code repository, if the corporate logo is placed in it, the corresponding code is a typical shared code. The existing standard codes are typical public shared codes. In addition, some control codes, such as interval coding of handwritten text, and system codes, such as codes representing user IDs, may employ shared coding. In this way, some system tools, such as retrieval systems, can use these codes more efficiently. In fact, Unicode also has the concept of encoding the home space, most of which is shared code, but also reserved a private area, which is actually the personal code we are talking about here.
前面提到,在对象编码数据处理系统中,我们可以对编码类型进行编码,对象编码包括两部分:类型编码(元编码)以及该类型中的具体实例编码。将编码归属空间作用到这两部分,实际上就产生了三种具体的编码方式:完全的共享编码、共享类型的个人编码、完全的个人编码。完全的共享编码实际上整个编码都是被所有编码仓库的用户所共享的,不和任何用户相关。其编码和对应内容一般由编码仓库管理员进行管理。共享类型的个人编码实际上仍然是个人编码,其编码是因人而异。但其类型编码是共享的。也就是说,不同的用户使用这样一种编码,其对应的类型编码部分是相同的,但是剩下的部分是因人而异。使用这种编码的一个好处是文字处理工具不需要任何个人信息就能获取到文字编码的的类型信息,然后能根据这个信息来对这个文字编码进行处理。完全的个人编码是指编码的这两部分都是个性化的、因人而异的。因此这种编码的安全性最高,但同时可操作性最低。文字处理工具必须根据编码所有者的用户信息来获得编码类型信息,才能进而获得全部编码信息。这里我们看到,同一种编码类型,在一个编码仓库中可能同时存在这三种不同的具体类型编码。As mentioned earlier, in the object encoding data processing system, we can encode the encoding type, which includes two parts: type encoding (meta encoding) and specific instance encoding in the type. Applying the encoding home space to these two parts actually produces three specific encoding methods: full shared encoding, shared type personal encoding, and full personal encoding. Full shared encoding The entire encoding is actually shared by all users of the encoding repository and is not associated with any user. The encoding and corresponding content is generally managed by the code repository administrator. The shared type of personal code is actually still a personal code, and its coding varies from person to person. But its type encoding is shared. That is to say, different users use such an encoding, and the corresponding type encoding portions are the same, but the remaining portions vary from person to person. One advantage of using this encoding is that the word processing tool can obtain the type information of the text encoding without any personal information, and then can process the text encoding based on this information. Complete personal coding means that both parts of the code are personalized and different from person to person. This code is therefore the most secure, but at the same time has the lowest operability. The word processing tool must obtain the encoding type information based on the user information of the encoding owner in order to obtain all the encoded information. Here we see that for the same coding type, these three different specific types of codes may exist in one code repository.
对于同一个用户来说,他的文字内容中,会同时出现其个人编码以及可用的共享编码。这时候,就需要通过编码空间来进行区分。现举例如下: For the same user, his personal code and the shared code available will appear in his text content. At this time, it is necessary to distinguish by coding space. Here are some examples:
在前面的例子中,我们在句末加上一个标准的笑脸图标,具体如图42所示。这个笑脸图标也来自于编码仓库,对应的编码是所有用户共享的表情编码。同时,这里的空格编码使用的也是共享编码。手写编码使用的是类型共享个人编码。共享类型信息如下(这里假定共享类型编码为0x01-0x7F):In the previous example, we added a standard smiley icon at the end of the sentence, as shown in Figure 42. This smiley icon also comes from the code repository, and the corresponding code is the expression code shared by all users. At the same time, the space encoding here uses the shared encoding. Handwritten coding uses type shared personal coding. The share type information is as follows (here the shared type code is assumed to be 0x01-0x7F):
Figure PCTCN2015086672-appb-000071
Figure PCTCN2015086672-appb-000071
上表中,编码类型0x01和类型0x02除了归属空间,其他信息完全相同。实际上,类型0x01和0x03都是共享编码,而类型0x02是个人编码。但这三个类型都是共享的,在同一个编码仓库中,个人类型信息会比共享类型信息多一个用户ID。In the above table, the encoding type 0x01 and the type 0x02 are identical except for the attribution space. In fact, types 0x01 and 0x03 are shared encodings, while type 0x02 is a personal encoding. But all three types are shared. In the same code repository, the personal type information will have one more user ID than the shared type information.
下面是类型0x02的内容项:The following is the content item of type 0x02:
Figure PCTCN2015086672-appb-000072
Figure PCTCN2015086672-appb-000072
Figure PCTCN2015086672-appb-000073
Figure PCTCN2015086672-appb-000073
下面是类型0x03的内容项:The following is the content item of type 0x03:
Figure PCTCN2015086672-appb-000074
Figure PCTCN2015086672-appb-000074
因此,该文字对应的编码为:Therefore, the code corresponding to the text is:
0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x050x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05
另外,对于对用户进行编码,从上面的例子中,我们注意到每个个人编码的内容项中都有用户ID的信息。对于一个多用户的编码仓库,个人编码的数据对象是因人而异的。不同的用户的编码可以通过数据对象的用户ID加以区分。但是,在可以独立于文字编码仓库单独存在的文字编码中,如何放置相应的用户ID信息呢?这里有两种情况。In addition, for encoding the user, from the above example, we noticed that each personal coded content item has information of the user ID. For a multi-user code repository, the personally encoded data objects vary from person to person. The encoding of different users can be distinguished by the user ID of the data object. However, in the text encoding that can exist independently of the text encoding warehouse, how to place the corresponding user ID information? There are two situations here.
对于单用户文字编码,一种情况是文字编码中的个人编码都来自于同一个用户(共享编码并不需要用户ID就能访问)。可以有不同的实现方式,一种方式就是使用前面提到的上下文对象设置系统编码;另一种方式就是在编码元模型中明确将用户类型定义为上下文对象类型,这种情况下,我们只需要将用户ID信息编码成一个共享编码,并将其放置在文字编码内容的最前面。For single-user text encoding, one case is that the personal encoding in the text encoding comes from the same user (shared encoding does not require a user ID to access). There can be different implementations. One way is to use the context object mentioned above to set the system encoding. The other way is to explicitly define the user type as the context object type in the encoding metamodel. In this case, we only need The user ID information is encoded into a shared code and placed at the top of the text-encoded content.
上例的共享类型信息增加用户ID编码,更新如下:The shared type information of the above example adds the user ID code, which is updated as follows:
Figure PCTCN2015086672-appb-000075
Figure PCTCN2015086672-appb-000075
Figure PCTCN2015086672-appb-000076
Figure PCTCN2015086672-appb-000076
相应的,两字节的用户ID直接作为类型0x01的编码参数。上例的最终编码为:Correspondingly, the two-byte user ID is directly used as the encoding parameter of type 0x01. The final encoding for the above example is:
0x01 0x0C3F 0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x050x01 0x0C3F 0x03 0x41 0x04 0x03 0x42 0x04 0x03 0x43 0x04 0x02 0x05
这样,文字编码的读入程序在读入最开始的三个字节0x01 0x0C3F就能知道之后的个人编码是属于哪个用户的。In this way, the character encoding read program can know which user the personal code belongs to after reading the first three bytes 0x01 0x0C3F.
有时候,这个用户编码也可以省略,实际上就是隐含的编码上下文。例如,在个人手写应用系统中,每个用户的文字编码内容都是该用户的个人编码。这种系统中,文字编码仓库的用户ID和系统账户唯一对应。该ID可以存储在文字编码之外的其他地方。Sometimes, this user code can also be omitted, which is actually the implicit encoding context. For example, in a personal handwriting application system, each user's textually encoded content is the user's personal code. In this system, the user ID of the text encoding warehouse uniquely corresponds to the system account. This ID can be stored elsewhere than the text encoding.
对于多用户文字编码,另一种情况是多用户混合编码,也就是说在同一个文档中,可能出现多个文字编码仓库用户的编码。我们还是可以使用上述方案,只不过不同的用户编码可以在文字中多次出现。每个用户编码之后的 个人编码都是该用户的个人编码。另外,我们还可以在结构化文档中(例如,基于XML的文档:XHTML、SVG等)将用户ID作为文字的属性。For multi-user text encoding, another case is multi-user hybrid encoding, which means that in the same document, multiple text encoding warehouse user codes may appear. We can still use the above scheme, except that different user codes can appear multiple times in the text. After each user code The personal code is the personal code of the user. In addition, we can also use the user ID as a property of the text in a structured document (for example, XML-based documents: XHTML, SVG, etc.).
当然还有一种最直接的上下文无关的编码方案,就是直接将用户ID作为编码的一部分。Of course, there is also a most direct context-independent coding scheme that directly uses the user ID as part of the encoding.
对于多用户多应用的编码方案,在多用户系统中,作为数据对象数据内容的仓库,编码仓库往往被多个应用系统所共享。应用系统的开发商是有机会获取用户存储在其系统中的对象编码的。如果同一用户对不同应用采用的是同一的编码方式,那么,如果黑客或者恶意应用开发商分析某个用户在某一应用中的对象编码就能建立该用户编码与内容之间的对应关系。这个对应关系就可以直接用于其他应用系统。因此,不同应用之间的编码隔离会大大加强系统的安全性。所谓编码隔离就是同一数据对象的数据内容对应不同应用的对象编码是不同的。为实现应用之间的编码隔离和共享,这里可以使用与应用相关的编码空间。不同的应用在申请使用某种编码时,可以使用不同的编码空间,也可以使用相同的编码空间。For multi-user and multi-application coding schemes, in a multi-user system, as a warehouse of data object data content, an encoding warehouse is often shared by multiple application systems. The developer of the application system has the opportunity to obtain the object code that the user stores in his system. If the same user uses the same encoding method for different applications, then if the hacker or malicious application developer analyzes the object encoding of a certain user in an application, the correspondence between the user code and the content can be established. This correspondence can be directly used in other applications. Therefore, code isolation between different applications will greatly enhance the security of the system. The so-called encoding isolation is that the data content of the same data object corresponds to different object encodings of different applications. To achieve code isolation and sharing between applications, application-related coding spaces can be used here. Different applications can use different encoding spaces or use the same encoding space when applying for a certain encoding.
以下进一步列示出了一些可以结合本发明编码方案的手写输入系统的应用举例:Further examples of the application of handwriting input systems that can incorporate the coding scheme of the present invention are listed below:
1.特定领域的手写系统,如手写日记本、手写记帐本、手写数独、手写填字游戏等等;1. Handwriting systems in specific fields, such as handwritten diaries, handwritten books, handwritten sudoku, handwritten crosswords, etc.;
2.基于手写的命令行输入系统;2. Based on a handwritten command line input system;
3.基于手写的公式编辑器;3. A handwritten formula editor;
4.基于手写的编程系统。4. A handwriting based programming system.
另外,为了进一步描述编码方案的各种实现,下面再举例来说,例如:针对DSL个性化文档,由于新数据处理系统编码的开放性,我们也可以将用户在特定领域中的交互进行编码。这样,就可以用文字的方式将用户的交互数据进行存储、处理和传输。这样做的一个好处就是我们可以将这种交互同用户的其他文字混合在一起进行存储和处理。同时,我们也能够用已有的文字处理工具对其进行处理。此外,我们也能够使用上面我们提到的各种编码方案将用户数据进行个性化编码,实现交互数据的安全性。In addition, in order to further describe various implementations of the coding scheme, for example, for DSL personalized documents, due to the openness of the new data processing system coding, we can also encode the user's interaction in a specific domain. In this way, the user's interaction data can be stored, processed and transmitted in text form. One of the benefits of doing this is that we can mix and match this interaction with other text from the user for storage and processing. At the same time, we can also process it with existing word processing tools. In addition, we can also personalize the user data using the various encoding schemes we mentioned above to achieve the security of interactive data.
具体的,以一个网上围棋的例子为例来具体说明,具体如图43所示。Specifically, an example of online Go is taken as an example for specific description, as shown in FIG. 43.
我们可以定义四个共享编码类型:一个是用户编码类型,将用户的编码 仓库用户ID编码其中。一个是开局编码,这是一个特定领域(应用)编码,其后是执黑、执白的用户ID。一个是落子编码,其后是落子的位置。如上所示,我们可以用两个字节表示,如0x00 0x00就是左上角的位置,0x09 0x09就是天元位置。最后一个是延时编码,记录的是上次落子之后的秒数。这里,我们采用8位字长、与ASCII编码兼容的方案。因此,我们的这里的所有非ASCII编码均采用首位为1的字节。类型信息(编码元数据)如下所示:We can define four shared encoding types: one is the user encoding type, the user's encoding The warehouse user ID is encoded in it. One is the opening code, which is a specific domain (application) code, followed by a black and white user ID. One is the drop code, followed by the position of the drop. As shown above, we can use two bytes, such as 0x00 0x00 is the position of the upper left corner, 0x09 0x09 is the sky element position. The last one is the delay code, which records the number of seconds since the last drop. Here, we use an 8-bit word length scheme compatible with ASCII encoding. Therefore, all of our non-ASCII encodings here use the first byte of 1. The type information (encoding metadata) is as follows:
Figure PCTCN2015086672-appb-000077
Figure PCTCN2015086672-appb-000077
在这里,这六种编码都是内容编码,因此在编码仓库中并不存在任何数据对象。现举例对弈文字如下(此例中,除ASCII码以外的编码都用十六进制表示):Here, these six encodings are content encodings, so there are no data objects in the encoding repository. The example game text is as follows (in this example, the code except ASCII code is expressed in hexadecimal):
0x81 0x85 0x830x81 0x85 0x83
0x80 0x85 0x83 0x85 0x82 0x8F 0x83 0x80 0x85 0x83 0x85 0x82 0x8F 0x83
0x83 0x82 0x84 0x860x83 0x82 0x84 0x86
0x80 0x83 0x83 0x8A 0x82 0x83 0x830x80 0x83 0x83 0x8A 0x82 0x83 0x83
0x80 0x86 0x83 0x87 Hello,everybody!0x80 0x86 0x83 0x87 Hello, everybody!
0x80 0x85 0x83 0x88 0x82 0x8F 0x900x80 0x85 0x83 0x88 0x82 0x8F 0x90
0x80 0x83 0x83 0x8F 0x82 0x83 0x8F0x80 0x83 0x83 0x8F 0x82 0x83 0x8F
0x85 0x860x85 0x86
0x80 0x85 0x83 0x83 0x82 0x90 0x8A0x80 0x85 0x83 0x83 0x82 0x90 0x8A
0x80 0x83 0x83 0x8F 0x82 0x8D 0x820x80 0x83 0x83 0x8F 0x82 0x8D 0x82
...
该对象编码序列将存储于围棋应用的网站存储中。由于采用的是新数据处理系统,对弈数据和聊天数据可以混合在一起。通过该内容,应用能够将其在用户的聊天记录中可视化出来(这里假定用户ID为0x05的用户名为“小明”,用户ID为0x03的用户名为“小亮”,用户ID为0x06的用户名为“小强”):The object code sequence will be stored in the website store of the Go app. Game data and chat data can be mixed together due to the new data processing system. Through this content, the application can visualize it in the user's chat history (here, the user name of the user ID is 0x05 is "Xiaoming", the user whose user ID is 0x03 is "Xiaoliang", and the user whose user ID is 0x06) Named "Xiaoqiang"):
系统:小明执黑,小亮执白。对弈开始。System: Xiao Ming is black, Xiao Liang is white. The game begins.
(开局后5秒)小明:落子P4(5 seconds after the start) Xiao Ming: Luo Zi P4
(开局后7秒)系统:小强加入观众席(7 seconds after the start) System: Xiaoqiang joined the auditorium
(开局后15秒)小亮:落子D4(15 seconds after the start) Xiao Liang: Luozi D4
(开局后22秒)小强:Hello,everybody!(22 seconds after the start) Xiaoqiang: Hello, everybody!
(开局后23秒)小明:落子P17(23 seconds after the start) Xiao Ming: Luo Zi P17
(开局后38秒)小亮:落子D16(38 seconds after the start) Xiao Liang: Luozi D16
(开局后38秒)系统:小强离开(38 seconds after the start) System: Xiaoqiang left
(开局后41秒)小明:落子Q11(41 seconds after the start) Xiao Ming: Luo Zi Q11
(开局后56秒)小亮:落子N3(56 seconds after the start) Xiao Liang: Luozi N3
...
其中的对弈过程也可以通过图形化的方式可视化出来。The game process can also be visualized in a graphical way.
根据这个文字记录,该围棋应用可以将整个对弈过程回放出来。如果考虑到保护棋手的隐私,只有被下棋双方都授权的对弈过程才能被正常回放出来。对于传统应用来说,实现这一功能需要在应用系统中做很多工作:建立 用户授权系统,维护用户授权信息,等等。而脱离了授权系统的对弈数据本身,并不存在任何隐私保护。因此,由于任何原因导致的应用数据泄露,都会导致用户隐私的泄漏。在新的数据处理系统中,将关键数据置于编码仓库上下文编码空间的保护中,就可以大大加强应用及数据的安全性,也能降低应用系统的复杂性。According to this transcript, the Go application can play back the entire game process. If the privacy of the player is protected, only the game process authorized by both players can be played back normally. For traditional applications, implementing this functionality requires a lot of work in the application system: The user authorizes the system, maintains user authorization information, and so on. There is no privacy protection from the game data itself that is out of the authorization system. Therefore, application data leakage for any reason will lead to leakage of user privacy. In the new data processing system, the key data is placed in the protection of the coding warehouse context coding space, which can greatly enhance the security of the application and data, and can also reduce the complexity of the application system.
回到围棋应用的例子,我们只需要将落子类型替换成上下文相关类型:Returning to the Go app example, we only need to replace the drop type with a context-dependent type:
Figure PCTCN2015086672-appb-000078
Figure PCTCN2015086672-appb-000078
对应编码仓库小明的用户空间中(实际上是用户空间中的文档空间),存在小明对该棋局的落子编码数据:Corresponding to the user space of the code repository Xiao Ming (actually the document space in the user space), there is Xiao Ming's coded data for the game:
编码coding XX YY
11 P P 44
22 PP 1717
33 Q Q 1111
小亮的落子编码数据为:The small bright sub-coded data is:
编码coding XX YY
11 D D 44
22 DD 1717
33 N N 1111
这样,对应的文字编码为:Thus, the corresponding text encoding is:
Figure PCTCN2015086672-appb-000079
Figure PCTCN2015086672-appb-000079
Figure PCTCN2015086672-appb-000080
Figure PCTCN2015086672-appb-000080
这样,只要小明、小亮对编码仓库中各自的编码空间进行适当授权,就能控制系统或者他人对棋局的访问。In this way, as long as Xiao Ming and Xiao Liang properly authorize the respective coding spaces in the code repository, they can control the system or other people's access to the game.
前面已经提到,编码仓库可以看作是新数据处理系统的字体库。只不过这个字体库中存的不一定是标准字形信息,还可以是任意其他类型的信息;信息存放的位置也不是特定的,而是任意的。这种字体库当然也能够存放标准化编码的字形信息,也就是传统字库的内容。以矢量轮廓字库为例,可以将每个字(或者字母)的矢量轮廓信息按照其标准码(如Unicode编码)的位置,存储于编码仓库的特定存储中。文字输出中需要用到的其他信息,如Hinting,Kerning等,也可以存储到编码仓库中。As mentioned earlier, the code repository can be thought of as a font library for new data processing systems. However, this font library does not necessarily contain standard glyph information, but also any other type of information; the location of information storage is not specific, but arbitrary. This font library can of course also store the standardized coded glyph information, which is the content of the traditional font library. Taking a vector outline font as an example, the vector outline information of each word (or letter) can be stored in a specific storage of the code repository according to the position of its standard code (such as Unicode code). Other information needed for text output, such as Hinting, Kerning, etc., can also be stored in the code repository.
编码仓库可以部署在网络中,网络化的字库可以更加容易地维护、升级、增加新字体等。可以将传统的字库文件作为编码仓库对应内容的本地缓存。同时,编码仓库的内容选择服务也可以根据输出设备的不同而选择不同质量的字形内容。Code repositories can be deployed on the network, and networked fonts make it easier to maintain, upgrade, add new fonts, and more. A traditional font file can be used as a local cache for the content corresponding to the encoding repository. At the same time, the content selection service of the encoding warehouse can also select different quality glyph contents according to different output devices.
文本显示客户端只需要在渲染标准编码的文字时,根据字体信息,从编码仓库获得文字对应的渲染信息或者渲染结果,就能够对传统文字进行正确渲染。The text display client only needs to render the standard encoded text, according to the font information, obtain the corresponding rendering information or rendering result from the encoding warehouse, and then can correctly render the traditional text.
在计算机系统中,人们不光用文字数据记录自己或者他人的言行,还用其刻画不同领域的模型及数据。一般说来,我们会使用格式化文本来记录模型和数据。格式化文本的好处就是便于计算机的自动分析和处理。XML就是一种典型的格式化文本,能够通过树形结构表达世界上的任意模型。由于XML所具有的人机可读、可扩展性、灵活性等优点,采用XML规范的文本 格式被普遍使用,广泛存在。如互联网网页使用的HTML(4.0以上版本)、SVG、RDF等,都是基于XML的格式。实际上,XML标准是互联网的基石之一。In computer systems, people not only use text data to record their own words or actions, but also use them to portray models and data in different fields. In general, we use formatted text to record models and data. The advantage of formatting text is that it facilitates automatic analysis and processing by the computer. XML is a typical formatted text that can express any model in the world through a tree structure. XML-compliant text due to the human-readable, extensible, and flexible nature of XML The format is widely used and widely exists. For example, HTML (4.0 or above), SVG, RDF, etc. used in Internet web pages are all based on XML format. In fact, the XML standard is one of the cornerstones of the Internet.
然而,XML有一个致命的弱点,就是太过冗余,导致文件存储、传输、处理的代价太大。也正是这个原因,万维网联盟(W3C)制订了EXI(Efficient XML Interchange)标准。这是一个二进制的XML标准。However, XML has a fatal weakness, which is too redundant, which makes the file storage, transmission, and processing too expensive. It is for this reason that the World Wide Web Consortium (W3C) has developed the EXI (Efficient XML Interchange) standard. This is a binary XML standard.
类似的,将XML文件在新的数据处理系统中表示也能避开其致命弱点。但是不同于EXI的完全二进制化,新数据处理系统中的XML文件依然是文本格式,只不过对应的编码变成了对象编码。从OTF-8中的SVG例子中可以看到,我们通过对象编码减少了XML语法上的冗余信息。结合编码仓库中的元数据,转换后的结果同转换前的信息是完全等价的。通过之前提到的“混合编码的通用显示和编辑”文字服务,人们能够方便地查看和编辑文字内容。我们可以更大限度地利用编码仓库,将XML元素、属性的值作为对应编码的数据参数,而直接使用对象编码来编码。这样能够进一步压缩存储空间并减少出错的可能。当然,我们也可以直接将XML内容或者片段存储于编码仓库中并在XML文件中使用其编码,但这只是编码仓库对XML的使用,并不是对XML编码本身进行优化。Similarly, representing an XML file in a new data processing system can also avoid its Achilles heel. But unlike the full binarization of EXI, the XML file in the new data processing system is still in text format, except that the corresponding encoding becomes the object encoding. As can be seen from the SVG example in OTF-8, we reduced the redundant information on the XML syntax by object encoding. Combined with the metadata in the code repository, the converted result is completely equivalent to the information before the conversion. People can easily view and edit text content through the previously mentioned "hybrid coded universal display and editing" text service. We can make greater use of the encoding warehouse, and use the values of XML elements and attributes as the data parameters of the corresponding encoding, and directly use the object encoding to encode. This further compresses the storage space and reduces the possibility of errors. Of course, we can also store the XML content or fragments directly in the encoding repository and use the encoding in the XML file, but this is just the use of XML by the encoding repository, not the XML encoding itself.
使用对象编码的XML文件,我们只需要在XML解析器中做少量改动,从编码仓库中获取相关信息。在此基础之上,现有的所有XML技术,如SAX、DOM、XPath、XSLT、XSLT-FO等,都可以直接使用。对于应用程序开发人员来说,所有的改动都在XML文件的存储层和解析层发生,如果API保持不变,则使用XML的应用程序不需要任何改变,可以直接享有更小的文件大小、更快的传输速度。Using object-encoded XML files, we only need to make a few changes in the XML parser to get the relevant information from the encoding repository. Based on this, all existing XML technologies, such as SAX, DOM, XPath, XSLT, XSLT-FO, etc., can be used directly. For application developers, all changes occur in the storage and parsing layers of the XML file. If the API remains the same, the application using XML does not need any changes, and can directly enjoy smaller file sizes and more. Fast transfer speed.
实际上,在现有XML的规范中,同一套字符集既用于表达语法标记,又用于表达文字内容。因此,在生成XML文件的过程中,我们有诸多的限制,如:一些系统字符(“<”、“>”、“&”等)不能直接使用,必须通过实体转义;非解析数据还得通过“<[!CDATA[”和“]]>”进行封装;等等。对象编码的使用使得这些限制完全没有必要,因为我们并不需要通过编码本身来确定其是标记还是内容,而是通过编码对应的编码仓库信息。因此我们可以简化XML的复杂程度以及对应的解析过程。 In fact, in the existing XML specification, the same set of character sets is used to express both grammatical and textual content. Therefore, in the process of generating XML files, we have a number of restrictions, such as: some system characters ("<", ">", "&", etc.) can not be used directly, must be escaped through the entity; non-parsed data has to Encapsulation via "<[!CDATA[" and "]]>"; and so on. The use of object coding makes these restrictions completely unnecessary, because we do not need to determine whether it is a tag or content by the encoding itself, but by encoding the corresponding encoding repository information. So we can simplify the complexity of XML and the corresponding parsing process.
类似的,我们可以使用同样的方法,将现有任意的文本格式(如CSV、RTF、CSS、JSON,甚至编程语言等)对象编码化:Similarly, we can use the same method to encode any existing text format (such as CSV, RTF, CSS, JSON, and even programming languages):
1.将语法标记/关键字的对应内容置于编码仓库中,在文件中使用对应对象编码;1. Place the corresponding content of the grammar tag/keyword in the encoding repository, and use the corresponding object encoding in the file;
2.去掉数据/文字内容中的任何字符限制。2. Remove any character restrictions in the data/text content.
上面我们提到,对象编码可以轻易消除原有标准编码的格式化编码同文字内容的冲突。同样,开放编码的这种编码和内容的拆分以及编码类型的开放性,使得将多种不同的任意文本格式混合在一起成为可能。现有的一些文本格式规范中也考虑到了这种可能性。例如,XHTML中可以嵌入JavaScript,也可以嵌入Base64编码的二进制数据;RTF中可以嵌入OLE对象等等。但是,一方面,这些格式都受到标准文字编码的限制,不同格式的数据都需要一定的编码转换或者字符转义;另外一方面,现有的格式混合也是受限的,以一种格式为主(其他格式只是内嵌数据)的进行的。然而,通过对象编码,我们可以很容易做到任意格式的混合。例如,在XML文档(实际上是树形文档)的一个节点中嵌于表格数据;或者反过来,在表格的一个单元中放入树形文档;或者将两种不同形式的文档数据并排放置。当然,这种多种格式的混合也是有一定规则约束的:As mentioned above, the object encoding can easily eliminate the conflict between the formatting code of the original standard encoding and the text content. Similarly, the open coding of such encoding and content splitting and the openness of the encoding type makes it possible to mix many different arbitrary text formats together. This possibility is also taken into account in some existing text format specifications. For example, XHTML can embed JavaScript, or embed Base64 encoded binary data; RTF can embed OLE objects and so on. However, on the one hand, these formats are limited by standard text encoding. Different formats of data require certain encoding conversion or character escaping. On the other hand, the existing format mixing is also limited, mainly in one format. (Other formats are just embedded data). However, with object coding, we can easily mix in any format. For example, embedded in tabular data in a node of an XML document (actually a tree document); or conversely, a tree-shaped document in one unit of the table; or two different forms of document data placed side by side. Of course, this mixture of multiple formats is also subject to certain rules:
1.每种格式都必须有一个明确的格式开始与格式结束编码。1. Each format must have an explicit format to start with the format end encoding.
2.不同格式的开始和结束不能交织在一起。也就是说,一个格式在另一个格式内部开始,则必须在其内部结束。2. The beginning and end of different formats cannot be intertwined. That is, if one format starts inside another, it must end inside it.
此外,对象编码还允许我们直接将二进制数据嵌入到编码结果中。实际上就是对数据对象数据内容的内容编码方式。只需要在对应的编码元数据中描述相应的二进制编码方法即开。这种对象编码的构成可以为如下形式:In addition, object encoding allows us to embed binary data directly into the encoded results. In fact, it is the way to encode the content of the data content of the data object. It is only necessary to describe the corresponding binary encoding method in the corresponding encoding metadata. The composition of such object coding can be in the form of:
元编码+二进制内容编码数据长度+具体的二进制内容编码数据Metacode + binary content encoded data length + specific binary content encoded data
实际上,混合格式编码的实现对于对象编码数据处理系统来说,是非常自然的。开放对象编码系统中,不同编码类型本来就需要不同的编码器、解码器,在一个对象编码文档中,根据需要动态装载它们进行编解码。编码器将对象编码成字节流,解码器将字节流解码成对象。而不同的格式是将编、解码器分成了不同的组。因此,对某一格式的编码实际上也是将对应的内存 模型编码成字节流,而对该格式的解码过程则是将字节流解码成内存模型,即更高级别的对象。因此,格式编解码器实际上是更为宏观的对象编解码器,在新数据处理系统中可以用同样的方式对他们进行管理。In fact, the implementation of mixed format encoding is very natural for object encoded data processing systems. In the open object coding system, different coding types originally require different encoders and decoders, and in an object coding document, they are dynamically loaded and coded as needed. The encoder encodes the object into a stream of bytes, and the decoder decodes the stream of bytes into objects. The different formats are to divide the codec and decoder into different groups. Therefore, the encoding of a certain format is actually the corresponding memory. The model is encoded into a byte stream, and the decoding process for this format is to decode the byte stream into a memory model, a higher level object. Therefore, the format codec is actually a more macro object codec that can be managed in the same way in new data processing systems.
就本质而言,对象编码系统是用字节流来编码对象串。对象串,即对象数组中的对象可简单如单个字符,也可复杂如程序代码对应的抽象语法树,或者XML对应的树形结构。Essentially, the object encoding system encodes object strings in a byte stream. The object string, that is, the object in the object array can be as simple as a single character, or can be as complex as the abstract syntax tree corresponding to the program code, or the tree structure corresponding to XML.
另外,对于基于手写的编程系统,编程系统中,编译器以及解释器所关心的对象主要是符号。至于这个符号对应的到底是单词还是图形,并不能影响编译的进行以及解释执行。在这个过程中,符号匹配极为重要。因此,在手写数据处理系统中,我们只要做好了文字内容的图形匹配,并将相匹配的内容使用同一编码,就能重用现有的编程语言基础设施。这个图形匹配主要是分两种:关键字匹配和标识符匹配。关键字匹配的结果是系统关键字(对于传统编程语言,一般为标准编码);标识符匹配的结果是相同的自定义编码或者扩展编码。In addition, for handwriting-based programming systems, the compiler and the objects of interest to the interpreter are primarily symbols in the programming system. Whether the symbol corresponds to a word or a graphic does not affect the compilation and interpretation. In this process, symbol matching is extremely important. Therefore, in the handwritten data processing system, we can reuse the existing programming language infrastructure by simply matching the text content and matching the matching content with the same encoding. There are two main types of pattern matching: keyword matching and identifier matching. The result of keyword matching is the system keyword (generally standard encoding for traditional programming languages); the result of identifier matching is the same custom encoding or extended encoding.
另外,对于编程语言,目前绝大多数的编程语言都采用的是文本文件。同样,采用上述方法可以将程序源代码对象编码化。程序源代码的对象编码化可以带来如下好处:In addition, for programming languages, most programming languages currently use text files. Similarly, the program source code object can be encoded using the above method. Object encoding of program source code can bring the following benefits:
1.缩小文件大小。这对于需要在网络中传输的源代码,如JavaScript等尤为重要。1. Reduce the file size. This is especially important for source code that needs to be transmitted over the network, such as JavaScript.
2.可以使用非标准编码进行编程。这使得如手写编程、语音编程成为可能。2. Can be programmed using non-standard encoding. This makes possible such as handwriting programming and voice programming.
3.可以使用开放编码的安全特性,将源代码中的编码置于作者或者版权人的相关上下文空间中,只有被授权的用户才能使用。3. You can use the security features of open coding to place the code in the source code in the relevant context space of the author or copyright owner, which can only be used by authorized users.
4.在解析关键字被开放编码的源代码过程中,对关键字的词法扫描和分析变成了直接的编码识别,会更加高效。4. In the process of parsing the source code of the keyword being open-coded, the lexical scanning and analysis of the keyword becomes a direct code recognition, which is more efficient.
同大多数文本文件的对象编码化一样,程序源代码的对象编码化主要是在工具层面进行,对最终用户是完全透明的。As with object encoding for most text files, object encoding of program source code is primarily at the tool level and is completely transparent to the end user.
另外,开放编码本身也给程序设计语言带来了新的可能。我们可以以全新的方式来构筑计算机软件:数据可以存在于编码仓库,程序中可以对其直接引用;程序也可以存在于编码仓库,可以用编码的方式对其引用;数据也 可以和程序以某种形式混合在一起。In addition, open coding itself brings new possibilities to programming languages. We can build computer software in a completely new way: data can exist in the code repository, which can be directly referenced in the program; the program can also exist in the code repository, which can be referenced by coding; Can be mixed with the program in some form.
另外,对于机器指令编码,编码仓库实际上又是一个天然的密码库。通过编码仓库进行编码的数据,具有很强的安全性。因此,我们不仅可以通过编码仓库来进行文字编码,还可以用其对二进制数据进行编码。一个典型的应用就是对机器指令进行上下文相关对象编码。这样,同一应用程序的二进制文件对不同用户来说是完全不同的。用户不能执行其他用户的可执行文件。这实际上是一种应用程序数字版权保护的方案。另外,这种方案也可以起到防止病毒或者恶意程序对可执行文件的破坏。In addition, for machine instruction encoding, the encoding repository is actually a natural password library. Data encoded by the code repository is highly secure. Therefore, we can not only encode the text through the encoding warehouse, but also use it to encode the binary data. A typical application is to encode context-sensitive objects for machine instructions. Thus, the binary of the same application is completely different for different users. Users cannot execute executables of other users. This is actually a solution for digital rights protection for applications. In addition, this solution can also prevent the destruction of executable files by viruses or malicious programs.
这种方案的具体实现主要是通过修改程序执行引擎或者虚拟机的实现来完成的。以Java虚拟机为例,只要按照某种方法(如随机算法)将标准的Java虚拟机指令码按不同用户重新编码,并将其置于编码仓库中,并设置适当的保护权限;对可执行的Java字节码按照编码之后的指令码进行编码;Java虚拟机在执行过程中,根据当前用户信息,动态将当前的字节码还原成标准指令码。这样,只有对应的用户才能正确执行相应的Java字节码。The specific implementation of this scheme is mainly accomplished by modifying the implementation of the program execution engine or virtual machine. Take the Java virtual machine as an example, as long as the standard Java virtual machine instruction code is re-encoded by different users according to a method (such as a random algorithm), and placed in the code repository, and the appropriate protection rights are set; The Java bytecode is encoded according to the encoded instruction code; during the execution of the Java virtual machine, the current bytecode is dynamically restored to the standard instruction code according to the current user information. In this way, only the corresponding user can correctly execute the corresponding Java bytecode.
对于二进制格式编码,同可执行文件类似,我们也可以将其他的二进制数据文件中的部分或者全部关键信息置于编码区中,从而起到版权保护的作用——只有被授权的用户才能获得关键信息并使用对应的二进制数据。For binary format encoding, similar to the executable file, we can also put some or all of the key information in other binary data files in the encoding area, thus playing the role of copyright protection - only authorized users can get the key Information and use the corresponding binary data.
以视频文件为例,很多视频文件格式实际上是容器格式,其中可以容纳不同编码格式的视频、音频流。业界一般使用称作为“FourCC”的四字节编码格式标识。视频播放器会根据这个FourCC来使用正确的解码器对视音频流进行解码、播放。目前已经有几百种注册的FourCC。我们可以将视频文件中的FourCC替换成对象编码,而真正的流编码标识符存储于对应的编码仓库存储中。这样,通过控制编码仓库相应的访问权限,我们就能对视频文件或者视频流的播放进行控制。Taking video files as an example, many video file formats are actually container formats, which can accommodate video and audio streams in different encoding formats. The industry generally uses a four-byte encoding format designation called "FourCC." The video player will decode and play the video and audio streams using the correct decoder based on this FourCC. There are currently hundreds of registered FourCCs. We can replace the FourCC in the video file with the object encoding, and the real stream encoding identifier is stored in the corresponding encoding repository storage. In this way, by controlling the corresponding access rights of the code repository, we can control the playback of video files or video streams.
另外,关于数据压缩,利用编码仓库,还能实现数据压缩功能:将数据中重复的部分放入编码区并使用对应的开放编码。In addition, with regard to data compression, it is also possible to implement a data compression function by using an encoding warehouse: the repeated portions of the data are placed in the encoding area and the corresponding open coding is used.
另外,对于网络数字商店,我们已经看到,对象编码仓库内置的安全机制使得数字版权管理、身份认证等可以在编码仓库的基础上轻松实现。我们可以将其用于网络数字商店的建设。In addition, for the network digital store, we have already seen that the security mechanism built into the object encoding warehouse makes digital rights management, identity authentication, etc. easy to implement on the basis of the encoding warehouse. We can use it for the construction of a network digital store.
网络数字商店系统主要是向网络用户提供数字内容交易服务的应用系 统。像应用商店、电子图书馆等都属于这个范畴。这里的用户主要分两类:数字内容的提供者以及数字内容的消费者。可以直接将网络数字商店系统建立在编码仓库基础之上,所有用户都是编码仓库的用户,将对应数字内容同用户相关的上下文编码联系起来,就能使用编码仓库内置的安全性。The network digital store system is mainly an application system that provides digital content transaction services to network users. System. Applications such as app stores, e-libraries, etc. fall into this category. There are two main types of users here: providers of digital content and consumers of digital content. The network digital store system can be directly built on the code warehouse. All users are users of the code warehouse. By linking the corresponding digital content with the context code associated with the user, the security built into the code repository can be used.
具体说来,消费者对数字内容的消费主要是两种模式:租赁模式和购买模式。Specifically, consumer consumption of digital content is mainly two modes: rental mode and purchase mode.
租赁模式是指数字内容或者数字资产由提供者拥有,消费者只是通过某种途径(一般来说是付费)获得了临时访问权或者使用权。被租赁的数字内容一般是有时效性的,过了期限的内容对消费者来说是不可访问的。将数字内容中置入提供者相关的上下文编码,就可以实现租赁模式的访问控制——根据每个用户的租赁期进行访问授权。Lease mode means that digital content or digital assets are owned by the provider, and consumers only obtain temporary access or use rights through some means (generally paid). The digital content being leased is generally time-sensitive, and content that has expired is inaccessible to the consumer. By placing the provider-related contextual code in the digital content, access control for the lease mode can be implemented—access authorization based on each user's lease period.
购买模式是指消费者通过某种方式(如付费购买)获得了数字内容的所用权。那么这里主要是数字版权保护的问题——防止非法拷贝的产生。利用编码仓库的具体实现就是,在用户购买到的数字内容中置入该用户个人空间中的特殊上下文编码。该编码只能由该用户访问,且该用户无法更改编码访问规则。这样,其他用户即使获得了同样内容的数字拷贝,也无法正常使用。The purchase model refers to the right of the consumer to obtain digital content in a certain way (such as paid purchase). So here is mainly the issue of digital copyright protection - preventing the creation of illegal copies. A specific implementation of the use of the code repository is to place a special context code in the user's personal space in the digital content purchased by the user. This encoding can only be accessed by this user, and the user cannot change the encoding access rules. In this way, other users will not be able to use the digital copy of the same content.
从上述描述中可以看出,基于对象编码的数据处理系统最核心的部分就是编码仓库(或编码库)。可以将各种编码的元数据存储于其中;文字的真正内容也可以存储于其中。通过编码仓库提供的各种服务,新的文字输入系统就可以将各种文字内容,或者其他内容(如用户交互内容、特定领域内容、应用内容等)转换成文字编码,被应用系统存储和处理。在生成文字编码的过程中,文字内容的部分或者全部存储到了编码仓库。同样,也是通过编码仓库的服务,新的文字输出系统能够将应用程序发送的字符串转换成可以渲染或者播放的文字内容,或者应用程序能够使用的对象模型。As can be seen from the above description, the core part of the data processing system based on object encoding is the encoding warehouse (or encoding library). Various encoded metadata can be stored therein; the real content of the text can also be stored therein. Through the various services provided by the code repository, the new text input system can convert various text content, or other content (such as user interaction content, specific domain content, application content, etc.) into text code, which is stored and processed by the application system. . In the process of generating the text encoding, part or all of the text content is stored in the encoding repository. Similarly, through the services of the code repository, the new text output system can convert the string sent by the application into text content that can be rendered or played, or an object model that the application can use.
当然,编码仓库并不是唯一的存储体或存储空间。广义的编码仓库可以是多个存储体的结合体,甚至可以是云存储中不同安全通道下的云存储服务供应商。Of course, the encoding repository is not the only storage or storage space. A generalized code repository can be a combination of multiple banks, or even a cloud storage service provider under different secure channels in cloud storage.
元数据在新系统中,不管是编码层处理还是文本数据处理,编、解码系统或功能都是它们的基石。作为新编码系统的核心,编码仓库至少提供两项 基本服务。其一,就是接收要被编码的内容,确保该内容正确存储于编码仓库中,并返回相应编码。称为编码服务。编码系统使用这项服务得到正确的文字编码。另外一项服务就是根据编码,返回相应的内容项,称为解码服务。解码系统就需要该功能获得能够被输出系统正确输出的内容。当然,对于单一用户系统而言,编/解码的功能或服务也可以直接设置在用户端,而不必设置在编码仓库端。Metadata In the new system, whether it is coding layer processing or text data processing, encoding, decoding systems or functions are the cornerstones of them. As the core of the new coding system, the code repository provides at least two items. Basic service. One is to receive the content to be encoded, ensure that the content is properly stored in the encoding repository, and return the corresponding encoding. Called the encoding service. The encoding system uses this service to get the correct text encoding. Another service is to return the corresponding content item according to the encoding, which is called decoding service. The decoding system needs this function to obtain content that can be correctly output by the output system. Of course, for a single-user system, the encoding/decoding function or service can also be set directly on the client side without having to be set at the encoding repository.
图44为本发明的一种编码处理系统的第一实施例的结构示意图,如图44所示,该编码处理系统包括:接收单元11C、元数据提取单元12C、元编码生成单元13C、编码规约选择或创建单元14C、实例编码生成单元15C和对象编码生成单元16C;具体的,接收单元11C用于接收编码处理请求,并根据根据所述编码处理请求,获取待编码的数据对象;元数据提取单元12C元数据提取单元,用于根据所述待编码的数据对象,获取元数据;元编码生成单元13C用于根据所述元数据,查询编码仓库,获取与所述元数据对应的元编码;编码规约选择或创建单元14C用于根据所述元编码,选择或创建对应的编码规约;内容编码生成单元15C用于根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码;对象编码生成单元16C用于根据所述元编码和实例编码,获取与所述数据对象对应的对象编码。44 is a schematic structural diagram of a first embodiment of an encoding processing system according to the present invention. As shown in FIG. 44, the encoding processing system includes: a receiving unit 11C, a metadata extracting unit 12C, a metacode generating unit 13C, and an encoding protocol. a selection or creation unit 14C, an example encoding generation unit 15C, and an object encoding generation unit 16C; specifically, the receiving unit 11C is configured to receive an encoding processing request, and acquire a data object to be encoded according to the encoding processing request according to the encoding processing request; metadata extraction a unit 12C metadata extracting unit, configured to acquire metadata according to the data object to be encoded, and a meta code generating unit 13C, configured to query an encoding warehouse according to the metadata, and obtain a meta code corresponding to the metadata; The encoding specification selection or creation unit 14C is configured to select or create a corresponding encoding specification according to the meta encoding; the content encoding generating unit 15C is configured to encode the data content of the data object according to the encoding specification, and obtain an instance encoding. The object encoding generating unit 16C is configured to acquire the data pair according to the meta encoding and the instance encoding Like the corresponding object encoding.
在本实施例中,该编码处理系统可以执行图5C和图5D所示方法实施例的技术方案,其实现原理和效果相类似,此处不再赘述。In this embodiment, the coding processing system can perform the technical solutions of the method embodiments shown in FIG. 5C and FIG. 5D, and the implementation principles and effects thereof are similar, and details are not described herein again.
另外,进一步的,该编码处理系统还可以包括:数据压缩单元,用于在数据传输和存储之前先对数据进行数据压缩,可以在编码规约中描述或体现出相应的压缩处理;以及加密单元,用于对需要加密的数据对象或编码进行加密处理。In addition, the encoding processing system may further include: a data compression unit, configured to perform data compression on the data before data transmission and storage, may describe or embody a corresponding compression process in the coding protocol; and an encryption unit, Used to encrypt data objects or encodings that need to be encrypted.
图45为本发明的一种解码处理系统的第一实施例的结构示意图,如图44所示,该装置包括:接收单元21C、拆解单元22C、获取单元23C和恢复单元24C;其中,接收单元21C用于接收解码处理请求,并根据所述解码处理请求,获取待解码的对象编码;拆解单元22C用于对所述对象编码进行拆解,获取元编码,或者所述元编码和实例编码;获取单元23C用于查询编码仓库,根据所述元编码获取对应的元数据和编码规约;恢复单元24C用于根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与 所述对象编码对应的数据对象。45 is a schematic structural diagram of a first embodiment of a decoding processing system according to the present invention. As shown in FIG. 44, the apparatus includes: a receiving unit 21C, a disassembling unit 22C, an obtaining unit 23C, and a restoring unit 24C; The unit 21C is configured to receive a decoding processing request, and obtain an object encoding to be decoded according to the decoding processing request; the disassembling unit 22C is configured to disassemble the object encoding, obtain a meta encoding, or the meta encoding and an instance. Encoding unit 23C is configured to query an encoding warehouse, and obtain corresponding metadata and encoding specifications according to the meta-encoding; the recovering unit 24C is configured to use the metadata, the encoding protocol, and the encoding according to the metadata and encoding protocol. , get with The object encodes a corresponding data object.
在本实施例中,该解码处理系统可以执行如图32所示的方法实施例的技术方案,其实现原理和效果相类似,此处不再赘述。In this embodiment, the decoding processing system can perform the technical solution of the method embodiment shown in FIG. 32, and the implementation principle and effect are similar, and details are not described herein again.
进一步的,与编码处理系统对应地,解码处理系统也可以包括对应的数据解密单元和数据解压缩单元等。Further, corresponding to the encoding processing system, the decoding processing system may also include a corresponding data decrypting unit, a data decompressing unit, and the like.
在本实施例中,举例来说,主要基于对象编码系统的文字处理系统为例来进行详细描述,图46为主要基于对象编码系统的文字处理系统的架构示意图,如图46所示,新系统大体分为两部分:编码仓库、以及相应的处理系统。In this embodiment, for example, a word processing system mainly based on an object coding system is taken as an example for detailed description. FIG. 46 is a schematic diagram of a structure of a word processing system mainly based on an object coding system, as shown in FIG. 46, a new system. It is roughly divided into two parts: the code warehouse and the corresponding processing system.
编码仓库(编码库)编码仓库可以包括两部分:编码数据,及围绕这些数据的相关服务。The code repository (code base) code store can consist of two parts: coded data, and related services around the data.
具体的,从开放编码的编码模型可以看出,该模型能够很容易地使用基于对象的方法得以实现。由于编码的持久性,我们可以使用对象数据库,或者通过对象—关系映射技术将对象存储于各种数据库中。Specifically, it can be seen from the open coding coding model that the model can be easily implemented using an object-based approach. Due to the persistence of encoding, we can use object databases or store objects in various databases through object-relational mapping techniques.
对于编码服务,编码服务实际上是一个编码仓库接收对象数据,将其存储于库中,并返回对应的编码的过程。从前面的编码模型中可以看出,这个编码分为两个部分:元编码以及实例编码。针对比较常见的短字长编码,我们一般提供两个对应的子服务。For an encoding service, the encoding service is actually a process in which the encoding repository receives the object data, stores it in the library, and returns the corresponding encoding. As can be seen from the previous coding model, this code is divided into two parts: meta-encoding and instance coding. For the more common short word length coding, we generally provide two corresponding sub-services.
对于注册编码元对象子服务,得到注册好的命名编码空间后,客户端可以向其注册编码类型。编码类型包括对应编码的目标编码空间,实际上是由类型数据对应的元编码空间来指定。编码仓库收到注册请求后,根据系统和用户的设定来验证请求的安全、合法性。验证通过后将对应的编码返回给客户端。For the registered encoding meta-object sub-service, after obtaining the registered naming coding space, the client can register the encoding type with it. The coding type includes the target coding space corresponding to the coding, which is actually specified by the meta coding space corresponding to the type data. After receiving the registration request, the encoding warehouse verifies the security and legality of the request according to the settings of the system and the user. After the verification is passed, the corresponding code is returned to the client.
命名编码空间并不是编码类型注册唯一的目标空间,客户端也可以直接向编码仓库的根空间进行注册。同注册命名编码空间类型类似,编码仓库会根据系统以及用户的设定将该编码类型置于特定的编码空间之中,并将对应的整个编码空间路径以及类型编码返回给客户端。The named encoding space is not the only target space for encoding type registration, and the client can also register directly with the root space of the encoding repository. Similar to the registered named encoding space type, the encoding repository will place the encoding type in a specific encoding space according to the system and the user's settings, and return the corresponding encoding space path and type encoding to the client.
对于对象编码子服务,客户端在向编码仓库提出编码请求时,必须同时提供对应的元编码、类型编码。编码仓库会将对象存储于编码类型对应的数据存储中,并将对象在该存储的位置返回给客户端。 For the object encoding sub-service, when the client makes an encoding request to the encoding warehouse, it must provide the corresponding meta-encoding and type encoding. The encoding repository stores the object in the data store corresponding to the encoding type and returns the object to the client at the stored location.
对于解码服务,同编码服务相反,解码服务是一个编码仓库接收编码,将对应的数据对象返回给客户端。For the decoding service, contrary to the encoding service, the decoding service is an encoding warehouse receiving code, and the corresponding data object is returned to the client.
具体说来,编码仓库提供两组解码子服务。在解码服务的短字长实现中,我们给出一个简单的约束:元编码和实例编码分别用单独的代码点来表示,实例编码只能出现在元编码之后。这样,解码服务可以通过两个子服务来完成。Specifically, the code repository provides two sets of decoding sub-services. In the implementation of the short word length of the decoding service, we give a simple constraint: the meta code and the instance code are respectively represented by separate code points, and the instance code can only appear after the meta code. In this way, the decoding service can be done through two sub-services.
对于解码元编码子服务,当客户端在向编码仓库提出对特定编码空间(如果没有指定,就是根空间)元编码的解码请求时,编码仓库会首先进行安全检查,查看当前上下文对象是否满足系统安全设定。在满足安全设置的基础上,将该元编码在指定编码空间内的编码元数据返回给客户端。这个编码元数据包括类型编码的对应类型信息以及对应编码实例的目标编码空间。如果对应类型是一个编码元数据类型,则其对应的编码空间为当前空间的子空间。For the decoding meta-encoding sub-service, when the client proposes a decoding request to the encoding repository for a specific encoding space (if not specified, the root space), the encoding repository first performs a security check to see if the current context object satisfies the system. Security settings. On the basis of satisfying the security setting, the encoded metadata encoded in the specified encoding space is returned to the client. This encoding metadata includes corresponding type information of the type encoding and a target encoding space of the corresponding encoding instance. If the corresponding type is an encoded metadata type, its corresponding encoding space is a subspace of the current space.
对于解码编码对象子服务,类似的,客户端取得编码元数据后,可以向编码仓库提出对特定编码空间、特定编码类型、特定编码的解码请求。在满足安全设置的基础上,编码仓库会将编码对应位置的对象数据返回给客户端。For the decoding of the encoding object sub-service, similarly, after the client obtains the encoding metadata, the decoding request for the specific encoding space, the specific encoding type, and the specific encoding may be proposed to the encoding warehouse. On the basis of satisfying the security settings, the code repository will return the object data of the corresponding location to the client.
对于内容缓存服务,内容缓存服务可以通过对编码仓库进行对象编码来实现。具体的,就是在一个编码仓库中建立对另一个或者多个编码仓库的对象编码,当然所谓的编码仓库对象的内容主要是对目标仓库的引用编码,如URL、连接字符串等。那么,每个目标仓库实际上就对应了一个编码空间。通过这种方式,在编解码过程中,通过设置缓存编码仓库,内容缓存服务就能通过代理缓存的方式将目标编码及对应内容都存储到缓存编码仓库内目标编码仓库对应的的编码空间中去。For content caching services, the content caching service can be implemented by object encoding the encoding repository. Specifically, the object encoding of another code storage warehouse is established in an encoding warehouse. Of course, the content of the so-called encoding warehouse object is mainly a reference encoding of the target warehouse, such as a URL, a connection string, and the like. Then, each target repository actually corresponds to a coding space. In this way, in the encoding and decoding process, by setting the cache encoding repository, the content caching service can store the target encoding and the corresponding content in the encoding space corresponding to the target encoding warehouse in the cache encoding warehouse by proxy caching. .
对于环境感知的授权访问系统,新系统的安全性主要是建立在编码仓库授权访问服务的基础之上的。编码仓库的其他服务都是在授权访问服务的基础之上才能提供。For an environment-aware authorized access system, the security of the new system is mainly based on the coded warehouse authorized access service. Other services for the code repository are provided on the basis of an authorized access service.
不同于一般的授权访问系统,编码仓库授权访问的粒度可以非常细,可以是某个具体的编码。而且编码的使用存在一个具体的上下文,如编码的作者、读者,使用编码的应用、文档等等。因此,基于这个上下文模型及其相 关的扩展模型,可以定义各种规则,来方便对编码仓库内各种编码服务的访问设置。Different from the general authorized access system, the granularity of the authorized access of the encoding warehouse can be very fine, and it can be a specific encoding. And the use of encoding exists in a specific context, such as the author of the code, the reader, the application using the code, the document, and so on. Therefore, based on this context model and its phase The extended model can define various rules to facilitate access settings for various encoding services within the encoding repository.
环境(上下文)感知授权访问系统的实现并没有任何技术难点,使用传统的基于规则系统的技术即可满足需求。The implementation of the environment (context)-aware authorized access system does not have any technical difficulties, and the traditional rule-based system-based technology can meet the demand.
访问授权规则库除了系统缺省的设置外,主要是由系统管理员、以及编码作者自己对自己的编码访问进行规则设置。In addition to the system default settings, the access authorization rule base is mainly set by the system administrator and the code author himself to set his own code access.
授权规则的设置是建立在编码模型及编码上下文模型的基础上的,如编码类型、编码空间、编码上下文、时间、地点(GPS)、编码作者、编码读者等,除此之外,使用编码仓库的应用系统还能向编码仓库提供编码上下文的扩展模型,编码访问规则可以建立在所有这些模型的基础之上。The authorization rules are set based on the coding model and the coding context model, such as coding type, coding space, coding context, time, location (GPS), code author, code reader, etc., in addition, use the code warehouse The application system can also provide an extension model of the encoding context to the encoding repository, and the encoding access rules can be built on all of these models.
与本发明的基于对象的上下文相关编码方案相结合的应用还可以包括但不限于:手写登录、安全认证模型、文字服务、文字编解码序列化服务等等。Applications that are combined with the object-based context-dependent coding scheme of the present invention may also include, but are not limited to, handwritten login, secure authentication model, text service, text codec serialization service, and the like.
另外,不同于前面所提及的编码的编解码服务,文字编解码序列化服务是将应用系统中的对象同编码之间进行相互的转换。文字编解码的序列化服务是建立在编码仓库的编解码服务基础之上的。文字编解码的序列化服务实际上就是数据对象的内容编码服务。除此之外,文字编解码同编码仓库编解码最主要的区别在于编解码数据的对应模型不同:文字编解码对应的是应用模型,而编码仓库编解码对应的是存储模型。当然,在某些情况下,两种模型完全相同。In addition, unlike the encoding and decoding services of the encoding mentioned above, the character codec serialization service converts objects in the application system from each other. The text encoding and decoding serialization service is based on the codec service of the encoding warehouse. The serialization service of the text codec is actually the content encoding service of the data object. In addition, the main difference between text encoding and decoding and encoding warehouse codec is that the corresponding model of codec data is different: the text codec corresponds to the application model, and the code warehouse codec corresponds to the storage model. Of course, in some cases, the two models are identical.
对于文字输入输出服务,我们前面提到,新的数据处理系统主要有两方面的编码能力,一个是个性化文字的编码能力,另外一个就是传统文字数据的再编码能力。我们这里提到的文字输入输出服务主要针对的是前者。对后者的输入输出主要是通过后面提到的“通用显示编辑服务”For text input and output services, we mentioned earlier that the new data processing system mainly has two aspects of coding ability, one is the coding ability of personalized text, and the other is the re-encoding ability of traditional text data. The text input and output services we mentioned here are mainly for the former. The input and output of the latter is mainly through the "general display editing service" mentioned later.
常见的个性化文字主要是手写文字、语音文字。当然,也可以是任何能够借助于计算机系统进行存储和传输的其他文字形式,如手语、手势、旗语、唇语等。Common personalized texts are mainly handwritten text and voice text. Of course, it can also be any other form of text that can be stored and transmitted by means of a computer system, such as sign language, gestures, semaphores, lips, and the like.
这里主要通过对手写文字的描述来展现个性化文字同传统计算机文字的不同。Here, the description of handwritten characters is mainly used to show the difference between personalized text and traditional computer text.
个性化手写文字可以有很多种,依据输入方法的不同,可以是直接输入 到计算机系统的图形/笔划信息,称之为联机手写;也可以是传统的在纸张之上书写结果的扫描图像,称之为脱机手写。依据笔划的细节不同,有硬笔手写、软笔手写等。There are many kinds of personalized handwritten texts, which can be directly input depending on the input method. The graphic/stroke information to the computer system is called online handwriting; it can also be a traditional scanned image of the result written on paper, called offline handwriting. According to the details of the strokes, there are hard pen handwriting, soft pen handwriting and so on.
这种个性化手写文字同现有手写输入有一个最本质的不同,就是个性化文字采用个性化编码,因人而异,不需要识别成标准编码。因此,个性化文字的输入输出过程主要是一个自然书写的过程。在这一过程中,计算机需要尽可能地适应个人的书写习惯,最大程度地保留书写结果。这与传统的人类适应计算机的键盘输入方式刚好相反。This kind of personalized handwritten text has one of the most essential differences from the existing handwriting input, that is, the personalized text is personalized, which varies from person to person and does not need to be recognized as a standard code. Therefore, the input and output process of personalized text is mainly a natural writing process. In this process, the computer needs to adapt to the individual's writing habits as much as possible, and to retain the writing results to the utmost extent. This is the opposite of the traditional human keyboard input method for computer adaptation.
个性化手写文字的输出主要是计算机屏幕的显示输出,当然,还有之后的打印输出等。输入则主要是手指或者笔式设备在计算机触摸屏上的直接书写。这里有两个自然的书写约束来保证我们输入的是文字,而不是图形:The output of personalized handwritten text is mainly the display output of the computer screen, of course, there are subsequent printouts. The input is primarily the direct writing of a finger or pen device on a computer touch screen. There are two natural writing constraints to ensure that we are entering text, not graphics:
1.基于行或者列的整体排版约束。也就是说,用户在进行输入时,必须通过某种方式激活目标行(或者列,在后面统称为行),然后才能在该行中进行输入。这样,文字输入系统能够很有效地确定文字的整体顺序。1. Based on row or column overall layout constraints. That is to say, when the user makes an input, the target row (or column, which will be collectively referred to as a row) must be activated in some way before the input can be made in the row. In this way, the text input system can effectively determine the overall order of the text.
2.基于间隔的行内排版约束。在同一行中,文字输入系统必须能够识别出最基本的文字单元,以保证有效的文字存储、编码以及重用。在表音数据处理系统中,单词间的距离往往明显大过单词内的字母、偏旁间距。因此,我们可以用单词作为对应数据处理系统的最基本文字单元,而通过对间距的分析来进行行内单词的划分。同时,我们对间距的长度也进行编码,用以保证文字内容的正确回放。在这种情况下,即使间距分析的结果并不是完全正确(主要是这个过程同人类的识别过程并不完全相同,缺少字母识别与语义分析),输出的结果也能同输入完全一致。考虑到间距分析的出错情况,文字输入系统还可以提供工具对间距分析结果进行修正。在表意数据处理系统中,单个字符大小相当,字间距相似,都比较小。在这种情况下,文字输入系统可以增加辅助网格,用以协助输入系统对字符的分割。例如,对于汉字,在文字输入时,我们可以提供作文格形式的辅助线来帮助用户将字符正确输入到对应的网格中,在字符间隔分析中,可以以这个网格为基础来进行分字。我们称之为作文格排版约束。事实上,文字排版规则是有巨大文化差异的,往往因语言不同而不同。在新系统中,可以针对不同语言文化来提供不同的输入输出系统。 2. Interval-based inline typesetting constraints. In the same line, the text input system must be able to recognize the most basic text units to ensure efficient text storage, encoding, and reuse. In the phonetic data processing system, the distance between words is often significantly larger than the letter and the distance between the words. Therefore, we can use words as the most basic text unit of the corresponding data processing system, and divide the words in the line by analyzing the spacing. At the same time, we also encode the length of the spacing to ensure the correct playback of the text content. In this case, even if the result of the gap analysis is not completely correct (mainly because the process is not exactly the same as the human recognition process, lacking letter recognition and semantic analysis), the output can be exactly the same as the input. The text input system can also provide tools to correct the pitch analysis results, taking into account the error conditions of the pitch analysis. In the ideographic data processing system, the single character size is equivalent, the word spacing is similar, and both are relatively small. In this case, the text input system can add an auxiliary grid to assist the input system in segmenting the characters. For example, for Chinese characters, when text is input, we can provide auxiliary lines in the form of text to help the user correctly input the characters into the corresponding grid. In the character interval analysis, the grid can be used to classify the characters. . We call it the text layout constraint. In fact, text typographic rules are highly cultural and often vary from language to language. In the new system, different input and output systems can be provided for different language cultures.
对于混合编码的通用显示和编辑,基于标准编码数据处理系统的一个主要好处就是其可读性,就是人们能够理解相应的文字内容。这个可读性是建立在编码标准普遍被各种软硬件系统支持基础之上的。支持得最广的编码标准就是ASCII编码。For general purpose display and editing of hybrid coding, one of the main benefits of a standard coded data processing system is its readability, which means that people can understand the corresponding text content. This readability is based on the fact that coding standards are generally supported by various hardware and software systems. The most widely supported coding standard is ASCII encoding.
新的数据处理系统中,我们可以完全兼容现有的编码标准。如前面提到的通过OTF编码对UTF编码的支持。除了对UTF标准文字的显示支持之外,我们还能提供通用的文字显示、编辑服务来提供对开放编码文字的直接显示和编辑。这里提到的显示和编辑既不是那种完全的文字显示编辑,也不是二进制的显示和编码,而是介于两者之间的一种通用服务。该服务具有以下特征:In the new data processing system, we are fully compatible with existing coding standards. Support for UTF encoding by OTF encoding as mentioned earlier. In addition to display support for UTF standard text, we also offer a common text display and editing service to provide direct display and editing of open coded text. The display and editing mentioned here is neither a complete text display editor nor a binary display and encoding, but a general service between the two. The service has the following characteristics:
1.能够正确显示、编辑UTF标准文字;1. Can correctly display and edit UTF standard text;
2.对于非UTF编码,能够显示、编辑编码类型ID(包括空间类型的ID)及编码对应的数字;2. For non-UTF encoding, it is possible to display and edit the encoding type ID (including the ID of the spatial type) and the number corresponding to the encoding;
3.对于一些常用的公有开放编码,如XML、JSON、HTML、SVG等,直接显示、编辑其原始文本内容。3. For some commonly used public open coding, such as XML, JSON, HTML, SVG, etc., directly display and edit the original text content.
这个文字的通用显示、编辑服务能够支持传统的文字输入输出方式:单色文字终端(可以用反显来区别编码和对应内容的显示)以及键盘(可以将编码编辑状态同编码内容编辑状态区分开来)。其主要是给开发人员和系统维护人员提供便利,让他们可以用传统的方式对文本数据进行查看和修改。This text's universal display and editing service can support traditional text input and output methods: monochrome text terminal (you can use reverse display to distinguish the encoding and corresponding content display) and keyboard (you can distinguish the encoding editing state from the encoding content editing state). Come). It is primarily intended to give developers and system maintainers the convenience to view and modify text data in a traditional way.
文字的通用显示、编辑服务是新系统保持人类可读性的重要保证。The universal display and editing services of text are important guarantees for the new system to maintain human readability.
对于编码仓库内容的匹配(服务),以个性化手写内容为例,编码仓库内容的归一就是形状匹配。For the matching (service) of the encoded warehouse content, taking the personalized handwritten content as an example, the normalization of the encoded warehouse content is shape matching.
目前,图形、图像的匹配技术较为成熟,针对字形,有各种各样不同的算法进行匹配。有基于笔划曲线拟合的方法、基于轮廓线的方法、基于特征分析的匹配方法、基于机器学习的方法等等。此处不再赘述。此外,由于本发明可以记录输入的每一笔划的时间和位置信息,因此,本发明还可以利用笔划的输入时间和位置信息来实现输入内容的匹配。At present, the matching technology of graphics and images is relatively mature. For the glyphs, there are various algorithms to match. There are methods based on stroke curve fitting, contour based methods, feature analysis based matching methods, machine learning based methods, and the like. I will not repeat them here. In addition, since the present invention can record the time and position information of each stroke input, the present invention can also utilize the input time and position information of the stroke to achieve matching of the input content.
对于编码仓库内容的归一,编码仓库内容的归一是建立在编码仓库内容匹配的基础之上,以确保相同或者相似的内容对应唯一一个编码。以个性化手写内容为例,最理想的归一结果就是同一用户对同一内容的手写总是对应 编码仓库的同一编码。For the normalization of the encoded repository content, the normalization of the encoded repository content is based on the matching of the encoded repository content to ensure that the same or similar content corresponds to a unique encoding. Taking personalized handwritten content as an example, the most ideal one is that the same user always has the same handwriting for the same content. The same encoding of the encoding repository.
编码仓库内容的归一,可以根据设定的阈值自动进行,也可以同用户交互进行。例如,以个性化手写为例,当用户的书写内容提交到编码仓库时,编码仓库将所有形状相似的字形找出来,并让用户确认是否归一、以及归一后的字形。The normalization of the content of the coded warehouse can be automatically performed according to the set threshold or interactively with the user. For example, in the case of personalized handwriting, when the user's writing content is submitted to the code repository, the code repository finds all the similarly shaped glyphs and allows the user to confirm whether they are normalized and the normalized glyphs.
对于对象编码的查找、匹配,传统的字符串模式匹配算法可以直接使用在对象编码的查找和匹配上。但是,有两点需要注意的是:For the search and matching of object encoding, the traditional string pattern matching algorithm can be directly used in the search and matching of object encoding. However, there are two things to note:
1.不能简单地使用二进制比对来判断源串中的编码和目标串中的编码是否相同,而是要确保源编码和目标编码的编码空间、编码类型以及实例编码完全相同。1. The binary alignment cannot simply be used to determine whether the encoding in the source string and the encoding in the target string are the same, but to ensure that the encoding space, encoding type, and instance encoding of the source encoding and the target encoding are identical.
2.对于源串和目标串中的间隔(即字符之间的空白)编码可以直接忽略。2. The encoding of the interval between the source string and the target string (ie, the space between characters) can be directly ignored.
因此,对于现有的字符串匹配算法,如经典的KMP算法等,只要稍作改造就可用于新的数据处理系统。值得一提的是,对对象编码的查找并不需要编码对应的文字内容,只需要编码对应的编码元数据,主要是包括编码类型信息以及编码空间的信息等。Therefore, existing string matching algorithms, such as the classic KMP algorithm, can be used in new data processing systems with minor modifications. It is worth mentioning that the search for the object encoding does not need to encode the corresponding text content, and only needs to encode the corresponding encoding metadata, mainly including the encoding type information and the information of the encoding space.
对于对象编码的检索,同对象编码的查找匹配类似,对对象编码的检索可以完全建立在现有检索方法的基础之上。同样也需要针对上述特点对现有方法进行改造。The retrieval of the object encoding is similar to the matching matching of the object encoding, and the retrieval of the object encoding can be completely based on the existing retrieval method. It is also necessary to modify existing methods in response to the above characteristics.
对于对个性化文字的输入查找,在新的数据处理系统中,所有编码的内容都可以存储于编码仓库中,因此对用户输入内容的查找可以在编码仓库内容归一服务的基础上进行优化。查找过程如下:For the input search of personalized text, in the new data processing system, all the encoded content can be stored in the encoding warehouse, so the search for the user input content can be optimized based on the encoding warehouse content normalization service. The search process is as follows:
1.通过文字输入系统输入待查找的文字内容(源文字);1. Enter the text content to be found (source text) through the text input system;
2.编码仓库对源文字进行归一匹配;2. The code warehouse performs a normal match on the source text;
3.如果源文字中包含新编码(未匹配编码),则直接返回查找失败;3. If the source text contains a new encoding (unmatched encoding), then directly return the search failed;
4.如果源文字中包含目标文字中没有出现的文字编码,则直接返回查找失败;4. If the source text contains a text encoding that does not appear in the target text, the direct return search fails;
5.在目标编码中查找待查文字对应的编码串。5. Find the code string corresponding to the text to be checked in the target code.
对于个性化文字的识别,对个性化文字的识别是传统文字识别的一个子集。识别的结果可以存储于编码仓库中。值得注意的是,同一编码的识别结 果可能有多个。例如,大写字母I可能对应数字1,或者小写字母l。这在传统文字识别的过程中也会遇到。这里只需要将传统文字识别过程稍作改动,结合编码仓库中的单字或者单词识别信息来进行整句、整篇的文字识别。For the recognition of personalized text, the recognition of personalized text is a subset of traditional text recognition. The results of the identification can be stored in the code repository. It is worth noting that the identification of the same code There may be more than one. For example, the capital letter I may correspond to the number 1, or the lowercase letter l. This is also encountered in the process of traditional text recognition. Here only the traditional text recognition process needs to be slightly modified, combined with the single word or word recognition information in the code warehouse to perform the whole sentence and the whole text recognition.
对于多级别的输出系统,在对象编码的编码仓库中,我们对编码对应的文字内容并不存在任何限制。因此,可能出现这两种情况:For multi-level output systems, we do not have any restrictions on the text content of the encoding in the encoding library of the object encoding. Therefore, these two situations may occur:
1.编码对应的文字内容是矢量化/参数化的信息,依据不同的条件/参数能够有不同的输出;1. The corresponding text content of the encoding is vectorized/parameterized information, which can have different outputs according to different conditions/parameters;
2.同一编码可能对应于多份文字内容。2. The same code may correspond to multiple pieces of text content.
其中任何一种情况都会使得在编码仓库的解码服务中,必须使用某种内容选择机制。对于第一种情况,编码仓库会根据解码请求的信息来动态生成对应的编码内容。而对于第二种情况,编码仓库会根据系统设置以及解码请求来选择最为合适的文字内容。In either case, a content selection mechanism must be used in the decoding service of the encoding repository. For the first case, the encoding repository dynamically generates the corresponding encoded content based on the information of the decoding request. For the second case, the encoding repository will select the most appropriate text content based on system settings and decoding requests.
对于个性化文字的可视化触控编辑,在新数据处理系统之下,个性化文字和传统文字的可视化混合编辑排版成为可能。传统的可视化文字编辑是以键盘为主要的编辑设备而设计的。其中有两个核心概念:For visual touch editing of personalized text, under the new data processing system, visual hybrid editing and formatting of personalized text and traditional text becomes possible. Traditional visual text editing is designed with the keyboard as the main editing device. There are two core concepts:
1.输入焦点,即当前文字插入或者覆盖的位置。对文字流来说,是一个一维的位置坐标。但是对于可视化的编辑区域来说,其对应一个二维坐标(行和列)。一般用一个闪烁的游标来可视化其位置。通过方向键来改变它,支持点设备(如鼠标)的系统也可以用点设备来直接定位焦点。1. Enter the focus, which is the position where the current text is inserted or overwritten. For a text stream, it is a one-dimensional position coordinate. But for a visual editing area, it corresponds to a two-dimensional coordinate (row and column). A flashing cursor is typically used to visualize its position. It can be changed by the arrow keys. Systems that support point devices (such as mice) can also use point devices to directly locate the focus.
2.选中文字(即待操控文字)。对文字流来说,是一对一维位置坐标。一般来说,输入焦点和选中文字不能同时存在。可以把输入焦点理解为长度为零的选中文字。一般通过反显或者高亮显示来可视化选中文字。通过键盘,主要是用方向键同特定功能键的组合来定义文字选择的起始和终止。使用点设备,如鼠标,主要是通过“按下并保持、拖拽、释放”的方式来选择文字。2. Select the text (that is, the text to be manipulated). For a text stream, it is a one-to-one dimensional position coordinate. In general, input focus and selected text cannot exist at the same time. The input focus can be understood as a selected text of zero length. The selected text is typically visualized by highlighting or highlighting. Through the keyboard, the start and end of the text selection are defined mainly by the combination of the direction keys and the specific function keys. The use of point devices, such as the mouse, is mainly to select text by "press and hold, drag, release".
传统的所见即所得的可视化文字编辑都是建立在对选中文字施加命令的方式之上的。但是这种用户界面对日趋普及的触控设备来说,并不自然。另外,手写输入对于现有的可视化编辑方式来说,也是格格不入的。与之相反,触控设备对手写文字来说,是非常自然的输入设备。因此,在现有文字可视化编辑的基础之上,我们引入输入模式以保证不同输入方式的切换,并 在触控输入模式下,将“输入焦点”扩展为一个区域范围,从而可以改善触控设备下的可视化文字编辑。以下是本发明引入的输入方式和输入区域的构思。Traditional WYSIWYG visual text editing is based on the way you apply commands to selected text. But this kind of user interface is not natural for the increasingly popular touch devices. In addition, handwriting input is also incompatible with existing visual editing methods. In contrast, touch devices are very natural input devices for handwritten text. Therefore, based on the existing text visualization editing, we introduce input mode to ensure the switching of different input methods, and In touch input mode, the "input focus" is extended to a range of areas, which can improve visual text editing under the touch device. The following is the concept of the input mode and input area introduced by the present invention.
1.输入方式。在原有键盘输入方式的基础之上,我们还允许手写输入的方式。在进行输入的时候,我们必须处于这两种方式中的某一种。用户可以在这两种方式之间自由切换。当处于键盘输入方式时,用户能够用键盘(虚拟键盘或者数字键盘)直接键入文字内容,并使用传统的可视化编辑界面。而在处于触控输入方式时,用户能够用触控设备(触控笔或者手指)在特定区域输入。并使用对触控友好的可视化编辑界面。1. Input method. Based on the original keyboard input method, we also allow handwriting input. When making input, we must be in one of these two ways. Users can switch between the two modes freely. When in keyboard input mode, the user can directly type text content using a keyboard (virtual keyboard or numeric keypad) and use a traditional visual editing interface. When in the touch input mode, the user can input in a specific area using a touch device (stylus or finger). And use a touch-friendly visual editing interface.
2.输入区域(即输入面板),只在手写输入方式下有效。对应键盘输入方式下的输入焦点。不同于传统编辑系统中的输入焦点,输入区域对应的不是一个一维位置坐标,而是编辑显示的二维区域。在手写输入方式下,用户可以在输入区域中直接书写文字。书写的文字直接以所见即所得的方式呈现并参与排版编辑。输入区域中存在同当前文字排版布局对应的行信息,这样,在区域中书写的文字信息可以直接对应到文字排版之后的位置。如无任何其他限制,最直接、自然的输入区域为行,或者列所在的显示区域。用户可以通过在输入区域之外的触控点击来改变当前的输入区域;也可以直接通过移动命令来改变输入区域的位置。2. Input area (ie input panel), only valid in handwriting input mode. Corresponds to the input focus in the keyboard input mode. Different from the input focus in the traditional editing system, the input area corresponds to not a one-dimensional position coordinate, but a two-dimensional area of the edited display. In the handwriting input mode, the user can directly write text in the input area. The written text is presented directly in a WYSIWYG manner and participates in typographic editing. The line information corresponding to the current text layout layout exists in the input area, so that the text information written in the area can directly correspond to the position after the text typesetting. If there are no other restrictions, the most direct and natural input area is the line, or the display area where the column is located. The user can change the current input area by touch click outside the input area; or change the position of the input area directly by moving commands.
对于排版,不同的语言文化、不同的文字有着不同的排版规则。例如,阿拉伯文字是从上至下、从右往左横排,而传统的中文是从右往左、从上至下竖排。个性化文字同样也必须遵循对应的排版规则。For typesetting, different language cultures and different words have different typesetting rules. For example, Arabic characters are horizontally arranged from top to bottom and from right to left, while traditional Chinese is vertically from right to left and top to bottom. Personalized text must also follow the corresponding typographic rules.
但是不管是哪种排版规则,都是在字符长度进行累加的基础之上进行段内绕排。同标准化文字类似,基于开放编码的个性化文字也有长度信息;但不同于标准化的文字编码,基于开放编码的个性化文字中没有固定长度的专门的空格字符,取而代之的是能够有不同长度(空格长度作为编码参数)的空格字符。However, no matter which typesetting rule is used, the segmentation is performed on the basis of the accumulation of the character length. Similar to standardized text, personalized text based on open coding also has length information; but unlike standardized text encoding, there is no fixed-length special space character in open-coded personalized text, instead it can have different lengths (spaces) The space character of the length as the encoding parameter).
另外,标点符号往往参与排版。但是在手写文字中,标点符号并不一定需要识别。因此,个性化的标点符号往往会同其他字符合成在一起被作为普通字符对待。In addition, punctuation often participates in typesetting. However, in handwritten text, punctuation does not necessarily need to be recognized. Therefore, personalized punctuation marks are often combined with other characters and treated as ordinary characters.
下面给出两个典型的排版算法,其他的排版规则算法可以通过它们修改 而来。Two typical typographical algorithms are given below, and other typographical rules algorithms can be modified by them. Come.
对于输入,在手写输入方式下,可以直接在输入区域中手写输入。输入的结果并不需要识别,而是直接转化成基于开放编码的个性化文字。在这个过程中,需要对文字以及文字的间隔加以识别。排版规则对于这个识别过程也起到制约作用。For input, in the handwriting input mode, you can write directly in the input area. The results entered do not need to be identified, but are instead translated into personalized text based on open coding. In this process, you need to identify the text and the spacing of the text. The typesetting rules also have a restrictive effect on this identification process.
对于对象编码系统的部署方案,基于开放编码的计算机数据处理系统将对象编码和数据对象的内容进行了拆分。同传统数据处理系统一样,文字编码能够存在于不同的存储——内存、文件、数据库、网路或者云中。因此,对文字编码具体采取何种存储方案,完全是由应用系统的需求以及架构决定的,同对应的编码仓库的存储方案无关。而我们此处要讨论的,不是文字编码的存储方案,而是对应文字编码仓库的部署方案。但是另一方面,使用不同的存储系统来存储文字编码和编码仓库,能够有效地提高系统的安全性——前面提到,在这种情况之下,攻击者只有同时破解了这两个系统才能最终获取文字信息。For the deployment scheme of the object coding system, the computer data processing system based on open coding splits the object coding and the content of the data object. Like traditional data processing systems, text encoding can exist in different storage—memory, file, database, network, or cloud. Therefore, the specific storage scheme for text encoding is completely determined by the requirements and architecture of the application system, and is independent of the storage scheme of the corresponding encoding warehouse. What we want to discuss here is not the storage scheme of text encoding, but the deployment scheme corresponding to the text encoding warehouse. On the other hand, using different storage systems to store text encoding and encoding repositories can effectively improve the security of the system - as mentioned above, in this case, the attacker can only crack the two systems at the same time. Finally get the text information.
另外,传统应用系统本身的系统架构,是单机应用还是网络应用,是单用户还是多用户模型,是基于浏览器还是基于富客户端等等,都是与编码仓库部署方案无关的。当然,在新的数据处理系统中,同一应用系统采用不同的编码仓库部署方案,将会有不同的安全级别和性能指标。In addition, the system architecture of the traditional application system is a stand-alone application or a network application, whether it is a single-user or multi-user model, whether it is based on a browser or a rich client, etc., and is independent of the coding warehouse deployment scheme. Of course, in the new data processing system, the same application system uses different coding warehouse deployment schemes, which will have different security levels and performance indicators.
图47为应用内部署的架构示意图。如图47所示,应用内部署就是指每个应用系统都有其特定的编码仓库。在这样的部署方案中,一个应用中的文字内容只能由该系统识别和显示。在其他应用系统中,则是无法解释的“乱码”。Figure 47 is a schematic diagram of the architecture of the in-app deployment. As shown in Figure 47, in-app deployment means that each application system has its own specific code repository. In such a deployment scenario, text content in an application can only be identified and displayed by the system. In other applications, it is "garbled" that cannot be explained.
这种部署方案中的文字内容安全级别较高——至少是在不同应用间进行了隔离。可用于安全性较高的个人应用。“个人日记”就是这样一个典型的应用系统,日记内容只能被已授权的该应用打开。应用内部署的缺点是其安全性的另一面:数据很难共享。The text content in this deployment scenario is at a higher level of security—at least between different applications. Can be used for personal applications with high security. "Personal Diary" is such a typical application system, and the diary content can only be opened by the authorized application. The downside of in-app deployment is the other side of its security: data is hard to share.
图48为终端部署的架构示意图,如图48所示,不同于应用内部署,编码仓库的终端部署是将其作为终端系统的一个系统服务而共享,能够被多个应用同时使用。这种部署方案也有较高的安全性,因为脱离开该终端的文字内容都无法被使用。 FIG. 48 is a schematic diagram of the architecture of the terminal deployment. As shown in FIG. 48, unlike the in-application deployment, the terminal deployment of the coding warehouse is shared as a system service of the terminal system, and can be used by multiple applications at the same time. This deployment scheme also has a high level of security, because the text content that leaves the terminal cannot be used.
图49为移动外置设备部署的架构示意图,如图49所示,编码仓库的终端部署很适合共享需求不大的个人应用。但是随着移动终端和平板设备的普及,拥有多个计算机设备的个人越来越多,这就导致个人信息也经常需要在多个设备间共享。将编码仓库部署在可访问的移动设备上就能直接满足这种需求。这个移动设备可以是运行编码仓库服务的智能移动终端,存储编码仓库的移动存储设备,或者专门的编码仓库设备。Figure 49 is a schematic diagram of the architecture of the mobile external device deployment. As shown in Figure 49, the terminal deployment of the coded warehouse is suitable for sharing personal applications with little demand. However, with the popularity of mobile terminals and tablet devices, more and more individuals have multiple computer devices, which leads to the need for personal information to be shared among multiple devices. Deploying the code repository on an accessible mobile device can directly address this need. The mobile device can be a smart mobile terminal running an encoded warehouse service, a mobile storage device storing a coded repository, or a specialized coded warehouse device.
对于网络部署,语言文字主要是用来同他人进行交流的。因此,编码仓库的主要部署方式还是网络部署。对于Internet范围的网络来说,就是云部署。如图50所示,所有应用都共享同一编码仓库。这样,所有使用应用的人就能在同一编码仓库的访问控制之下,使用和交换文字信息。For web deployments, language text is primarily used to communicate with others. Therefore, the main deployment of the code repository is network deployment. For Internet-wide networks, it is cloud deployment. As shown in Figure 50, all applications share the same code repository. In this way, all people using the application can use and exchange text information under the access control of the same code repository.
对于局域网或者企业内联网来说,编码仓库的网络部署就是私有云部署或者内部服务器部署,如图51所示。这样,编码仓库就被防火墙同外界隔离了开来,对应的编码内容就只能被组织内部使用。For a local area network or a corporate intranet, the network deployment of the coded warehouse is either a private cloud deployment or an internal server deployment, as shown in Figure 51. In this way, the encoding warehouse is isolated from the outside by the firewall, and the corresponding encoded content can only be used internally by the organization.
图52为点对点部署的架构示意图,网络部署的一个特例就是点对点部署。如图52所示,在应用内部署或者终端部署的基础之上,临时或者永久地将编码仓库同其他用户共享。一个典型的应用就是个人即时消息应用:通话期间,通话的双方彼此共享编码仓库,因此双方能够正常交流。如果通话结束时一方关闭编码仓库的共享,那么另一方就无法看到对方的通话记录。在现实生活中,我们有时候会需要这样的安全效果。Figure 52 is a schematic diagram of the architecture of a peer-to-peer deployment. A special case of network deployment is peer-to-peer deployment. As shown in FIG. 52, the code repository is temporarily or permanently shared with other users on the basis of in-application deployment or terminal deployment. A typical application is a personal instant messaging application: during a call, both parties to the call share the code repository, so the two parties can communicate normally. If the party closes the share of the code repository when the call ends, the other party cannot see the other party's call history. In real life, we sometimes need this kind of security effect.
一个应用所使用的编码仓库部署方案并不是绝对的和一成不变的。应用系统可以同时混用不同的方案。图53为混合部署的架构示意图,如图53所示,同一应用可以使用三种不同的编码仓库。这样,该应用就能在三种不同的环境中被使用,只需要切换对应的编码仓库。The coding repository deployment scheme used by an application is not absolute and static. Application systems can mix different scenarios at the same time. Figure 53 is a schematic diagram of a hybrid deployment architecture. As shown in Figure 53, the same application can use three different code repositories. In this way, the application can be used in three different environments, and only the corresponding code repository needs to be switched.
结合上面描述,本实施例中具体结合实际应用进行举例,以实现对传统信息系统的增强和改造,并对基于对象编码的文字系统的支持。Combined with the above description, the present embodiment is specifically combined with practical applications to implement enhancements and modifications to the conventional information system, and support for the text system based on the object encoding.
如图54所示,传统信息系统中的文字,一般都是直接使用操作系统提供的文字服务,来进行输入以及显示输出。由于新数据处理系统中的对象编码可以完全兼容传统文字编码,如图54所示,我们可以通过更改操作系统的文字服务来加入对新数据处理系统的支持。这样,传统信息系统可以不用更改就能直接支持非标准文字(如个性化手写文字)的输入输出。 As shown in FIG. 54, the text in the traditional information system generally uses the text service provided by the operating system directly for input and display output. Since the object encoding in the new data processing system can be fully compatible with traditional text encoding, as shown in Figure 54, we can add support for the new data processing system by changing the text service of the operating system. In this way, traditional information systems can directly support the input and output of non-standard text (such as personalized handwritten text) without modification.
具体的,对基于对象编码的后端存储的改造,在现有的软件应用系统中,可持续化数据对象的装载和存储是由数据访问模块/组件完成的。存储时,数据访问组件将应用对象对应的数据直接存储于应用存储中;装载时,数据访问组件通过访问应用存储获取对应数据,并将数据装载并实例化为应用对象。Specifically, for the transformation of the object-based backend storage, in the existing software application system, the loading and storage of the sustainable data object is completed by the data access module/component. When storing, the data access component stores the data corresponding to the application object directly in the application storage; when loading, the data access component acquires the corresponding data by accessing the application storage, and loads and instantiates the data into an application object.
而从本发明的对象编码系统的系统应用层面上可以如下实现,具体实现方法并不局限于此。例如可以将编码仓库设置在用户方,也可以设置在第三方服务器上,或者是云存储的任何位置等。The system application level of the object coding system of the present invention can be implemented as follows, and the specific implementation method is not limited thereto. For example, the code repository can be set on the user side, on the third party server, or anywhere in the cloud storage.
请参见图55:对象编码系统将需要装载和存储的数据进行系统化的编号,从而得到相应的对象编码。这样,应用存储中储存的主要是编码后的对象编码以及对象编码序列。真正的应用数据需要通过对象编码系统使用这些编码才能获得。应用系统同应用数据之间的联系就引入了“编码”这个间接层。如此,固然引入了额外的运行乃至存储开销,但同时也带来了安全、灵活、高效等诸多好处。这在某些应用场合是非常有益的。See Figure 55: The object encoding system systematically numbers the data that needs to be loaded and stored to obtain the corresponding object encoding. Thus, the application storage stores mainly the encoded object code and the object code sequence. Real application data needs to be obtained through the object encoding system. The link between the application system and the application data introduces the indirection layer of "encoding". In this way, although it introduces additional running and storage overhead, it also brings many benefits such as security, flexibility and efficiency. This is very beneficial in some applications.
如图55所示,基于本发明的对象编码系统的应用将用到的编码/编码序列存储到应用存储中。存储时,数据访问组件根据具体的应用逻辑将应用对象对应的数据转换为编码内容;通过基于对象编码系统,数据对象被转换成相应的编码返回到数据访问组件,数据对象的内容本身则存储在对象编码系统中;数据访问组件则将得到的编码/编码序列存储到应用存储中。装载时,数据访问组件通过访问应用存储获得所需的编码后,通过对象编码系统将其还原为数据对象;最终,应用系统的数据访问组件将数据对象转换为应用对象。As shown in FIG. 55, the application of the object encoding system based on the present invention stores the encoded/encoded sequence used in the application storage. When storing, the data access component converts the data corresponding to the application object into the encoded content according to the specific application logic; by the object encoding system, the data object is converted into the corresponding encoding and returned to the data access component, and the content of the data object itself is stored in In the object encoding system; the data access component stores the resulting encoding/encoding sequence into the application storage. At load time, the data access component obtains the required code by accessing the application store, and then restores it to a data object through the object encoding system; finally, the data access component of the application system converts the data object into an application object.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修 改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that It is still possible to repair the technical solutions described in the foregoing embodiments. Modifications, or equivalents to some or all of the technical features, and the modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (21)

  1. 一种手写输入字符的处理方法,其特征在于,包括:A method for processing handwritten input characters, comprising:
    在当前激活的第一目标行/列中,采集获取用户输入的笔划以及对应的输入信息;其中,所述输入信息包括所述笔划在所述第一目标行/列中的输入位置;And acquiring, in the currently activated first target row/column, a stroke of the user input and corresponding input information; wherein the input information includes an input position of the stroke in the first target row/column;
    对于每个笔划,根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符。For each stroke, according to the input position of the stroke in the first target row/column, or the input position of the stroke in the first target row/column and the first target row/column The specified character, creating a new character for the stroke or determining the character to which the stroke belongs.
  2. 根据权利要求1所述的方法,其特征在于,根据所述笔划在所述第一目标行/列中的输入位置,或者所述笔划在所述第一目标行/列中的输入位置以及所述第一目标行/列中指定的字符,为所述笔划创建一个新的字符或者确定所述笔划归属的字符,包括:The method according to claim 1, wherein an input position in said first target row/column or an input position of said stroke in said first target row/column according to said stroke and said a character specified in the first target row/column, creating a new character for the stroke or determining a character to which the stroke belongs, including:
    将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining the correlation between the stroke and the character;
    若所述笔划不与任何字符相关联,则为所述笔划创建一个新的字符,所述笔划归属于所述新的字符;If the stroke is not associated with any character, a new character is created for the stroke, the stroke being attributed to the new character;
    若所述笔划与至少一个字符相关联,则根据相关联的至少一个字符,对所述笔划进行归属处理。If the stroke is associated with at least one character, the stroke is attributed according to the associated at least one character.
  3. 根据权利要求2所述的方法,其特征在于,所述指定的字符为所述第一目标行/列中已存在的所有字符;The method according to claim 2, wherein said designated character is all characters existing in said first target row/column;
    或者,所述指定的字符为所述第一目标行/列中的待比较区域中的字符,其中,所述待比较区域的边界位置与所述笔划的距离小于第二预设阈值。Or the specified character is a character in the area to be compared in the first target row/column, wherein a distance between a boundary position of the area to be compared and the stroke is less than a second preset threshold.
  4. 根据权利要求2或3所述的方法,其特征在于,将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划与字符之间的关联性,包括:The method according to claim 2 or 3, wherein the position information of the stroke in the first target row/column corresponding to the character specified in the first target row/column is performed. In contrast, the correlation between the stroke and the character is determined, including:
    将所述笔划在所述第一目标行/列中的输入位置与所述第一目标行/列中指定的字符对应的位置信息进行对比,判断所述笔划是否与所述字符中的至少一个笔划重叠;若所述笔划与所述字符中的至少一个笔划重叠,则判断所述笔划与所述字符相关联;若所述笔划与所述字符中的所有笔划均不重叠, 则判断所述笔划与所述字符不相关联;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character specified in the first target row/column, and determining whether the stroke is at least one of the characters Stroke overlap; if the stroke overlaps with at least one of the characters, determining that the stroke is associated with the character; if the stroke does not overlap with all strokes of the character, Then determining that the stroke is not associated with the character;
    或者,or,
    对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符对应的位置信息进行对比,判断所述笔划与所述字符的边界之间的距离是否小于第三预设阈值;若所述笔划与所述字符的边界小于第三预设阈值,则判断所述笔划与所述字符相关联;若所述笔划与所述字符的边界不小于第三预设阈值,则判断所述笔划与所述字符不相关联;Comparing the input position of the stroke in the first target row/column with the position information corresponding to the character for each character specified in the first target row/column, and determining the stroke and the location Whether the distance between the boundaries of the characters is less than a third preset threshold; if the boundary of the stroke and the character is less than a third preset threshold, determining that the stroke is associated with the character; if the stroke is If the boundary of the character is not less than a third preset threshold, determining that the stroke is not associated with the character;
    或者,or,
    对于所述第一目标行/列中指定的每个字符,将所述笔划在所述第一目标行/列中的输入位置与所述字符中的各个笔划对应的位置信息进行对比,获取所述笔划与所述字符对应的各个笔划之间的间距中的最小间距值,并判断所述最小间距值是否小于第三预设阈值;若小于,则所述笔划与所述字符相关联;若不小于,则所述笔划与所述字符不相关联。Comparing the input position of the stroke in the first target row/column with the position information corresponding to each stroke in the character for each character specified in the first target row/column, Determining a minimum spacing value among the spacings between the strokes corresponding to the character, and determining whether the minimum spacing value is less than a third preset threshold; if less than, the stroke is associated with the character; Not less than, the stroke is not associated with the character.
  5. 根据权利要求1至4任一所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 4, further comprising:
    在接收到存储请求时,根据预设元数据剥离规约,获取保存的手写文字的元数据,并将获取的元数据从所述手写文字中剥离;When receiving the storage request, the protocol is stripped according to the preset metadata, the metadata of the saved handwritten text is obtained, and the obtained metadata is stripped from the handwritten text;
    根据预设数据内容拆分规约,将所述手写文字划分为至少两个数据片断。The handwritten text is divided into at least two pieces of data according to a preset data content splitting specification.
  6. 根据权利要求5所述的方法,其特征在于,还包括:The method of claim 5, further comprising:
    查询编码仓库,根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,对所述手写文字进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述手写文字对应的文字编码;Querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and encoding the handwritten text according to the encoding specification, Obtaining an instance code, and acquiring a text code corresponding to the handwritten text according to the meta code and the instance code;
    或者,or,
    将所述手写文字和所述元数据发送给所述编码仓库,以供所述编码仓库根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,对所述手写文字进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述手写文字对应的文字编码;并接收所述编码仓库返回的所述文字编码,所述文字编码 是引用编码形式或者内容编码形式。Transmitting the handwritten text and the metadata to the encoding repository, wherein the encoding repository selects or creates an encoding specification according to at least a portion of the metadata, and generates a correspondence corresponding to the metadata according to the encoding specification Encoding according to the encoding protocol, encoding the handwritten text, obtaining an example encoding, and acquiring a text encoding corresponding to the handwritten text according to the meta encoding and the example encoding; and receiving the encoding warehouse The returned text encoding, the text encoding Is a reference to the coding form or content coding form.
  7. 一种数据拆分方法,其特征在于,包括:A data splitting method, comprising:
    在接收到携带有待存储数据标识的存储请求时,根据预设元数据剥离规约,获取所述待存储数据标识对应的数据对象中的元数据,并将获取的元数据从所述数据对象中剥离;When receiving the storage request carrying the identifier of the data to be stored, the protocol is stripped according to the preset metadata, the metadata in the data object corresponding to the data identifier to be stored is obtained, and the obtained metadata is stripped from the data object. ;
    根据预设数据内容拆分规约,将所述数据内容划分为至少两个数据片断。The data content is divided into at least two data segments according to a preset data content splitting specification.
  8. 根据权利要求7所述的方法,其特征在于,所述将所述数据内容划分为至少两个数据片断之后,所述方法还包括:The method according to claim 7, wherein after the dividing the data content into at least two data segments, the method further comprises:
    根据预设编码分离规约,分别对各个数据片断进行编码处理,以获取每个数据片段对应的编码;Separating the data fragments according to the preset encoding, and respectively encoding each data segment to obtain a code corresponding to each data segment;
    根据各个数据片断在所述数据内容中的原始顺序,排列各个编码,以得到编码的排列顺序信息。The respective codes are arranged according to the original order of the respective data segments in the data content to obtain coded arrangement order information.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:The method of claim 8 further comprising:
    基于所述编码的排列顺序信息生成编码顺序信息唯一标识符,和/或基于各个所述数据片断生成各自的数据片断唯一标识符,将所述编码顺序信息唯一标识符和/或各个所述数据片断唯一标识符作为所述元数据的一部分存储。Generating a coding order information unique identifier based on the encoded ranking order information, and/or generating a respective data segment unique identifier based on each of the data segments, the encoding order information unique identifier and/or each of the data The fragment unique identifier is stored as part of the metadata.
  10. 根据权利要求8或9所述的方法,其特征在于,所述根据预设编码分离规约,分别对各个数据片断进行编码处理,以获取每个数据片段对应的编码,包括:The method according to claim 8 or 9, wherein the encoding is performed according to a preset encoding, and each data segment is separately encoded to obtain a code corresponding to each data segment, including:
    根据预设编码分离规约,查询编码仓库,根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,分别对各个数据片断进行编码处理,获取每个数据片段对应的实例编码;Decoding a protocol according to a preset encoding, querying an encoding warehouse, selecting or creating an encoding specification according to at least a part of the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification; and respectively, according to the encoding protocol, respectively Encoding each data segment to obtain an instance code corresponding to each data segment;
    或者,or,
    根据预设编码分离规约,将各个数据片断和所述元数据发送给所述编码仓库,以供所述编码仓库根据所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;并根据所述编码规约,并分别对所述各个数据片断进行编码,获取实例编码;并接收所述编 码仓库返回的所述元编码和实例编码。And transmitting, according to a preset encoding separation protocol, each data segment and the metadata to the encoding warehouse, so that the encoding warehouse selects or creates an encoding specification according to at least a part of the metadata, and generates according to the encoding protocol. a meta-code corresponding to the metadata; and according to the coding protocol, respectively encoding the respective data segments to obtain an instance code; and receiving the code The meta-code and instance code returned by the code repository.
  11. 一种数据合并方法,其特征在于,包括:A data merging method, comprising:
    接收携带有标识信息的数据对象获取请求;其中,所述标识信息包括定位信息,且所述定位信息用于定位所述数据对象中部分数据信息的存储地址;Receiving a data object acquisition request carrying the identification information; wherein the identification information includes positioning information, and the positioning information is used to locate a storage address of part of the data information in the data object;
    获取所述定位信息对应的存储内容,并根据获取到的所述存储内容中的定位信息获取其他存储内容中数据信息,直到获取到所述数据对象的所有数据信息;Acquiring the storage content corresponding to the positioning information, and acquiring data information in the other storage content according to the obtained positioning information in the storage content, until all data information of the data object is obtained;
    根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到所述数据对象。And obtaining, according to the preset merge rule in the obtained data information, the acquired data information, to obtain the data object.
  12. 根据权利要求11所述的方法,其特征在于,所述数据信息的类型为数据片断、编码、编码顺序的组合时,所述根据获取到的数据信息中的预设合并规约,将获取到的各个数据信息进行合并处理,得到所述数据对象,包括:The method according to claim 11, wherein when the type of the data information is a combination of a data segment, an encoding, and an encoding sequence, the acquiring according to the preset merge protocol in the acquired data information Each data information is merged to obtain the data object, including:
    根据预设合并规约中的合并算法,对编码进行解码操作,得到所述编码对应的数据片断;根据编码顺序对解码后的各个数据片断进行排列,得到按照各个数据片断原始顺序排列的数据对象。According to the merging algorithm in the preset merging convention, the encoding operation is performed to obtain the data segment corresponding to the encoding; the decoded data segments are arranged according to the encoding order, and the data objects arranged in the original order of the respective data segments are obtained.
  13. 根据权利要求12所述的方法,其特征在于,所述根据预设合并规约中的合并算法,对编码进行解码操作,得到所述编码对应的数据片断,包括:The method according to claim 12, wherein the decoding operation of the encoding according to the merging algorithm in the preset merging protocol to obtain the data segment corresponding to the encoding comprises:
    根据预设合并规约中的合并算法,对所述数据信息进行拆解,获取元编码,或者所述元编码和实例编码;Disassembling the data information according to a merge algorithm in a preset merge protocol, obtaining a meta code, or the meta code and an instance code;
    查询编码仓库,根据所述元编码获取对应的元数据和编码规约;Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;
    根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述数据信息对应的数据对象。Obtaining a data object corresponding to the data information according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
  14. 一种编码处理方法,其特征在于,包括:An encoding processing method, comprising:
    根据接收的编码处理请求,获取待编码的数据对象及其元数据;Acquiring the data object to be encoded and its metadata according to the received encoding processing request;
    根据编码仓库和所述数据对象及其元数据,获取所述数据对象的对象编码。Obtaining an object encoding of the data object according to the encoding repository and the data object and its metadata.
  15. 根据权利要求14所述的方法,其特征在于,所述根据编码仓库和 所述数据对象及其元数据,获取所述数据对象的对象编码,包括:The method of claim 14 wherein said encoding according to a code repository The data object and its metadata obtain an object code of the data object, including:
    根据编码仓库以及所述元数据的至少一部分选择或创建编码规约,并依据所述编码规约生成与所述元数据对应的元编码;Selecting or creating an encoding specification according to at least a portion of the encoding repository and the metadata, and generating a meta encoding corresponding to the metadata according to the encoding specification;
    根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码,并根据所述元编码和实例编码,获取与所述数据对象对应的对象编码;Encoding the data content of the data object according to the coding protocol, acquiring an instance code, and acquiring an object code corresponding to the data object according to the element code and the instance code;
    所述对象编码是引用编码形式或者内容编码形式。The object code is a reference coded form or a content coded form.
  16. 根据权利要求15所述的方法,其特征在于,所述根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码,包括:The method according to claim 15, wherein the encoding the data content of the data object according to the encoding specification to obtain an instance code comprises:
    根据所述编码规约,对所述数据对象的数据内容进行序列化处理,获取序列化结果;其中,所述实例编码为所述序列化结果;Performing serialization processing on the data content of the data object according to the coding protocol to obtain a serialization result; wherein the instance code is the serialization result;
    或者,or,
    根据所述编码规约,对所述数据对象内容进行序列化处理,获取序列化结果,并将所述序列化结果保存在所述编码仓库中,以获取在所述编码仓库中的对象编号;其中,所述实例编码为所述对象编号。Performing serialization processing on the data object content according to the encoding specification, obtaining a serialization result, and saving the serialization result in the encoding warehouse to obtain an object number in the encoding warehouse; The instance code is the object number.
  17. 根据权利要求14至16任一所述的方法,其特征在于,还包括:The method according to any one of claims 14 to 16, further comprising:
    对所述编码仓库中的数据设置访问权限。Set access rights to the data in the encoding repository.
  18. 根据权利要求15或16所述的方法,其特征在于,所述根据所述编码规约,对所述数据对象的数据内容进行编码,获取实例编码,包括:The method according to claim 15 or 16, wherein the encoding the data content of the data object according to the encoding protocol to obtain an instance code comprises:
    获取上下文对象;Get the context object;
    根据所述上下文对象和所述编码的规约,获取对应的编码空间;Obtaining a corresponding coding space according to the context object and the coded protocol;
    在所述编码空间,对所述数据对象中的数据内容进行编码,获取实例编码。In the coding space, the data content in the data object is encoded to obtain an instance code.
  19. 根据权利要求14至18任一所述的方法,其特征在于,所述元编码包括如下一种或者几种的组合和/或嵌套:类型编码,空间编码和上下文编码。The method according to any one of claims 14 to 18, wherein the meta-coding comprises a combination and/or nesting of one or more of the following: type coding, spatial coding and context coding.
  20. 一种解码处理方法,其特征在于,包括:A decoding processing method, comprising:
    接收解码处理请求,并根据所述解码处理请求,获取待解码的对象编码;Receiving a decoding processing request, and acquiring an object encoding to be decoded according to the decoding processing request;
    对所述对象编码进行拆解,获取元编码,或者所述元编码和实例编码;Decomposing the object code to obtain a meta code, or the element code and the instance code;
    查询编码仓库,根据所述元编码获取对应的元数据和编码规约; Querying an encoding warehouse, and obtaining corresponding metadata and a coding specification according to the meta code;
    根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述对象编码对应的数据对象。Obtaining a data object corresponding to the object encoding according to the metadata and the encoding specification, or the metadata, the encoding specification, and the instance encoding.
  21. 根据权利要求20所述的方法,其特征在于,所述根据所述元数据和编码规约,或者所述元数据、编码规约和实例编码,获取与所述对象编码对应的数据对象,包括:The method according to claim 20, wherein the acquiring the data object corresponding to the object encoding according to the metadata and the encoding protocol, or the metadata, the encoding protocol, and the instance encoding, comprises:
    获取上下文对象;Get the context object;
    根据所述上下文对象和所述编码规约,获取对应的编码空间;Obtaining a corresponding coding space according to the context object and the coding protocol;
    从所述编码空间中,对所述实例编码进行解码,获取对应的数据内容;Decoding the example code from the coding space to obtain corresponding data content;
    根据所述元数据和所述数据内容,获取与所述对象编码对应的数据对象。 Obtaining a data object corresponding to the object encoding according to the metadata and the data content.
PCT/CN2015/086672 2014-08-11 2015-08-11 Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing WO2016023471A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580042761.6A CN106575166B (en) 2014-08-11 2015-08-11 Method for processing hand input character, splitting and merging data and processing encoding and decoding
CN202310088220.3A CN116185209A (en) 2014-08-11 2015-08-11 Processing, data splitting and merging and coding and decoding processing method for handwriting input characters

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410392557 2014-08-11
CN201410392557.4 2014-08-11

Publications (1)

Publication Number Publication Date
WO2016023471A1 true WO2016023471A1 (en) 2016-02-18

Family

ID=55303878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086672 WO2016023471A1 (en) 2014-08-11 2015-08-11 Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing

Country Status (2)

Country Link
CN (2) CN116185209A (en)
WO (1) WO2016023471A1 (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359283A (en) * 2018-09-26 2019-02-19 中国平安人寿保险股份有限公司 Method of summary, terminal device and the medium of list data
CN109804362A (en) * 2016-07-15 2019-05-24 伊欧-塔霍有限责任公司 Primary key-foreign key relationship is determined by machine learning
CN110548290A (en) * 2019-09-11 2019-12-10 珠海金山网络游戏科技有限公司 Image-text mixed arranging method and device, electronic equipment and storage medium
EP3567507A4 (en) * 2017-12-20 2020-03-04 iSplit Co., Ltd. Data management system
CN110968592A (en) * 2019-12-06 2020-04-07 深圳前海环融联易信息科技服务有限公司 Metadata acquisition method and device, computer equipment and computer-readable storage medium
CN111046632A (en) * 2019-11-29 2020-04-21 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN111279304A (en) * 2017-09-29 2020-06-12 甲骨文国际公司 Method and system for configuring communication decision tree based on locatable elements connected on canvas
CN112181950A (en) * 2020-10-19 2021-01-05 北京米连科技有限公司 Method for constructing distributed object database
CN112333256A (en) * 2020-10-28 2021-02-05 常州微亿智造科技有限公司 Data conversion frame system and method during network transmission under industrial Internet of things
CN112966475A (en) * 2021-03-02 2021-06-15 挂号网(杭州)科技有限公司 Character similarity determining method and device, electronic equipment and storage medium
CN113360113A (en) * 2021-05-24 2021-09-07 中国电子科技集团公司第四十一研究所 System and method for dynamically adjusting character display width based on OLED screen
TWI738717B (en) * 2016-03-04 2021-09-11 香港商阿里巴巴集團服務有限公司 Verification processing method and device based on verification code
CN113569534A (en) * 2020-04-29 2021-10-29 杭州海康威视数字技术股份有限公司 Method and device for detecting messy codes in document
CN113625932A (en) * 2021-08-04 2021-11-09 北京鲸鲮信息系统技术有限公司 Full screen handwriting input method and device
CN113659993A (en) * 2021-08-17 2021-11-16 深圳市康立生物医疗有限公司 Immune batch data processing method and device, terminal and readable storage medium
CN113723048A (en) * 2021-09-06 2021-11-30 北京字跳网络技术有限公司 Method and device for setting rich text space, storage medium and electronic equipment
US20220107796A1 (en) * 2018-12-25 2022-04-07 Huawei Technologies Co., Ltd. Application Package Splitting and Reassembly Method and Apparatus, and Application Package Running Method and Apparatus
CN114900315A (en) * 2022-04-24 2022-08-12 北京优全智汇信息技术有限公司 Document electronic management system based on OCR and electronic signature technology
US11442712B2 (en) * 2020-06-11 2022-09-13 Indian Institute Of Technology Delhi Leveraging unspecified order of evaluation for compiler-based program optimization
US11494201B1 (en) * 2021-05-20 2022-11-08 Adp, Inc. Systems and methods of migrating client information
US11775843B2 (en) 2017-09-29 2023-10-03 Oracle International Corporation Directed trajectories through communication decision tree using iterative artificial intelligence
CN117371446A (en) * 2023-12-07 2024-01-09 江西曼荼罗软件有限公司 Medical record text typesetting method, system, storage medium and electronic equipment

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073913B (en) * 2018-01-05 2022-06-14 南京孜博汇信息科技有限公司 Handwriting datamation data acquisition method
CN110134452B (en) * 2018-02-09 2022-10-25 阿里巴巴集团控股有限公司 Object processing method and device
CN111078907A (en) * 2018-10-18 2020-04-28 中华图象字教育股份有限公司 Chinese character tree processing method and device
GB2578625A (en) * 2018-11-01 2020-05-20 Nokia Technologies Oy Apparatus, methods and computer programs for encoding spatial metadata
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN112230781B (en) * 2019-07-15 2023-07-25 腾讯科技(深圳)有限公司 Character recommendation method, device and storage medium
CN110543243B (en) * 2019-09-05 2023-05-02 北京字节跳动网络技术有限公司 Data processing method, device, equipment and storage medium
CN111401137A (en) * 2020-02-24 2020-07-10 中国建设银行股份有限公司 Method and device for identifying certificate column
CN114077466A (en) * 2020-08-12 2022-02-22 北京智邦国际软件技术有限公司 Automatic layout algorithm for multiple rows and multiple columns of fields in Web interface form
CN113760246B (en) * 2021-09-06 2023-08-11 网易(杭州)网络有限公司 Application text language processing method and device, electronic equipment and storage medium
CN113608646B (en) * 2021-10-08 2022-01-07 广州文石信息科技有限公司 Method and device for erasing strokes, readable storage medium and electronic equipment
CN114221783B (en) * 2021-11-11 2023-06-02 杭州天宽科技有限公司 Data selective encryption and decryption system
CN115022302B (en) * 2022-08-08 2022-11-25 丹娜(天津)生物科技股份有限公司 Equipment fault data remote transmission method and device, electronic equipment and storage medium
TWI821128B (en) * 2023-02-23 2023-11-01 兆豐國際商業銀行股份有限公司 Data checking system and method thereof
CN116827479B (en) * 2023-08-29 2023-12-05 北京航空航天大学 Low-complexity hidden communication coding and decoding method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375989A (en) * 2010-08-06 2012-03-14 腾讯科技(深圳)有限公司 Method and system for identifying handwriting
CN102455867A (en) * 2011-09-29 2012-05-16 北京壹人壹本信息科技有限公司 Method and device for matching handwritten character information
CN102455845A (en) * 2010-10-14 2012-05-16 北京搜狗科技发展有限公司 Character entry method and device
CN103460225A (en) * 2011-03-31 2013-12-18 松下电器产业株式会社 Handwritten character input device
CN103513898A (en) * 2012-06-21 2014-01-15 夏普株式会社 Handwritten character segmenting method and electronic equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3725877A (en) * 1972-04-27 1973-04-03 Gen Motors Corp Self contained memory keyboard
JP3017740B2 (en) * 1988-08-23 2000-03-13 ソニー株式会社 Online character recognition device and online character recognition method
CN101311887A (en) * 2007-05-21 2008-11-26 刘恩新 Computer hand-written input system and input method and editing method
CN101673408B (en) * 2008-09-10 2012-02-22 汉王科技股份有限公司 Method and device for embedding character information in shape recognition result
CN101739118A (en) * 2008-11-06 2010-06-16 大同大学 Video handwriting character inputting device and method thereof
CN102156608B (en) * 2010-12-10 2013-07-24 上海合合信息科技发展有限公司 Handwriting input method for writing characters continuously
CN102508598B (en) * 2011-10-09 2014-03-05 北京捷通华声语音技术有限公司 Method and device for gradually blanking character strokes
GB2509552A (en) * 2013-01-08 2014-07-09 Neuratron Ltd Entering handwritten musical notation on a touchscreen and providing editing capabilities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375989A (en) * 2010-08-06 2012-03-14 腾讯科技(深圳)有限公司 Method and system for identifying handwriting
CN102455845A (en) * 2010-10-14 2012-05-16 北京搜狗科技发展有限公司 Character entry method and device
CN103460225A (en) * 2011-03-31 2013-12-18 松下电器产业株式会社 Handwritten character input device
CN102455867A (en) * 2011-09-29 2012-05-16 北京壹人壹本信息科技有限公司 Method and device for matching handwritten character information
CN103513898A (en) * 2012-06-21 2014-01-15 夏普株式会社 Handwritten character segmenting method and electronic equipment

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI738717B (en) * 2016-03-04 2021-09-11 香港商阿里巴巴集團服務有限公司 Verification processing method and device based on verification code
CN109804362A (en) * 2016-07-15 2019-05-24 伊欧-塔霍有限责任公司 Primary key-foreign key relationship is determined by machine learning
CN109804362B (en) * 2016-07-15 2023-05-30 日立数据管理有限公司 Determining primary key-foreign key relationships by machine learning
US11775843B2 (en) 2017-09-29 2023-10-03 Oracle International Corporation Directed trajectories through communication decision tree using iterative artificial intelligence
CN111279304A (en) * 2017-09-29 2020-06-12 甲骨文国际公司 Method and system for configuring communication decision tree based on locatable elements connected on canvas
CN111279304B (en) * 2017-09-29 2023-08-15 甲骨文国际公司 Method and system for configuring communication decision tree based on locatable elements connected on canvas
US11900267B2 (en) 2017-09-29 2024-02-13 Oracle International Corporation Methods and systems for configuring communication decision trees based on connected positionable elements on canvas
EP3567507A4 (en) * 2017-12-20 2020-03-04 iSplit Co., Ltd. Data management system
CN109359283B (en) * 2018-09-26 2023-07-25 中国平安人寿保险股份有限公司 Summarizing method of form data, terminal equipment and medium
CN109359283A (en) * 2018-09-26 2019-02-19 中国平安人寿保险股份有限公司 Method of summary, terminal device and the medium of list data
US20220107796A1 (en) * 2018-12-25 2022-04-07 Huawei Technologies Co., Ltd. Application Package Splitting and Reassembly Method and Apparatus, and Application Package Running Method and Apparatus
CN110548290B (en) * 2019-09-11 2023-10-03 珠海金山数字网络科技有限公司 Image-text mixed arrangement method and device, electronic equipment and storage medium
CN110548290A (en) * 2019-09-11 2019-12-10 珠海金山网络游戏科技有限公司 Image-text mixed arranging method and device, electronic equipment and storage medium
CN111046632B (en) * 2019-11-29 2023-11-10 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN111046632A (en) * 2019-11-29 2020-04-21 智器云南京信息科技有限公司 Data extraction and conversion method, system, storage medium and electronic equipment
CN110968592B (en) * 2019-12-06 2023-11-21 深圳前海环融联易信息科技服务有限公司 Metadata acquisition method, metadata acquisition device, computer equipment and computer readable storage medium
CN110968592A (en) * 2019-12-06 2020-04-07 深圳前海环融联易信息科技服务有限公司 Metadata acquisition method and device, computer equipment and computer-readable storage medium
CN113569534A (en) * 2020-04-29 2021-10-29 杭州海康威视数字技术股份有限公司 Method and device for detecting messy codes in document
US11442712B2 (en) * 2020-06-11 2022-09-13 Indian Institute Of Technology Delhi Leveraging unspecified order of evaluation for compiler-based program optimization
CN112181950B (en) * 2020-10-19 2024-03-26 北京米连科技有限公司 Construction method of distributed object database
CN112181950A (en) * 2020-10-19 2021-01-05 北京米连科技有限公司 Method for constructing distributed object database
CN112333256A (en) * 2020-10-28 2021-02-05 常州微亿智造科技有限公司 Data conversion frame system and method during network transmission under industrial Internet of things
CN112333256B (en) * 2020-10-28 2022-02-08 常州微亿智造科技有限公司 Data conversion frame system and method during network transmission under industrial Internet of things
CN112966475A (en) * 2021-03-02 2021-06-15 挂号网(杭州)科技有限公司 Character similarity determining method and device, electronic equipment and storage medium
US11494201B1 (en) * 2021-05-20 2022-11-08 Adp, Inc. Systems and methods of migrating client information
CN113360113A (en) * 2021-05-24 2021-09-07 中国电子科技集团公司第四十一研究所 System and method for dynamically adjusting character display width based on OLED screen
CN113625932A (en) * 2021-08-04 2021-11-09 北京鲸鲮信息系统技术有限公司 Full screen handwriting input method and device
CN113625932B (en) * 2021-08-04 2024-03-22 北京字节跳动网络技术有限公司 Full-screen handwriting input method and device
CN113659993A (en) * 2021-08-17 2021-11-16 深圳市康立生物医疗有限公司 Immune batch data processing method and device, terminal and readable storage medium
CN113723048A (en) * 2021-09-06 2021-11-30 北京字跳网络技术有限公司 Method and device for setting rich text space, storage medium and electronic equipment
CN114900315A (en) * 2022-04-24 2022-08-12 北京优全智汇信息技术有限公司 Document electronic management system based on OCR and electronic signature technology
CN114900315B (en) * 2022-04-24 2024-03-15 北京优全智汇信息技术有限公司 Document electronic management system based on OCR and electronic signature technology
CN117371446A (en) * 2023-12-07 2024-01-09 江西曼荼罗软件有限公司 Medical record text typesetting method, system, storage medium and electronic equipment
CN117371446B (en) * 2023-12-07 2024-04-16 江西曼荼罗软件有限公司 Medical record text typesetting method, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN106575166A (en) 2017-04-19
CN116185209A (en) 2023-05-30
CN106575166B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN106575166B (en) Method for processing hand input character, splitting and merging data and processing encoding and decoding
US20210165955A1 (en) Methods and systems for modeling complex taxonomies with natural language understanding
US10089299B2 (en) Multi-media context language processing
US11556697B2 (en) Intelligent text annotation
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
US10423649B2 (en) Natural question generation from query data using natural language processing system
US9003295B2 (en) User interface driven access control system and method
TWI590082B (en) Sharable distributed dictionary for applications
US10049098B2 (en) Extracting actionable information from emails
US8750630B2 (en) Hierarchical and index based watermarks represented as trees
CN111414122B (en) Intelligent text processing method and device, electronic equipment and storage medium
CN110597963A (en) Expression question-answer library construction method, expression search method, device and storage medium
KR20090127936A (en) Client input method
CN111026858A (en) Project information processing method and device based on project recommendation model
CN111314388B (en) Method and apparatus for detecting SQL injection
US20230090050A1 (en) Search architecture for hierarchical data using metadata defined relationships
WO2023024975A1 (en) Text processing method and apparatus, and electronic device
CN111886596A (en) Machine translation locking using sequence-based lock/unlock classification
CN105144147A (en) Detection and reconstruction of right-to-left text direction, ligatures and diacritics in a fixed format document
CN110990057A (en) Extraction method, device, equipment and medium of small program sub-chain information
JP2022518645A (en) Video distribution aging determination method and equipment
US20190188004A1 (en) Software application dynamic linguistic translation system and methods
CN107526742A (en) Method and apparatus for handling multi-language text
GB2603586A (en) Document access control based on document component layouts
CN110569488A (en) modular template WORD generation method based on XML (extensive markup language)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15832430

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15832430

Country of ref document: EP

Kind code of ref document: A1