Binary IEML Representation
Collective Intelligence Lab Technical Report
Steven R. Newcomb
Michel Biezunski
Version 1.1 (3 May 2008)
In IEML processing, it is frequently necessary to calculate such things as:
While it would be possible, at least at the lower layers of IEML, simply to enumerate all of the flows in each category and then to compare the two sets of flows, such a brute-force approach is infeasible at the higher layers. The highest (seme) layer, for example, is a space consisting of roughly 10162 semes. The brief and often-encountered IEML expression I~~~~~_ represents all of them. (Star-IEML is designed to enable the expression of large categories very compactly.)
We offer an approach to the binary representation of IEML categories that minimizes their storage requirements, and at the same time minimizes the processing time required to use them as operands in many kinds of calculations.
The structure of IEML categories is such that either they are primitive categories, or they are recursively constructed entirely of primitive categories.
| Layer | Roles | Role players are IEML categories at layer: |
| 6 - seme | Source Destination Translator | 5 |
| 5 - phrase | Source Destination Translator | 4 |
| 4 - idea | Source Destination Translator | 3 |
| 3 - relation | Source Destination Translator | 2 |
| 2 - event | Source Destination Translator | 1 |
| 1 - primitive | (none) | (no role players) |
Therefore, if we wish to represent IEML categories in a compact and computable fashion, it makes sense to focus first on the question of how best to represent a primitive category.
We propose to represent each primitive category as a six-bit unsigned integer, as follows:
| E: | U: | A: | S: | B: | T: | |
| E: | 1 | 0 | 0 | 0 | 0 | 0 |
| U: | 0 | 1 | 0 | 0 | 0 | 0 |
| A: | 0 | 0 | 1 | 0 | 0 | 0 |
| S: | 0 | 0 | 0 | 1 | 0 | 0 |
| B: | 0 | 0 | 0 | 0 | 1 | 0 |
| T: | 0 | 0 | 0 | 0 | 0 | 1 |
In the above table, the columns are primitives, and the rows show how to represent primitive categories as six bits, when each category has a primitive as its only member.
This approach is inspired by the fact that every Star-IEML upper-case letter symbol represents a set of one or more primitives (a primitive category). There are six primitives, and therefore six bits are needed to represent all possible combinations of them. If a bit is 0, then the corresponding primitive is not present in the category. If 1, then it is present.
The significances of the bits are in the left-to-right order E:, U:, A:, S:, B:, T:. This conforms to IEML's convention that, at least by default, symbols should be presented in the order that best reflects increasing substantiality of their semantics.
The following table demonstrates the 6-bit binary representations of the primitive categories for which the Star-IEML syntax provides a special symbol. The first column presents the Star-IEML symbol; the second column shows the representation of the same primitive category using Star-IEML's non-generative set operation notation. The rest of the columns show the bits.
| E: | U: | A: | S: | B: | T: | ||
| O: | (U:|A:) | 0 | 1 | 1 | 0 | 0 | 0 |
| M: | (S:|B:|T:) | 0 | 0 | 0 | 1 | 1 | 1 |
| F: | (U:|A:|S:|B:|T:) | 0 | 1 | 1 | 1 | 1 | 1 |
| I: | (E:|U:|A:|S:|B:|T:) | 1 | 1 | 1 | 1 | 1 | 1 |
The following table provides examples of other Star-IEML expressions of primitive categories, and their binary representations:
| E: | U: | A: | S: | B: | T: | |
| (U:|S:) | 0 | 1 | 0 | 1 | 0 | 0 |
| (U:|B:) | 0 | 1 | 0 | 0 | 1 | 0 |
| (U:|T:) | 0 | 1 | 0 | 0 | 0 | 1 |
| (A:|B:|T:) | 0 | 0 | 1 | 0 | 1 | 1 |
| (E:|S:|T:) | 1 | 0 | 0 | 1 | 0 | 1 |
The semantics of the most significant (25) bit are worth explaining. If the bit is set (i.e., nonzero), the primitive E: is included in primitive category represented by the six bits. In Star-IEML, E: means the absence of any U:, A:, S:, B: or T: primitive in the primitive category. The Star-IEML expression (U:|E:) represents a set of two possible primitive states, one of which is U: and the other of which is the absence of any U:, A:, S:, B: or T: primitive in the primitive category.
Therefore, while a primitive category can include both the possibility of emptiness (i.e., the total absence of U:, A:, S:, B: or T:) as well as any combination of U:, A:, S:, B: and T:, it can never be true that a primitive category excludes the possibility of emptiness and at the same time excludes the possibility of any U:, A:, S:, B: or T:. In terms of the six-bit representation, therefore, a binary primitive category representation in which all six bits are reset (zero) is an invalid representation. This zero state can be used to initialize the data of a binary representation of a primitive category in any context. Then, when an all-zero representation is encountered, there is a good chance that the problem is that no data have yet been written.
The integer value of a six-bit array used to represent an IEML primitive category is called an IEML "character". There are 63 characters in IEML. In order to reserve the word "character" for this special semantic, the word "glyph" is used to denote symbols used in Star-IEML syntax to represent primitive categories and events, such as O, U, b and wo. (The symbol wo is a single glyph, not two glyphs.)
As demonstrated above, every primitive (i.e., layer 1) category can be represented as a single six-bit unsigned integer ("character"). Every event (i.e., layer 2) category consists of three role players, each of which is a primitive (layer 1) category. A category at layer 3 consists of 3 categories at layer 2, and so on, recursively, until layer 6, which is the highest layer. Therefore, the number of characters required to represent any category can be calculated as a power of 3, as shown in the table below:
| layer | number of IEML characters required to represent a category at this layer |
| 1 | 3(layer-1) == 30 == 1 |
| 2 | 3(layer-1) == 31 == 3 |
| 3 | 3(layer-1) == 32 == 9 |
| 4 | 3(layer-1) == 33 == 27 |
| 5 | 3(layer-1) == 34 == 81 |
| 6 | 3(layer-1) == 35 == 243 |
Therefore, an array of one unsigned 6-bit integer can represent any primitive category, while an array of 243 can represent any seme category. For a category at any layer other than layer 1 (there are no role players at layer 1), the categories of each of the three role players are easily calculated by splitting the array into thirds. The first third represents the player of the source role, the second third represents the player of the destination role, and the third third represents the player of the translator role. For example, in a binary representation of an idea (layer 4) category, the category is always represented by an array of 27 unsigned 6-bit integers. Its source role player is represented by the first 9 elements in that array, its destination role player is the next 9, and the last 9 is the translator.
In the rest of this paper, the term "character array" refers to such an array of unsigned 6-bit integers used to represent an IEML category.
Every distinct category has a single unique character array representation. By contrast, a single distinct category can be expressed in a very large number of different ways using Star-IEML.
The one-to-one mapping between character arrays and all possible IEML categories offers a convenient way to index information about IEML categories.
When it is useful to represent a category in Star-IEML in such a way that the Star representation will be unique among all similarly-canonicalized Star representations of other categories, it makes sense first to convert the original Star-IEML expression into a binary-IEML representation, and then to apply a specific rendition process -- a kind of style sheet -- to that binary representation in order to render it as a canonical Star-IEML expression.
When a Star-IEML expression is converted into a binary-IEML representation, information having to do with the exact way in which the category was originally expressed will not be reflected in the resulting character array. For example, nothing in a character array can reveal the expressive distinction, if any, between (U:|A:). and O:..
Comments and instantiators are two of the syntactic features of Star-IEML that may or may not need to be preserved when rendering binary-IEML representations as Star-IEML expressions. Applications may require that comments and instantiators be represented as additional information that is associated with character arrays. Such applications will incur more overhead.
Undetermined subset parameter identifiers must be accounted for in binary representations of IEML categories. This information may impact calculations involving the categories. For example, at processing time, undetermined subsets may be treated as variables whose actual values are bound to them via processing parameters. In any case, a category that represents an undetermined subset cannot always be considered identical to a category that represents only itself, even if the two categories are represented by the same character array. Implementations will need to represent associations between undetermined subset parameter identifiers and the portions of the character array to which they correspond.
Approaches to implementation will necessarily be constrained by their technical environments. An approach suitable for the Python programming environment will be described in a subsequent writing.