Symbols, Tables and Catalogs

Schematic Overview


  +---------+
  | Catalog |
  +----+----+-------------------------------+
  |    |                                    |
  |    |    +-------------+                 |
  |    +--->| SymbolTable |                 |
  |    |    +---+---------+---------------+ |
  |    |    |   |                         | |
  |    |    |   |     +-------------------+ |
  |    |    |   |     | Symbol (ID, Text) | |
  |    |    |   +---->| Symbol (ID, Text) | |
  |    |    |         | ...               | |
  |    |    +---------+-------------------+ |
  |    |                                    |
  |    |    +-------------+                 |
  |    +--->| SymbolTable |                 |
  |    |    +---+---------+---------------+ |
  |    |    |   |                         | |
  |    .    |   |     +-------------------+ |
  |    .    |   +---->| Symbol (ID, Text) | |
  |    .    |         | ...               | |
  |    .    +---------+-------------------+ |
  |    .                                    |
  +-----------------------------------------+

Catalog

The Catalog holds a collection of ion\Symbol\Table instances queried from ion\Reader and ion\Writer instances.

See also the ION spec's symbol guide chapter on catalogs.


<?php
$catalog 
= new ion\Catalog;
$symtab ion\Symbol\PHP::asTable();
$catalog->add($symtab);
?>

Symbol Table

There are three types of symbol tables:

Local symbol tables do not have names, while shared symbol tables require them; only shared symbol tables may be added to a catalog or to a writer’s list of imports.

Local symbol tables are managed internally by Ion readers and writers. No application configuration is required to tell Ion readers or writers that local symbol tables should be used.

Using a shared symbol table

Using local symbol tables requires the local symbol table (including all of its symbols) to be written at the beginning of the value stream. Consider an Ion stream that represents CSV data with many columns. Although local symbol tables will optimize writing and reading each value, including the entire symbol table itself in the value stream adds overhead that increases with the number of columns.

If it is feasible for the writers and readers of the stream to agree on a pre-defined shared symbol table, this overhead can be reduced.

Consider the following CSV in a file called test.csv.


 id,type,state
 1,foo,false
 2,bar,true
 3,baz,true
 ...

An application that wishes to convert this data into the Ion format can generate a symbol table containing the column names. This reduces encoding size and improves read efficiency.

Consider the following shared symbol table that declares the column names of test.csv as symbols. Note that the shared symbol table may have been generated by hand or programmatically.


 $ion_shared_symbol_table::{
   name: "test.csv.columns",
   version: 1,
   symbols: ["id", "type", "state"],
 }

This shared symbol table can be stored in a file (or in a database, etc.) to be resurrected into a symbol table at runtime.

Because the value stream written using the shared symbol table does not contain the symbol mappings, a reader of the stream needs to access the shared symbol table using a catalog.

Consider the following complete example:


<?php

/**
 * Representing a CSV row
 */
class Row {
  public function 
__construct(
    public readonly 
int $id,
    public readonly 
string $type,
    public readonly 
bool $state true
  
) {}
}

/* Fetch the shared symbol table from file, db, etc. */
$symtab ion\unserialize(<<<'SymbolTable'
 $ion_shared_symbol_table::{
   name: "test.csv.columns",
   version: 1,
   symbols: ["id", "type", "state"],
 }
SymbolTable
);

/* Add the shared symbol table to a catalog */
$catalog = new ion\Catalog;
$catalog->add($symtab);

/* Use the catalog when writing the data */
$writer = new class(
  
catalog$catalog,
  
outputBinarytrue
) extends ion\Writer\Buffer\Writer {
  public function 
writeRow(Row $row) : void {
    
$this->startContainer(ion\Type::Struct);
    
    
$this->writeFieldname("id");
    
$this->writeInt($row->id);
    
    
$this->writeFieldName("type");
    
$this->writeString($row->type);
    
    
$this->writeFieldName("state");
    
$this->writeBool($row->state);
    
    
$this->finishContainer();
  }
};

$writer->writeRow(new Row(1"foo"false));
$writer->writeRow(new Row(2"bar"));
$writer->writeRow(new Row(3"baz"));
$writer->flush();

?>

Let's inspect the binary ION stream and verify that the column names are actually replaced by SymbolIDs:


<?php
  
foreach (str_split($writer->getBuffer(), 8) as $line) {
    
printf("%-26s"chunk_split(bin2hex($line), 2" "));
    foreach (
str_split($line) as $byte) {
        echo 
$byte >= ' ' && $byte <= '~' $byte ".";
    }
    echo 
"\n";
}
echo 
"\n";

/*
  e0 01 00 ea ee a2 81 83   ........  \ 
  de 9e 86 be 9b de 99 84   ........   |
  8e 90 74 65 73 74 2e 63   ..test.c    > here's ION symbol table metadata
  73 76 2e 63 6f 6c 75 6d   sv.colum   |
  6e 73 85 21 01 88 21 03   ns.!..!.  <
  da 8a 21 01 8b 83 66 6f   ..!...fo   |
  6f 8c 11 da 8a 21 02 8b   o....!..    > here's the actual data
  83 62 61 72 8c 11 da 8a   .bar....   |
  21 03 8b 83 62 61 7a 8c   !...baz.  /
  11                        .
*/

?>

When unserializing without knowing the used symbols, our column name will actually be just symbol IDs $<SID>:


<?php

var_dump
(ion\unserialize($writer->getBuffer(), [
  
"multiSequence" => true,
]));

/*
array(3) {
  [0]=>
  array(3) {
    ["$10"]=>
    int(1)
    ["$11"]=>
    string(3) "foo"
    ["$12"]=>
    bool(false)
  }
  [1]=>
  array(3) {
    ["$10"]=>
    int(2)
    ["$11"]=>
    string(3) "bar"
    ["$12"]=>
    bool(true)
  }
  [2]=>
  array(3) {
    ["$10"]=>
    int(3)
    ["$11"]=>
    string(3) "baz"
    ["$12"]=>
    bool(true)
  }
}
*/

?>

When unserializing with known symbols, the symbol IDs will be resolved when using the catatalog with the appropriate symbol tables:


<?php

$reader 
= new \ion\Reader\Buffer\Reader($writer->getBuffer(),
    
catalog$catalog
);
$unser = new ion\Unserializer\Unserializer(multiSequencetrue);
var_dump($unser->unserialize($reader));

/*
  array(3) {
    [0]=>
    array(3) {
      ["id"]=>
      int(1)
      ["type"]=>
      string(3) "foo"
      ["state"]=>
      bool(false)
    }
    [1]=>
    array(3) {
      ["id"]=>
      int(2)
      ["type"]=>
      string(3) "bar"
      ["state"]=>
      bool(true)
    }
    [2]=>
    array(3) {
      ["id"]=>
      int(3)
      ["type"]=>
      string(3) "baz"
      ["state"]=>
      bool(true)
    }
  }
*/

?>