docs: tutorial
[awesomized/ext-ion] / docs / tutorial / : Tutorial / :5. Symbols, Tables and Catalogs.md
1 # Symbols, Tables and Catalogs
2
3 ## Schematic Overview
4
5 ```
6 +---------+
7 | Catalog |
8 +----+----+-------------------------------+
9 | | |
10 | | +-------------+ |
11 | +--->| SymbolTable | |
12 | | +---+---------+---------------+ |
13 | | | | | |
14 | | | | +-------------------+ |
15 | | | | | Symbol (ID, Text) | |
16 | | | +---->| Symbol (ID, Text) | |
17 | | | | ... | |
18 | | +---------+-------------------+ |
19 | | |
20 | | +-------------+ |
21 | +--->| SymbolTable | |
22 | | +---+---------+---------------+ |
23 | | | | | |
24 | . | | +-------------------+ |
25 | . | +---->| Symbol (ID, Text) | |
26 | . | | ... | |
27 | . +---------+-------------------+ |
28 | . |
29 +-----------------------------------------+
30 ```
31
32 ## Catalog
33
34 The Catalog holds a collection of ion\Symbol\Table instances queried from ion\Reader and ion\Writer instances.
35
36 See also [the ION spec's symbol guide chapter on catalogs](https://amzn.github.io/ion-docs/docs/symbols.html#the-catalog).
37
38 ``` php
39 <?php
40 $catalog = new ion\Catalog;
41 $symtab = ion\Symbol\PHP::asTable();
42 $catalog->add($symtab);
43 ?>
44 ```
45
46 ## Symbol Table
47
48 There are three types of symbol tables:
49
50 - Local
51 - Shared
52 - System (a special shared symbol table)
53
54 Local symbol tables do not have names, while shared symbol tables require them; only shared symbol tables may be added to a catalog or to a writer’s list of imports.
55
56 Local symbol tables are managed internally by Ion readers and writers. No application configuration is required to tell Ion readers or writers that local symbol tables should be used.
57
58 ### Using a shared symbol table
59
60 Using local symbol tables requires the local symbol table (including all of its symbols) to be written at the beginning of the value stream. Consider an Ion stream that represents CSV data with many columns. Although local symbol tables will optimize writing and reading each value, including the entire symbol table itself in the value stream adds overhead that increases with the number of columns.
61
62 If it is feasible for the writers and readers of the stream to agree on a pre-defined shared symbol table, this overhead can be reduced.
63
64 Consider the following CSV in a file called `test.csv`.
65
66 ```
67 id,type,state
68 1,foo,false
69 2,bar,true
70 3,baz,true
71 ...
72 ```
73
74 An application that wishes to convert this data into the Ion format can generate a symbol table containing the column names. This reduces encoding size and improves read efficiency.
75
76 Consider the following shared symbol table that declares the column names of `test.csv` as symbols. Note that the shared symbol table may have been generated by hand or programmatically.
77
78 ```
79 $ion_shared_symbol_table::{
80 name: "test.csv.columns",
81 version: 1,
82 symbols: ["id", "type", "state"],
83 }
84 ```
85
86 This shared symbol table can be stored in a file (or in a database, etc.) to be resurrected into a symbol table at runtime.
87
88 Because the value stream written using the shared symbol table does not contain the symbol mappings, a reader of the stream needs to access the shared symbol table using a catalog.
89
90 Consider the following complete example:
91
92 ```php
93 <?php
94
95 /**
96 * Representing a CSV row
97 */
98 class Row {
99 public function __construct(
100 public readonly int $id,
101 public readonly string $type,
102 public readonly bool $state = true
103 ) {}
104 }
105
106 /* Fetch the shared symbol table from file, db, etc. */
107 $symtab = ion\unserialize(<<<'SymbolTable'
108 $ion_shared_symbol_table::{
109 name: "test.csv.columns",
110 version: 1,
111 symbols: ["id", "type", "state"],
112 }
113 SymbolTable
114 );
115
116 /* Add the shared symbol table to a catalog */
117 $catalog = new ion\Catalog;
118 $catalog->add($symtab);
119
120 /* Use the catalog when writing the data */
121 $writer = new class(options: new ion\Writer\Options(
122 catalog: $catalog,
123 outputBinary: true
124 )) extends ion\Writer\Buffer\Writer {
125 public function writeRow(Row $row) : void {
126 $this->startContainer(ion\Type::Struct);
127
128 $this->writeFieldname("id");
129 $this->writeInt($row->id);
130
131 $this->writeFieldName("type");
132 $this->writeString($row->type);
133
134 $this->writeFieldName("state");
135 $this->writeBool($row->state);
136
137 $this->finishContainer();
138 }
139 };
140
141 $writer->writeRow(new Row(1, "foo", false));
142 $writer->writeRow(new Row(2, "bar"));
143 $writer->writeRow(new Row(3, "baz"));
144 $writer->flush();
145
146 ?>
147 ```
148
149 Let's inspect the binary ION stream and verify that the column names are actually replaced by SymbolIDs:
150
151 ```php
152 <?php
153
154 foreach (str_split($writer->getBuffer(), 8) as $line) {
155 printf("%-26s", chunk_split(bin2hex($line), 2, " "));
156 foreach (str_split($line) as $byte) {
157 echo $byte >= ' ' && $byte <= '~' ? $byte : ".";
158 }
159 echo "\n";
160 }
161 echo "\n";
162
163 /*
164 e0 01 00 ea ee a2 81 83 ........ \
165 de 9e 86 be 9b de 99 84 ........ |
166 8e 90 74 65 73 74 2e 63 ..test.c > here's ION symbol table metadata
167 73 76 2e 63 6f 6c 75 6d sv.colum |
168 6e 73 85 21 01 88 21 03 ns.!..!. <
169 da 8a 21 01 8b 83 66 6f ..!...fo |
170 6f 8c 11 da 8a 21 02 8b o....!.. > here's the actual data
171 83 62 61 72 8c 11 da 8a .bar.... |
172 21 03 8b 83 62 61 7a 8c !...baz. /
173 11 .
174 */
175
176 ?>
177 ```
178
179 When unserializing without knowing the used symbols, our column name will actually be just symbol IDs `$<SID>`:
180
181 ```php
182 <?php
183
184 var_dump(ion\unserialize($writer->getBuffer(), [
185 "multiSequence" => true,
186 ]));
187
188 /*
189 array(3) {
190 [0]=>
191 array(3) {
192 ["$10"]=>
193 int(1)
194 ["$11"]=>
195 string(3) "foo"
196 ["$12"]=>
197 bool(false)
198 }
199 [1]=>
200 array(3) {
201 ["$10"]=>
202 int(2)
203 ["$11"]=>
204 string(3) "bar"
205 ["$12"]=>
206 bool(true)
207 }
208 [2]=>
209 array(3) {
210 ["$10"]=>
211 int(3)
212 ["$11"]=>
213 string(3) "baz"
214 ["$12"]=>
215 bool(true)
216 }
217 }
218 */
219
220 ?>
221 ```
222
223 When unserializing with known symbols, the symbol IDs will be resolved when using the catatalog with the appropriate symbol tables:
224
225 ```php
226 <?php
227
228 var_dump(ion\unserialize($writer->getBuffer(), [
229 "multiSequence" => true,
230 "readerOptions" => [
231 "catalog" => $catalog
232 ]
233 ]));
234
235 /*
236 array(3) {
237 [0]=>
238 array(3) {
239 ["id"]=>
240 int(1)
241 ["type"]=>
242 string(3) "foo"
243 ["state"]=>
244 bool(false)
245 }
246 [1]=>
247 array(3) {
248 ["id"]=>
249 int(2)
250 ["type"]=>
251 string(3) "bar"
252 ["state"]=>
253 bool(true)
254 }
255 [2]=>
256 array(3) {
257 ["id"]=>
258 int(3)
259 ["type"]=>
260 string(3) "baz"
261 ["state"]=>
262 bool(true)
263 }
264 }
265 */
266
267 ?>
268 ```
269