I put together a simple parse of some XML and then a parse of an equivalent data set in YAML - just to see which one appears to perform the best under a load test. I used PHP for my tests, but rather than parsing in native php, (that would be daft), I relied on components coded in C and compiled into php - namely SimpleXML and Syck.
It could be argued that the tests weren't exactly identical. Here are the source files. Being lazy, and with less than an hour to try this out, I just stole them from an online article about YAML (also here: http://news.php.net/php.general/268422) - First the XML example:
<?php
$xml = <<<EOD
<user id="babooey" on="cpu1">
<firstname>Bob</firstname>
<lastname>Abooey</lastname>
<department>adv</department>
<cell>555-1212</cell>
<address password="xxxx">ahunter@example1.com</address>
<address password="xxxx">babooey@example2.com</address>
</user>
EOD;
$data = new SimpleXMLElement($xml);
print_r($data);
?>
And then the YAML example:
<?php
$yaml = <<<EOD
babooey:
computer : cpu1
firstname: Bob
lastname: Abooey
cell: 555-1212
addresses:
- address: babooey@example1.com
password: xxxx
- address: babooey@example2.com
password: xxxx
EOD;
$data = syck_load($yaml);
print_r($data);
?>
The next stage was to load-test each of these simple programmes to see if either one displays a significantly higher throughput. Normally I would turn to http_load for such stress-testing needs. However, I saw Rasmus using the 'Siege' application at DrupalCon earlier this year. It looks pretty sweet for stressing a single url, so here's what the two tests looked like:
siege -c 5 "http://localhost/xmltest.php" -b -t30s
siege -c 5 "http://localhost/yamltest.php" -b -t30s
All I'm doing there is specifying a maximum of 5 concurrent connections to the url (given in quotes), and I am asking siege to run the test for 30 seconds. I think it would make sense to increase both those values to represent a higher number of concurrent connections (better representing the state of a live Yahoo! front-end). Next time round I will run for a longer period - perhaps a couple of minutes. So what did the results look like:
YAML version:
Lifting the server siege... done.
Transactions: 4906 hits
Availability: 100.00 %
Elapsed time: 29.75 secs
Data transferred: 3.00 MB
Response time: 0.02 secs
Transaction rate: 164.91 trans/sec
Throughput: 0.10 MB/sec
Concurrency: 3.15
Successful transactions: 4909
Failed transactions: 0
Longest transaction: 0.15
Shortest transaction: 0.00
XML version:
Lifting the server siege... done.
Transactions: 5759 hits
Availability: 100.00 %
Elapsed time: 29.06 secs
Data transferred: 1.91 MB
Response time: 0.02 secs
Transaction rate: 198.18 trans/sec
Throughput: 0.07 MB/sec
Concurrency: 3.80
Successful transactions: 5759
Failed transactions: 0
Longest transaction: 0.21
Shortest transaction: 0.00
There are a lot of interesting values in those results. Probably most significant are the 'Response time' and 'Transaction rate'. These results show that and XML parse appears to operate more quickly than a YAML parse. The difference isn't huge, but these were very simple tests. It would be fun to try a similar exercise on a much larger data set.
I guess these results are in line with what I would expect. The YAML language is more verbose than XML. Although it's easier to read, it does have a more complex vocabulary. So in any situation where you have a run-time parse, I would probably head for XML in preference to YAML. However, if parsing occurs off-cycle - say in the reading of a configuration into a memory-resident data set, then the decision would have to be made based on who or what is likely to read or modify the file. If it's likely to be a human in either case, YAML will probably win through.