Comparing YAML vs XML - a performance perspective

chris (2008-10-03 21:28:27)
4756 views
1 replies
I was looking at some documentation at work the other day, where some engineers were discussing the various merits of XML and YAML, generally debating the benefits of one over the other. I personally don't object to either for basic exchange of data, or configuration management. I tend to find YAML easy to read (which is useful, since that's the intention of YAML), but as it happens, I also find most XML quite easy to read - although that might be because I have been reading XML for a long time. So I asked myself - What would be the overriding factor if I was to decide between either one? The simple answer was driven by the fact that at Yahoo! we write web applications for massive-scale deployments and I am interested in anything which assists us in meeting those scaling needs. Performance to us is key. I decided to run a simple load test to compare the two.

I put together a simple parse of some XML and then a parse of an equivalent data set in YAML - just to see which one appears to perform the best under a load test. I used PHP for my tests, but rather than parsing in native php, (that would be daft), I relied on components coded in C and compiled into php - namely SimpleXML and Syck.

It could be argued that the tests weren't exactly identical. Here are the source files. Being lazy, and with less than an hour to try this out, I just stole them from an online article about YAML (also here: http://news.php.net/php.general/268422) - First the XML example:

<?php
$xml = <<<EOD
    <user id="babooey" on="cpu1">
        <firstname>Bob</firstname>
        <lastname>Abooey</lastname>
        <department>adv</department>
        <cell>555-1212</cell>
        <address password="xxxx">ahunter@example1.com</address>
        <address password="xxxx">babooey@example2.com</address>
    </user>
EOD;

$data = new SimpleXMLElement($xml);

print_r($data);
?>

And then the YAML example:

<?php
$yaml = <<<EOD
babooey:
            computer : cpu1
            firstname: Bob
            lastname: Abooey
            cell: 555-1212
            addresses:
                - address: babooey@example1.com
                  password: xxxx
                - address: babooey@example2.com
                  password: xxxx
EOD;

$data = syck_load($yaml);
print_r($data);
?>

The next stage was to load-test each of these simple programmes to see if either one displays a significantly higher throughput. Normally I would turn to http_load for such stress-testing needs. However, I saw Rasmus using the 'Siege' application at DrupalCon earlier this year. It looks pretty sweet for stressing a single url, so here's what the two tests looked like:

siege -c 5 "http://localhost/xmltest.php" -b -t30s
siege -c 5 "http://localhost/yamltest.php" -b -t30s

All I'm doing there is specifying a maximum of 5 concurrent connections to the url (given in quotes), and I am asking siege to run the test for 30 seconds. I think it would make sense to increase both those values to represent a higher number of concurrent connections (better representing the state of a live Yahoo! front-end). Next time round I will run for a longer period - perhaps a couple of minutes. So what did the results look like:

YAML version:
Lifting the server siege...      done.
Transactions:		        4906 hits
Availability:		      100.00 %
Elapsed time:		       29.75 secs
Data transferred:	        3.00 MB
Response time:		        0.02 secs
Transaction rate:	      164.91 trans/sec
Throughput:		        0.10 MB/sec
Concurrency:		        3.15
Successful transactions:        4909
Failed transactions:	           0
Longest transaction:	        0.15
Shortest transaction:	        0.00

XML version:
Lifting the server siege...      done.
Transactions:		        5759 hits
Availability:		      100.00 %
Elapsed time:		       29.06 secs
Data transferred:	        1.91 MB
Response time:		        0.02 secs
Transaction rate:	      198.18 trans/sec
Throughput:		        0.07 MB/sec
Concurrency:		        3.80
Successful transactions:        5759
Failed transactions:	           0
Longest transaction:	        0.21
Shortest transaction:	        0.00

There are a lot of interesting values in those results. Probably most significant are the 'Response time' and 'Transaction rate'. These results show that and XML parse appears to operate more quickly than a YAML parse. The difference isn't huge, but these were very simple tests. It would be fun to try a similar exercise on a much larger data set.

I guess these results are in line with what I would expect. The YAML language is more verbose than XML. Although it's easier to read, it does have a more complex vocabulary. So in any situation where you have a run-time parse, I would probably head for XML in preference to YAML. However, if parsing occurs off-cycle - say in the reading of a configuration into a memory-resident data set, then the decision would have to be made based on who or what is likely to read or modify the file. If it's likely to be a human in either case, YAML will probably win through.
comment
chris
2008-10-03 21:46:58

longer test

I had a chance to load this up a bit more today. Here are the revised results running with seige at 10 concurrent threads for a duration of 5 minutes (60 seconds). As expected (according to the results above), the XML parse proves to be more efficient than the YAML parse, taking about 13% off the response time.

XML version:
Transactions:		       51980 hits
Availability:		      100.00 %
Elapsed time:		      301.54 secs
Data transferred:	       31.73 MB
Response time:		        0.04 secs
Transaction rate:	      172.38 trans/sec
Throughput:		        0.11 MB/sec
Concurrency:		        6.72
Successful transactions:       51989
Failed transactions:	           0
Longest transaction:	        0.25
Shortest transaction:	        0.00

YAML version
Transactions:		       59242 hits
Availability:		      100.00 %
Elapsed time:		      301.52 secs
Data transferred:	       19.66 MB
Response time:		        0.04 secs
Transaction rate:	      196.48 trans/sec
Throughput:		        0.07 MB/sec
Concurrency:		        7.06
Successful transactions:       59251
Failed transactions:	           0
Longest transaction:	        0.32
Shortest transaction:	        0.00

I have no doubt that there are refinements that could be made to this test - perhaps make the XML and YAML data more identical - also, the SimpleXML parse is just calling a constructor and passing back an XML object, whereas the syck parse is creating an associative array - What are the implications of using such differently-written processes?
reply icon