Dig in and create you own parser library

“Understanding Software” by Max Kanat Alexandar is a super fast read, and should be required reading for any new developer (engineer). It gives you simple guidelines and ways to solve problems. One part of the book kept ringing in my ears a year or so ago, “Good developers read every line of code”. I never considered this before. Like for the application/company you are supporting, you should understand every line of code that’s running in production.

This includes 3rd party libraries you’re using. So back in 2021, my team was writing a new application that could ingest all sorts of XML files full of jobs. (I worked for Getwork/LinkUp)

One big challenge was with a feed that contained about 8-10 million jobs. So the file (uncompressed) was about 500 G. The challenge we were facing was to present the fields that were in this xml in a UI. On the fly, so an internal user could do some mappings. Well, trying to 1. decompress 2. parse and then 3. find the fields was not an easy endevaor using out of the box parsers. I tried many different parsers, even old ones. None could do it, there was a few that streamed the file, but it couldn’t decompress on the fly.

Some would decompress but then do a horrible job streaming. Another had too many requirements just to initialize.

So i started to read the code, and step through each line, and each class. Turns out the Symfony XML parser is made up of like 3 or 4 different modules.

Ok so maybe I could do something with these different parts. I setup a quick proof of concept. And within a few hours, I had something working. I could stream chunks of this file using one open source module and then decompress these chunks using another.

All of this was possible because of open source. Some dev (or many devs) decided to break these things apart and then bundle them up into on parser.

The code

Here’s the interface, since we were already using a couple different OTS parsers.

<?php

namespace App\Feeds\Service\Parser;

use Generator;

interface ParserInterface
{
/**
* @param string|null $path
* @return Generator
*/
public function getNextJob(string $path = null): Generator;
}

And here’s the custom XML parser. Does a bunch of setup in the constructor. Its combining Symfony’s XML Encoder, Psr\Http\Message\Streaminterface, and GuzzleHttp\Psr7.

use Symfony\Component\Serializer\Encoder\XmlEncoder;
use Exception;
use Generator;

class XmlPsr7Parser implements ParserInterface
{
private Feed $feed;
private XmlEncoder $encoder;
private string $protocol;
private Fetcher $fetcher;
private StreamInterface $stream;

/**
* @param Feed $feed
* @throws Exception
*/
public function __construct(Feed $feed)
{
$this->encoder = new XmlEncoder();
$this->feed = $feed;
$this->protocol = Feed::PROTOCOL_NAME_SPACE.$this->feed->getProtocol();
$this->fetcher = new Fetcher(new $this->protocol($this->feed));
$this->stream = $this->fetcher->fetchResource();
}

/**
* @param string|null $path
* @return Generator
* @throws Exception
*/
public function getNextJob(string $path = null): Generator
{
$data = '';
$xml_node_name = $this->feed->getXmlNodeName();
$nodeLen = strlen('</'.$xml_node_name.'>');

while(!$this->stream->eof()) {
$data .= $this->stream->read(1024);
//Match for regex or non regex
$posStart1 = strpos($data,'<'.$xml_node_name.'>');
$posStart2 = preg_match("/<".$xml_node_name."(\s.*)>/i", $data, $matches, PREG_OFFSET_CAPTURE);
$posEnd = strpos($data,'</'.$xml_node_name.'>');
if(($posStart1 || $posStart2) && $posEnd) {
$start = $posStart1 ?: $matches[0][1];
$xmlString = substr($data, $start, $posEnd+$nodeLen-$start);
try {
yield $this->encoder->decode($xmlString, 'XML');
} catch (Exception $e) {
throw new Exception( $e->getMessage()." Url= ".$this->feed->getUrl());
}
//Take remnant
$data = "<holder/>".substr($data, $posEnd+6, strlen($data)-$posEnd+6);
}
}
$this->stream->detach();
$this->stream->close();
}
}

A lot going on here. The “getNextJob” method..

Sets the node name in based off the Feed entity

$xml_node_name = $this->feed->getXmlNodeName();

Then in a while loop…First read a chunk from the stream and append to $data.

$data .= $this->stream->read(1024);

Try to find the start and end of the given node name, by regex or just a string match. (For cases when there’s XML attributes mainly)

$posStart1 = strpos($data,'<'.$xml_node_name.'>');
$posStart2 = preg_match("/<".$xml_node_name."(\s.*)>/i", $data, $matches, PREG_OFFSET_CAPTURE);
$posEnd = strpos($data,'</'.$xml_node_name.'>');

If the current $data contains a start and end tag, then try to XML decode the $xmlString and yields an XML “chunk” (a Generator)

$start = $posStart1 ?: $matches[0][1];
$xmlString = substr($data, $start, $posEnd+$nodeLen-$start);
try {
yield $this->encoder->decode($xmlString, 'XML');
} catch (Exception $e) {
throw new Exception( $e->getMessage()." Url= ".$this->feed->getUrl());
}

Then it removes that valid XML from $data and continues the loop.

//Take remnant
$nodeLen = strlen($xml_node_name)+3; //Account for < and />
$data = "<holder/>".substr($data, $posEnd+$nodeLen, strlen($data)-$posEnd+$nodeLen);

The whole thing stops when end of file is reached and closes it all down.

And a key method that wraps a lot of the parsing stuff inside Fetcher/Protocol…

 /**
* @return StreamInterface
* @throws Exception
*/
public function fetchResource(): StreamInterface
{
try {
$remote_file = fopen($this->feed->getRemotePath(), 'r+');
$stream = Psr7\Utils::streamFor($remote_file);
$stream = $this->archiver->extract($stream);
} catch (Throwable $ex) {
throw new Exception($ex->getMessage().' HTTP authentication failed.');
}

return $stream;
}

Here’s the implementation (usage) of it inside our RabbitMQ consumer…

$parser = $this->feedParserRepository->getOne($feed);
$mapper = new FeedMapper($feed);
foreach ($parser->getNextJob() as $row) {
$job = $mapper->map($row);
if ($this->processJob($feed, $job)) {
$processedJobs++;
}
}

Wrapping it up

Oftentimes we treat these open source libraries as black boxes. Things we shouldn’t touch because someone else maintains them. So we shouldnt bother reading the code.

Since I read Max’s book, I remembered that I should know every line. I took the time to read through these things and then was able to create my own parser. One that worked perfectly for the given use case. The above class also made it pretty simple to write unit tests as a lot of the “setup” in the constructor can be skipped when running the tests.

This worked greatly, and allowed us to have a simple UI that worked for our internal users. Within their workflow, managers could start a new client’s feed, do the mappings, and ingest jobs all within a few moments. And they could get instant feedback that it all worked, sometimes while they were on a call with a new prospect.

Leave a comment

Design a site like this with WordPress.com
Get started