Wednesday, February 10, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 6)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations, and we're trying to do it without spending a lot of money.

So far we have:
  1. Talked about the regulations and how we might solve this with Linux software
  2. Built a "PTP Bridge" with Puppet
  3. Started recording metrics with collectd and InfluxDB, and
  4. Finished recording metrics
  5. Drawn lots of graphs with Grafana and found contention on our firewall
In this post we will look at introducing a second ASIC based firewall to route PTP through so that PTP accuracy is not affected by our management firewall getting busy.

The New Design

Here is the adjusted design diagram, it's not that much more complicated:


Routing To Another Firewall

In order for this design to work every PTP Client must send it's PTP traffic back through the PTP firewall, rather than the management firewall. Each client needs a routing rule to reflect this. There are similar changes needed on the PTP Bridge to send through the PTP firewall.

With the new firewall in place and the routing changes applied, everything should "just work", but it doesn't. We see multicast leaving the PTP Bridge correctly:

[root@ptp-bridge ~]# tcpdump -i p4p3 multicast -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p4p3, link-type EN10MB (Ethernet), capture size 65535 bytes
16:10:07.519357 IP 192.168.0.4.319 > 224.0.1.129.319: UDP, length 44
16:10:07.519405 IP 192.168.0.4.320 > 224.0.1.129.320: UDP, length 44

Restart sfptpd on a client and we see it cannot subscribe to the multicast group:

2015-12-15 16:01:18.306138: info: using SO_TIMESTAMPNS software timestamps
2015-12-15 16:01:18.406259: notice: Now in state: PTP_LISTENING
2015-12-15 16:01:18.406353: info: creating NTP sync-module
2015-12-15 16:01:30.306177: warning: failed to receive Announce within 12.000 seconds

A packet trace shows that we see PIM announcements from the right firewall, and the IGMPv3 packets from sfptpd are leaving the client:

16:07:15.444343 IP 192.168.0.2 > 224.0.0.13: PIMv2, Hello, length 34
16:07:18.307942 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.369939 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.686936 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)

Looking at the multicast groups on the new firewall, we don't have the PTP multicast addresses appearing as registered groups that the firewall can forward.

To cut a very, very long story short, this was a TTL problem compounded by two firewall brands behaving differently. The ptp4l process sets the TTL of it's multicast packets to 1 to restrict them to the same subnet. Our management firewall accepted this and did the multicast forwarding anyway, the PTP firewall did not. Setting udp_ttl=2 for the ptp4l process that's doing the multicast fixed this. The lmaxexchange-linuxptp module has been changed to take this option now too.

The performance improvement moving from the software firewall to an ASIC firewall is quite apparent. Whereas before our "normal" range of variance was ±20μs (top image), we're now getting ±5μs (bottom image):


Our mean accuracy has improved substantially - at a glance the hardware change has saved 15% of the tolerance the regulations allow.

There are interesting patterns in the most recent data; regular and somewhat uniform spikes in client offset are seen on both clients. The graph below draws both client offsets together:



The spikes are recorded at almost exactly the same second, they last for a few seconds, they appear to come in groups of two and are generally over 9000ns in size. Identifying how regular these spikes are may indicate what the problem is. If they are perfectly regular on both clients, I would start to think it might be something on the firewall, like an administrative or housekeeping function.

We can use an InfluxDB query to get the raw data:

SELECT "time","value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND ("host" = 'client1') AND ( "value" > '0' OR "value" < '0' ) AND time >= '2016-02-05 09:43:19' AND value > 9000

Now with a little bit of Python we can work out the difference in time between spikes:

Unfortunately there's not as much uniformity in the spikes as I was hoping to see:

$ python diff.py < foo  | awk '{print $4}' | sort | uniq -c  | sort -n | tail
      6 0:01:08
      6 0:01:10
      6 0:01:14
      6 0:01:16
      8 0:01:00
      8 0:01:03
      8 0:01:17
      9 0:01:04
      9 0:01:05
     10 0:01:06

The most common time difference is around 60-70 seconds, which is between the individual spikes in a pair of two. The difference between the pairs of spikes themselves though is quite varied, anywhere from 2 minutes to 8 minutes apart. This doesn't mean it's not the firewall, but this data doesn't reliably pin the problem to some X minute recurring job in the internals of the firewall's operating system. The network engineers in my team are also quick to point out that this is an ASIC based firewall that's doing nothing but shifting a couple PTP packets per second, so firewall spikes of 20μs would be a very strange thing to be encountering. Before getting lost in a rabbit hole chasing performance spikes (something our engineers here know all too much about), there may be another way to mitigate this.

Attracting Attention

January was an interesting month for this project. We had several Solarflare representatives in to talk specifically about sfptpd, how it may be better utilised here and what feedback we could give to the Solarflare developers. A few days later I was tracked down on LinkedIn by Wojciech Owczarek, developer of the PTPd project. There are some very interesting features coming into PTPd, most of which we'll get to in a later blog post, but Wojciech did recommend tuning the PTP Sync and DelayReq interval higher to give the filters more samples to work with. This is a very simple thing to adjust and measure. So lets see if adjusting these values smooths out our spikes to any degree.

Faster Message Rates

Changing the sync rate of the ptp4l process on the PTP Bridge requires a minor code addition to the lmaxexchange-linuxptp Puppet module, and at the same time I'll rename repositories on GitHub to be more indicative that they are actually Puppet modules. Setting logSyncInterval to -2 means ptp4l will send out 4 Sync messages a second. The sfptpd daemons recognise this immediately:

2016-02-05 14:47:24.187549: info: received new Sync interval -2 from master (was 0)

After half an hour it appears as if the size of the spikes has lessened from upwards of 20μs to 15μs. Observing over a longer period appears to show the intensity of a spike has lowered slightly, see the before (top) and after (bottom) graphs below:


Changing the Delay Request rate of the sfptpd client is also simple. Setting ptp_delayreq_interval to -2 in the sfptpd configuration file means the client will send a DelayReq message 4 times a second. We also have two clients, so by changing only one we can compare the difference between the two.

An interesting pattern has emerged a day after changing one of our clients to send 4 Delay Requests per second; the spikes are no longer uniform across both clients as seen in the graph below:


This debunks the theory that it's the firewalls causing the jitter, each client is seeing it's own variance at different intervals. One of our key performance guys, Mark Price, tells me 10-20μs is well within the range of Operating System jitter he's seen, so on an un-tuned system, even with no workload, there's a good chance this is what we are seeing.

A Different Architecture

Lets step back a bit and look at what we've found so far. Routing PTP through a firewall is possible but we are heavily affected by contention. We can reduce this by using a dedicated firewall, but are still seeing 20μs spikes of variance, and there's a good chance it's from the Operating System. We could continue down the OS tuning route now - it's definitely something we'll have to do eventually - but it would be a big time investment and right we're still experimenting with architecture and design. Finally, our Networking team want their spare ASIC firewall back, so I'm losing my dedicated PTP firewall. The PTP Bridge generally keeps a baseline of 100ns accuracy (excluding jitter), perhaps it's time to look at switching-only designs.

In the very first post I mused about a design that didn't require a firewall, something that involved connecting the PTP Bridge to multiple separate management networks and having it send multiple independent PTP multicast streams. The idea had security implications - the PTP Bridge would join networks that don't usually join, networks that are usually protected from each other by a firewall. With careful attention to detail these implications can be mitigated.

Along side this, I've got new options in my choice of software. Solar Flare are interested in us replacing the PTP Bridge component with a purely Solar Flare approach - I'm told there is an undocumented feature that allows sfptpd to work in UDP Unicast mode, so it would be able to talk directly to our S300 Grand Master. There's also been some advances with PTPd that make it viable to replace linuxptp on the PTP Bridge, as well as replace sfptpd as the PTP clients.

In the next post we'll remove the PTP firewall from the mix and look at purely switched designs.

No comments:

Post a Comment