Fixing out of sync server times


In My server times are out of sync! I showed a formula to smooth the time differences between multiple servers when working a collaborative environment, but I started not to believe the results so set out to validate them.

Development environment


I use cloud9 for development and very nice it is too. However, I'm working on push notification for Ephemeral Exchange where multiple servers are involved in writing and monitoring changes, and I've discovered that their server times are very poorly synchronized. 

For development, my push and api servers are on cloud9 and the redis servers are on Google compute engine in a docker container. For production the api server is on Google App engine flexible environment docker container, and the push and redis servers are on Google compute engine, also dockerized.


The problem

All of these components have different ideas of what the time is. I use the API server as master of all times - so when an item is created the creation time is that of the api server. The push server gets to know about changes by subscribing to a redis channel which publishes notifications when it detects a change. 

A client subscribes to push notification via the API. If the api server is not at the same time as the push server then notifications coming from redis might not be selected because  if the push server's time is ahead of the api server's time, it could look like they were made earlier than the subscription actually happened. 

On Cloud 9, I found there was a 24 second discrepancy between the push and api server. Here's the time on two servers checked at more or less the same time (I had to swap between tabs which took a few seconds).

Push server
$ date
Sun May  7 08:52:47 UTC 2017

Api server
$ date
Sun May  7 08:53:05 UTC 2017

A solution

The most accurate way would have been to include a timestamp on each modification in each item, and read it from redis on each change notification. That would have worked, except that 
  • Because I am monitoring deletes and expirations as well as updates, the items would have been deleted by the time the notification came.
  • Some Redis operations used such as ZADD don't give the opportunity to add a timestamp.
  • I would not be able to monitor changes made by means other than the API.
Here's what I came up with in the end

  1. The push server publish periodic requests for a time synch using a redis channel.
  2. The api server subscribes to that channel and reacts to messages by writing an item to redis
  3. The push server is subscribed to redis changes, and receives a notification when a time sync item is written
This gives me the following information

Using the push server clock
  • now  - when it has received a push notification for redis
  • requestAt - when the push server asked for a timesynch check
  • now - requestAt - elapsed time for the entire process 
Using the api clock
  • writtenAt - when the api server received the request.
Deducing
  • If the clocks were at the same time, then writtenAt would be, requestAt + the network delay (the maximum of which will be approximately (now - requestAt) / 2
  • The clock time error should be writtenAt - now - (now - requestAt)/2

Let's plug in some observed numbers, (which actually came out extremely consistently)
nowrequestAtwrittenAtnow-requestAtwrittenAt-nowoff bynow on api server
14941487991331494148798941149414882319419224061239651494148823098
14941488092361494148809045149414883329719124061239661494148833202
14941488193361494148819145149414884339719124061239661494148843302
14941488294331494148829242149414885349419124061239661494148853399
14941488395341494148839343149414886359519124061239661494148863500
14941488497621494148849445149414887382331724061239031494148873665
14941488598641494148859673149414888392519124061239661494148883830
14941488699681494148869777149414889402919124061239661494148893934
14941488800721494148879881149414890413319124061239661494148904038
14941488901771494148889985149414891423719224060239641494148914141
14941489002791494148900089149414892434119024062239671494148924246
14941489103821494148910191149414893444319124061239661494148934348
14941489204831494148920293149414894454519024062239671494148944450
14941489305941494148930397149414895464919724055239571494148954551
14941489406921494148940501149414896475319124061239661494148964658
14941489507951494148950605149414897485719024062239671494148974762
14941489608991494148960709149414898496119024062239671494148984866
14941489710041494148970812149414899506519224061239651494148994969
14941489811081494148980917149414900516919124061239661494149005074
14941489912141494148991023149414901527419124060239651494149015179
14941490013141494149001123149414902537619124062239671494149025281
14941490114191494149011229149414903548019024061239661494149035385
14941490215251494149021334149414904558619124061239661494149045491
14941490316301494149031439149414905569119124061239661494149055596
14941490417311494149041540149414906579219124061239661494149065697
14941490518281494149051637149414907588919124061239661494149075794
14941490619351494149061743149414908599519224060239641494149085899
14941490720341494149071843149414909609519124061239661494149096000
14941490821361494149081945149414910619719124061239661494149106102
14941490922411494149092049149414911630219224061239651494149116206
14941491023451494149102154149414912640619124061239661494149126311
14941491124451494149112254149414913650619124061239661494149136411
14941491225451494149122354149414914660619124061239661494149146511
14941491326491494149132458149414915671019124061239661494149156615
14941491427531494149142562149414916681419124061239661494149166719
14941491528561494149152665149414917691719124061239661494149176822
14941491629571494149162766149414918701819124061239661494149186923
14941491730621494149172871149414919712319124061239661494149197028
So this tells me the the api server is running 24.2 secs ahead of the push server - compared that with my manual observation from earlier, and adjusting for the few seconds it took me to change tabs between them that looks about right.

Push server
$ date
Sun May  7 08:52:47 UTC 2017

Api server
$ date
Sun May  7 08:53:05 UTC 2017

Using that offset

Now I know that each event I log on the push server typically happens (API server time) about 24 seconds from now (Push server time), so I simply log the event with the adjusted time. I'll keep that timeoffset up to date by applying the formula from My server times are out of sync! to smooth it out and all should be well. 

Does it work ?

Here's some observations comparing the real creation time stamped in the record, and the estimated creation time using this method - typically less that 20ms out - so that'll do just fine.
createdestimatevalues - estimate
1494235792603149423579262118
1494235792596149423579261923
1494235792608149423579262416
1494235792604149423579262218
1494235792613149423579262613
1494235792614149423579262713
1494235792616149423579262812
1494235792612149423579262513
14942357926941494235792692-2
1494235792707149423579271811
1494235792713149423579272512
1494235792705149423579271712
1494235792698149423579271315
1494235792696149423579271216
14942357927731494235792772-1
1494235792708149423579271911
149423579282814942357928324
1494235792711149423579272413
14942357928801494235792879-1

The code

On the push server. This will schedule a second immediately after the first one, because I plan to ignore the very first one since there is always an initialization time associated with a first publishing.
  ns.synchTime = function() {

    // I'm using recursive setTimout rather than setInterval
    function sched () {
      pub()
        .then(function() {
          return Useful.handyTimer(ns.settings.redisSyncFrequency);
        })        
        .then(function() {
          sched();
        });
    }
    
    // publish a timesync request    
    function pub() {
      return redisSyncPub_.publish(ns.settings.redisSyncChannel, JSON.stringify({
        requestAt: new Date().getTime(),
        syncId: syncId,
        expire:ns.settings.redisSyncFrequency * 2
      }));
    }
    
    // the first one will generate misleading initialization times, so it'll be ignored
    pub().then(function () { sched() });

  };
  

And listen for events, simply by extending the current listening algo to include synch items ( I won't go into the detail of this here as it's covered elsewhere).
         else if (db === st.db.ts && method === "set" && message.slice(0, st.syncPrefix.length) === st.syncPrefix) {
            ns.synchTimeUpdate (message);
          }

Here's the smoothing of each observation. Note that it discards the first one (because of the initialization anomaly), but it also uses a smoothed latency for its calculation and discards observations with extreme or unusual latencies to avoid skewing by network blips. I'm finding that the typical redis write to keyspace notification latency in this configuration (redis in a docker container on google compute engine, push and api servers on cloud 9) is about 190ms.
   ns.synchTimeUpdate = function (message) {
    
    var now = new Date().getTime();
    
    // will look something like this '__keyevent@6__:set sy-.1bfgvorjj.1494145143663'
    var s = message.split (".");
    var sId = s[1];
    
    // check this is for me
    if (sId !== syncId) {
      console.log ("received a synch not for me", sId , syncId);
      return Promise.resolve(null);
    }
    
    // now get the item so the times can be processed
    return redisSync_.get (message)
      .then (function (res) { 
        // convert to an object
        var ob = Useful.obify (res).ob;

        // end to end time
        var latency = (now - ob.requestAt);
        
        // smooth the observations
        
        // ignore completely outlying latency as it could be a network blip
        if (latency < ns.settings.redisSyncNetWorkOutlier && (!timeLatency || latency < timeLatency * 10) ) {
        
          // api server is golden for time,

          // ignore the first one because it'll be misleading since it probably did some initialization
          if (timeNobs) {
            timeLatency = timeNobs > 1 ? latency * smooth + timeLatency * (1 - smooth) : latency;
            var offBy = ob.writtenAt - now - timeLatency/2; 
            timeOffset = timeNobs > 1 ? offBy * smooth + timeOffset * (1 - smooth) : offBy;
          }

          timeNobs++;
        }
        
      });
    
    
  };

and the smoothing in action with a few observations. I could probably make the smoothing  a little more reactive to more recent numbers by increasing it, but these are very stable numbers, so I'll leave it for now.
nowerrorsmoothed
14941615723582425924259
14941615824592415424248
14941615925652415424239
14941616026692415424230
14941616127722415424223
14941616228692415524216
14941616329732415424210
14941616430712415424204
14941616531752415424199
14941616632802415424194
14941616733832415424190
14941616834882415424187
14941616935852415424183
14941617036902415424180
14941617137982415324178
14941617239012415524175
14941617340032415424173
14941617441082415224171
14941617542092415524169

On the API server
  ns.synchInit = function () {

  // the push server will ask me to connect 
    return redisSyncSub_.subscribe(ns.settings.redisSyncChannel)
      .then(function(count) {
        if (!count) console.log ('failed to subscribe to ', ns.settings.redisSyncChannel);
        // every time I get one of these, I'll write something
        return redisSyncSub_.on('message', function(channel, message) {
          var ob = Useful.obify (message).ob;
          ob.writtenAt = new Date().getTime();
          return redisSync_.set ([ns.settings.syncPrefix,ob.syncId,ob.writtenAt].join("."), JSON.stringify(ob),"EX", ob.expire)
          .then (function (r) {
            if (!r)console.log ("failed to write sync item",r,ob.syncId);
          });
        });
      });
  };
      

Using the offset

The push server makes an entry in a log file each time a change notification is identified, with a timestamp, and this log file is used to feed push notification subscriptions. These timestamps are used as the official event time, so they need to be modified from the push server time, to the official API server time by adding the latest timeOffset that is continually self adjusting.

 ns.logEvent(method, message, "redis", now + timeOffset );







For more like this, see React, redux, redis, material-UI and firebase. Why not join our forum, follow the blog or follow me on twitter to ensure you get updates when they are available.
Comments