NetAtmo triggers not operational
Incident Report for Stringify

When you connect something to Stringify, you type your login & password into a web browser inside our app on the third party web page - that way, we have no visibility to your credentials. When you do that, they send us a "code" that is good for only a few seconds. We exchange that code, along with our "partner ID & password", and they give us an "access token" and a "refresh token." This is the process for oAuth2 authorization.

Most services allow you to revoke the token that they have provided to Stringify at any time on their web page. In the event that your phone is stolen or you feel your Stringify account has been compromised, it's a good idea to do this. Of course if this happens, you can always contact us and we'll be happy to delete all third-party credentials from your account immediately and provide you with detailed logs to make sure your home remains safe.

This access token is what we use to control your devices and it has an expiration date/time. Every company has a different expiration date/time on access tokens. For Netatmo, it's 3 hours after issue. That's fairly short, but others are shorter (Google - 30 minutes, Honeywell, 1 hour).

When it expires, we have to do reach out to the third party service and give them the refresh token along with our "partner ID & password" and then they give us a new access token and a new refresh token - only good for the same amount of time as the original.

When these third parties give us the access token & refresh token, they also tell us when it will expire. We save this information in our database but typically don't bother looking at it. Each of our third party integrations uses the same "Thing SDK." This SDK exposes a function to make outgoing HTTP requests, as well as provide a "test" to see if the token has expired. Some third parties do a good job at responding with a "proper" HTTP response code while others give us a more cryptic response in the body of the reply. This is why each "Thing integration" has to provide its own test - because lots of third parties don't follow the standards (for what it's worth, Netatmo does).

Some third parties send US events, for others we have to poll. Netatmo sends us events for their Welcome camera, but for the Weather stations, we poll for data every 10 minutes. Our Netatmo integration tries to do a poll, and if it fails the authorization test, it sends a message to a microservice telling it to exchange the refresh token for a new access token. Once it has that new access token, it retries the request. If the request fails again, you'll receive an email and will see a (!) next to the Thing in your Stringify library requiring you to re-authenticate.

It's because each service is different that we don't proactively "refresh the tokens." For some, even if a service expires a token every 30 minutes, we might not need additional data that often. For example, if we only write to a google spreadsheet once a day, it would introduce unnecessary load on our API servers to refresh the token every 30 minutes.

Netatmo's policy is to block IP addresses that make requests that result in X number of "unauthorized" responses in a specific time period. Because their tokens only last 3 hours, and we poll every 10 minutes, we reached "critical mass" where enough tokens expired every 10 minutes that we hit that limit nearly every polling cycle. Once we hit it, we can't even refresh the token and re-try with the new one because they've blocked us for 24 hours!

Every company's security policy is different and typically put in place for a reason, so while we ask for an exception, we don't expect it and must respect them. Netatmo refused our request for an exception and asked us to proactively refresh tokens based on their expiration date.

I thought a lot about how to handle this today. I decided to make a change that could be applied to all third party integrations if we choose - only if they had a policy such as this. That involved making changes to our "Thing SDK" that gave each integration visibility into the token expiration date (typically, we want individual Thing integrations to be 'as dumb as possible' and know as little as possible. This is part of our security model. For example, each thing has no idea what it is connected to in a flow. It simply knows what triggers it needs to send and what actions it needs to perform). Even these access tokens and refresh tokens are stored and transmitted to each Thing integration encrypted, and each of our environments (development, staging, production) uses different encryption keys.

Tonight I did tests with my new changes, and they seem to work. For Netatmo, instead of making a request, testing the response, and requesting a new token if the test fails, we will first look at the token expiration date. If it has expired, we will request a new token prior to making the first request. I pushed this change into production...I think our 24 hour 'ban' expires sometime in the middle of the night California time, so I'll double-check in the morning that everything is working as expected. If not, I promise I'll be all over it! :slight_smile:

I'd like to say that Netatmo has been very good at responding to our inquiry and explaining / confirming the issue - some of that is in thanks to you! I received prompt replies from my initial email to them and then from their head of API/Partner development after he heard from some of you.

Thank you all for your patience during this outage, I hope this gives you an idea of what it takes to "Connect Every Thing!"

-Kris

Posted about 1 year ago. Mar 24, 2017 - 05:12 PDT

Resolved
This incident has been resolved.
Posted about 1 year ago. Mar 24, 2017 - 05:11 PDT
Identified
NetAtmo has blocked traffic from Stringify. We have contacted them and are awaiting a reply.
Posted about 1 year ago. Mar 21, 2017 - 08:43 PDT