Monitoring Raspberry Pi Devices Using Telegraf, InfluxDB and Grafana
blog.thecloudside.comIf I have a regret in my observability stack I think it’s got to be influxdb.
I bought in to the TICK stack and planned on using an enterprise support contract when going to production, but every interaction with InfluxData the company has felt a bit sleazy. Trying to push very hard to the cloud offering for example.
That’s bad enough, but the documentation and observability of the database is quite poor, and it’s trivially easy to “vanish” all your data and lock your instance up for hours or days by changing the retention policy of a database. (Not making it much different).
Now of course it’s not TICK at all. More like “TI” as kapacitor and chonograph (dashboarding and alerting respectively) are deprecated products and rolled in to the main offering.
Added to that they completely changed the query language.
I have to say; pick something better if you can. TimescaleDB or Prometheus (which uses openTSDB) are promising.
I looked into TimescaleDB, but didn't find a lot of support for monitoring agents that push data into Postgres. TimescaleDB is built on Postgres and uses the same mechanisms for ingesting data.
There's a plugin for Telegraf that looks promising, but it hasn't been merged yet.
Is anyone else using TimescaleDB? If so, what do you use to push monitoring data to it?
I recently built a janky system which runs nmon (plus a few custom rpi stats) every two minutes or so and pushes the output file to a watch folder. Another service uploads to server if network available. The server then ingests into timescale. Have been running it on two rpis and a few aws servers for the past few months.
Edit: I’m using grafana but was considering checking out apache superset.
How is the data sent into timescale? Do you run psql to load the data?
python
The solution I came up with but haven’t implemented yet is to use the collectd mqtt output plugin to get the data onto my broker (I use mqtt for other purposes, many of which should also end up in timescale) and then an mqtt to Postgres/Timescale bridge.
Yes, I can see how that could work. I may do something similar where I send my data to one system that can then forward it into Postgres/TimescaleDB.
Good luck waiting for the telegraf merge! I was watching a PR for another tsdb for 2 years before I switched jobs and stopped caring. I believe the PR is still open.
I used 1.x for my push-monitoring stack at my last job. (For cases where "pull" is practical, I would always use Prometheus. Prometheus also has "push" now, by the way.) They went into 2.0 mode and kind of neglected 1.x, and I kind of forgot about it. At the time, I was most familiar with an internal monitoring system at Google, and I found I couldn't do queries that I expected to be able to do. I even mentioned it on HN and some influx folks told me that what I wanted to do was too weird to support. (It's not. I was collecting byte counters from fiber CPEs, and wanted to have bandwidth charts based on topology tags I stored with the data -- imagine a SQL table like (serial_number text not null, time timestamp not null, locality text not null, bytes_sent int64 not null, bytes_received int64 not null). The problem was that timestamps would not be aligned between records in the same locality group -- I sampled these occasionally throughout the day and not all at the same instant. And, they were counters, not deltas, so the query would have to do the delta across each serial number, and then aggregate across all devices in a locality. Very possible to do, I literally had that chart with the other monitoring system. But not possible with the influx v1 querying, as far as I could tell.)
I set up 2.x for myself recently, and they have really done a lot of work. The OSS offering has most of the features that cloud/enterprise would. It was easy to set up -- they don't have any instructions for installing it in Kubernetes, and haven't updated their Helm charts for 2.x, but it was like 3 minutes to write a manifest (https://github.com/jrockway/jrock.us/tree/master/production/...) myself, which I prefer 99.9% of the time anyway. The new query language is incredibly verbose, but I see the steps that I remember having with Google's internal system, align, delta, aggregate... all possible. (I had to scratch my head a lot, though, to make it work. And I really am not able to reason about what operations it's doing, what's indexed or not indexed, why I ingest my data as rows but process it as columns, etc.) The performance is good, and it worked well for my use case of pushing data from my Intranet of Stuff. Generally I like it and I don't think they are being shady in any way. It's on my list of something to set up at work to collect various pieces of time series data outside of the Prometheus ecosystem (CI runtimes, etc.).
The reason I picked InfluxDB over TimescaleDB for my personal stuff is because InfluxDB has an HTTP API with built-in authentication. I already a ton of HTTP services exposed to the Internet, and I understand them well. (Yup, I have SSO and rate limiting and all that stuff for my personal projects ;) I can give each of my devices an API key from their web interface, and I make an HTTP request to write data. Very simple. (They have a client library, but honestly my main target is a Beaglebone, and it doesn't have enough memory to compile their client library. I've never seen "go build" run out of memory, but their client makes that happen. I shouldn't develop on my IoT device, of course, but it's just easier because it has Emacs and gopls, and all the sensors connected to the right bus. Was easier to just manually make the API calls than to cross-compile on my workstation and push the release build to the actual device.) TimescaleDB doesn't have that, because it's just Postgres. So I'd basically have to expose port 5432 to the world, create Postgres users for every device, generate a password, store that somewhere, etc. Then to ingest data, I'd connect to the database, tune my connection pool, retry failed requests manually, etc. Using HTTP gets me all that for free; I can just configure retries in Envoy.
But... SQL queries are a lot easier to figure out than FluxQL queries, and I already have good tools for manipulating raw data in Postgres (DataGrip is my preferred method), so I think I will likely be revisiting TimescaleDB. Honestly, I'd pay for a managed offering right now if they had a button in Google Cloud Console that was "Create Instance and by the way this just gets added to your GCP bill for 10% more than a normal Cloud SQL instance".
It is strange that there isn’t more overlap between tech software monitoring and metrics products and industrial historians and HMI products. Osisoft was purchased by Aveva/Schneider for 5 billion despite them already owning citect, wonderware, and probably 6 other historian products.
The industrial historians solve the same problems - collect data at nodes that might have intermittent connectivity, send to a centralized server/service that can handle lots of data, and allow users to plot it.
I wonder if we’ll start to see more open source monitoring on the factory floor. While it will be easy for a product to work as well as industrial offerings, maybe their value is in the long term support (usually close to a decade) and supporter upgrade paths.
At my work (A university) We setup a TIG stack to monitor IT systems and then one of our facilities management people saw it and we got chatting. We have just setup a POC to pipe metrics out of multiple proprietary building management systems into a single Grafana dashboard.
We've also updated all our tender documents for future projects to include a requirement that we can query metrics and logs through an API or direct DB access.
A recent project I worked on identified over 500 applications whose only use is to provide monitoring to a bespoke system or tool. This isn't uncommon at a university as different faculties and departments will buy "the best tool for XYZ" without ever asking IT if perhaps there is a tool that is almost as good that we already have.
Yes, agreed. This is basically what we do at my work: plot stuff from PLCs onto Grafana, while coping with intermittent connectivity. I'd love to buy a ready-made product that could do it, but there's literally nothing out there that quite comes close.
As well as the things you identified, I suspect that there's just a lot of mistrust of open source in the industrial world - there's that whole thing of perceived value being directly proportional to product cost, plus commercial vendors also tend to at least offer training and tech support, even if they're not always the most helpful.
A roll your own open source method works if you have a good low level understanding of what you are listening to and can organize your own alarm thresholds and groupings etc. Dealing with a large number of modbus/BACnet endpoints with vendor specific device profiles and PLC logic means the vendors backend systems will handle events and trend paterns better than what most facility teams can put together. However, the history retention data by itself will have more flexibility to do it yourself, I do agree with that.
There are open source projects like EdgeX and Fledge that can do this also
Wonderware historian, inductive automation ignition, and osi pi all can buffer data at a collection or edge node that it collects from a Plc and then forward it on to the central database historian when connectivity is present. Grafana might have a better web ui but they all offer some client for plotting.
I am curious in what regard nothing comes close to grafana? I am Currently paying a lot for citect and wonderware support across a couple dozen facilities.
We're a bit of a weird case in that we log data at 20Hz and 100Hz for prototypes, so we need custom queries and/or aggregation to reduce that down for display. We had a good chat with Inductive earlier this year, but even they said that might be a bit much for them. Other than that, though, Ignition is literally everything we want in a single product.
You are logging data in to grafana at 100 Hz from a Plc? That’s pretty quick for a plc, I’d definitely have either a tiny program or a fast periodic task to be generating new samples every 10 ms, although I’m sure there are faster PLCs out there than m340
Hehe, yeah. We run B&R PLCs which are fairly beefy Atom processors under the hood.
Telegraf can do this too, in a more limited capacity. It keeps an in-memory buffer of collected data points so if a write to an output fails it'll try again on the next interval without losing the data it has already collected.
For some use cases, you wouldn't even need Grafana. You could use the graphing/visualization and alerting functionality built into the InfluxDB 2.0.
That said, Grafana is a more mature product. I can't fault anyone for using InfluxDB just as a time series database and using Grafana for visualization and alerting.
I’ve had a basic influxdb/grafana setup going for a year or so. Despite having worked with it for a while, I always am left with the feeling that I’m missing something.
The sql-like language is similar enough that it’s confusing. And I’m still not sure why a dedicated time series database is supposed to be better than using standard mysql/postgres with inserts. At least the you get a wider range of options for data types, indexing, and querying.
I’m likely still not in the correct mindset but not sure what I’m missing.
> And I’m still not sure why a dedicated time series database is supposed to be better than using standard mysql/postgres with inserts
It's optimized for timeseries data, so queries and inserts may be faster and the storage requirements may be lower. Other than that, Postgres can probably do everything influx can do and more.
Telegraf is an awesome agent, and also pairs really nicely with Prometheus as a TSDB, then you can put Grafana on top.
Maybe not in this specific case, but in general Prometheus my preferred TSDB sitting between Telegraf and Grafana
If you need further scale-out there are options for federating Prometheus instances as well.
Telegraf really is a swiss army tool for metrics. It has an insane number of plugins: inputs, outputs, aggregators, etc. I used it heavily at a previous company, before Prometheus was an option.
Trying to get a change merged into the codebase is a nightmare though. Especially if it's in a plug-in that isn't a money earner for InfluxDB.
InfluxData doesn't make money from Telegraf at all. It fills a need (data collection) that InfluxDB users have, so it helps them which helps us, but it also helps our competitors whose users also use Telegraf to collect data.
The reason merges have been a problem is that historically there was only a couple of people involved in the project doing all of the code reviews, and new plugins are usually large chunks of new code that interact with products or protocols those reviewers aren't familiar with, so there is a steep learning curve just to properly review code contributions.
Last year we formed a new maintainers team that is a mix of InfluxData staff and community contributors who are working together to review and land code changes. This has significantly increased the rate they're getting through PRs but there's still a very large backlog to get through, plus new stuff coming in all the time.
Anybody who wants to see new code and plugins land in Telegraf faster can ask to join the maintainers team. You'll need to be familiar with the codebase and willing to work on any plugins of functionality that come in (a lot of plugins come from people building things for their own job/product).
I'm intrigued by the need for what looks like server hardware monitoring that is needed in a vehicle?
SaaS version of that would be use Telegraf, but send data to Sumo Logic, Data Dog or other observability vendor. I would also
You don't need to host InfluxDb and Grafana yourself. I would also consider gathering logs and traces to troubleshoot problems. Straightforward with top tier observability vendors, harder to do it on your own.
Disclaimer: I'm employee of Sumo Logic.
Note that you can gather and display logs with this same stack, Telegraf includes a plugin to consume syslog output: https://github.com/influxdata/telegraf/blob/release-1.14/plu... and then you can do something like this on Grafana: https://grafana.com/api/dashboards/12433/images/9004/image
The one thing I would add to this guide is enabling HTTPS for the whole stack, if you are transmitting over the public internet. Fortunately, it is quite straightforward (and free) with Let's Encrypt.