GreenPowerMonitor efficient and scalable big data architecture

GreenPowerMonitor designs an efficient and scalable architecture to process and store time-series data

GreenPowerMonitor started its activity in the development of monitoring solutions in 2007. Since then, the number of monitored plants and variables has raised year after year. As of early 2016, the volume of data that we store reached a trillion (1×1012) rows.

Three facts that illustrate this upward trend:

  • 400 million (4×108) of new rows per day.
  • More than 100,000 million (1×1011) of rows received in 2015, meaning a 30% growth compared to 2014.
  • In 2016, according to conservative estimates, this figure will raise until the 150,000 million (1,5×1011) of rows. A 50% increase compared to 2015.

As seen in the chart below, the number of the data we process and store every day has grown exponentially. Sustaining this increase without compromising the performance of our applications has been a tough challenge for our Dev & Ops teams, who have had to make a tremendous effort to squeeze every CPU cycle and each byte on disk.

Grafico

Evolution of the annual received data volumes. 2016 value is estimated by projecting the average daily reception rate from January and February of 2016.

 

The model we have used to manage the data up to now, developed in the beginning of the company, was based on a relational database management system, and presented significant drawbacks that made the system difficult to scale. Because of that, in GreenPowerMonitor, we decided to carry out a project with the objective to develop and deploy a new architecture that can scale in a coherent way with our current and expected rate of growth for the next years.

In an initial stage of the project we identified all the requirements of the architecture. These include: minimize the latency in read and write operations, allow the ubiquitous execution of all the services, guarantee the high availability of all the elements and provide the system with enough capacity to use data mining tools to extract knowledge of the monitored information. Before starting the design stage, we carried out a study to determine if there existed a solution in the market capable of meeting our requirements and we finally reached the conclusion that none covered all of them.

Therefore, we considered developing our own solution, which would combine concepts and strategies of the current Big Data and NoSQL solutions as well as our own ideas to achieve all of the defined goals. The absence of suitable solutions that meet all the requirements from the management of renewable energy plants has forced us to research and develop new ways to deal with the treatment and storage of the characteristic data flow of this sector.

The architecture that we have implemented is made up of different technologies and in general terms it’s inspired by the lambda architecture (see http://lambda-architecture.net). This strategy allows to create a pipeline where data is processed in real time, and therefore, it’s possible to meet the latency requirements and deliver aggregated data to the monitoring applications in near real-time.

Architecture global overview (article web)

Global overview of the architecture that resembles to the lambda architecture. Data flow is routed to either RT or Batch layer depending on its criticality.

 

 

The lambda architecture model enables us to serve live aggregated daily data as well as historical data views preaggregated in multiple dimensions, in a way that is transparent to applications. With the new architecture, approximately the 99% of data requests are delivered with latencies less than 2ms.

Actually, one of the project goals has been maintaining the scalability of the system while keeping the response time as low as possible. With the new architecture, the time that passes between a data enters the datacenter and when it’s available to be queried in the client applications has been lowered below one second. This opens the door to multiple application where the response time is a critical issue: plant operations or alarm activation, among others.

The data we store at GPM mainly comes from physical devices –sensors, for example- that measure some physical quantity. This information is collected in constant intervals, which gives place to series of measures associated with a time reference and disposed in chronological order. The distinctive traits of this data pattern, commonly called “time series data”, makes it possible to take advantage of novel techniques that allow to optimize the insertion and the queries of data.

One of this techniques are the fractal tree indexes (see http://oreil.ly/1RAezo8). In the architecture that we have developed, we implement this kind of indices to store historical data. This enables us to increase the data insertion efficiency and avoids the performance degradation that suffer traditional indices, therefore we can guarantee that the performance of read operations will be optimal over time.

High availability is one of the main goals of the architecture. To address this requirement, we have deployed several nodes with master-to-master replication, which allows every node in the topology to be functional, even in unfavorable circumstances, like a network partition or a node outage. This kind of replication presents significant differences in relation to traditional master-slave architectures, in which there only exist a single functional node, and the others are in “spare” state.

Distributed architecture emphasys (article web)

Distributed execution of the architecture. Client applications can request data to any replica.

 

 

This is not only limited to the data nodes but has been a design principle present in the development of all the components. In fact, all the elements allow to be executed in a distributed fashion with redundant instances. To avoid centralization mechanisms, which are often used in several distributed systems, we have used strategies and data structures that do not require coordination among different instances (lock-free), so it can be considered that the architecture applies a shared-nothing model. This has significant advantages: (1) because of every component is self-sufficient and independent from each other, a great level of fault tolerance is achieved and (2) ubiquitous execution of the services is enabled, which is especially important in a clear context of internationalization.

Another of the strengths of the architecture is the optimization of disk space usage. The useful life of a photovoltaic plant is, at least, 25 years. This requires to apply out of the ordinary policies for the management of historical data. To do this we use advanced compression algorithms that reduce size of the data up to 10 times in respect to their original size. Thanks to a fine-grained adjustment of the compression block sizes, we achieve a high level of compression rate without penalizing read and insertion operations.

This is relevant because the compression rate and the speed at which compression and decompression operations can be completed strongly depend on the data pattern and on the compression blocks size. In our case we have optimized the compression parameters to maximize the efficiency for the time-series data.

This project has received a grant from the “Plan Avanza 2014” and has involved several multidisciplinary teams inside GreenPowerMonitor, as well as independent experts. Currently, the project is in deploying stage of the architecture to production after subjecting it to an extensive test period in our development environment.

In conclusion: in the last two years we have designed, implemented and deployed a Big Data architecture adapted to our current growth estimates and which allow us to comply with strong requirements of scalability, latency, high availability and robustness. The usage of mechanisms of compression adapted to the time-series data pattern allows us to maximize the cost-effectiveness relation of our storage system.

We consider that this architecture is optimal for our use case because it allows us to support the growth while improving the response capacity of the system: we can deliver raw and aggregated data with latencies below 2ms and serve the data to the applications in less than a second. Thanks to that, we are able to take advantage of the immediacy which is characteristic of monitored data, and convert it to value.

Furthermore, the usage of new algorithms opens the door to the development of data mining and forecasting systems that will contribute to extract valuable information from the renewable energy plants. Specifically, we are considering a future development feature of an expert system that allow to make analysis through data correlation mechanisms as well as machine learning systems.

GreenPowerMonitor is committed with the continuous innovation and we believe that, conceptually, the design of the architecture offers a high value knowledge in the Big Data and NoSQL applications area, considering that it provides a novel and proven solution for the management of renewable energy plants sector and, in general, for those sectors that operate and store a big volume of time-series data.

We are always glad to receive feedback from our projects. If you have any doubt, suggestion or want to know more about our architecture, feel free to contact us by mail at bigdata@greenpowermonitor.com.

Categories Noticias

david