My Cloud Cookbook
A Portfolio of App Architecture from 5 Years at LiveTiles
For many working in technology, having a portfolio is an important artifact to prove your skills. I have certainly tried to maintain my GitHub profile to ensure projects are pinned and up-to-date; however, in my time at LiveTiles my real work was leadership and architecture.
Architecture — The product of planning, designing, and constructing structures.
How can I capture the work I did engineering, deploying, and maintaining cloud-based systems? Over five years, I would estimate that we built about one new product a year, with various supporting services to go with. In each case, as Chief Architect I was responsible for the planning and prototyping of the underlying architecture. Furthermore, mentoring the team on the skills necessary to make and maintain them.
I will attempt to walk through how they were designed, the technologies that can implement them (in both Azure and AWS), and the impact each approach had going from zero to 950,000 Monthly Active Users (MAUs) on my team’s most established single product — with a suite of others around it and growing adoption from acquisition and cross-pollination of customers constantly increasing numbers across the suite.
Each approach explores the evolution that happened across the rise of cloud technologies, starting with the monolith and trending toward microservice distributed systems that are growing in adoption today. In so many ways, a fitting summary of the change in the industry at large across the decade.
Monolithic
Dating back to my time at Pacific Northwest National Laboratory, where I first made and managed production web apps, I’ve been building basic three-tier web applications — JavaScript web client, supporting web services, and a backing database. More advanced versions of this basic architecture provided the starting framework for many applications, including the LiveTiles Licensing Service (LTLS), originally made in 2015, that tracks over 20,000 installations of our SharePoint product around the world.
In the cloud, this basic monolithic approach usually means distributing front-end assets via CDN, adopting Platform-as-a-Service (PaaS) compute (Web App in Azure or Elastic Beanstalk on AWS), and using a managed SQL Database (Azure SQL Database on Azure, RDS on AWS) for auto-scaling capacity and geo-redundancy in the data layer.
One gotcha about the horizontal scale of the web application — not captured in the diagram — is shared caching in the web app. Behind the scenes, PaaS platforms are running multiple VMs with many instances of your app. Unless it is disabled, the load balancer is providing affinity for your users to stick to a single server when requests are received. This means caching techniques for the individual user’s information (EG, HttpContext.Current.Cache in ASP.Net) works with no changes; however, if the information is shared between users (EG, organization information for the company for which the employees work), then you must introduce a unifying cache to the architecture (Azure Cache for Redis in Azure, ElastiCache in AWS).
For virtually all of the applications we made, we also made use of NoSQL Table storage (Azure Tables in Azure, DynamoDB on AWS). For instance, this provided an audit trail of activations in our licensing services (partition per day, row per activation) and the lists capability in the SaaS version of our design product. Geo-redundant storage enabled transparent failover in the event of a catastrophe, all without modification of the app configuration.
To provide security for the application, a Lightweight Directory Access Protocol (LDAP) provider was used — since we were providing platforms to enterprise customers. Most often this was Active Directory (Azure Active Directory in Azure, Directory Services in AWS), though we also supported Okta. For some customers, we also implemented Active Directory B2C, which abstracts authentication over social providers like Microsoft Live Accounts and LinkedIn.
Finally, a service for capturing live telemetry (App Insights in Azure, CloudWatch in AWS) from the application ensures scalability and performance can be monitored, along with insights into the audience of the application. This often proved useful to find hotspots in the code since it would trace requests from the web server through to the database. Also, the Monthly Active Users (MAU) and user retention counts were harvested from this system, providing the user engagement metrics to qualitatively prove product success.
This architecture was easy to develop and deploy, harnessed the resiliency and scale of the cloud, make observability a breeze, but it rapidly became brittle when contemplating larger audience sizes. Shared resources (data and compute) for all tenants had their limits and the scaling sliders can only take you so far.
Elastic
When we set about creating a pure SaaS version of our page design product in 2016, it was obvious that a Monolithic architecture was not going to provide the scale necessary to support our customers. This “Elastic” approach was utilized for our LiveTiles Page Designer for Azure product, which powers over 1,400 sites in enterprises around the world.
The foundation was providing a means for the services to dynamically allocate new resources when a new tenant joined the platform. This included provisioning a new database and storage containers for the site. These resources could be added to a region of the user’s choice — just like how Office 365 enables admins to pick a region when getting started. This enabled global organizations to have multiple sites in different regions, picking the hardware nearest the team. It also provided for data segregation, a vital security concern for our enterprise clients.
To orchestrate this process, we ended up creating three applications. The “Infrastructure App” was an internal application for managing resources. While we could have used the Azure Portal to achieve the same effect, this enabled application logic to be applied to the resources (like using Entity Framework to deploy the database schema and seeding initial database and storage entities). The “Portal App” was the front-door for the application that tracked infrastructure in its database and provided initial user identity. Finally, the “Tenant App” was the actual page designer platform with full content management and third-party data integrations.
Finally, this approach used a managed search provider (Cognitive Search in Azure, CloudSearch in AWS) to index data on pages, lists, and files. Users could then search to find their content. This was deployed one instance per region, opening up the possibility for organization-wide searches across sites.
As time has gone on, the horizontal scalability of the elastic databases along with the ability to manage them in pools has proven invaluable. If there is a flaw in this newer approach, it’s that the PaaS hosting plans — which were normally shared across all tenants in a region — became a bottleneck in instances where we didn’t do a deployment per tenant. While we’ve never run out of room on the scaling sliders, there was a newer, better approach infiltrating the industry and we were excited to try it.
Born-In-The-Cloud
Since the first year of LiveTiles, we knew that the ability to design a page would be greatly complemented by direct user feedback. In 2017, we finally sat down to make a scalable cloud service for collecting, analyzing, and presenting real user activity to our community of page designers. The system processes and visualizes data for any designer across our 1,000 organizations — a demanding feat for day one of a product given our history so far.
While we had considered micro-services-oriented managed services (Service Fabric on Azure, App Mesh on AWS), the newly emerging availability of serverless compute (Azure Functions on Azure, Lamda on AWS) to implement micro-service patterns immediately drew our attention. It promised the potential for infinite scale with the minor tradeoff of being stateless and the developer tooling being in a still-emerging state.
The system was ultimately a small distributed system made of alternating compute and storage aligned in a data pipeline. User telemetry tracking was embedded into our page design tool, which would stream user activity (page views, tile clicks, etc) to the ingest endpoints. They would immediately store these into table storage with a partition per-site and split by half-hour. Periodically, another set of functions would comb through the raw data and aggregate key analytics into a relational database. This reporting database was then accessible to one last set of presentation endpoints that drove visualization on the canvas.
Working with serverless technology did introduce a few complications. When we were first looking at the capabilities, it was painful to include outside libraries and the dev tooling was quite atrocious. We had to make our security integration with Microsoft Graph. Furthermore, the lack of fully developed web frameworks for the platform means conventional techniques like dependency injection were unavailable. Finally, we discovered that the “consumption plan” approach — true serverless, where you rely upon the platform to scale without any concept of servers — resulted in unbearable cold start times. This meant managing an App Service Plan for all user-facing functions.
In hindsight, we should have probably picked up a NoSQL solution for the reporting database instead of just using SQL. While the relational capabilities made reporting easy, the performance characteristics started to suffer quickly. Within months we were partitioning, indexing, and overall tweaking the database multiple times just to ensure the multiple slicings of the data would load in a bearable timeframe. We were only saved by scaling back up my rusty SQL Server admin skills and especially my timely rediscovery of the power of column store indexes. The next level of scale would require a new approach to data; either harnessing a more specialized NoSQL solution or perhaps graduating to a full Data Warehouse behind the scenes.
The added capabilities of user behavior heatmaps expanded the possibilities for our product. It was a big part of the excitement leading up to Microsoft awarding LiveTiles US Partner of the Year for Modern Workplace Transformation in 2018.
Planet-Scale
The final architectural approach was viewed by my team as the ultimate evolution of our approaches; something we would converge on in due time to provide a global platform where customers didn’t need to choose a region for their infrastructure and could collaborate in real-time. While not fully realized on a global scale, components of this approach were used in 2019 to deploy an integration for our new Salesforce tile.
The “Planet-Scale” approach involves having multiple regions of compute and data, connecting such that users see only a single unified service. Users are routed based on geography (Application Gateway in Azure, Application Load Balancer in AWS) to their nearest region. Within each region, compute is serverless (Azure Functions, Lambda) broken up into micro-services and the back-end is a robust NoSQL solution (CosmosDB on Azure, DynamoDB on AWS) capable of “Multi-Master” or “Active-Active” configuration coupled with fast storage (Azure CosmosDB used SSD drives as part of its managed service). Further, movement through the system internals would be event-driven (Event Grid on Azure, SNS in AWS) such that some or all of a transaction would be processed asynchronously.
This takes the scalability lessons around serverless from LiveTiles Intelligence and marries them with established patterns from NoSQL. While the emergence of CosmosDB in the Azure community has a lot more people talking about multi-master database architecture in the world of Microsoft, other NoSQL solutions (such as MongoDB) have had these capabilities for a long time. Ultimately, this enables concurrent, global user interaction with the system with zero consistency locking and potentially infinite scale compute.
While we never quite arrived at products we had considered with many-to-many user interactions across the platform (EG, what if we did page comments for the designers and/or audience), I knew the total number of users on the coordinating service of the SaaS version of our page designer would ultimately arrive at a scale where this would become a tempting approach.
Breaking up into a truly distributed system does impose its own challenges. Tracing activity can become very difficult and monitoring, in general, becomes harder. I’ve used DataDog in the past to overcome some of these challenges, but it only helps when coupled with immense discipline around logging and dedicated DevOps/SRE discipline. The issues with distributed transactions — the fact that micro-services often talk to multiple services to accomplish writes to multiple data stores — also means you need to start trading off on scale vs data quality vs availability due to atomicity going out the window in many cases the moment you have a variety of transactions bundled into a higher-level operation.
Endless Forms Most Beautiful
With each product and each iteration, I’ve grown in my approach to meet the ever-increasing demands of an ever-increasing user base. When building the first solutions in year one of LiveTiles, we could rely upon spending more money to solve our scaling issues. Nowadays, when given a novel feature or finding a new performance challenge, it does often mean refactoring the architecture overall rather than just adding a new database index.
LiveTiles started with practically zero customers and now, after five years, we’re reaching toward an audience across all products of 8.5 million. What could be done to make the next great innovation serve all of them all at once?
In the years ahead, I look forward to not only designing and deploying even bigger and better systems but growing the people who build and support them. I couldn’t have achieved what I did at LiveTiles without Miyuki Gimera, Trey Miller, and Kellan Corbitt coding and creating alongside me. I am thankful to have led their growth this far and wish the LiveTiles team going forward good luck in taking these platforms even further.
Art is never finished, only abandoned — Leonardo Da Vinci
Erik Ralston is an innovator with 13 years of experience, 5 years in technical leadership at the fastest growing tech company in Australia, a BS in Computer Science from Washington State University, and an open calendar for talking about the next step in his career. Erik is also co-founder of Fuse Accelerator in Tri-Cities, WA where he works on connecting people and sharing knowledge to turn new ideas into growing startups. You can find him on LinkedIn, Twitter, or the next Fuse event — whenever we’re allowed to have those again.