Skip to main content
Reliability & Availability

At Candu, we care deeply about providing the highest availability and reliability to our customers.

Lauren Cumming avatar
Written by Lauren Cumming
Updated over a week ago

At Candu, we care deeply about providing our customers with the highest availability and reliability. Because Candu is meant to be a seamless part of your UI, our explicit objective is that Candu must be as or more reliable than any of our customers' infrastructures. Of course, we never want our app to go down, but more importantly, we never want anything to negatively impact your users' experience with your product.

This article discusses all the architectural decisions we have made and implemented to provide our customers with a highly reliable product.

Backend Infrastructure

Candu is deployed over AWS, and most of our servers are run in ECS. We use blue-green deployments powered by a custom framework. At any given point, we can roll back to any commit within 3 minutes.

We also can create a staging environment in less than a minute for any commit pushed to our backend. Developers can then test any change in a production-like environment.

For databases, we use a combination of DynamoDB and RDS across multiple hosting zones to ensure maximum uptime and reliability.

High throughput data processing code is hosted using AWS Lambda to ensure the highest availability and scalability.

Finally, we use a message-driven architecture to ensure critical customer data is processed at least once.


Frontend Infrastructure

Candu uses a CDN to publish content to provide high availability and fast delivery.

The changes are saved on our servers whenever a user edits content in our dashboard.

Once content has been published, Candu uploads a version of that content to the CDN. We use this publishing mechanism to:

  1. Ensure you can safely edit content in a draft state before publishing.

  2. Increase the upload speed and availability of all content you create.

  3. Provide a versioning process and audit trail for any content.

All the assets served through Candu to your customers are hosted on an enterprise-grade CDN.

We use a CDN for the following reasons:

  1. CDNs have extremely high availability and reliability. We selected S3 + Cloudflare as our primary vendors.

  2. CDNs are distributed and geographically close to our customers, meaning we can serve content quickly.

SDK

The Candu SDK is installed within your application to provide dynamic segmentation, user analytics, and UI rendering.

The SDK's main functionalities are:

  1. Rendering Content

  2. Collecting analytics via eventing

Our SDK is engineered to handle multiple failure points.

The first request executed when initializing the SDK is retrieving the "customer segment." This is the only request from the SDK that has not been stored on the CDN. Customer segments are cached to local storage in case the network fails.

If a network call fails or is too slow for any reason, the customer will see whatever content version they saw last (to create a consistent user experience). If the network call fails the first time before it has been cached, the customer will default to the Everyone Segment.

After the SDK requests Segment membership, Candu fetches the Content. These are also cached to prevent further network failures as an additional failsafe; they will already have been hosted on the CDN.

In the worst-case scenario, if AWS S3 or Cloudflare fails and there is no local content cached, no content will be served, and the page will remain intact, looking as though Candu was never installed.

Error Handling

All components (e.g., Content, Segments, etc.) in the SDK are wrapped with error boundaries to prevent JavaScript-related errors from propagating outside the Candu SDK and impacting our clients. If the error boundaries receive any errors, those are logged in the Candu tracking system.

If Candu encounters a JavaScript error in customer code or aywhere in the Candu SDK, those errors are logged in the Candu tracking system for immediate review. If, for any reason, an undetected error occurs, Candu automatically drops rendering and will not display any content to protect the page'se performance.

Code Quality

Before release, each new SDK version needs to pass an extensive and continuously expanding set of unit and integration tests to eliminate potential regression tests. Additionally, we aim to reach 100% coverage using TypeScript to minimize type unsafe errors and help developers integrate the SDK into their projects.

Versioning

If a breaking change is identified on an SDK release, we will bump versioning for that release, according to SemVer. This should ensure accidental installation of the SDK with proper migration steps in place.

Architecture

An architecture diagram of Candu's engineering system.

SLAs

At Candu, we take our SLA and partner operations seriously. We strive to maintain a 99.9% SLA in all of our APIs & frontend assets.

SLA monitoring is done through third-party integration monitoring. We currently ping 10+ APIs for uptime and other critical aspects of our infrastructure that we use to provide services.

All the tests are performed from 7 locations worldwide (Canada Central, Ohio, Oregon, Sydney, Tokyo, Frankfurt, London) to ensure we maintain availability within and throughout different regions.

All critical integration tests are performed each minute.

Our team will be notified immediately if any alerts fail, as outlined in our escalation policy.

Did this answer your question?