Platforms 6 min read

What reliable internal platforms require

Shared infrastructure succeeds when it absorbs complexity for product teams without hiding how the system behaves. Two platforms I worked on, one for notifications and one for files, made that lesson concrete.

Notification delivery and file management look like different problems. One routes messages through providers; the other moves assets through storage, scanning, and delivery. Internally, however, both platforms had the same job: replace external or fragmented implementations with a dependable shared capability.

The notification platform gave product teams one API for transactional email, with queued delivery, provider failover, template versioning, suppression handling, and delivery tracking. The file platform gave teams a path away from a third-party file platform and fragmented implementations by standardizing uploads, direct API-to-API access, antivirus scanning, secure delivery, image transforms, and tenant-level access. The migration reduced third-party infrastructure costs and consolidated ownership, while notification failover and suppression handling improved delivery reliability by about 18%.

Shared infrastructure starts with repeated pain

Before these platforms, notification templates and provider logic were duplicated across services. For files, most teams and products used a third-party file platform, while other implementations were fragmented across SDKs, storage patterns, security controls, and delivery paths. There was no consistent direct API-to-API integration for product services.

Duplication was only the visible cost. The deeper problem was inconsistent behavior. Retries, suppression checks, tenant boundaries, scanning rules, signed URLs, and audit trails varied depending on which product had built the feature. During an incident, there was no single system that could explain the full lifecycle.

A platform becomes worthwhile when centralization can improve correctness and operations, not merely reduce lines of code.

Make the correct path the easiest path

Internal teams should not need to understand every provider failure mode before sending an email, or become storage experts before accepting an upload. The platform API should make the secure and reliable path the natural integration path.

For notifications, a product submitted a message request while the platform handled template rendering, provider selection, retries, suppression checks, and tracking. For files, the API handled authentication, validation, metadata, and orchestration while file bytes moved directly to object storage.

This is an important boundary. A platform should absorb infrastructure complexity, but its contract must remain clear enough that consuming teams understand status, ownership, and failure behavior.

Keep slow and failure-prone work out of request paths

Both systems relied on asynchronous workers because their most important work involved external dependencies or variable processing time.

The notification API validated and stored a request before placing delivery work on a Redis-backed queue. A worker rendered the template, selected a provider, attempted delivery, and recorded each outcome. This kept provider latency away from the request path and provided a clean place for retries and failover.

The file platform used the same principle for antivirus scanning and optional OCR. An uploaded file entered quarantine and only became deliverable after it passed scanning. Large file bytes never travelled through the API; direct-to-storage and resumable uploads prevented the application layer from becoming a transport bottleneck.

Treat state transitions as part of the product

Once work becomes asynchronous, a simple success flag is not enough. Messages move through accepted, queued, attempted, delivered, failed, or suppressed states. Files move through initialization, upload, quarantine, scanning, availability, or rejection.

Explicit lifecycle states make retries safer, improve support workflows, and give operators a common language during incidents. They also make the platform more honest for its consumers: accepting a request is not the same as completing the work.

Enforce tenant isolation at every boundary

Multi-tenancy is not solved by adding a tenant identifier to a table. Tenant context has to flow through authentication, authorization, queries, storage layout, queues, credentials, and delivery.

In the notification platform, tenant-aware API keys, role-based access, rate limits, and repository filters protected templates, messages, and events. In the file platform, isolation extended to storage prefixes, ownership checks, metadata access, and signed delivery paths.

The safest approach was to make scoped access explicit in the architecture rather than relying on every engineer to remember a filtering convention.

Design observability around the lifecycle

Queue-backed systems gain resilience but can become difficult to debug. A request may succeed while downstream processing fails minutes later. Useful observability therefore has to follow the unit of work across API calls, queues, workers, providers, and storage.

Delivery events, provider-attempt history, explicit file statuses, structured logs, request correlation, and health checks made it possible to answer practical questions: where is this item, what attempted to process it, why did it fail, and is it safe to retry?

Operational tooling was part of the platform product. The notification dashboard exposed templates, providers, and delivery health, while file status and audit trails helped internal teams understand processing and access.

Migration is part of platform architecture

A shared service can be technically strong and still fail if adoption requires every team to stop feature work and migrate at once. We designed integration surfaces that allowed teams to move incrementally.

Notification consumers could migrate one event type at a time. File consumers could move from the third-party platform or a product-specific implementation onto a shared API, direct service integration, and upload patterns that hid secure storage and delivery details without forcing every product into the same user interface.

Documentation, predictable semantics, and operational clarity mattered as much as the backend architecture. Internal developers are users, and migration effort is part of the cost they experience.

Measure more than availability

Reliability improvements and infrastructure savings were useful outcome measures, but platform value also appeared in less dramatic signals: duplicated code removed, provider-specific logic contained, security controls standardized, migrations completed, and incidents investigated from one operational surface.

The notification platform’s roughly 18% reliability improvement made one part of the impact legible. For the file platform, the value showed up in lower third-party costs, fewer duplicated implementations, consistent security controls, and clearer operational ownership. Future providers, controls, and workflows could be added once inside the platform instead of rebuilt in every product.

Reliable internal platforms do not remove complexity. They put it behind clear ownership, consistent rules, and interfaces that help teams use the capability without recreating its risks.

← Back to all notes