10 Practical Tips to Improve Your Observability

04.06

Starting with observability often involves setting up basic logs, metrics and traces. But to really turn that raw data into actionable insights – the kind of insights that enable rapid debugging, proactive performance tuning and clear, shared understanding within teams – requires more than just the basics.  

At CloudFuel, we partner with businesses to navigate this complexity and realise the full potential of their observability stack. This blog post shares proven tips from our practical experience, illustrated with concrete examples from our webshop demo. 

1. Combine automatic and manual instrumentation

Auto-instrumentation, such as OpenTelemetry’s automatic agents, is an excellent start to quickly gain broad coverage across your application landscape. It captures standard interactions like HTTP requests and database calls with minimal effort, providing you with an immediate basic level of understanding. However, this approach often misses the crucial business context or the fine details about specific internal logic unique to your application, such as complex calculations in the webshop backend. 

So don’t rely solely on auto-instrumentation; supplement it with targeted manual instrumentation. By manually creating specific spans around critical business functions (think ProcessPayment or CalculateShipping in the webshop) or complex algorithmic components, you will gain much more insight. In addition, add custom metrics to track application-specific counters or meters, such as the number of active shopping carts (ActiveShoppingCarts) or failed login attempts (FailedLoginAttempts).  

An Azure configuration screenshot: create work item

2. Manage context propagation in distributed systems

Modern applications, such as our e-commerce demo, are often distributed. A single click by a user, such as placing an order, can start a chain of interactions spanning multiple microservices, message queues or serverless functions. Without correct context propagation, you lose end-to-end visibility. The trace of that single click can then become fragmented into disconnected pieces of information, making it impossible to follow the complete flow and identify bottlenecks. 

It is therefore essential to ensure that the trace context is propagated consistently across all service boundaries. Pay particular attention here to asynchronous boundaries, such as message queues or background job processing. These often require explicit actions to ensure the context is correctly propagated with the message or job payload. 

A schema showing context propagation in distributed systems

3. Enrich data with meaningful business properties

Technical traces and logs contain a lot of information, but their value increases exponentially when linked directly to actual business processes. It is useful to know that a transaction failed, but it is far more critical to identify which customer experienced problems, or which type of transaction is consistently slow. 

That’s why you should actively enrich your traces and logs with relevant business properties, also known as tags or baggage. Add meaningful attributes such as CustomerID, OrderID, SessionID, ProductCategory, or BasketValue directly to your spans or structured logs. This allows your team to quickly filter traces for a specific customer reporting issues and aggregate performance metrics based on business dimensions. 

An Azure configuration screenshot: create work item

4. Export data efficiently with a collector

Sending telemetry directly from each instance of your webshop APIs or frontend to your observability backend may seem simple, but this pattern often causes problems in production environments. It can adversely affect application performance under load and quickly lead to a proliferation of difficult-to-manage configurations. 

A more robust approach is to use an OpenTelemetry Collector (or similar agent) as a central intermediary, especially in production. Configure your application services to send their telemetry to this collector. This collector pattern offers significant advantages: it isolates the impact on application performance, reduces network overhead through batching, centralises the management of sampling and filtering, and provides vendor abstraction allowing you to switch back-ends more easily. Moreover, the collector itself can also provide metrics regarding its own health and telemetry throughput available. 

A scheme on exporting data with a collector

5. Be strategic with sampling

Collecting all the telemetry data from a busy webshop – every page view, every product interaction – can lead to excessively high data ingestion and storage costs. An over-abundance of data can also make it harder to find genuinely relevant signals amid the noise of routine operations. 

Consider implementing a thoughtful sampling strategy instead of simply logging everything or sampling randomly. The goal is to strike a balance between maximum visibility and manageable costs. Consider tail-based sampling, often facilitated by a collector setup. This involves temporarily collecting all spans for a trace and deciding whether to retain it only after the trace is complete.

A visual to help decide which data to sample.

This allows you to preferentially retain traces that are slow (e.g., long checkout times), contain errors (e.g., failed payments), or meet other specific, important criteria, while sampling routine interactions (such as product browsing) more aggressively. Start by defining what is absolutely essential to retain and then apply sampling to the remainder.

6. Consolidate backends where possible

When using a platform like Azure, setting up separate Application Insights instances for each microservice can lead to data silos. This makes it difficult to obtain a unified view when a problem or analysis spans multiple services. 

Therefore, consider configuring multiple Application Insights resources to send their data to a single Log Analytics Workspace. It greatly simplifies cross-service analysis and building overarching dashboards, without losing the ability to still view application-specific insights within each individual Application Insights instance.

A schema showing several application insights sending data to a single log analytics workspace.

7. Link logs, metrics and traces

The well-known “three pillars” of observability – logs, metrics, and traces – offer the most value when they are not used in conjunction, not in isolation. Seeing an error message in a log is a starting point, but being able to retrieve the full distributed trace that led to that error provides much more context. Add to that the relevant system metrics (such as CPU and memory usage) of the services involved during that particular trace, and you have a complete picture for effective root cause analysis. 

So, make sure your systems are set up to easily switch between these three data types. A crucial step here is consistently adding the TraceID and SpanID to your structured logs. Then, when you examine an error log (for example, a failed webshop payment), the TraceID allows you to retrieve the corresponding end-to-end trace immediately and see the interactions between the frontend, backend APIs, and the payment gateway service. This drastically speeds up the diagnostic process.

{
"timestamp": "2025-05-15T10:30:45.123Z",
"severity": "ERROR",
"message": "Payment declined by gateway.",
"user_id": "client-12345",
"ip_address": "192.168.1.100",
"application": "eShop",
"order_id": "order-12345",
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "051581bf3cb55c13"
}

8. Define custom exceptions for clarity

Relying solely on generic framework exceptions, such as a NullReferenceException or RuntimeException during the webshop checkout process, often makes it unnecessarily difficult to quickly understand what precisely went wrong. Without diving deep into the stack trace, the specific cause remains unclear: was it a problem with stock, payment, or invalid user input? 

Therefore, create and use custom exception types for specific, predictable failure conditions within your application logic. ‘Throwing’ a specific exception such as InsufficientStockException or PaymentGatewayTimeoutException immediately tells the team much more about the nature and location of the problem. This enables more precise alerting (you can choose to alert only on critical custom types), speeds up debugging, and provides clearer error reports in your logs and traces. 

throw new CheckoutValidationException("Invalid address format.");

A screenshot of the Azure interface showing the 'Create an alert rule' screen.

9. Use a local OpenTelemetry collector for development

Constantly forwarding all telemetry data generated during your web shop’s development to your shared staging or even production observability backend is usually not desirable. It causes unnecessary noise, can incur significant costs, and slows down the crucial local feedback loop for developers. 

A better practice is to run an OpenTelemetry Collector locally during development. Configure it initially without an exporter or use a simple console or logging exporter that displays the data locally. Then point your local instance(s) of the webshop services to this local collector.

This allows you to inspect the generated telemetry (logs, traces, metrics) immediately after performing an action, validate their instrumentation changes locally and quickly, and develop and test without impacting shared environments or unnecessary backend costs.

10. Keep up to date with OpenTelemetry developments

The OpenTelemetry project is a living ecosystem that is actively developed and rapidly evolving. Specifications are becoming more mature, new instrumentation libraries for different languages and frameworks are continuously added, and existing libraries are regularly improved with bug fixes and performance optimisations. 

It is therefore advisable to stay informed about OpenTelemetry updates relevant to your web shop’s tech stack (e.g. for .NET, Java, Node.js, …). Keeping your OpenTelemetry dependencies up to date can bring immediate benefits such as performance improvements, access to new features (such as improved auto-instrumentation for a specific web framework), and ensuring compatibility as the observability landscape matures. Periodically checking the official OpenTelemetry blogs and repositories is a good habit to pick up.

Conclusion

Implementing even just a few of these tips can significantly improve your ability to understand and manage your systems. Moving from basic monitoring to rich, context-aware observability enables your teams to solve problems faster, optimise performance more effectively, and ultimately build more reliable applications. 

At CloudFuel, we specialise in guiding technical teams through this observability improvement journey. While this post has focused on specific technical tips, it is crucial to understand how to apply them systematically within your organisation. Watch our short animation video to learn more about how we approach observability projects at CloudFuel. 

Smokescreen