Transform Your Insights with Strobelight: The Open Source Profiling Service Redefining Data Analysis

Admin

Updated on:

Transform Your Insights with Strobelight: The Open Source Profiling Service Redefining Data Analysis

  • Let’s delve into Strobelight, Meta’s tool for profiling.
  • Strobelight merges various technologies, many of which are open source, into one efficient service for Meta engineers.
  • With Strobelight, we’ve achieved impressive efficiency gains, saving the equivalent of 15,000 servers annually.

Strobelight isn’t just one technology. It combines many, especially open-source tools, to drive significant efficiency improvements. It serves as an orchestrator for various profilers across Meta’s production hosts. This means it collects crucial performance data, like CPU and memory usage, from running processes. Developers use this information to spot issues and optimize their code.

When skilled engineers access detailed performance data, they can tackle potential problems before they impact production and identify weaknesses in existing code. For instance, if a developer inadvertently adds a large object in their service’s critical path, our tools can catch this issue and query Strobelight for its possible impact, indicating that the change could waste resources equivalent to 20,000 servers.

Sure, static analysis tools can catch some issues, but they often miss the bigger picture of overall compute costs. Sometimes, inefficiencies only surface when a service starts processing millions of requests.

Why Use Profilers?

Profilers collect data to perform statistical analysis. They sample events over time, helping engineers see where and how often specific actions occur. For example, by sampling CPU cycles, a profiler can reveal how much time is spent in particular functions, offering insights into a service’s performance.

Choosing Your Path with Strobelight

While other daemons at Meta track observability metrics, Strobelight focuses specifically on software profiling. It connects resource usage directly to the source code, making it easier for developers to understand what’s happening. Many of Strobelight’s profilers are built using eBPF, a Linux kernel technology that lets engineers safely inject custom code to gather data efficiently.

Currently, Strobelight features 42 different profilers, including:

  • Memory profilers utilizing jemalloc.
  • Function call count profilers.
  • Event-based profilers for various programming languages, including Python and Java.
  • AI/GPU profilers.
  • Profilers that monitor off-CPU time.
  • Profilers for service request latency.

Engineers can easily gather data from servers through Strobelight’s command line tool or web interface.

ywAAAAAAQABAAACAUwAOw==
The Strobelight web interface.

Users can also set up continuous profiling for any of these profilers by updating a configuration file in Meta’s Configerator. This allows them to target specific services or geographical hosts, specifying how often the profilers should run, the duration, and much more.

For example, here’s a basic configuration:

add_continuous_override_for_offcpu_data(
    "my_awesome_team", // the team responsible for this service
    Type.SERVICE_ID,
    "my_awesome_service",
    30_000, // desired samples per hour
)

Strobelight has so many profilers because of the diverse technologies and tasks within our systems.

Additionally, Strobelight supports ad-hoc profiling. This is helpful because the type of data needed can vary. While adding a new profiler from scratch takes time, engineers can quickly write a single bpftrace script for immediate use instead.

Although powerful, Strobelight includes safeguards. These prevent performance hits for the workloads being targeted and maintain database integrity. It makes sure profilers don’t interfere with each other. For example, if one profiler is tracking CPU cycles, Strobelight won’t allow another to use the same counter to avoid conflicts.

Strobelight also employs concurrency rules and a profiler queuing system. Service owners can opt to run extensive data collections for debugging needs when necessary.

Data for All

From the beginning, Strobelight has prioritized collecting profiling data automatically for all Meta services. Think of it as a flight recorder—something you don’t consider until you need it. It’s critical to have data when a service fails.

Strobelight runs curated profilers automatically on every Meta host, with custom intervals and sampling rates tailored to each workload. This setup collects just the right amount of data without overwhelming the system.

For instance, consider a service called Soft Server running across 1,000 hosts. If we want profiler A to collect 40,000 CPU cycle samples per hour, Strobelight starts with a conservative run probability to avoid bias—profiling consistently at peak times may distort results.

Each day, Strobelight adjusts its sampling rate based on data collected, optimizing the process for every service in Meta.

But what if multiple services run on the same host? Strobelight uses efficient configurations to ensure both services get sufficient data collection.

Strobelight maintains the ability to compare and aggregate data across hosts, giving weighted samples to normalize the data. This means even when services are profiled at different rates, the data remains comparable.

How Strobelight Saves Resources

Two continuous profilers deserve special mention for their impact on resource savings.

The Last Branch Record (LBR) Profiler

The LBR profiler samples last branch records—Intel hardware features that help optimize performance. Its data feeds into our feedback-directed optimization (FDO) pipeline, improving compile-time profiles and potentially cutting CPU cycles by up to 20% for our largest services, which translates to significant server reductions.

The Event Profiler

Strobelight’s event profiler, similar to Linux’s perf tool, captures stack traces from performance events like CPU cycles and cache misses. Engineers utilize this data to assess critical functions and identify regressions before they impact production.

Performance Insights

Analyzing function call stacks with flame graphs is valuable. However, service owners often find numerous external functions cluttering their analysis. They may need insights into specific latency requests or identify unintended string copies.

Stack Schemas

Strobelight enhances its data with mechanisms like Stack Schemas. This allows users to tag call stacks or remove unwanted functions using regex, helping tailor the data to needs. Engineers can create dashboards identifying inefficiencies across vast fleets of machines.

Strobemeta

Another feature, Strobemeta, attaches dynamic metadata during runtime to call stacks. This capability makes eBPF particularly powerful, allowing customized actions while collecting data and enhancing filtering options for engineers.

Symbolization Techniques

Symbolization is crucial as it converts virtual instruction addresses into actual function names. Doing this efficiently can be challenging, especially when debugging data sets can be massive.

Strobelight tackles this by utilizing various open-source technologies, organizing what is needed for efficient access. It also defers symbolization until after profiling to avoid additional computational strains.

This streamlined approach is possible thanks to frame pointers included in our user binaries, ensuring efficient stack walking and data collection.

ywAAAAAAQABAAACAUwAOw==
A simplified graphic showing the Strobelight service.

Making Data Beautiful

Strobelight users primarily rely on Scuba, a powerful query language and visualization tool. Scuba’s interface offers various visualizations for the data, enhancing understanding and insights.

Once a profiling session completes, users can quickly visualize the results in Scuba, sharing insights with their teams efficiently. Additionally, tools like Perfetto extend the querying capabilities, allowing detailed analysis of results.

ywAAAAAAQABAAACAUwAOw==
An example of a flame graph showing function call stacks for the mononoke service.

Another visualization tool used within Meta is Tracery, which combines different profiles into one display. This tool is perfect for timeline views and allows for custom visualizations that help engineers pinpoint key data points.

ywAAAAAAQABAAACAUwAOw==
An example trace in Tracery.

A Notable Success

Strobelight has brought tremendous benefits to Meta, increasing efficiency and reducing latency across the board.

One standout success is what we call the “Biggest Ampersand.”

A skilled performance engineer, sifting through Strobelight data, pinpointed an expensive array copy caused by using the ‘auto’ keyword in a C++ function. By tracing this to its source, the engineer realized the copy was unintentional and easily fixed it by changing one character. This tiny adjustment resulted in annual server savings of around 15,000!

Yes, you heard right—just one ampersand!

Looking Ahead

This overview only touches on Strobelight’s capabilities. Our team continuously collaborates with Meta’s engineers to introduce new features for better performance analysis.

We are working on open-sourcing Strobelight’s profilers, making them more robust and accessible. Many technologies behind Strobelight are already open source, and we encourage contributions to these tools!

Acknowledgements

Thanks to Wenlei He, Andrii Nakryiko, Giuseppe Ottaviano, Mark Santaniello, Nathan Slingerland, Anita Zhang, and the Profilers Team at Meta.



Source link