Indexer Operational Dashboard

Slide Overview

Background

After reviewing Indexer-related content on GitHub, Discord, and the forum, as well as chatting with a handful of Indexers, I wrote the following product vision. From here, we can break-down into tactical deliverables to help Indexers while meeting the overall goals of The Graph protocol.

I documented by research here: The Graph Indexer Experience.

Most of the proposed functionality already exists in the ecocsystem: in first-class tools like the Explorer, in custom tooling by Indexers released to the community, or in proprietary tools.

At a feature level, I am not proposing something new under the sun.

However, the tools are fragmented. Support sometimes unclear. And despite the plethora of tools, there still remains friction and cognitive load, leading to custom tools.

I envision a simpler deployment model the harmonizes the different views and tooling, freeing Indexers from the need to build their own from the ground up.

The empathy extends beyond the tooling, but to the overall goals and challenges of operating a Graph Node as a profitable business.

Benefits to the Ecosystem

Indexers: faster time to operations, shorter times to recovery, lower cognitive loads
Dapp Consumers: better query quality and lower latency
GRT Holders: I am unclear how better operations directly improves demand; but address this should be part of the overall vision for the Indexer Experience

Vision

Create an easy-to-install (likely through containers) operational dashboard for Indexers. (A bundled approach like StakeSquid's or stand-alone, I'm still not sure -- need more research)

The dashboard should give indexers not only peace-of-mind around their operations, but also enable them to work on higher-level concerns like building their community of delegators or thinking more strategically about their infrastructure roll-out or allocations.

The dashboard is based on four core competences:

Infrastructure Management
Subgraph Intelligence
Financial Operations
Delegator Relationship Management

Tackling it this way opens up the design space to impacts on the broader ecosystem:

While there is no proven mechanism where happier "token holders" successfully drive demand (in this case, more queries), it's one of severay dynamics worth exploring since Indexers do have direct interaction with them.

However, I focus on the expressed JTBD by Indexers (more details from research can be found in The Graph Indexer Experience)

The following are centered around metrics, logging, alerting, telemetry -- which often falls under Observability.

Core User Stories

The user stories break down key "jobs to be done" by the Indexer persona into four areas:

DevOps: I want to maintain high performance and availabilty of my Indexing infrastructure
GraphOps: I want to make optimal decisions around subgraphs selections and allocations; I want an easy option to also set up subgraphs on my own infrastructure if I chooe to do so
FinOps: I want to manage the costs of my infrastructure to maximize operating margins and APR
Delegator Experience: I want to build a healthy community of Delegators to support the Indexer

DevOps

I want to see the recent actions, filterable by time and attributes, to identify the most important
I'd like to be alerted if there's something I need to address in the actions queue
I want to understand my indexer performance:
1. By failed queries
2. By latency
3. By error logs (filtered by resources, time, error message)
I want the right metrics and tools to optimize the infrastructure

GraphOps

I want a report to show allocations and rewards by subgraphs
I want to be able to fine-tune an allocations model which maximizes my rewards
I want to be alerted when Subgraphs have been deprecated or changed

FinOps

I want to see my utilization and a standard cost model for my resources (whether public cloud or my own infrastructure)
I want a way to see performance and output of cost models
I want insights to help me optimize my rewards:
1. I want to find trends in rewards by subgraph and allocations
2. I want to ensure performance in areas where I earn the most
3. I want insights into the profitable queries

DelegatorOps

I want to ensure quality customer service and relations with Delegators
I want to understand better how we are deliverying on Delegators financial expectations
I want insights into duration and delegation patterns of Delegators

Potential Metrics

The following represents metrics which can help Indexers to achieve their jobs to be done.

High leverage comes from the right metrics and telemetry. Creative DevOps can repair and optimize with the right insights.

(Note: I recognize there are many community tools which provide these and much more. I identified a fragmentation in tools which presents an opportunity for a better experience.)

Query Performance
- Query volume (drill downs include: subgraphs, query params, fields query volume)
- Query fees
- Query latency
- Revenue pool (% total query fees)
- Receipt volume (grouped by query, subgraph)
Indexing Performance
- Allocations by subgraph
- Indexing speed
- Indexing size
- Indexer stake
- Proof of Indexing Discrepancy
- Delegated stake
Infrastructure Performance
- CPU usage
- Disk space
- Bandwidth
- Gas fees
Delegation
- Delegator growth
- Delegator stakes
- Delegator returns
- Time-metrics (how long staking)

Telemetry

Metrics, however, are a starting point.

If seems Indexers may want to be able to drill down into any hot spots and see relevant logs for them to take an action.

There's still some research on what might be some available actions that can be helpful or how they would want to see the logs.

Some options include a click from UI into CLI to see logs in the terminal.

Others might be using pre-configured log management tools like the ELK stack.

Remediation

If we provide basic metrics and telemetry into hotspots, we may want to explore opportunities for remeditation from within the dashboard.

Typically at this point, they would go to CLI. However, even in the DevOps industry there seems to be different thoughts on what is involved to set things up based on the types of errors.

Possible areas of exploration by talking to Indexers include simplifying the visiblity or available remediation actions in areas such as:

Hardware resource allocations

Providing insights into where and how much resources should be allocated to impact performance.

Query performance

This can include easier ways to develop indexing filters or propose data schemas based on data back to the Subgraph developer.

Traffic management

If relevant, there might be opportunities to improve load balancing and routing.

These are areas of exploration with the end-goal of simplifying the experience for Indexers.

Doing so could widen the persona to the non-professional node operators as discussed in The Graph Indexer Experience

Design Principles

For the most part, the metrics themselves aren't what is missing (however I spoke to one Indexer who created their own logging agent because they needed more fields.)

I feel that a better UX is possible to create higher leverage (better query performance at lowest costs including MTTR-related overhead).

These principles may include the following:

Operational Ease - the UX looks at the holistic operations of being an Indexer
Opinionated with Flexibility - bring your own tools isn't the top-level approach; we present an opportunity to be opinionated with some flexibility but the goal is operational ease
Ecosystem Aware - the ecosystem is the first-class citizen; the metrics give Indexers a way to optimize based on insights into the entire ecosystem through insights and empathy

Deployment Requirements

One of the areas would be making it easy to set up and operate the dashboard itself.

Setting Up

The installation should be a "one-click" package (docker image or pre-made script to install from source)
Logging end-points for each Indexer instance should ideally be auto-discoverable with confirmation
Easy ingree any external facing dashboards

Configurability

Ability to ingest and display custom fields if added to the logging agent
Different slices and views
Ability to add remeditation actions

Questions and Initial Thoughts

It appears there are different sets of tooling being built or used by indexers.

Do we have (and is it worthwhile) to show metrics globally that aren't captured already? And allow Indexers to identify how they stack?
Can we create a simple it-just-works UX from deployment to customization for this dashboard?
Do we need to rely on existing tools like Grafana and ELK stack and, instead, looks at other open source tools or stacks upon which to build (LogRocket, StreamIt, eBPF agents)? In other words, can we start from the Indexer Experience first?

More areas to flesh out and get feedback on:

How will specific data will be displayed on the dashboard? Which charts, graphs, or statistics make sense, are used, or perhaps not currently used but add value?
How will the data be aggregated and calculated (can it be)? For example, to show total network indexing demand, do ndexers self-report their demand cryptographically to protect sensitive data? How would excess capacity be measured?
How customizable will the dashboard be? This is not only at the UI but at the agent level.
What is the development process? I see GIPs as being a way ideas are introduced, but maintenance completely by a core dev team?
Any potential downsides or unintended consequences of an Indexer dashboard? For example, will certain metrics introduce harmful incentives or competitive dynamics in the network? How to avoid dashboard metrics being misinterpreted or misused?