A Time I Failed

Wednesday, August 16th, 2023

At first, it seemed like we could have a clear plan.

Building the data and analytics was a priority, and scale of (at the time) 1 billion events per day was large.

The goal was to create public facing and valuable data (and from there, enable customer specific analytics).

What I attempted to do was start backwards: what could this look like?

So I began with questions about what could be useful as global data (imagine a free dashboard for a sampling of global traffic) and what is useful for customers.

This ranged from sources, common referrers, (we couldn't use visited websites because that identified customers and was proprietary), IP and IP ranges, ASN, geography, attack vectors, content types.

Once this table was in hand as a starting point I was able to socialize it with other people in the company to collect feedback on what could be a valuable v1.

Then came discussion with engineering on whether we could operationalize this, and the challenges began. In retrospect, I would have tried a different set of approaches.

After talking with the Data EM and lead, they said that an engineer was working on analytics and aggregation. Work with him.

So I presented the representation of data and he said it's not possible to do.

This already got us on a difficult path, and I what I would have done is bring in a more directed problem solving solution much earlier, from beginning to end. This would have meant sharing the requirements, putting in my understanding of what was the engineering challenges specifically, and then setting a true north of what we want to do and doing a "fill in the blank" of what was missing.

Instead, I talked to the EM who then sat down next to the engineer and said, "Can you make this work?" and the engineer said "I can try."

I proposed that we scope down a single example that we can use publicly: the mobile device identifier sent with requests.

The engineer said that this would take specific work to do to capture this, then to aggregate this. I took notes on what was involved, but I was focused on getting the "spike" done. It would help us flesh out problems and give us a short term win.

We worked on it, but I noticed that the process was brittle. I would ask about why there was an error in the data (I would check dates, totals, sanity checking) and the engineer would be surprised, then come back and said there was an error (schema was wrong, system was down, aggregation tooling was wrong) and we went back and forth.

I had an informal conversation with the EM around how do we systematically check on data quality, and his response was that the engineering just needed to stop f****ing up.

What I would have done differently is ask a) how do we reduce the human errors; b) we still need a way for better data verification, or it will bite us.

Well, fast forward, I planned a launch article based on data, and the night before checked the data and it was wrong! The data was incorrect (summations were wrong, I believe.)

I called the engineer at home, shared with him my concerns, and he agreed to work on it to fix it so we can do another run and then post the results the next morning.

This was definitely a problem and I wished, now, we had done a post-mortem with the formalities I described earlier that reviewed - what are the goals, what is the engineering design, what are the hurdles to get there, what is the timeline and who are the owners.

I focused on delivering it and shipping the fix and we did, but then we still had the on-going problem of more reports.

So then I thought we should at least just sanity check these and we did each one, manually reconstructing them. But none of them could be made as an online analytical reporting tool, either globally or for specific customers.

At this point, I brought up against to the EM and the lead that this is not going to be a good way to proceed and I don't have clarity on how we are going to deliver, and they said they are going to get a new engineer to work on it and that should solve it.

New engineer came, but the leadership left.

New EM without a data background came.

So we restarted again with the requirements, now more fully fleshed out AND I had working prototypes because, to fill in the gap with enterprise customers, we worked with a third party tool.

So we had detailed documents and a working temporary stop gap solution. This pleased the field, but CEO was very unhappy.

The new team said that would experiment with Spark, and nothing materialized, and then with Flink and still nothing materialized after months. I wasn't able to work with the EM because he was unfamiliar with the data space; I worked with individual engineers, and they said it was hard and they are building it themselves.

My mistake was not having a broader conversation to invite more people after discussing my concerns with the EM. Instead I descoped and said can we pick just one feature we can ship that had a high-cardinality problem (we called these Top-N).

Struggles ensued, but finally we produced something.

But I was failing, and not sure what to do. I had requirements, but when I went to VP of Engineering, he asked me where they were. I hadn't brought the full, broad scope to a wider audience to shine a light on the issue. My issue was I was trying to solve it single-threaded.

So I approached it another way: I looked at what other companies were doing at a comparable scale, met with them, and asked them what tooling they used. One used the open source project Druid, and I met with the Project creator and commercial founder to speak with our team.

We opened with use cases (the functional requirements as well as non-function) and described the challenges, and said let's have an open forum. The engineers said this seemed good.

Then afterwards, I debriefed with the engineers to see what they thought, and they said they wouldn't go through with it. I went back to the founder/tech lead and asked him whether there had been side bar conversations or did he read the room differently from how I had.

He said he thought there weren't any tough questions asked, but some seemingly irrelevant, but nothing major.

So we were stuck, again.

I then went to lunch with the former data engineering leadership and asked them what do they think of the situation.

He said that the normal engineering to do this were possible and known; except that the number of nodes exceeded the budget from the CEO; it needed to be cut by a factor of 10x. He felt that was an impossibility.

With that information, I wasn't sure what to do; but I should have raised the concern, again, with a broader team, task whoever can validate this (either the engineering side or the budget side) to help me understand the full trade-offs and issues and then workshop a meaningful solution.

When I moved off the project (I was already carrying three full products, not features, but products at this point), the next PM also hadn't been able to deliver for a year, then new engineers were brought in and Clickhouse was brought in.

This resulted in internal tools based on Clickhouse and eventually a customer-facing version.

The global analytics that was the original vision still hadn't come to light.

So what were my take aways from this?

Go broad earlier: I needed to get visibility with a plan by me bought into by everybody earlier
Approach short-term fixes with caution: the short-term "spikes" were meant to get us some kind of traction, but we needed to keep talking about the broader vision and what it would take to get there; they are good in that they demonstrate high agency, but I think it must be contextualized (I'm going to do this short-term fix to get us going, but we still need a plan to make this scalable and comprehensive)
Share ownership: I was the only one bearing the deliverability, and without sharing this, we were getting bogged down in other things (for example, engineers said that they were spending time on-call); this should have been shared much more widely by me by identifying the issue, talking with the EM and saying together we need to escalate
Address concerns: I had concerns that the data engineering related blockers couldn't be addressed directly by the EM; and that the engineering lead he was relying on had alienated the team (they left, citing working with him, and then he did later); the personnel issues should be owned and fixed by the EM, of course, but I owned talking more about the manner in which we were or were not solving problems. The new EM mostly wanted me to provide requirements, and then he provided execution; I should have gone beyond requirements.