A Tale of a Typo: How a Small Error Revealed UX Issues in Datadog
As an ardent user of Datadog's monitoring and analytics platform, I recently had a user experience that I believe highlights an opportunity for significant enhancement in the platform's functionalities.
In a recent tweet, I briefly touched on my encounter, stating,
Dear @datadoghq, an enhancement request - we could really use a conf.yaml linter and quick tester. Here's my journey: On 2023-06-26 at 17:55:41, I started the agent on the host, followed by a datadog-agent check. A painstaking 15+ minutes later at 18:02:22, I discover a typo in my Postgres database name. Frustrating! #UXissues
Why this matters? This delayed feedback and requirement for host access turned what should've been a quick process into a 3-day investigation! After initially launching the agent and waiting in vain for 5-10 minutes, I moved on to other work... until I finally revisited and checked the logs today. Can't help but think a UI feature for adding a target without needing to edit a host file would've saved so much time! #BetterMonitoring"
A Journey Begins
On June 26, 2023, at exactly 17:55:41
I embarked on what seemed like a routine task. The mission was straightforward - start the Datadog agent on the host using the command
sudo service datadog-agent start
and subsequently run a
datadog-agent check
The latter process was performed with this command:
sudo DBM_THREADED_JOB_RUN_SYNC=true \
DD_LOG_LEVEL=debug \
datadog-agent check postgres -t 2 | tee /tmp/dd.debug
For those not fully immersed in the intricacies of the tech world, this command might come across as complex jargon. I agree! That command to monitor the agent launch is world class propeller head. However, for me and many others in my field, it's a standard part of our workflow. My expectation was to monitor the launch of the agent, specifically targeting my database, affectionately named "dumbo."
An Unexpected Delay
The monitoring process unfolded as expected, or so it seemed at first. After waiting for about 5-10 minutes, I realized that "dumbo" was not present in the output from the datadog-agent check nor in the output file /tmp/dd.debug. Puzzled by this absence, I decided to switch gears and focus on my other work, keeping the unresolved issue on the back burner.
Fast forward to three days later. A colleague of mine inquired about the replication delay on "dumbo." I turned to Datadog, hoping for insights, only to find that the database was still missing from the monitoring dashboard. My curiosity piqued; I decided it was time to revisit the issue and dig into the logs.
Upon inspecting the debug file on the host running the agent, I stumbled upon a service check timestamped at 18:02:22, nearly 15 minutes after I had initially launched the agent. The service check read:
=== Service Checks ===
[
{
"check": "postgres.can_connect",
"host_name": "dumbo",
"timestamp": 1687802542,
"status": 2,
"message": "Error establishing connection to postgres://dumbo:/kylelf, error is FATAL: database \"kylelf\" does not exist\n",
"tags": [
"db:kylelf",
"port:socket",
"server:dumbo"
]
},
The Epiphany
Suddenly, the missing piece of the puzzle came to light. The feedback delay and the subsequent three-day investigation were due to a simple typo in the Postgres database name. What was initially intended to be a quick check turned into a drawn-out and time-consuming ordeal, all due to delayed error feedback and an inconvenient requirement for host access.
Reflecting on this experience, I couldn't help but think: there has to be a better way.
And indeed, I believe there is.
The Case for a conf.yaml Linter and Quick Tester
My experience highlights the pressing need for a conf.yaml linter and quick tester.
There has to be a better way.
conf.yaml lint
quick check on conf.yaml connectivity
a UI interface in datadog to add an new target - what a thought!
Comentarios