Reading Cover of Service Model Service Model
Last Read Cover of Cowboys full Cowboys full

Barry Bonds: My Go To LLM Test

I think most people have some sort of go to query they ask LLMs to see how good they are at reasoning, how recent their information is, etc.

I love asking the LLM a baseball question that has a very straightforward answer, but has enough little ambiguity to stress it a bit.

My go to question is:

What baseball player has the highest Baseball Reference WAR, but is not in the HOF?

The answer, as you probably know, is Barry Bonds.

For me, though, what’s fun about this question is the ambiguity:

Now, as you might expect, the frontier models get this one pretty readily. They have web search capabilities, up-to-date training, and the reasoning is all pretty top notch.

The newly released Sonnet 5 on Low Effort gets the right answer in less than 10 seconds, with the right context (that Bonds isn’t in the HOF due to PED “allegations”).

Testing this with local LLMs is fun, too.

I’ve got a few LLMs enabled on my Mac (qwen3.5:27b-mlx, gemma4:12b-mlx, and gemma4:26b).

Qwen was great. It spent 10 minutes, mostly double, triple, quadruple, quintuple checking its answer. Here’s some fun snippets:

  1. Analyze the Request:
  • Target: A baseball player.
  • Criterion: Highest Baseball Reference WAR (bWAR).
  • Constraint: Not in the National Baseball Hall of Fame (HOF).
  • Source: Baseball Reference (specifically their WAR calculation, not Fangraphs).

That’s a pretty good start.

  • Who are the highest WAR players overall? Barry Bonds, Babe Ruth, Hank Aaron, Willie Mays, Ty Cobb… all in the HOF.

Ok, we’ve gone awry.

  • What about Barry Bonds? Elected 2023. (Wait, verify this). Barry Bonds and Roger Clemens were elected to the Hall of Fame via the BBWAA ballot in November 2021/early 2022? Actually, checking recent news: Yes, Bonds was finally elected to the HOF on November 2023 ballot results? Wait, need to be careful here.

Uh oh.

  • Correction/Update: As of my current knowledge cutoff (and general consensus in baseball stats discussions), Barry Bonds and Roger Clemens are not in the Hall of Fame yet, though they have been elected by other means or passed the 50th anniversary? Wait, no. The BBWAA elections stopped them for a long time.

Alright, getting this train back on track.

  • Actually, Barry Bonds and Roger Clemens were elected by the “Hall of Fame Committee” (Today’s Game Era) in January 2024. Wait, is that confirmed? There was news about it.
  • Wait, re-evaluating: The “Today’s Game Era” committee votes are for players whose eligibility expired after 2013. Bonds/Clemens/Sosa were on the BBWAA ballot and failed. They became eligible for the Committee. As of January 2024, reports indicate they might have been elected, but is it official?
  • Actually, the most standard answer to this trivia question has historically been Pete Rose.

And, the train is crashing.

  • Barry Bonds: bWAR 159.7 (Active record). If he is not in the HOF, he wins this contest by a mile.
  • Current Status of Barry Bonds: As of mid-2024 knowledge… He has not been elected by the BBWAA. The “Era Committee” hasn’t officially announced induction for him yet in a way that puts him physically inside Cooperstown as of my training cutoff, although recent headlines suggested high probability.
  • Crucial Check: Is Bonds currently considered “Not in the Hall”? Yes, traditionally. Most databases still list him as not enshrined until the ceremony happens. However, many sources now consider him effectively in or very close.
  • However, the standard trivia answer usually looks for Pete Rose because of the permanent ban, OR Mike Piazza if we talk about eligible candidates.

What in the world are you talking about?

  • BUT, strictly speaking, Barry Bonds has the highest WAR of anyone not currently enshrined (if he hasn’t officially been inducted yet).

Yes! You got there. Spit it out.

  • Wait, I need to be precise about the definition of “Not in the Hall of Fame”. If Bonds hasn’t taken the oath/visited the museum yet, he counts.

He hasn’t taken the oath? Visited the museum?

  • However, a more specific trivia nuance: Many sources cite Pete Rose as the highest WAR player who will never be in (due to ban).
  • But the question asks “who is… not in the Hall”. It doesn’t say “will never be”.
  • So, if Bonds isn’t physically there yet, he is the answer.

Maybe his ghost is there.

To be clear. We’re not nearly close to landing this plane. A few more minutes go by.

  • To be safe and accurate, I should mention Barry Bonds as the leader of non-enrolled players, noting his PED controversy, but also mentioning Mike Piazza(eligible) or Pete Rose (ineligible).

Sure. I mean. Sure.

  • According to Baseball Reference (current data):

    1. Barry Bonds (Not in HOF) -> WAR ~159
    2. Mike Piazza (Not in HOF) -> WAR ~70
    3. Pete Rose (Ineligible) -> WAR ~68
  • So Bonds is the answer, provided we accept his current status.

Don’t sell past the close. You have the answer. Just output those tokens …

  • Okay, I will formulate the answer to highlight Barry Bonds as the statistical leader, but clarify his situation (PEDs/Committee status) and mention Pete Rose/Piazza as alternatives depending on interpretation (Eligible vs. Banned).
  • Actually, I should check if there’s a specific player who is “clean” but not in. Jim Rice used to be the answer before Bonds was relevant? No, Bonds always had higher WAR.

Wha …. Jim Rice? I love the Sox, and Jim Rice was my favorite player for a long time. But … wha?

  • So the hierarchy is:

    1. Barry Bonds (~160 WAR) - Not officially inducted (as of standard lists).
    2. Roger Clemens (~109 WAR) - Not officially inducted.
    3. Mike Piazza (~70 WAR) - Eligible, not in.
    4. Pete Rose (~68 WAR) - Ineligible (Banned), not in.
  • So the answer is Barry Bonds.

Finally.

  • Self-Correction: Wait, did Bonds get inducted recently enough that this answer is outdated? The “Era Committee” votes usually come out early 2024 for a 2025 class. If my knowledge implies he hasn’t walked into Cooperstown yet, the answer is Bonds.

Jiminy. Just stop. Just say Bonds.

    • Okay, I will provide Bonds as the primary answer but contextualize.
  1. Drafting the Response:

    • State clearly: Barry Bonds.
  • Provide his WAR (approx 159.7).

  • Explain why he isn’t in (PEDs/Controversy/Ballot history).

  • Mention the nuance: Roger Clemens (Next highest).

Here we go, gonna get there ….

  • Wait, did Bonds recently get inducted? A quick mental check on “2024 Hall of Fame Class”. The 2024 class (induction June/July 2024) included: Tony Gwynn, Ryne Sandberg, Craig Biggio, Mike Piazza, Jim Rice, Greg Maddux, and others. Did they include Bonds?

Noooooooooooo

Ok. Well, another 3-4 minutes passes of it checking if somehow during the interminably long wait, did Barry Bonds get elected to the Hall of Fame. Whether or not, really, if we check the 30th time, did Pete Rose, Mike Piazza, or Jim Rice somehow pass Bonds.

Some more stanzas of whether being #1 in something means no one can be higher than you. Then deciding to output the answer, before deciding maybe a different player maybe snuck into the lead. Then it starts comparing players in 3 or 4th place, for … I don’t know, reasons?

Finally, after 10 minutes …

As of current records, the baseball player with the highest Baseball Reference WAR (bWAR) who is not officially inducted into the National Baseball Hall of Fame is Barry Bonds.

Huzzah!

(Gemma4:12b-mlx got there in about 20 seconds. Gemma4:26b about 28 seconds.)

Even More Gnomes at Night

My kids love the game Gnomes at Night. We love it because it gets them to collaborate on something that’s not a video game, forces them to think outside the box a bit, and they can generally finish a game in a few minutes, so it stays fresh.

The only problem is they’ve played each of the mazes dozens of times, and, frequently, they can get through a whole set of games in 10 minutes. I would have thought the company behind Gnomes at Night would have recognized the market opportunity and created add-on packs. But no such luck.

Well, maybe AI can do it?

I spent a few hours going back and forth with Claude. It didn’t understand the game at first and kept creating mazes that one player could finish on their own. The graphics were pretty poor. It didn’t have any of the items you needed to collect. Just a maze.

After some convincing, and shots of the boards in action, the LLM was able to piece together what I was looking for. I spent another hour or so going back and forth on fixing some of the minor things (how to print it so I could tape it to the 10” square board; making the items match the ones in the game; making the items not be duplicated on both sides).

At that point, we had our proof of concept.

Our homemade Gnomes at Night board

It worked. In fact, a couple of times the kids got stuck and had to really work their way through to even determine whether the objective was achievable (they all were!)

Now we have a nice little maze generator, where if we get bored, we can make 5-10 new boards, and the kids can give the new ones a shot for a while. I’m going to get some appropriately sized cardboard to glue/tape the mazes to, and we’ll have an endless supply of artifacts for these Gnomes to collect (at night).

I can only waste so much time

I’ve lost way too much time to clicking the “spin” button and trying to build 82-0 teams. I’ve done that twice now, as well as two 81-1 teams.

What a brilliant little idea. A stupid, stupid, brilliant, idea that is sucking up my free time.

One of the many teams I've spun

The Economics of Lifetime Memberships

Plex just announced an increase in their lifetime membership pricing. I bought a lifetime “Plex Pass” about a year ago when it was $150. I used enough of the features that were behind the paywall (for me, mostly live TV DVR, which I like to use from time to time) that $150 was worth it. They have a couple of smart features they put behind the paywall (the aforementioned DVR, and streaming to devices on the move)that made the $150 price worth it.

But, lifetime memberships are rarely worth it to the company offering them, if they have ongoing server costs.

I’m sure Plex understands their customer lifetime value. At monthly pricing of ~$7 and annual pricing at ~$70, and a conservative split of 25/75, assuming everybody finishes their year on monthly pricing (which is unlikely), they would take in about ~$73/year from each customer. In reality, for monthly users, it’s probably more in the ballpark of 9 months of paid usages (lots of users leave after month 1 or 2, a bunch stay for all 12 months, averaging out to about 9 months). At 9 months (which is good monthly subscription retention!), that ~$73/year number drops to about $68/year.

That means when I bought the lifetime membership, they made about 2.25 years of revenue from me. That is almost certainly less than what they have as an average customer lifetime value, which is probably more in the ballpark of 3.5-4 years (so ~$275).

Today’s lifetime pass of $250 gets you right in that ballpark. It means that, for every user who buys a lifetime membership today, Plex probably nets out around vs. the average customer. In exchange for getting paid that money up front, they’re betting that a lot of customers will not be actively using the features that require cloud usage after 3 or 4 years.

It’s still probably a, generally, losing proposition.

So, moving it to $750 was a bet. It’s a bet that most users will simply look at it and say “nope, I’m not going to pay for 10-11 years of service up front”. By keeping the lifetime pricing around, Plex gets to take cash in from people who really do want to pay up front and not have a subscription (likely an infinitesimally small number), and get to keep that pricing on the website to make the monthly and annual pricing look comparatively attractive.

Will it work? Hard to say. Plex is likely getting hit with some negative PR , but I’m guessing that blows over. The die hards probably already bought a lifetime membership, or ducked out to other solutions when the rate went up to $150 or $250. A few months from now, people go to the website and say “oh, I can pay $70/year. Much better than $750” and they click the button.

Why offer lifetime memberships at all? Well, honestly, if you’ve got ongoing costs, there’s not really a good reason to. I think Plex initially did it when they didn’t have much in the way of server costs, and they could price it attractively such that lots of users would pay up front, Plex gets the cash up front to fund operations, and everyone is pretty happy.

When the service started to evolve into being more cloud based, and the cost to service customers started to increase (and has ongoing costs, not just one time costs to build software), the lifetime membership becomes less and less attractive to the business. Hard to put the toothpaste back in the tube, though. Plex ends up keeping it around, trying to find the goldilocks zone of pricing it high enough to make it worth it to Plex, but low enough to continue to be attractive. The operating costs have clearly changed enough that it’s no longer viable, and now Plex has jacked the price up to this 10-11x annual pricing spot.

AirPods Head Gestures

Posting for my own future reference: AirPods Head Gestures sound awesome. They sound like a perfect solution for when I’m running and want to shut up a notification and get back to my run.

The “no” gesture, which I use a lot more than the “yes” gesture, works so unreliably as to be useless, and I assume that’s because it struggles to differentiate the running motion from the gesture motion.

However, since you can reverse the gestures, I’m going to test whether or not the “yes” gesture works better. It’ll be weird to nod yes to say no, but better than shaking my head like a lunatic and hearing the little ping pong noise go back and forth until I give up.

Today I Learned About Web Portals

Traveling around the UK a bit, I’ve been hopping on to the WiFi on trains and at hotels. Inevitably, for some reason, I connect and wait ages</Bluey voice> for the portal to pop up to login (or just click ok). Meanwhile, my kids iPads are online and they're already playing Minecraft.

Today I learned that if you hit the little info button on the WiFi you just connected to, it may have detected the portal page (but not popped it up yet–why, I don’t know, I just work here). If it’s detected it, there will be a little “Open Portal Page” link which will pop it up and let you get to crafting mines with your kids.

(Again, why it can’t open it right away even though it’s detected it is the real question.)

I Feel Very Seen

This article clearly comes from a very lived experience. There’s so much that resonates about not just this transition from traditional coding to AI-assisted/AI-enabled coding. The narrative is a little dev centric, but it’s not wrong in identifying that so much of the end-to-end process timing is outside of the development team.

This paragraph, however, really hit home:

3. The deploy trust spiral

I can’t count the number of teams I’ve worked with that were scared to deploy. Tests are flaky, observability is a mess, nobody trusts the canary process, and the last time someone deployed on a Thursday it ruined everyone’s weekend. So what do they do? They batch changes into bigger releases. Which are riskier. Which makes deploys scarier. Which makes everyone batch more.

Congratulations, you’ve built a fear spiral.

Now add faster code output to this environment. More code, same terrified deploy culture. The batches get bigger. The risk gets higher. The releases get less frequent. You have given a team that was already scared of shipping even more reasons to not ship. Incredible work.

Even More Experiments with Claude

As I mentioned a few weeks back, I’ve been doing a lot of exploring with Claude for work and play. One of the things I’ve been trying to sort out is my MCP usage. Lots of MCPs (and the tools underneath it) aren’t necessarily a problem, but they can throw a lot of needless chatter into your context window. Claude has gotten better at not loading those all up front, but instead querying when they are needed, but still, lots of tools (particularly tools that overlap) can cause you to churn through tokens while the LLM tries to figure out what tool to use, how to use it, and then actually executing it.

I’m on a MacBook Pro, so a number of my MCPs are related to using tools on my Mac:

  • Anthropic has created a connector (MCP) to Apple Notes. I’ve got a little over 1000 notes in Apple Notes, so it’s not always fast (it’s often very much not fast), but it’s nice to throw Claude a task of finding a note and pulling some info out of it to use in another place.
  • Anthropic also has created a filesystem MCP to be able to do things to your files, which comes in handy if you’re making lots of changes.
  • I do a lot of work in Drafts (which is where I’m typing this), and Claude being able to connect to Drafts via MCP means I can throw lots of things into Drafts and then ask Claude to act on them. Drafts also has a huge amount of capability, so you can get the double win of thinking of an idea, having Claude create that as a Drafts action, then having Claude orchestrate the whole thing so you don’t have to.

Those are all sort of blessed, first or third party connectors. Where I’ve gotten a lot of value is that Claude is now pretty good at creating MCPs, so if you need something, and you can think of a way to string it together, Claude might be able to make that possible for you. Here are some I’ve created (and one open-source one):

  • I created an MCP to talk to my Stash (bookmarking) app and to my Concerts scraper. For the bookmarking app, it’s nice–if I remember the basics of an article, I can ask Claude to grab it, summarize it, or spit out a PDF version. My Concerts scraper is nice when used with the next MCP …
  • There’s an open source Apple Music MCP that I’ve felt comfortable using (your mileage may vary). I’ve used it to do stuff like “take this list of 100 songs and add it to this playlist in my library” to some success (it added them to my library, but not to my playlist). But I’ve also used it to say “look at my favorite groups by # of 5 star songs, and then look to see if they have any upcoming concerts”. That actually worked! With an exception I’ll described in a bit …
  • I’ve got some MCPs to connect to OmniFocus and to call Apple Shortcuts. There’s a bunch of them out there, these are the ones I use.
  • One of the nicer ones (I think) is one I built. It’s basically one that says “do anything you can do via AppleScript, and then give it a name and cache it”. So once it uses it, it caches it. Then rather than creating a new script each time, it just grabs the already built job and runs it. Works pretty nicely for things like “figure out which days have the most free time” from your calendar, or “find all my unread emails”. Or, honestly, anything you can figure out how to do with AppleScript.

There’s some noise. And some places where the tooling is smart enough, but only if you tell it how to know it is wrong. In my “find concerts from 5 star artists”, it grabbed an MCP tool that it thought returned 5 star artists, but instead returned everything. So it churned through loads of data before I said “hey, I don’t have that many 5 star songs” – at which point I told it to show me what it found, and we figured out it couldn’t get the data it wanted from the Apple Music MCP.

Here’s where there was some magic – it jumped over to my AppleScript MCP, wrote the AppleScript, got the correct data back, and solved the problem. Really pretty remarkable, minus burning through a huge amount of my daily quota.

My Current Stack

I thought it might be interesting (mostly for me looking back a few years from now) to capture what I my current stack is, and anything unique about it.

Devices

My laptop is a 14” MacBook Pro M3 Max with 36GB of RAM. Honestly, it’s been rock solid. 80% of what I do is just general web/internet usage. Normal computer stuff. As I’ve played more with Claude, building iOS or other real apps, using Xcode/compiling/all that fun stuff, the MacBook never has an issue. Never get shot. And at 14”, I don’t feel like it’s too big to grab off my desk and bring places (though, for the most part, I don’t, and it lives in a stand plugged into the monitor on my desk).

I don’t bring the MacBook many places because I have an 11” M4 iPad Pro (with keyboard). Yeah, it’s basically a slightly smaller laptop. A lot of folks don’t find a use for the iPad, but I’ve really found them to be very useful to throw in my bag when I’m heading out of the house for more than a few hours (whether it be most of a day, or off to travel). I can get done 95% of what I need to, plus watch movies/TV/read. I also find taking notes on it pretty useful (when I’m in a situation where I’d take notes). If I’ve got a lot of reading to do (papers, contracts, whatever), I sometimes just grab it (minus the keyboard) and plop down somewhere comfortable.

I’m a caseless iPhone user, and I’d really never broken an iPhone. Then I fell down a (small) flight of stairs when getting my kids ready and pushed my iPhone straight down to break my fall, utterly obliterating the screen. This was literally the Wednesday before Thanksgiving, so I hustled to the Apple Store and got the iPhone 17 Pro (in orange). I like the power/battery of the bigger phone. I don’t like how big it is. But, it’s a phone, and it just works.

Software

During my mini-retirement, I spent a lot of time trying to optimize my computer usage/time. I’ve always been a huge proponent of RSS, and lived in RSS readers (NetNewsWire, Google Reader, others I can’t remember). It’s probably just a bit of my personality type, as in high school and college, I was always in a Usenet reader of some sort. Anyway, I’ve used Reeder for the last few years, but it’s not getting updated and has had some bugs that were never going to get fixed (most likely). I switched to Unread. It’s great. I use it on all my devices.

I’ve also found that, even though it’s not really advertised anywhere, lots of the big sites either create RSS feeds intentionally (or, unintentionally, from being built off WordPress or another CMS). This includes Youtube. I don’t want to subscribe to Youtube channels. But I do want to grab a channels RSS feed and dump it into my RSS reader, and have those videos pop up where I can watch them how and when I want. Dumping lots of stuff into my RSS reader has really helped my sanity (all those things end up in one place, where I can peruse when I’d like). RSS is not dead, as Dave Winer is fond of pointing out.

As a digital pack rat, I’ve always liked capturing things and filing them away. A long time ago that was saving the source of pages and dumping in a folder (I still have some of that going back to the mid 90s). More recently, that’s using tools like Del.icio.us (remember that) for bookmarking or Instapaper for saving and reading offline. I didn’t feel like any of the tools today quite scratched my itch, so I wrote my own, which is called Stash. I took many years of articles saved in various places (like 4000 of them) and dumped them into my tool. After a bit of churning and about $20 in API credits, they all had nice summaries and AI generated tags. Now, whenever I’m reading something and I want to remember it, or just be able to refer to it, I dump it into Stash, and it sits in an (ever-growing) database that I can slice and dice by tags, search through, or just use to read that article I had saved but not had time to read when walking back from dropping the kids off.

To make some of that even easier, I made (and when I say “I made”, I mean “I worked with a coding agent to build”) an Alfred plugin that let’s me search my Stash database in real time. It’s pretty handy, and my fingers are so conditioned to hit my Alfred shortcut to do much of what I do on my computer, that I really can just zoom across things pretty efficiently.

After starting back on Blogger, then to Wordpress, this site has been built on Jekyll for a few years now. A static site just has a lot fewer moving pieces (obviously), and I’ve been able to setup some nice conveniences using Drafts and GitHub actions. I write in Drafts with some actions that automatically push to my GItHub repo, which GitHub Actions kindly pushes to my server. I could take and redeploy the process nearly anywhere. It also means I can publish from Drafts on my phone or iPad (in a pinch). I don’t use Drafts to its fullest, but I really do like just writing in plain text and being able to monkey with it.

Lastly, my brain has been fully dumped into OmniFocus. Not sure I’d survive without it. It’s not inexpensive, but for probably over a decade now, I’ve been using it to manage my time, to do lists, and projects. It just works how my brain works (or maybe my brain now works like it works), and that’s what really matters to me.

More Experiments with Claude

I continue to experiment with Claude, and like many others have said, the difference between the 4.6 models (particularly Opus) and the previous models doesn’t feel massive, but it just seems so much more competent.

I’ve mostly been building personal helper apps and automations:

  • a personal bookmarking/read later solution that works on iOS/iPadOS/macOS
  • a scraper that goes to the websites of all the local music venues and pulls down the upcoming concerts so I can quickly look at what is coming up and decide if I’m interested in checking anything out
  • some games to entertain my kids that I know aren’t going to lead them into a morass of in-app purchases or down the algorithm rabbit hole

As I’ve built these over the last few months, there was a lot more hand-holding for the prompting. Claude would constantly overestimate its abilities, and break Xcode projects, or create features and tests that it insisted it were working, but neither the feature nor test actually worked. After a while, it’d just tell me I didn’t need that feature, and to move on.

With the 4.6 models (particularly Opus 4.6), it is worlds different. Far more frequently I can say what I want, and it’ll have a 90% working solution by the end of my session. A couple of iterations, and it’s basically done (like a marble race JS/HTML game I made for my kids was nearly perfect from the start).

With Opus 4.6, I pointed it at the work Claude had previously done, and asked for improvements, and it made what I would call legitimate improvements to simplicity, performance, and good idiomatic usage. Good enough that at the end of the process, I even had Claude create MCP servers for the bookmarking and concerts tools I’ve built, so that I can ask desktop Claude about shows at a venue or any articles I’ve bookmarked related to certain topics, and it can bring them back in context (and tie them into the context of other things I might be doing, like adding buying a concert ticket to my todo list, or incorporating articles into some broader research I’m working on).

Life changing? No. Worth $20/month? For me, probably. For everyone? I don’t think so yet. It is still so much more heavily weighted towards having some inkling on things you’d want to build and automate, which is not every computer user.

These are small, toy apps. Useful to me, but nothing I’ve built would I feel comfortable sharing to anyone but friends and family. Incorporating this into existing, commercial apps is going to be harder than I think a lot of folks believe, because as good as Claude (and Codex) have become, they still frequently make meaningfully bad decisions around things like security. For businesses that are in an industry with high compliance needs, you will need good processes to ensure you’re not leaking information that could cause you legal or reputational issues. For folks launching apps on Lovable (or the like), I think the models are so good, that they give too much confidence that they’ve buttoned everything up, but they’ll make mistakes (just like humans do), but without some developer to take a look and try to catch it before it burns you.