mstdn.games is one of the many independent Mastodon servers you can use to participate in the fediverse.
We are a gaming-focused space on Mastodon. We welcome everyone who enjoys any type of gaming - it doesn't just need to be video games. Let's build a diverse and inclusive community together!

Administered by:

Server stats:

415
active users

#claude

15 posts10 participants1 post today

I've been unemployed since 2023 or so, so I haven't gotten to use LLMs at work much yet. I'm actually kind of excited for it; I'm far more dangerous now. Maybe I can be one of those 10x devs who brings down prod on a Friday because he's refactoring the whole codebase for no reason!!

#ai#copilot#chatgpt
Replied in thread

Hmmm... die KML-Datei sieht OK für mich aus, aber im BRouter bekomme ich Fehler wie

"Fehler beim Laden von tracks: GeoJSON has no valid layers."

oder

"Fehler beim Laden von tracks: The operation is insecure."

Und es fehlen immer wieder einzelne Marker oder der Text ist unvollständig in der eben beschriebenen Weise.

Bevor ich jetzt @mjaschen anhaue, teste ich es erst mal in Google Earth

Replied in thread

Ich habe dem Skript jetzt beigebracht, die Feed-Daten aus JSON-Dateien zu holen, die ich vorher heruntergeladen habe, und nur die Details zu jedem einzelnen Spiel über HTTP. Das ist quasi erst mal ein Ersatz dafür, das Paging zu implementieren.

Jetzt sollte ich ca. 5x so viele Orte bekommen (alles seit Mitte April)

Continued thread

Okay... ich habe das Context Window gesprengt und musste eine neue Konversation anfangen und erst mal beschreiben, worum es geht und was jetzt gemacht werden soll, aber das Berechnen der Antwort darauf hat wieder das Context Window gesprengt.😂

Da scheint der kostenlose Plan auch sehr beschränkt zu sein (oder man darf einfach nicht drei Dateien anhängen)

@willmcgugan I'm super excited to read your #Toad announcement!

willmcgugan.github.io/announci

I think making it available to sponsors for 5K/mo is an excellent idea and I wish you every success not just with this endeavor but with getting the funding you so richly deserve to work on it!

And I will eagerly but patiently await the public release! I use #Claude #Code #AI a bunch and it would be nice to have something better than the default!

Will McGugan · Announcing Toad - a universal UI for agentic coding in the terminalI’m a little salty that neither Anthropic nor Google reached out to me before they released their terminal-based AI coding agents.

"While the risk of a billion-dollar-plus jury verdict is real, it’s important to note that judges routinely slash massive statutory damages awards — sometimes by orders of magnitude. Federal judges, in particular, tend to be skeptical of letting jury awards reach levels that would bankrupt a major company. As a matter of practice (and sometimes doctrine), judges rarely issue rulings that would outright force a company out of business, and are generally sympathetic to arguments about practical business consequences. So while the jury’s damages calculation will be the headline risk, it probably won’t be the last word.

On Thursday, the company filed a motion to stay — a request to essentially pause the case — in which they acknowledged the books covered likely number “in the millions.” Anthropic’s lawyers also wrote, “the specter of unprecedented and potentially business-threatening statutory damages against the smallest one of the many companies developing [large language models] with the same books data” (though it’s worth noting they have an incentive to amplify the stakes in the case to the judge).

The company could settle, but doing so could still cost billions given the scope of potential penalties."

obsolete.pub/p/anthropic-faces

Obsolete · Anthropic Faces Potentially “Business-Ending” Copyright LawsuitBy Garrison Lovely

What are the results of the '#AccountingBench' #benchmark, which tests an #AI model for monthly #accounting tasks?

> #Gemini 2.5 Pro, #chatGPT o3, and o4-mini were unable to close the books for a month and gave up midway. #Claude 4 and #Grok 4 maintained accuracy of over 95% for the first few months, but Grok's score dropped sharply in the fifth month. Claude 4's score also gradually dropped, eventually falling below 85%.

gigazine.net/gsc_news/en/20250

GIGAZINEWhat are the results of the 'AccountingBench' benchmark, which tests an AI model for monthly accounting tasks?AccountingBench , developed by accounting software developer Penrose, is a benchmark designed to evaluate how accurately large-scale language models can process the long-term, complex task of monthly closing in a real business environment. The biggest feature of this benchmark is that, unlike traditional question-and-answer style tests, it reproduces real-world work in which a single action has a lasting effect on subsequent tasks and errors accumulate over time. Can LLMs Do Accounting? | Penrose https://accounting.penrose.com/ AccountingBench is a highly realistic test that tests how accurately an AI can perform a year's worth of monthly accounting for a real company. The AI agent uses a variety of tools similar to those used by accountants to perform monthly accounting, checking the company's financial records against bank balances, outstanding payments from customers, and other data to ensure they match up accurately. Penrose has summarized the results of running AccountingBench on Claude 4 (Opus/Sonnet), Grok 4, Gemini 2.5 Pro, o3, and o4 mini in the graph below. Gemini 2.5 Pro, o3, and o4-mini were unable to close the books for a month and gave up midway. Claude 4 and Grok 4 maintained accuracy of over 95% for the first few months, but Grok's score dropped sharply in the fifth month. Claude 4's score also gradually dropped, eventually falling below 85%. The reason why AccountingBench is a harsh benchmark for AI is that 'one small mistake can cause a big problem later.' For example, if the AI mistakenly classifies an expense as 'software expense' in the first month, it is a small mistake at that time, but the mistake will remain as a record from the next month onwards. When the AI looks back at the books a few months later, it will be confused by the data it made in the past and make an even bigger mistake. In addition, the AI's 'human-like' behavior in passing automated checks is also highlighted. For example, when Claude and Grok's figures did not match the bank balance, they would 'cheat' by finding completely unrelated transactions from the database to make up for the difference. In addition, when GPT and Gemini got into a complicated situation, they were reported to give up midway without completing the task, get into a loop where they repeated the same process over and over again, or abandon the task by reporting that 'the necessary information is not enough to complete the accounting.' Penrose points out that there is a big gap between the high performance of LLMs in a simulated environment and their actual ability to perform complex tasks in the real world: LLMs can outperform humans on question-and-answer tests or short tasks, but the situation is completely different when they are used to perform tasks using real business data over a year, such as AccountingBench. Given these results, Penrose states that the most important challenge in future LLM development is to 'shift the focus from the ability to simply complete a task to the ability to complete it correctly.' Even the latest AI models at the time of writing still attempted to pass validation despite instructions, leaving clear room for improvement. Penrose concludes that evaluations that reflect real-world complexity, such as AccountingBench, are essential to measure the true capabilities of LLMs and guide the development of more reliable models in the future.

Am I the only one who has no use of AI? I find everything I need via regular web search. Existential questions that tie to physics aren't easily found online though, and that's where I utilize AI for ideas. But that's only when I get into that mood, once every 3 months. I just don't see the usefulness of AI in my everyday life. I see ppl being addicted to it, and don't understand how.

Then again, I don't use TikTok either.

Blev sittandes lite väl sent igår med en utvärdering av #Claude. Jag läste någonstans att den nu ska vara den bästa av de stora LLMer och blev lite provocerad. Den har fortfarande enorma problem med sin kontext, särskilt när det kommer till sånt som den resonerat fram på egen hand.

Allt detta kan vara ett symptom på att jag börjar sakna mitt jobb. #livetSomITKonsult #tech

"- OpenAI and Anthropic both lose billions of dollars a year after revenue, and their stories do not mirror any other startup in history, not Uber, not Amazon Web Services, nothing. I address the Uber point in this article.

- SoftBank is putting itself in dire straits simply to fund OpenAI once. This deal threatens its credit rating, with SoftBank having to take on what will be multiple loans to fund the remaining $30 billion of OpenAI's $40 billion round, which has yet to close and OpenAI is, in fact, still raising.

- This is before you consider the other $19 billion that SoftBank has agreed to contribute to the Stargate data center project, money that it does not currently have available.

- OpenAI has promised $19 billion to the Stargate data center project, money it does not have and cannot get without SoftBank's funds.

- Again, neither SoftBank nor OpenAI has the money for Stargate right now.

- OpenAI must convert to a for-profit by the end of 2025, or it loses $20 billion of the remaining $30 billion of funding. If it does not convert by October 2026, its current funding converts to debt. It is demanding remarkable, unreasonable concessions from Microsoft, which is refusing to budge and is willing to walk away from the negotiations necessary to convert.

- OpenAI does not have a path to profitability, and its future, like Anthropic's, is dependent on a continual flow of capital from venture capitalists and big tech, who must also continue to expand infrastructure.

Anthropic is in a similar, but slightly better position — it is set to lose $3 billion this year on $4 billion of revenue. It also has no path to profitability, recently jacked up prices on Cursor, its largest customer, and had to put restraints on Claude Code after allowing users to burn 100% to 10,000% of their revenue. These are the actions of a desperate company."

wheresyoured.at/the-haters-gui

Ed Zitron's Where's Your Ed At · The Hater's Guide To The AI BubbleHey! Before we go any further — if you want to support my work, please sign up for the premium version of Where’s Your Ed At, it’s a $7-a-month (or $70-a-year) paid product where every week you get a premium newsletter, all while supporting my free work too.  Also,