• 4 Posts
  • 32 Comments
Joined 2 years ago
cake
Cake day: July 13th, 2023

help-circle

  • The other interesting thing is that if you try it a bunch of times, sometimes it uses the calculator and sometimes it does not. It, however, always claims that it used the calculator, unless it didn’t and you tell it that the answer is wrong.

    I think something very fishy is going on, along the lines of them having done empirical research and found that fucking up the numbers and lying about it makes people more likely to believe that gemini is sentient. It is a lot weirder (and a lot more dangerous, if someone used it to calculate things) than “it doesn’t have a calculator” or “poor LLMs cant do math”. It gets a lot of digits correct somehow.

    Frankly this is ridiculous. They have a calculator integrated in the google search. That they don’t have one in their AIs feels deliberate, particularly given that there’s a plenty of LLMs that actually run calculator almost all of the time.

    edit: lying that it used a calculator is rather strange, too. Humans don’t say “code interpreter” or “direct calculator” when asked to multiply two numbers. What the fuck is a “direct calculator”? Why is it talking about “code interpreter” and “direct calculator” conditionally on there being digits (I never saw it say that it used a “code interpreter” when the problem wasn’t mathematical), rather than conditional on there being a [run tool] token outputted earlier?

    The whole thing is utterly ridiculous. Clearly for it to say that it used a “code interpreter” and a “direct calculator” (what ever that is), it had to be fine tuned to say that. Consequently to a bunch of numbers, rather than consequently to a [run tool] thing it uses to run a tool.

    edit: basically, congratulations Google, you have halfway convinced me that an “artificial lying sack of shit” is possible after all. I don’t believe that tortured phrases like “code interpreter” and a “direct calculator” actually came from the internet.

    These assurances - coming from an “AI” - seem like they would make the person asking the question be less likely to double check the answer (and perhaps less likely to click the downvote button), In my book this would qualify them as a lie, even if I consider LLM to not be any more sentient than a sack of shit.


  • Try asking my question to Google gemini a bunch of times, sometimes it gets it right, sometimes it doesn’t. Seems to be about 50/50 but I quickly ran out of free access.

    And google is planning to replace their search (which includes a working calculator) with this stuff. So it is absolutely the case that there’s a plan to replace one of the world’s most popular calculators, if not the most popular, with it.








  • It would have to be more than just river crossings, yeah.

    Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.




  • Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

    Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

    edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

    I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that’s public improves, vs the one that’s held back.






  • I think it could work as a minor gimmick, like terminal hacking minigame in fallout. You have to convince the LLM to tell you the password, or you get to talk to a demented robot whose brain was fried by radiation exposure, or the like. Relatively inconsequential stuff like being able to talk your way through or just shoot your way through.

    Unfortunately this shit is too slow and too huge to embed a local copy of, into a game. You need a lot of hardware compatibility. And running it in the cloud would cost too much.




  • When confronted with a problem like “your search engine imagined a case and cited it”, the next step is to wonder what else it might be making up, not to just quickly slap a bit of tape over the obvious immediate problem and declare everything to be great.

    Exactly. Even if you ensure the cited cases or articles are real it will misrepresent what said articles say.

    Fundamentally it is just blah blah blah ing until the point comes when a citation would be likely to appear, then it blah blah blahs the citation based on the preceding text that it just made up. It plain should not be producing real citations. That it can produce real citations is deeply at odds with it being able to pretend at reasoning, for example.

    Ensuring the citation is real, RAG-ing the articles in there, having AI rewrite drafts, none of these hacks do anything to address any of the underlying problems.