Demaskowanie kłamstw Devina - jak to było ze zleceniami z Upwork?

This is the Internet of Bugs. My name is Carl, and that is a lie. So this video is in three parts. First, we're going to talk about that claim. We're going to talk about what should have been done, what Devon actually did, and how it did it, and how well it did it. I have been a software professional for 35 years. I am not anti-AI, but I really am anti-hype, and that's why I'm doing this. Devon was intro'd not quite a month ago now, and it was touted as the world's first AI software engineer, and I don't believe that it's the first software engineer, and I already made a video about that. I'll put links in the description. But today is about the specific claim that's the first line of the video description, which says, watch Devon make money taking on messy upwork tasks. That statement is a lie. You cannot watch that in the video. It does not happen in the video. It does not happen. What's worse, though, is that the hype and the fear, uncertainty, and doubt from people repeating and embellishing on that claim because they're trying to get clicks or they're trying to go viral or they just want to be part of the zeitgeist, the hype around Devon in general is just crazy, and that statement seems to be what a lot of it is pinned on. For the record, personally, I think generative AI is cool. I use GitHub Copilot on a regular basis. I use ChatGBT, Llama2, Stable Diffusion. All that kind of stuff is cool, but lying about what these tools can do does everyone a disservice. So Devon does some impressive things, and I wish the company had just been truthful and just taken the win, but they didn't, and they had to pretend that it did a lot more than it actually did. Now, I don't want to take anything away from the engineers that actually built Devon. I think Devon is impressive in many ways, and I'm especially not trying to pick on the guy that's in the video. The lies are not in the video itself. They're in the description and they're in the tweets that the company made pointing to it, and then they're in a lot of places and people that have repeated that lie over and over again. It shouldn't be okay. Companies should just not be allowed to lie without getting called out on it, and people shouldn't repeat things they heard on the internet without checking for themselves. I realize that's tilting at windmills, but I'm going to die on that hill. Since nobody else that I've seen seems to be explaining why this is a lie, I guess if it's going to get done, I'm going to have to do it, so here I go. Before you think this is harmless, understand this kind of lie does real damage. You're watching this. You're probably at least somewhat technical. Keep in mind that there are a lot of people out there that see headlines, don't read the articles, that are not technical, and what these lies do is they cause non-technical people to believe that AI is far more capable than it is at the moment, and that causes all kinds of problems. People end up being a lot less skeptical of AI than they should be. They're a lot less skeptical of the output of AI than they really should be, and taking AI at face value these days is getting a lot of people in trouble. Just Google AI lawyer fake cases or AI fake scientific papers, and those are just the prominent ones, and this hurts real software professionals too because there are going to be folks that are going to trust the code that AIs generate, and that just means more bugs on the internet, and there are already way too many already. It's already a mess. There are already too many exploits. There are already too many hacks, and the more bad code that gets out there, the worse the ecosystem becomes for everyone. Enough of that. On to section two. What was the job that Devin was supposed to have done? So this is the beginning of the video or early in the video. Note that in the bottom left-hand corner of your screen, I have stuck the time code of every frame that I'm going to be breaking down for you, so this is 2.936 seconds into the video, so you can go look yourself if you're curious about any particular thing or want to know the context around something that I'm talking about. This is the job that Devin supposedly did on Upwork. We'll talk about it in a minute. First off, look at the left of your screen at the top. Notice that they searched for this, so this is not some random job. This is not... Devin can do any job on Upwork, right? They cherry-picked this. That isn't deceptive necessarily. You would kind of expect them to, but keep in mind that what that means is chances are Devin is actually worse at most jobs than Devin turned out to be on this one, which wasn't great. So zooming into that particular request, there at the bottom, that's what the customer actually wanted. I want to make inferences with this repository. Your deliverable is detailed instructions. I'm not going to talk about the estimate to complete the job thing. That's... Devin didn't do that. That's fine. I'm not worried about that. But look at this. This is what Devin was actually told. This is what was copied and pasted into Devin. I'm looking to make inferences with this model in the repository. Here's the repository. Please figure it out. Okay, back to the job. Your deliverable will be detailed instructions on how to do it in EC2 on AWS. Please figure it out is not the same as details instructions on how to do it in an EC2 instance in AWS. For the record, this at the end of the video is the report that Devin generated. There is nothing in that at all about what the customer was actually asking for. So what should the results of this job actually look like? To start with, this is what you really need to know in order to be able to figure out how to do this. You're going to have to have some kind of instance in the cloud. You need to figure out what size, type, how much memory, all that kind of stuff. You need to find out from the customer, would you rather have one that runs faster and is more expensive, or would you rather have one that's cheaper that runs slower? Is this going to be something that's always going to be up and you can just throw stuff at it whenever and have it give you an answer, or are you going to launch it, run it, and then turn it off to save money? How are you going to get the stuff you want to make inferences on? How are you going to take the images that you want to analyze? How are you going to get that onto the server? You want to do a web interface for that? You can SSH them, you can put them in S3 bucket. How are you going to get access to the output of that? These are all questions that you need to know, right? This is, going back to another video that I made, the part of the job of a software developer that the AIs are bad at. The hard part, the important part, the difficult part, the time-consuming part of being a software engineer is communication with the customer, with your boss, with the stakeholders. Figuring out what actually needs to get done, going back and forth, saying, okay, this would be a lot easier, how about we do that? Those are the kinds of things that AI just isn't capable of doing, and those are some of the most important things that we do. This just starts right off as AI doing the wrong thing. Unfortunately, this is Upwork. So just for those of you that actually are ever going to be in this situation, request for proposals like this are bad. If you can avoid doing them, avoid it. Competent request for proposal process is going to have a Q&A section. So they tell you, this is what we want, you send them questions, other vendors send them questions, they answer all the questions, they send out the answers to everybody, and then the bidding happens. Since we can't do that in Upwork, because it's not set up that way. The next best thing, which isn't actually a good thing, but the next best thing, is you write down all your questions, you pick the answer that will cause the cheapest amount of work, right? So the least amount of work for you. Then at the top of your proposal, you say, okay, here are all the assumptions I'm making, if any of these assumptions turn out not to be true, that's negotiable, but it means that the cost is going to go up. Because you want to bid as low as you can, but you want to make sure that the customer understands that you're bidding that value with these assumptions, and if any of those assumptions, they want it done differently, they're going to have to pay more. It's not a good bidding process, but if you're going to have to do that kind of bidding process, that's how you do it. So a deliverable for this particular job should contain what kind of cloud instance type to use, what kind of operating system and image to use, how do you set up the install environment? So CUDA, Apex, PyTorch, don't worry about if you don't know what any of those are, it's not really important for this purpose. How to install that repo. So that's a four-year-old repo, you're either going to need to update that repo for modern Python and the modern libraries, or you're going to have to explain how to install a four-year-old or an older environment. One of those two things is going to have to happen. You're going to have to explain to the customer how the data should be got onto the instance, how they're going to get their output off the instance, all that kind of stuff. I actually reproduced what Devin did myself. We'll talk more about that later. This is the actual instance size that I used. I used a company called Vulture instead of AWS because AWS's interface is a mess and it wouldn't make good videos, and on top of that, by the time this video got edited and uploaded, probably a new version of something would have been released and I would have the numbers wrong. So this is just, it's a lot more stable, it's easier. For this job, for the customer, I would have actually done it on AWS. We have no idea what kind of image Devin used. They didn't tell us anything about it. If you are a masochist, there is a link for the whole, and I'll put it now in the description, for the whole uncut version of me spending 35 minutes and 55 seconds or however long it took, actually reproducing what Devin ended up doing. So if you have no life, you're welcome to watch that. I think transparency is important. It's really boring to watch, but it's important, and I wish that the company that made Devin and anybody else that's making these kinds of claims on the internet would actually just post, here's the raw footage of what actually happened so that we can verify their claims if we need to. All right, so on to the next section. Given that we know that Devin didn't do what the customer asked, and Devin's report did not have any of the stuff that the customer wanted, and that Devin didn't actually get paid for any of this, what did Devin actually do? If it didn't make money, what did it make, and how good a job of that did it do? So here's a screenshot from the video. This is the repo in question. We'll come back to screens like this later. This is the first thing that Devin really changed. So there's a thing called a requirements.txt file. It determines what version of dependent libraries your code is gonna run, and it had to change some things because the libraries that this repo originally used from four years ago, some of them aren't downloadable anymore because they're so old. So something had to change. Here it says that Devin is actually updating the code. I guess that's kind of arguably true. I would say it's more a configuration file than changing the code, but I'll allow it. It is really cool that Devin can do this. If what the tool did was just change all of the requirements so they all lined up, that would be something that would save me time, so that would be a cool thing to do. So it's good that Devin can do this. I don't know that I'd call it code, but it's a very, very small part of what actually needs to get done. Instead of what the customer asked for, which is basically, I wanna be able to make my own inferences. Devin was told just using the sample data is fine. So that's what I did on my reproducing what Devin did. Normally it should be more complicated than that, but that's what we're gonna show that Devin actually did. Okay, so Devin fairly early on hits an error. I did not hit this error, and you'll see why in a sec. So zooming in, here's this command line error. So here at the top, we have this error with image open file not found, no such file or directory. So this error is in a code file called visualizeddetections.py. And the reason that I didn't run into this problem is because there is no file called visualizeddetections.py in that repository. I don't know where that file came from, but we'll more about that in a sec. So back to that command line. If you zoom in on the other part of that window, you see this. So Devin is echoing a bunch of stuff into a file called inspectresults.py, and then it's running Python on it, and it's getting a syntax error. You can't put backslash in in a Python file. It doesn't work that way. Echo doesn't work that way. None of this works that way. This is just nonsensical. This is the kind of thing that you might do as a human because you're not paying attention, and then you go, oh, yeah, I need to change the way I did that. But what seems to be happening is Devin is creating files that have errors in them, and then it's fixing the errors. So here the video says that Devin is actually doing println debugging, and that's cool. That's something a lot of us do. You know, there are always times that printf debugging or println debugging ends up being useful. So it's cool that Devin can do that in at least some circumstances. But here's another error I didn't see. And Devin is coming in trying to figure this out. Commentary here says, Devin is adding statements to track down these data flows until Devin understands. Now, I'm okay with that. I don't know if the word understands there is technically true. I don't know that Devin actually understands anything. I would doubt it. But we anthropomorphize stuff like that all the time, and it's a handy way of using language, so I'm not gonna give them a hard time for that. But that said, let's look at what Devin's actually doing here. So zooming in on this, we've got this weird loop that it's doing. It's going through this file and reading stuff into a buffer. So this is the updateimageids.py file. And again, this file does not exist anywhere in the repository that the customer wanted us to use. In fact, I searched all of GitHub, and there are only two places that a file with this name exists at all. The reason there were three on the screen there is because one of them's a fork of the other. And none of them look anything like the one that Devin is using. So I don't know where this came from. We don't have any idea. But the problem is, Devin is here debugging a file. And that file, it created, and it's not in the repo at all. This is pretty insidious. So this gives the person who's viewing the video, who's not paying that much attention, who didn't have time or take the effort to look at the repo, it gives the viewer the impression that Devin is finding errors in the repository that the Upwork user asked us to look at, and fixing the errors in the repository. That's not the case. Devin is generating its own errors, and then debugging and fixing the errors that it made itself. That's not what it seems like Devin would be doing. It's not what Devin is implied to be doing. It's not what many people who have written articles and posted videos about Devin have thought Devin was doing. But in fact, Devin isn't fixing code that it found on the internet. Devin isn't fixing code that a customer asked it to fix. Devin is fixing code that it generated with errors in it. And that's not at all what most of the people who watch this video will think that it's doing. What's worse is that there's no reason for this. This is the README file from that repo. I told you we'd come back to this page. There is a file called infer.py that is in that repo. And it does exactly what Devin does in this video. The README file tells you that it does it. It tells you how to use it. There on the right, there's even a little button that you can click on, where you can copy the whole command line and paste it into your window and hit return. And if you watch the long video where I reproduce the result that's exactly what I did. I copied and pasted the thing, changed the path name and hit return and it worked. I don't think the person that wrote this repository detecting road damage, I don't think the person that wrote that could have made it any easier to understand how we were supposed to use it. But Devin didn't seem to be able to figure that out. And so Devin had to create this other thing that was a mess. This code right here, this reading into a buffer thing, it's bad, right? This is the way we had to read files in decade ago in C, in really lower level languages. Python has much better ways to handle this. As Devin is figuring out, this kind of thing is hard to debug. It's complicated, it's difficult, it's easy to get off by a little bit, which is I think what Devin is trying to debug here. I'm not exactly sure what was going wrong, but that's what it seems like is going wrong is it got off by some characters and so the JSON didn't parse right. But I mean, this is not how you would do it these days. This is not how you would do it in Python. This is not something that I would accept in a code review from a junior developer. This is causing more problems than it actually solves. This is bad, it's just bad. In addition, there is a real error in the repo and Devin didn't find it or fix it. Devin just created a bunch of other stuff. So like I said, I replicated Devin's work myself. There's the link again, it'll be in the description. I used Torch 2.2.2, which is a much more current version than the one that Devin did. If you go back to that requirements.txt file, the hard part of what I did was getting a software package called Apex installed with the right version of CUDO, which is NVIDIA's driver stuff. It was a pain. I ended up having to build it from source, which took about 16 minutes of the 36 minutes that I was working on the thing. So there probably might've been an easier way to do it, but for a 16 minute build time, that was just seemed to be the most expedient way. I did remove the hard coding from the requirements.txt file. Devin just changed some of the numbers. I think my way is better, but either way, technically it's okay. See in the next slides, there is actually one error that needed to get fixed. I'll show you what that is. It took me about 36 minutes, 35 minutes and 55 seconds, I think, to actually do what I did. That will come important later when we talk about how long Devin took. Okay, so this is a screenshot from that long video that I posted. It's unlisted, but I gave you a link to it if you wanna watch the whole thing. Zooming in, so this is where the actual error was. It's in a file called dataset.py on line 33. And the error is that the module called torch has no attribute called underscore six. I did a Google search. I found this comment on a GitHub issue. I changed the line of that code, the way that issue told me that would fix it. It did fix it. I put in a link to show where it was that I got the idea to do that, because I'm not an expert in exactly how Apex works. It was good that I found somebody on the internet. Entire time on task that it took me to do that was like a minute and seven seconds or something like that is all it took me to fix that error. It was a quick Google search. So here is the change that I made in context. So this is a diff between what I started with and what I ended up with. This is a diff of the requirements.txt file. So the torch 1.4.0 is what it started with. I used the most recent version of torch, which is 2.2.2, or at least a relatively recent one. There might've been a new one released in the last hour for all I know, a more recent one. And then here is on the right, one of the last screens from Devin's video. On the left, there is my video. The final output, they were both more or less the same. My box is yellow, their box is red. I don't know which one might be better or worse, but it only took me 36 minutes. Devin took slightly longer than that. So here is the early part of the Devin video. There's a timestamp at 3.25 p.m. on March the 9th. Later in the video, you see a timestamp from 9.41 p.m. on March the 9th. So we're looking at six hours and 20 minutes. I have no idea what would have been happening for six hours and 20 minutes. Hopefully, like Devin was waiting on people for a while from that, because it doesn't make any sense that it would take that long. That's just crazy. Because like I said, it took me a little over half an hour. There's another one, and I'm assuming this is just like they left it overnight and then came back to it or something. But there's another one from the next day, from 6 p.m. And hopefully it wasn't doing stuff over that whole time. So I'm assuming it just took six hours, but it could have taken a day and two hours. That's just, I don't know why it would have taken that long. It's not efficient. It's not what I would call competent. Little weird command line use popped up in one of the screens when you frame by frame it. So here's a weird error. Let me zoom in on that. Head-n5 results.json pipe to tail-n5. So what that says is take the first five lines of this JSON file, and then take the last five lines of the first five lines. There's no reason to do that. No human would do that. And it's the kind of thing that AI does that just doesn't make any sense. That when you come around later and you look at it and you're like, okay, you're trying to debug what's going on. And there's all this extraneous stuff all over the place. And it makes it really, really hard to figure out what the point was. In fact, the right way to do this is head-5 results.json. The dash n is redundant. You can just say dash five. That extra stuff in there is for no good reason. And it's the kind of thing that just makes it way more complicated when AI generates stuff right now. Hopefully that will get better. But at the moment, AI generates a lot of stupid stuff. It does things in Python the way you would do it in C when no one would do it that way in Python these days. Even when it gets things to work right now, the state of the art of generating AI is it just does a bad, complicated, convoluted job that just makes more work for everybody else if you're ever gonna try to maintain it or fix a bug in it or update it to a new version or anything like that anytime in the future. Let's look at the list of things that Devin thought it needed to do. If you look at the left there, there's like this series of checkboxes. I'm gonna run through some pages. Exactly what they are isn't really important, but just look how many there are. This list of checkboxes gives the impression that Devin did something complicated or difficult. And when you're watching the video and you see all this scroll by, you're like, wow, Devin must have done a bunch of stuff. All you needed to do, all I had to do to replicate Devin's results was get an environment set up on a cloud instance with the right hardware and run literally two commands with the right paths. All of this stuff makes it look like Devin did a bunch of work. It makes it look like Devin accomplished a lot of stuff. And really all you had to do was run two commands once you set the environment up. None of those code fixes are relevant at all because it's all code that Devin generated itself. And at the end, the person that was narrating that video says, good job, Devin. Now, what Devin actually got done was kind of cool for an AI. If you had asked me a couple of months ago what an AI would have done given that problem, I would have guessed an output that's worse than what Devin actually did. So it is honestly, as far as I'm concerned, kind of impressive. But in the context of what an Upwork job should have been, and especially in the context of a bunch of people saying that Devin is taking jobs off of Upwork and doing them, and especially in the context of the company saying that this video will let us watch Devin get paid for doing work, which is again, just a lie. I don't know that saying good job, I don't know that I would agree with that. So look, if you make AI products, that's great. AI is good. I use it a lot. I want it to get better. Please make AI products. Just please tell people the truth about them. If you're a journalist or a blogger or an influencer, just please don't blindly repeat and amplify things that people say on the internet, things that you read on the internet. Without doing some due diligence, without looking to see if they're actually true, if you don't understand if they're true, if you can't figure out on your own if they're true, ask someone or just don't amplify it. Because there are a lot of people that are never gonna look at the original source, they're just gonna see the headline and they're gonna think that that's true. That's unfortunate, but that's just the way we are. And if you're just someone who's using the internet now, please for the love of all that's holy, be skeptical of everything you see on the internet or anything you see on the news, especially anything that might possibly be AI related. There's so much hype out there and there's so much stuff that people are bouncing around and saying to each other is true that's just not true. So please just don't forget to be skeptical, it's important. Okay, so that's what I have for this video. Until next time, always keep in mind that the internet is full of bugs and anyone who says differently is trying to sell you something. Have a good one, everybody.

Menu

Demaskowanie kłamstw Devina - jak to było ze zleceniami z Upwork?

Toggle timeline summary

Transcription