UK bank fined £49M over IT system meltdown
bbc.comI have a bank account with TSB and got compensation as a result of this mix-up.
Some rather personal experiences of the fiasco:
– Rather pointlessly, the website changed from being mostly static to entirely written in a very JS-heavy, "dynamic" way. I still can't use it in my normal browser (FF) with its extensions because it relies heavily upon CORS requests and referrer information that my somewhat privacy-paranoid extensions block.
– This was introduced at the time of the switchover, and until that point the IT system used looked identical between Lloyds, TSB and Halifax / BOS systems (I have accounts with some of those)
– The online browser-based system was telemetry and JS heavy, replacing a far leaner page
– I was unable to log in during the time of the fiasco, mostly due to 403 errors or timeouts. Often the page would just hang as an async request wasn't answered.
– Once I did manage to log in, I was amazed to see another person's account details (!!!), replete with (their) name and statement.
– I was unable to use online banking to pay bills or check my balance – I could see someone else's account in detail but was too honest to do anything with that knowledge. I can't remember if my card stopped working but I was effectively forced to make other arrangements for quite an extended period of time.
> – The online browser-based system was telemetry and JS heavy, replacing a far leaner page
I remember one of those banks using the "leaner" page also had heavy telemetry turned on at some point. I type very fast, so I noticed that when I was entering my user id, it was lagging heavily. Then I turned on developer tools only to see that they were logging all keystrokes to analytics. Including username and password. At first I thought I got a virus or something, but these appeared to be legit scripts from the bank. So I decided to not use that bank account for a while. I wonder why would they turn something like that on.
Report that to the regulators.
If you're in the US I know for a fact the regulators listen to and review complaints.
You can also report serious problems to FinCEN and the OCC
Honest question - why do you still have an account with them?
In general, banks compete on other attributes - a small difference in a mortgage interest rate is a lot of money and makes up for a HUGE difference in the quality of internet services; and whenever loan market becomes tight (and thus it's not attractive to refinance to another bank) people are pretty much locked in.
Exactly this. They're not great, but there's something about being a customer somewhere for a very long time that genuinely does (or did) seem to offer you a mortgage rate that was advantageous.
I'm surprised to hear UK banks value loyalty. It's certainly not my experience.
Why don't people get divorced or break up?
Abusive relationships can trap you, personal or business.
As a privacy-aware user, when making a contract with a bank (or buying a flight ticket or whatever) you should get assertions that their web site meets certain quality standards so you can use your browser to access the account or actually check in.
Paper did not have those incompatibility problems...
However, from the BBC article I conclude that even customers with a default browser could not necessarily use their account
Edit: Forgotten not added.
Businesses can change, too. My credit union[1] recently made a web site change causing me to no longer be able to log in. The new shiny red login button they probably paid $millions for an incompetent developer to provide does nothing when you click it (desktop Safari). I vetted the old site which worked perfectly, but now it doesn’t. I’m working on moving my business elsewhere.
Hopefully they didn't pay $millions...the source is completely unminified, it checks a cookie, calls Google Analytics, then changes the login link from display:none to display:block.
The direct login link is then visible, you can bookmark https://online.techcu.com/User/AccessSignin/Start for later...but yeah, nonfunctional for ~10% of desktop browsers is not a good look for a technology credit union.function LoginButtonClick() { var selAccount = $("#accounts").val(); LoginCookieSet(selAccount); if (typeof ga !== 'undefined') ga('techcu.send', 'event', 'button', 'Click', 'Member Login'); var formAction = "https://online.techcu.com/User/AccessSignin/Password"; // testing url switch (selAccount) { case "1": formAction = "https://online.techcu.com/User/AccessSignin/Password"; if (window.location.host === "dev.techcu.com" || window.location.host === "qa.techcu.com") { formAction = "https://onlinetest.techcu.com/User/AccessSignin/Password"; } break; case "3": formAction = "https://businessbanking.techcu.com/"; break; case "2": formAction = "https://businessbanking.techcu.com/smallbusiness"; break; default: formAction = "http://online.techcu.com/User/AccessSignin/Username"; } if ($('#UsernameField1').val().substr(0, 2) == "**") { $('#onlineBankingLogin #UsernameField').val($('#UserNameHidden').val()); } else { $('#onlineBankingLogin #UsernameField').val($('#UsernameField1').val()); } $('#onlineBankingLogin #PasswordField').val($('#PasswordField1').val()); $('#onlineBankingLogin').attr('action', formAction); $('#onlineBankingLogin').submit(); }Nice debugging and thanks! Didn't expect someone to actually dive in and figure it out. I'll bookmark the direct link to hold me over until I find a new bank but I've already totally lost confidence in the business. They can't even be assed to test their main web site. I wonder if they see the failure from analyzing the before-and-after browser share in their logs. I wonder if anyone's even monitoring the logs.
Found similar kinds of things happening with the occasional website too, mainly due to my use of Firefox. eg buttons that used to work, suddenly "do nothing"
Pretty sure its caused by "Chrome-only" developers, as going through the hassle of installing a Chrome/webkit based browser gets things working. But really, fuck that. ;)
Ubereats made this change recently. Naturally, there's no way on their website to contact them about it. :/
Yes, they can change. But if technical compatibility were part of your contract they might end up paying damages if they break it. Well, I am dreaming...
The login button works for me on Safari (15.2).
> you should get assertions
I think you’d get blank looks if you asked that question followed by generic we use modern blah and new improvement next blah
How did you get compensation? Thank goodness for monzo and revolut being so quick to set up but I had money trapped in TSB for some time. I thought it would only last a day or two at most. The services and ability to get support were non-existent during that time I totally stopped trying to call. I closed my TSB account shortly after
I went in-store to get money out in person, mentioned the problems and very nicely got handed details of how to complain. I did so, with screenshots, and got something like £150 -- completely unrequested -- in my account about six months later. I think they handled the whole thing very well actually, although I'd probably feel different if I had gone into mortgage arrears because of it, as I understand some customers did...
> I still can't use it in my normal browser (FF) with its extensions because it relies heavily upon CORS requests and referrer information that my somewhat privacy-paranoid extensions block.
So you have extensions that literally break normal browser behaviour and you are blaming them somehow? CORS is part of browser security and should be respected.
Not saying that TSB aren't clearly a shitshow but maybe just disable the extension for that site.
"I could see someone else's account in detail but was too honest to do anything with that knowledge"
Are you patting yourself on the back for not commiting fraud?
I meant more that I didn't answer the question "if I make a bank transaction, like I want to, will it come out of my account or theirs"?
In today's world, that's no small feat. Not the patting on the back, but to be honest about it and not do anything untoward.
The 250+ page analysis of the incident was an excellent insight into how large IT projects fail: https://www.tsb.co.uk/news-releases/slaughter-and-may/slaugh...
money quote: > This situation has all the hallmarks of business management strong-arming the IT organization into an unrealistic timeline. When business leaders push for overly-aggressive timelines, or regulators ask for multiple competing risk frameworks and excessive after-the-fact incident reporting, this all puts a strain on the delivery organization’s ability to untangle the complexity before ‘go live’.
The report is by Slaughter & May, one of the more delightful company names in the City of London.
My understanding was that they’re a law firm, perhaps they’ve also branched into IT consultancy?
Law firms are often hired to conduct independent reviews when things go wrong or when allegations of wrongdoing are made, e.g. RBS in 2013,[1] RICS in 2018,[2] Baker McKenzie in 2018[3], and UNICEF in 2020.[4][5]
1. https://www.natwestgroup.com/news-and-insights/feature-conte...
2. https://www.rics.org/uk/about-rics/corporate-governance/inde...
3. https://www.legalbusiness.co.uk/blogs/metoo-latest-bakers-ap...
4. https://www.unicef.org.uk/press-releases/unicef-uk-confirms-...
5. https://www.civilsociety.co.uk/news/unicef-appoints-differen...
Generally legal firms like this will have domain specialists.
No this kind of thing is called an audit and law firms are typically involved.
> they’re a law firm, perhaps they’ve also branched into IT consultancy?
And that's exactly how such IT disasters begin.
Sadly far too often managers and decisionmakers think timeline and deadlines is a topic for haggle, not discussion. They treat it same as haggling on a bazaar.
In many cases it can be though.
If I'm asked for an estimate to do X and I say I can deliver in three weeks, and my boss says customer needs it for golive in three days, I'll try to find some way of making that work. Perhaps they can live without a full solution for the first few weeks, instead requiring only a subset of the requirements in that period. Or, if I insist it just cannot be done, I'll tell the boss who'll try to push back golive if it's important enough.
There needs to be respect for each other and the project though.
That's a discussion. Not haggling.
Demanding the same for less is haggling. As if we can magic time reductions out of our asses without cutting features .
Fair enough, I interpreted it a bit differently.
Of course on any project I might be able to deliver the same in less time, but then at the expense of something else. That might be acceptable if the boss thinks he can manage the other clients which work gets delayed.
Another write up:
https://www.computerweekly.com/news/252474170/TSB-programme-...
2008: "The UK-based IT department of the fifth largest bank continues to dwindle as more jobs go overseas... This round of cuts, starting in June and lasting 12 months, involves up to 250 permanent IT roles and 200 contractors from the bank's technical delivery division, responsible for software development and design." [1]
2018: "Timeline of trouble: how the TSB IT meltdown unfolded". [2]
It's probably more complicated than that, but perhaps not much more complicated.
[1] https://www.itpro.co.uk/197982/lloyds-tsb-cuts-more-uk-it-jo... [2] https://www.theguardian.com/business/2018/jun/06/timeline-of...
That probably has very little to do with it. The immediate cause of all the problems was that Lloyds TSB was forcibly split up in order to try and increase competition and the Lloyds half kept the IT department, and when the TSB half tried to move over to the existing IT platform of their new parent company everything broke.
True as far as it goes, but "everything broke" as a predictable result of poor decisions they made, such as moving everyone over in one go.
Note that "they" in this case is the new parent company. IIRC, they were a fairly new bank, heavily reliant on technology. They had the tech but not the customers. TSB were the opposite. The parent thought their tech, which up to that point was only dealing with 100,000s of records, could deal with billions with little change. They were spectacularly wrong and it showed in all testing. But their management pushed ahead to go live anyway.
Their actions are not far from criminal negligence IMHO.
TSB’s parent company is Sabadell which is a massive Spanish corporation.
I was a Solbank (one of their brands) customer and can safely say tech isn’t their strong point. Awful UX and a pain to deal with.
> such as moving everyone over in one go.
It's the core banking system of a big bank. Handling consistent state between the old and new systems while progressively migrating customers would probably have been extremely complex. They also get to have a maintenance window (few people will complain if they get warned their bank and all transactions won't work for 2 hours in the middle of the night on Monday). A "big bang" migration makes more sense, if everything is properly prepared and tested, which it wasn't.
I've been a part of multiple "big bang" migrations in banking (generally scheduled to the coincidences of local banking holidays next to a weekend, so you can afford multiple days of semi-downtime) and all of them had explicit multiple testing gates for potential rollback during the migration, where after pretty much all the stuff is on the new system, the board convenes and after looking at the difficulties (there inevitably are some unexpected difficulties) make the decision whether they "accept" the switch to the new system or postpone the switch.
Part of your preparation and testing is the rollback of a partial migration - if you're irreversibly committed to the "big bang" before you know its outcome, then your preparation and testing has failed.
> A "big bang" migration makes more sense, if everything is properly prepared and tested, which it wasn't.
That sounds like ‘A “big bang” migration makes more sense, if it works.’
Your rollback plan should never be an afterthought. Your rollback plan should be designed like you expect to use it. If your rollback plan is “Burn the ships” and fix-forward, you shouldn’t be working at a bank.
Definitely been the plan at a variety of places I've worked. Including a bank.
It's funny how there's always a cheap offshored bodyshop involved in these stories yet it's never their fault. Cue in the 9$/hour indian coders working for Boeing.
Here's the thing I've learned over the years: Never touch offshored code. Always go for a complete re-write. Don't add features to it, don't refactor it, don't extend it. Just re-write. In my experience it's the best approach.
I know guys who made it their whole business to go and completely re-write projects from scratch after offshoring efforts failed.
There was also a heavy enforcement of IR35 in the banking sector, so that substantially reduced the access to talent pool.
There was also a tightening of posted worker regulations, so that banks couldn't ship workers from overseas as a source of cheap talented workforce.
I will always remember this incident as the time when the UK general public were exposed, en masse, to Spring error messages.
The confusion caused by ordering a member of the general public not to request a bean from a bean factory in a destroy method implementation still makes me laugh, even now.
“Just want to see my balance and these guys @tsb think I'm robbing a bean factory with a bomb, jesus”
Brilliant. Reminisce with a screenshot here:
https://twitter.com/thejackthomson_/status/98856435451268710...
I remember when we were telling devs to stop returning java 503's with stacktraces to the user.
Devs fixed it by returning 200's with stack traces.
And as page was ESI stitched together on Varnish, when they fucked up there wasn't just a stacktrace, but a bunch of different ones in various parts of the page.
I really feel this is an infra problem rather than a dev problem though; the reverse-proxy should strip 5xx response bodies before the egress no?
Eh, depends, for some apps 503 is legit response that should be returned to customer/app, in other cases it's app being badly designed. We did that few times when it made sense but in most it does not. There is no error code for "down for maintenance" so 503 was also used sometimes for that purpose (although we recommended devs to just fail healthchecks so loadbalancer displays its own error page in those cases)
The other problem (let's just say devs were not... that great with architecture) was that they were getting exceptions in the logs without attached URL and other metadata so they kinda wanted to get that exception directly on the webpage.
I'll paint you a picture of how shoddy some stuff was: they were using a templating language to generate JSONs (and had many bugs in it too) for years, instead of just natively encoding some JSON in Java, coz of party shoddy architecture, partly years-long war between frontend and backed dept. They migrated to Git in... 2018 I think ? No CI/CD of any sort till recently.
End result (of them insisting that they will send the exception to frontend to be seen) was actually exceptions being signed and encrypted blob wrapped with a bunch of JS that gathers all the errors (incl. errors that JS on site might've done), adds all the metadata it sees from "browser perspective" and sends it to monitoring endpoint where that is shoved into ES cluster.
Sort of retarded version of distributed tracing that is now in vogue... done somewhere in 2013. But it did catch a bunch of bugs that were "only" showing to users in browser
The fact that those messages were visible to external users is a major problem and sign of incompetence.
I really hope my bank runs cobol rather than any of this spring bean junk.
If I cant raise a specific exception or its in a block of code I dont expect an error I generally make the exception memorable or dangerous sounding just so the user is more inclined to report it (hopefully not via a tweet tho)
Sounds like classic A type personalities with zero technical chops deciding how long a technical project should take to further their own agenda:
> The Migration Programme experienced delays from the outset and fell behind the IMP timings. While progress had been made, on 20 September 2017 the firm decided that the Migration Programme would have to be re-planned. However, nine days after it had resolved to re-plan, and before it had concluded its re-planning exercise, TSB publicly announced it would now migrate in Q1 2018.
It's not just the CEO who should've been fired. The COO, CTO and CIO also should've left the building with a cardboard box in their hands.
This is a shameless indulgence in incompetence and recklessness. They didn't even bother to test large swaths of the transitional data or have a fallback plan if things went wrong.
Most likely their customers will now simply leave and the bank will be shut down.
I sneer at the emphasis on “1.4 billion records!” in the article as if it’s a lot.
At a recent place of employment I created and was responsible for a database that had about that many records and in actuality was a single 2tb postgres db and completely unremarkable.
I never claimed to have worked with big data.
It's not really the quantity of data that is important in migrations like this.
It's what is and isn't in the data - often a lot of junk in my experience if the source system is a legacy system that has evolved
what meaning that data has within a completely different system
what the demands are on the completeness of that data is in the new system
how to deal with exceptions
and whether that data can ever be frozen, or whether it is still online (as in the case of banking transactions)
This is unlikely to be simply a technical problem of ETLing tables, changing date ranges from inclusive to exclusive and mapping some address fields.
Of course the size of the data after a certain point does make a big difference to risk planning and business continuity planning. It's not possible to rollback and try again within the migration window should a catastrophic issue occur, and it's not possible to simply run some bulk updates to fix issues during the go-live validation.
It is noted though in this project that the data migration itself was not found to contribute to the failure.
What were your latency requirements on pulling a record out and how complex were the joins to pull said records?
If you have a simple db structure with a few tables and very clear data/index rules then billions and billions of records is pretty easy. Your indexes cut out 99% of the work and everything runs smooth and efficient.
But then you can have eldritch horrors where your stored procedures look like seedy detective novels where you chase join after join and have scary high memory requirements on execution.
When you say "emphasis on" you mean the single mention with no distinction between that stat and the any other of the facts of the case? I don't think they really put any weight on if that's a lot or not.
Personally I would say it is a lot. No other number mentioned in the article even approaches a billion. Billions are big. It might be not be true big data big, but it is still a lot of customer records for a migration project (depending on what exactly they were trying to do within the migration) and it does illustrate why they had so many issues, because there was a lot to deal with.
Just wondering if the migration disaster at this scale can be avoided using modern cluster and orchestration technology like Kubernetes?
No technology can compensate poor planning and technical incompetence. From all I read that was the root cause of the problem. So the same people and processes using Kubernetes: No.
(Of course this is just speculation. I have no insider knowledge.)
I think downvotes are unnecessary and this is a finely crafted joke.
Without even having got around to reading the whole report yet, I can promise you that a f*ckup on this scale cannot be avoided solely through technology decisions. The problem was (is always) with the people and the structures they were working in.
To the contrary, switching to a new technology is a favorite reductive excuse of poor management. They choose one early technical decision and try to hang all the failure on that.
As a sibling comment states a screw up of this magnitude is never simply a technology issue — it requires bad management at many levels.
Kubernetes are to help you scale. They do not fix one's incompetence. They increase complexity of the stack and if anything would make it even worse for the incompetents.
k8s does absolutely jack shit when it comes to data migration so not really.
Still need to write all the procedures, test it, then do it on live system again.
It might make prototyping easier (...or harder) but that's about it
Hopefully Virgin Money will get one too. They broke their Android app earlier this year and since they make you verify web logins using the app I was unable to access any of my business accounts for ~3 weeks.
If something really urgent had come up I could have done what I needed via telephone banking or in a branch, but it was a huge pain in the arse because of a single point of failure.
Just let me use a Yubikey as my second factor damnit.
+1 on the Yubikey. I'm pretty good at moving my savings around and getting the best interest rate possible - the side effect is a ton of accounts, which means I'm drowning in 'secure memorable passcode key PINs' and my SMS inbox is full of SMS 2FA codes, and I'm wondering what it would take to get a bank to offer Webauthn/FIDO.
How about a website where we pledged to open an account and deposit £X into savings, or switch current account, if they offered Webauthn/FIDO?
I'd love FIDO for online banking auth. But AIUI, there's some EU regulation that requires 2FA, but that 2FA must also verify some other data (like the recipient of a transfer, amount being transferred and suchlike). I don't remember the details, but unfortunately that rules out FIDO for 2FA to make transactions. For initial authentication it would work, but it would have to be yet another system on top of the 2FA they have to use for transaction validation.
That makes sense, thanks for the info.
The actual report by the FCA: https://www.fca.org.uk/publication/final-notices/tsb-bank-pl...
Quite thorough report, some points that stand out from the summary:
>SABIS was TSB’s principal outsourced provider
>SABIS relied extensively on 85 third parties (TSB’s fourth parties) to deliver the systems required for the migration and the operation of the platform, which required it to act as a service aggregator.
It amazes me the sheer complexity of a retail bank software system and I suspect most of it is due to legacy systems, legal requirements and lack of regular spring cleaning.
Discussed variously at the time, eg.
https://news.ycombinator.com/item?id=16910947
Others
https://hn.algolia.com/?dateEnd=1600300800&dateRange=custom&...
i remember this. for at least a week people couldn't access their money. it was chaos. the bank lost lots of money and customers due to this botched transfer.
People were seeing random balances in their accounts, or seeing other people's accounts, according to some news reports at the time: https://www.bbc.co.uk/news/uk-43860449
Not your keys, not your coins.
Nobody actually lost any money. The "coins" were still mine (I was affected) and heavy banking regulation in the UK probably reduced the harm to effectively zero for the vast majority of retail customers.
I received compensation for the harm caused and got all of my money back.
Not my main bank so this did not affect me badly but their online (web based) banking portal is still glitchy and not very good.
Even if it's not your main account, it could be bad if your account details were leaked.
I remember when this all happened. Would be interesting if it was the result of some really interesting technical bug that nobody could have foresaw followed by a fascinating effort to save the migration.
In fact it was all quite boring and simply the result of the sheer incompetence of the bank’s leadership in running bank IT.