They snarl traffic and strand passengers, but information technology experts say computer outages are a reality of flying, no matter how hard airlines try to prevent them.
In air travel, there are two main culprits that ground an airline: an act of God or a “glitch.” While fliers are never thrilled to spend a night in a terminal due to a blizzard, at least the snowed-in runway is incontrovertible evidence that nothing can be done.
Perhaps more maddening is the invisible glitch that can cripple an airline’s reservations and flight management systems. That’s what happened to US Airways and United Airlines earlier this month when both carriers collectively experienced three outages in a 9-day span. Two separate power failures in Phoenix and Charlotte, N.C., knocked out US Airways’ computer systems on June 10 and June 19, resulting in delayed flights but no cancellations. On June 18, United experienced a “network connectivity issue” at 7:15 p.m. Five hours later, the airline’s systems were back up, but not before the outage canceled 41 flights, delayed 100, and rippled into the next day as passengers missed connecting flights.
Blizzards can cause far more havoc — a snowstorm last December canceled 10,000 flights over three days — but minor and major non-weather-related outages are unpredictable and often remain a mystery to passengers.
To industry insiders, though, there is usually a simple explanation: a power or mechanical failure, a hardware or software malfunction, a breakdown in a telecommunication provider’s services, or planned system changes, like the installation of new servers, that aren’t successful. Airlines try to plan for and prevent all of these possibilities with everything from batteries and generators to multiple copies of backups, but no IT system is perfect and an airline’s fleet will be grounded at some point thanks to an outage.
“While [an airline] may be able to bullet-proof their company from a specific problem, there are lots of problems that can happen in lots of different ways,” said Scott Nason, a Dallas-based industry consultant and former chief information officer for American Airlines. “It’s too expensive and too complicated to make sure all of these things don’t happen.”
That may be a grim prognosis for passengers who expect a glitch-free flying experience, but Nason and other experts consider outages a part of running an airline, provided major ones are infrequent. While minor outages, like a short-lived power failure at a regional airport, can happen a few times a year, Nason said major ones in which the airline is grounded across the country, should happen only every few years.
While US Airways was explicit about the source of their outages — backup generators didn’t kick in when power failed — United has been vague about the problem. Airline spokesperson Charles Hobart declined to elaborate on the nature of United’s “network connectivity issue,” though the carrier insists the outage was not caused by a hacker.
“I’ll tell you what it wasn’t,” said R. Craig Murphy, a Denver-based airline IT consultant. “It wasn’t a glitch. I don’t know what that means. It was some real issue that a whole bunch of technicians are reacting to now, trying to make sure it never happens again.”
Murphy, former chief technology officer from the travel IT company Sabre, believes that many outages are due to planned updates and changes to technology infrastructure that go awry, even if they’ve been tested multiple times on smaller practice systems.
Airlines, Murphy said, are making dozens of changes varying in scale each week, which can include tweaks to software as well as the installation of new hardware. While airlines typically try to make these changes between midnight and 2 a.m. to avoid interrupting peak travel in the U.S., doing so isn’t always possible. Still, if a planned update is the cause of an outage, an airline is unlikely to admit this in public, Murphy said. “They’ll say it was a glitch.”
Dwayne Ingram, executive vice president of airline distribution and IT for Amadeus, said that there are more risks for system failures and outages with the increasing number of carrier mergers, alliances and joint ventures. Such partnerships require one airline’s system to talk to another’s, which introduces new changes to the IT infrastructure.
In the wake of its merger with Continental, United recently canceled a contract with Amadeus to use the company’s reservation system. Ingram said he had no insight into United’s outage, and Hobart, United's spokesperson, said the outage was not related to technical difficulties in merging with Continental.
Ingram also said that many U.S. airlines continue to rely on IT technology known as transaction processing facility (TPF), which was developed in the 1960s by IBM and operates on a mainframe, which can leave an airline more vulnerable to outages since there is less redundancy and more chances for single points of failure. Newer technology, known in the industry as “open” systems, distributes the airline’s work flow across different servers, thereby creating a more robust barrier against outages by helping to ensure that a backup will kick in.
Murphy said that both approaches have their advantages. Since TPF technology has been around for decades, it has seen many challenges and consequently developed contingency plans. As it’s become more sophisticated, however, the technology has become more complicated and there are fewer experts who know its quirks intimately. And while open systems are more nimble and redundant, they are still new, which means there are fewer contingency plans.
Murphy said that while open systems are unquestionably the “future” of airline IT, he’s not worried about the airlines’ present ability to prevent outages using older technology.
None of the experts msnbc.com interviewed thought hacking was a major concern. “Most of the time when hacking is going on,” Ingram said, “it’s to try to get on the money. Most of the airlines systems themselves wouldn’t help anyone.” He added that a hacker would have to be trained in an airline’s back-end system in order to interfere with a departure control system, to which few people in the airline itself have access.
Murphy said that despite the recent bout of outages, airlines do a good job of preventing system failures, particularly as business increasingly moves at the Internet’s pace.
“Anytime you mix machinery and humans and change, you're going to have problems,” he said.
Passengers, he said, are also unlikely to notice most outages since they are resolved quickly. “It’s just, unfortunately, everyone expects it to run clickety-click and when it doesn’t, it’s front page news.”
More stories you might like
- Flash! Act fast, save big on travel deals
- Driving better for budget but flying saves on time
- The disappearance of the cheap red-eye
Rebecca Ruiz is a senior editor at msnbc.com. Follow her on Twitter.


Those 'IT' people need to be replaced (by cheaper people). The reality is that 'glitches' are the result of two and only two things: not resourcing for true real-time redundancy, and systems not properly designed and tested for failure recovery!
There are no other reason for glitches.
The reality is more likely, we did the best we could with the budget we had.
The reality is that there will always be bugs in software, always. This is because companies cannot spend unlimited funds to find every single bug and some bugs are damn near impossible to find because a very specific set of criteria may be required for the bug to occur.
Although this problem was more of a technical issue, not a glitch. Power failures are not something that can be prevented usually, however one wonders why these system had no UPS' (universal power supply for those who did not know). And if they did why were they dead, did no one regularly run maintenance on them?
Bug and glitch prevention is fine, but if there is not maintenance on the back end, spending all of that money to chase down bugs that could cause things like this is wasted money.
A UPS is an UNINTERRUPTIBLE power supply, Geowil. 'Uninterruptible' is kind of the entire point. Sigh.
Any system with a generatory typically has a UPS too, but a UPS will only carry you for a short time. Usually less than 30 minutes. Its purpose it to let you ride through short power outages, and to keep you powered while the generators start and warm up.
If the generators don't start for some reason, the UPS won't save you.
Within the world of IT, there is an increasingly useless layer of "management" that escalates cost without producing code. The actual business of programming has been outsourced to people who speak English at a grammar school level. Then we took the expensive mainframe and replaced it with the "cost effective" client/server model. That might have been OK except the operating system couldn't stand up to a 12 year old script kiddie.
Is it any wonder that system reliability is less than it was 20 years ago? The glitches CAN be eliminated, but not if we are going to rely on technologically feeble management handing the work to people who can barely read an e-mail message.
Obviously not an IT person, but there again neither am I (Oceanographer) but I have a back up mirror hard drive, & a back to that. All with UPS systems.
I have a horrible feeling Ms. Ruiz was paid to write this piece by the airline industry. Live with glitches, HA! Try telling my bosses/customers that
When was the last time Google went down? Sounds like someone needs to fire the consultants who don't have a clue about fault tolerant architecture.
google servers go down all the time. It's just that you don't know it because the redundant systems kick in and take over.
Google suffers outages all the time and all over the place. We use their Google Apps functions. Because of things running on multiple servers in multiple data centers, a service outage involving our 30 users will take out one or two people's access for a period of time, but thankfully not the whole company due to the distributed nature of Google's network. It doesn't happen often, but it does happen.
You, the average person using the average public Google services never notice this as your interaction is on such a basic level that any minor bump and hiccup you experience gets attributed by you to a network issue with your local ISP and that's about it.
Bingo... "redundant", "fault tolerant". Whatever you want to label it... JUST DO IT.
Wrong people in those jobs, wrong management, wrong design, wrong testing, wrong development.
Thank the almighty. It wasn't the CIA (opps I mean Hackers) this time!!!
CIA AGENDA: Legally have an "Internet Kill Switch" to controll the Sheeple. That would shut down all of the alternitive media.
How To Advance the AGENDA: Stage "False Flag" Hacking Attacks!!! Look at all of the Hacking Happening Lately. Then do some research about it!!!
I love it that the monday morning quarterbacks are posting here. They don't know scat from shinola, but they are posting away. One has even proposed that the IT people should be replaced with cheaper people.
I expect that the root cause is that the team that deployed the patch / release did not do adequate testing. Either because the test platforms are inadequate (due to lack of funds to properly build it) or the testing plan was inadequate (again - due to hiring cheaper people -- incompetent people tend to cost less), or management demanded the release go live prior to bug fixes (due to promises to top management).
There is a high probability (based on my 30 years experience in this field) that this goes back to management budget cutting and incompetent management. But this would never be admitted by management. CYA all the way.
Funny that you jump on monday morning quarter backs yet without any information you "know" the source of the issue. Do you have any proof ? Are you one of those people who simply blame management whenever your code fails ? Sure seems like it
Funny, when I went and got my degree in Computer Programming if I had a glitch in a program I failed the assignment. Do not tell me that glitches are inevitable, that is bull. Prior Planning, prevents, Piss Poor, Performance. Remember the Software development life cycle and stop taking short cuts and you will get rid of glitches. If you have a program running your business and you didn't load test it the glitch that is coming your way is your fault. Any programmer worth anything will tell you when you load test a program and system you test it against a minimum of twice the Maximum anticipated load. You always have an exact copy of the live system to do patch testing, if you don't you never know where the real issue is whether it is software or hardware related. If you accepted the system without testing it under a load then you screwed up and who ever was involved in accepting that system should be terminated.
To quote "IT system is perfect". I have 35 years in the IT industry -- programming mainframes. This is 100% true. No natter how well designed programmed and tested a system is, it will have a problem. This is a fact of IT life. And I have to admit, my systems have had a few minor problems. It happens. We fix them and move on.
you need to retire then, because in parallel, redundant, cloud systems, you don't have to beg the machine to come back up. Nobody's saying you won't have problems, what people are saying is that having problems need not bring the company and all it's customers to their knees (in prayer to the IT god). Too many people with 30+ years experience just don't have the energy or vision to accept that the limitations they grew up with don't need to exist today.
<< what people are saying is that having problems need not bring the company and all it's customers to their knees>> and yet - time after time - that is exactly what happens. So tell me again - what's your point/solution?
Machines don't make mistakes, they do exactly as they are told by their human programmers, right or wrong
@rrobeson : "In addition to taking down the sites of dozens of high-profile companies for hours (and, in some cases, days), Amazon's huge EC2 cloud services crash permanently destroyed some data"
Care to explain that ?
How about needing to reboot every node in a open system system once a week to recover lost memory, or the inability to run most open systems at more than 30% because they can't handle a surge
You should change careers, I doubt you could write lines of code in a row that actually work, and even then you stole them from another programmer
@rrobeson : "In addition to taking down the sites of dozens of high-profile companies for hours (and, in some cases, days), Amazon's huge EC2 cloud services crash permanently destroyed some data"
Care to explain that ?
How about needing to reboot every node in a open system system once a week to recover lost memory, or the inability to run most open systems at more than 30% of capacity because they can't handle a surge
When I worked in IT prior to retirement, a "glitch" resulting in such outage would result in new IT people. These "IT consultants" are nit-wits! A properly designed system does not have "glitches"
Actually there is no such thing as a fool-proof system. A system might look air tight, but that is because no one has found the gremlin that is living under the stairs yet.
Not one piece of software nor hardware created has ever been completely bug free, maybe back in the infancy of technology but as things grow more complex the more bugs a system will inherently have because of their components and potential points of failure.
I started in the computer industry in the fifties. Tubes. Drums. punch cards. My employer at that time, UNIVAC, installed an early R.T. system for a major airline back east and I cannot recall it ever failing to any degree. This in the sixties. My last employer used an all Cisco global WAN and with a duplicate backup site 12 miles away, complete with duplicate everything, we never had an outage. We did get annoyed by a virus one AM, but it resulted only in annoyance to me as it was squelched. Only one other remote site got hit. A properly designed system does not have glitches.
And people bitching about the never ending increases for airfare is unavoidable too!
Are we truly so STUPID and COMPLACENT we accept this drivel? Computer systems crash because there are problems that no one anticpated or would acknowledge.
How can an AMERICAN carrier admit they are so stooooooopid they are unable to antipate problems or have sufficent staff to fix such problems be?
Are American corporations so afraid of Wall Street/the WSJ/Fox News that it has become impossible to treat their customers with respect and dignity? Why is the shareholder MORE important than than the actual consumer?
Why do we Americans revile those of you who sell us products? Get a clue: treating us as an afterthought is not a business plan no matter what Donald Trump, Jamie Dimond and Rupert Murdoch say. There is no excuse for an airline failing to explain to its customers why there is a problem. Nor is it acceptable to expcect the customer to pay for lodging and meals simply because you will NOT hire compent staff.
Outsourcing your enterprise does not work no matter what Rupert Murdoch says.
Does anyone with a critical system ever practice a manual fallback process? Sorta like performing a fire drill periodically? Pencil & paper – telephones – that 'old school stuff'. We all get too reliant on the almighty computer system - Folks they were built and programmed by the like of you and me - and some of us are fallible.
The scale of our large business systems today make operation without computers impossible. Could you balance 10 million bank accounts in a day by hand with a dozen people and pencil and paper? Could you do it with 1,000 people? No. And you won't have that many people to do it with. You couldn't efficiently route 30,000 flights a day by hand either. Nor process 50 million passengers. Computers ARE almighty. They make possible the civilization we take for granted.
Manual fallback system, thanks, good joke, let me wipe my monitor off and clean the keyboard.
And go out and break out the Morse Code keys and alert the phone tree so we can have all the managers ready to do voice relay on the gigabytes of info that needs to be passed.
to 'yet another news reader' ... <make operation without computers impossible> ... <They make possible the civilization we take for granted> ... and this doesn't worry you, or your broker?
Sean - didn't say it had to replicate it - just survive.
Just surviving is failure, failure is not an option in today's data transmission world. Given the emergency services backup I've participated in, the back up has to work as well as the real thing or the various agencies needing it reject it as a crude hack prone to failure when they need it the most. They've usually been proven right in their estimation.
An article a few days ago pointed out that United chose not to spend money to update their aging computer system. Having some IT guy say no system is perfect doesn't mean squat.
@Woodsyr. So you designed toaster components in the 60's cool!! Geowil is correct, bugs will be around no matter if they aren't apperant bugs or possibly an overloaded system. If everything was designed "perfect" and there were no bugs the first time out then why would any company even have a technology division?? You would simply pay the big 5 to do your next project and rest easy knowing that everything will be perfect forever. lol In todays age of tech everything we build from a programming or infastructure standpoint uses 3rd party products no matter which side of the fence you choose to stand on. Almost anything is so abstracted with frameworks, adapters, etc and you expect all of them to always work 100% of the time?
"everyone expects it to run clickety-click"
I must admit that's what I expect. Or at worst 'clackety clack'.
First of all I have no idea what exactly happened and we never will hear the real details. I work in IT in a 911 center that has to have multiple layers of redundancy. Each additional layer of redundancy has a cost. The cost is weighed against the reduction in downtime caused by a failure. IT "stuff" will fail. Buggy software is part of it, but worse yet is software that talks to other software developed from a different company. On top of that is the OS, for cost savings most is developed on top of an existing OS (like windows) that has it's own set of issues. It would be ridiculous to think airline IT would involve 1 software program. Right off hand I count 12 critical different software that make the dispatch center work, with others that are not as critical. One piece of software is altered (update or just a simple "lets change this setting") and the result is chain reaction that breaks lots of stuff. Hardware fails as well, as well as back-ups to the hardware. How many backups are you willing to pay for to lower downtime. The knee-jerk reaction is to blame IT when most IT staff don't get what they need because of "budgetary reasons". You will never hear from IT, or at least the IT staff (maybe an upper manager) the truth of what happen. Cost/benefit, would you be willing to pay more for a ticket at an airline that additional redundancy that would reduce the potential downtime. Or would you go on the cheaper airline that normally provides a good service with a slightly increased risk of downtime due to IT issues (this increased risk would not be known to you, just the price). Hardware and Software will fail, software will need to be updated/altered for additional needs and hardware will need replaced due to age. All of those are absolute in the world of IT. Ask yourself this...do you have a backup for your primary computer (laptop/tablet/desktop computer)? If yes, is that backup kept up to date with exactly what you have on your primary device? Do you have a backup for the backup device? Do you test the backup device on a regular basis with a controlled fail of the primary device? Do you have a set replacement schedule on the primary and backup's that lowers the risk of failure to an acceptable level (replacement every couple of years)? Do you know what it takes to create a system with multiple backup devices that are identical to the primary at all times, use controlled fails of the primary to test the backup systems, replace the hardware devices on a schedule and but them in production while using running on a backup system, keep software updates and patches working together....then the kicker--budget. Reduction it IT staff, outsourcing systems to save money, stretching hardware months or years beyond what it should be (because it seems to be working OK). Before you just blame IT...it's never that simple. Is an airlline (or any company) willing to spend a lot of extra money on redundant backup systems that may never be used? With that cost pushing up ticket prices? I'll stop now. :)
Redundant systems are not always better, keeping multiple copies of a database in sync while maintaining access to the data without adding unacceptable latency is not an easy task, and often impossible - meaning redundancy is likely to have incorrect data, which can be just as damaging as an outage.
The cost of HW and SW is indeed a factor, but so are things like recovery time - and in many cases it makes more sense to make your system bomb proof rather than just cloning it, and focusing on recovery.
Attn: All News media
Do not deviate from the most critical now. Focus only on how to create:
JOBS
JOBS
JOBS
JOBS
JOBS
JOBS
JOBS
JOBS
JOBS
..................................................
or....................... 25 million+ unemployed will revolt..................
I hate to tell you people that think because they have a "ups" that is what it really is. If you look on your "UPS" it will give a a millisecond measurement to switch to the battery. That is not what a true ups is. What you have as a "ups" that you get from a store is really a switching power supply. You couldnt afford a true UPS, they cost a few thousand dollars and indutries like hospitals and government facilities may have a true UPS. If you have a true UPS you are running from the battery and not the power line coming from your wall.
@rrobeson, WOW, that was probably the most ignorant thing I have ever heard.
"Those 'IT' people need to be replaced (by cheaper people)" Does that really make any sense to anyone...other than rrobeson????
The truth behind computer glitches and the airlines is that they want "proprietary" and cheap applications to handle their work load, when in reality, this causes issues. If you employ cheaper IT staff that "rrobeson' is talking about, the issues will become much worse than they are now.
The airlines need quality, professionally built applications, i.e. not customizable to their corporate heads needs. B/C in reality the corporate world has absolutely no idea what it needs in terms of productivity and reliabilty in terms of software applications.
This is why it takes sooooo freaking long at the ticket line to get checked in. I have been in IT for over 14 years, and I have never seen the airlines take any leaps towards bettering themselves in technology. It is quite sad, seeing as how much an airline ticket costs nowadays.
Hint to all airlines, try putting some of those profits back into the business!!!!
Why not just wait until you land and turn him in to TSA? They will grope him, feel his 'pelvic bone,' take a naked picture of him and run their fingers up his backside. That ought to at least embarrass him--it certainly does everyone else.
This is irresponsible reporting by MSNBC - UAL called off a contract with Amadeus 2 months ago and you use an executive from Amadeus to comment on UAL ? Can you say self-serving ?
Amadeus is attempting to market their solution against TPF (oh, and in the mean time guess what system Amadeus' entire business relies on), they are years (probably a decade) late in delivering parts of their system, they've lost customers like UAL because of their failures, and their executives are pathetic enough to run around using lies such as in this case to make themselves look like they are not losers.
But let's get to facts. Lets see some proof that open systems are better than TPF for the airlines reservation business. Both technologies have their benefits, and weaknesses, but lets see some proof Dwayne - put up or shut up.
TPF systems have availability records which would make open systems run home crying to their mommies (and they don't get hacked - ever).