bjarvis: (Default)
The past few months, I've been working on a major project for work: building out a new cage in our data center near Sterling, VA. Our current two cages are working reasonably well, but the equipment has aged, enough that much of it is no longer viable under Paycard Industry 3.2 standards (PCI). Even on the hardware we could keep using, we desperately need operating system upgrades and extra capacity.

The new cage has 9 racks, compared to the 27 racks in the old cage. Nearly everything is virtualized and clustered, all of it has the latest patches of whatever OS they're running, and we have RAM, storage & CPU cycles to spare.

And this past weekend, we went live.

It was a bit rocky in parts. One of the first tasks I had in the migration plan was to fix some issues in our Sun Microsystems/Oracle database servers, and our Veritas Cluster System. It took more hours than I was hoping/expecting, but I did get through it all. I think I spent more time delving into the depths of VCS that one night than I did the previous ten years.

By midday Saturday, everything had been migrated and we were starting running test traffic through it. We found some issues in routing, permissions, ownerships and such, but not many. Most effort was focused on getting the F5 traffic managers fully tuned for our requirements.

Today was our first regular business day since the cutover, and although we've had some problems getting our new IP ranges white-listed with a couple of our larger customers and had some performance problems with our hotel search databases, the day has been a success. We're getting great comments about the vastly improved speed & performance of our systems as well.

The road ahead is still a long one. While the travel portion of our systems have migrated, our Purchase and Car Service divisions have not yet. The Sun servers moved to the new cage, but their data still resides on a storage array in the old cage. At this moment, I'm still waiting for the license codes to built out a new monitoring system.

Once all of that has finished, there's a tonne of dismantling & disposal to do with the old cages & equipment. Some will be redeployed in our non-production environments, but 80% will be trashed completely. The most intensive part of the project is over, but I have work for the rest of the summer.
bjarvis: (Default)
This morning, a fruit basket & candy package arrived, sent by my COO, CTO & head of HR, addressed to the "Jarvis Family." The note attached thanked my clan for supporting me as I worked extra hours on the latest set of office projects. Nice touch --although I'd rather have a small bonus rather than $50 gift package. Thoughtful, though.

I've been working extra hours (extra extra hours?) this week as we run down the clock to our big data center migration. Today, two people who now report to me (one is my former boss, on contract!) are flying from San Francisco to Sterling, VA. Tomorrow morning, I also head to Virginia to meet with them, ensure their badges & keys work at the data center, and generally show them the cages, servers & tools we have on-site for this migration. I also hope we can discuss in person the sequence of steps we're taking once the site goes offline Friday night, filling in any details I may have overlooked.

I'm staying at a hotel in Virginia Friday through Sunday so I can be as close as possible to the data center. I'm also expecting that after many extended hours of battle, I'd be in no shape for a 40 minute drive home and return the following day.

Friday night, about 10pm Easter time, our site will go down and the fun begins. In our preliminary testing, we were getting speed improvements of 5x or so, but I think that's just a happy dream of what we'll be able to do in another two months: even after the apps move from the old cage to the new, the databases will largely be reaching back to the storage arrays in the old cage over a 4Gbps fibre link until we can migrate the data. I do intend to start migrating the data after we go live this weekend, but it will take weeks of effort to finish that (I'm hoping to have the bulk of it done by July 1).

I cannot say how relieved I am to get the new cage online. It's not just that we've been working on it constantly the past several months, but we've been letting maintenance of the old cage slide a bit, and it had inherent issues we couldn't easily fix anyway. Killing the old cage removes a lot of legacy equipment & unfortunate architecture decisions: the slate gets swept clean.

And even when we shut down the old cage, there is still much to do. We could only make this deadline by physically moving the Sun T4-1 servers with their Oracle databases to the new cage. We were originally planning to let that equipment be retired, but that aspect is a huge project in itself. In the next three month work sprint, we're going to: identify the data we need to retain, convert the data from SPARC data word format into Intel data word format, and restructure the database & LUN layout. All on live systems. The database team has 80% of this workload, but I'm still in the mix, doing the storage allocations and helping where I can with optimizing the process.

At this moment, here and now, I'm feeling rather serene. I've just ticked off the last of my pre-migration tasks, and finished scripting a lot of Veritas cluster stuff I need to do the moment the site is offline Friday night. In all, I'm now in wait mode. There is nothing left to do but wait for the dawn.
bjarvis: (Default)
Folks are storing their data in the cloud, things are happening in the cloud, our businesses are becoming more cloud-based, etc..

Great buzzwords. Still a lot of hype.

There is no cloud, just other peoples' computers. There are still servers and hard drives out there, stuffed into data centers: you just get to rent a little part of it to store your files. The "cloud" nomenclature was created to demonstrate directly that you, the customer, have absolutely no idea where those servers, drives, and your data actually are. The sales pitch is that you don't need to know, the reality is that you can't know.

As one who has worked in major data centers for decades, trust me: I know where the cloud actually is. I've had a direct hand in building small portions of it. Indeed, since each machine has a number of sharp corners & edges, I've had more than a few injuries getting the equipment assembled & racked, and I'm of course not the only one.

It's an interesting thought: the 'cloud' contains not just your data, but an awful lot of very real blood, my own included.

Solaris EOL

Dec. 2nd, 2016 04:46 am
bjarvis: (Default)
I read rumours this morning that Oracle was going to be shutting down all further Solaris development.
Solaris being canned, at least 50% of teams to be RIF'd in short term. All hands meetings being cancelled on orders from legal to prevent news from spreading. Hardware teams being told to cease development. There will be no Solaris 12, final release will be 11.4. Orders coming straight from Larry.

Even if development is stopped, there is still promised support for existing versions for a couple more years, but once the last version runs its course, the game is over.

I have mixed feelings about this, if it is true. I've been with Sun Microsystems since the Sun 3 line and SunOS 3.5, back in the 1980s when the Motorola 68000 CPU was hot stuff. Hell, in those heady days, the OS included a compiler! The machines were sturdy, the screens were huge (cathode ray tubes, naturally) and while they were expensive, they sold like hot cakes. I worked for a Sun VAR in Toronto in the early 1990s, then for the University of Toronto caring for a Sun 3/280 server.

The transition to SPARC and the Sun 4 line was joyful and traumatic. I loved the faster & more powerful CPUs, and the upgrade of our machine was as simple as swapping out a VME board. I did not love Solaris, however. Yeah, SunOS 4.1.5 at that time needed a complete refresh to handle newer communications technologies, extra cores, multi-CPU architectures and such, but it was a solid OS and worked well. Slowlaris was a painfully poor performer and a resource pig by comparison. And it didn't come on quarter-inch tapes: one had to lay down serious money for a CD drive since that was the only distribution method available. And adding insult to injury, it didn't have a development environment by default: it was an extra.

Over the years, my Sun 4/280 gained extra memory and SCSI drives. It was running better than ever, albeit two versions of Solaris later.

After some extra years, a couple of extra jobs and a move to the US, I landed at Fannie Mae for ten years. We were told Fannie Mae was the second largest Sun customer on the east coast (after NASA): I was part of the team which built and maintained their MornetPlus system, mostly built on Sun 250 and Sun 450 machines for data processing and a large pair of Sun 6800 machines for their core cluster. The 6800 machines were standalone, but the 450 models would fit two to a rack --and they weighed a tonne. We were mostly running Solaris 2.6 when I arrived, transitioned to Solaris 8 during my tenure, and began migrating to Solaris 10 as I left (now eight years ago). I loved having a single operating system for our entire enterprise: it made support so much easier, and Solaris 8 was again pretty solid.

While I used Solaris 10 at Fannie Mae and again at Talaris/Rearden Commerce/Deem where I work currently, I've never loved it. Solaris 10 and I tolerated each other. It felt snobbish and repressed. It ran solidly and had some interesting new features (introducing zones), but other kids on the block (eg Linux) seemed to be moving faster and offered more flexibility. And most of all, the new kids were vastly cheaper.

Fannie Mae paid an enormous amount to Sun Microsystems every year for support. Millions of dollars. Oracle bought up Sun Microsystems and continued to support Solaris and release new models of the Sun hardware, but they added their own special Oracle DNA, that is, their desperate desire to drain customers of every penny they had. Support costs soared and purchase prices spiked, although discounts sometimes be had if you bundled together other Oracle products, especially their software.

Even now, I'm typing this while monitoring a storage issue on a Sun T4-1 machine running Solaris 10. It's fine, nothing much to write about. But we're also building a new data center cage, refreshing our entire hardware base and allowing us to retire & scrap our old systems by spring of 2017. Sun will not be part of the new cage: the Solaris stops here.

As I said, Solaris 10 and I never loved each other, but after Larry Ellison got his mitts on it all, I knew it was time for me to start dating other operating systems. Our on-again-off-again affair had run its full course.

So reading that Oracle is tossing in the metaphoric towel on Solaris (and presumably the hardware line too) is like seeing an obituary notice in the newspaper for an old boyfriend. It's a sad thing and I'll remember the good times, but I let go along time ago.
bjarvis: (Default)
My dear employer purchased another company about six years ago, adding ground & car bookings to our business travel portfolio. Of all of the acquisitions we've done over the years, this was practically the only one which made sense, the only one which has been financially worthwhile and the only one still operating, but that's another story.

This particular car service division has been largely independent of the larger firm: our travel systems makes system calls into the car service systems, but we haven't tried doing a full integration of their services or their staff. Our core travel systems are all based on Linux with Solaris/Oracle handling the backend databases, while the car service machines are all Windows Server based with Microsoft products and some cloud-based services to supplement.

In the past month, two of the primary people from the car services division have left the firm, and because we have no other staffing, care & feeding of their systems have fallen to my systems engineering team. And now we're seeing the true nature of the nightmare...

These car service systems require constant care. Constant. We've learned that the core database has been receiving manual maintenance daily for the past eight years it has been in service. We've learned the logging system has been manually restarted every 48 hours or so for the past number of years. There's a stack of little things like this which have been consuming the full attention of two fulltime staff on a daily basis.

I'm horrified by the amount of work that has been required daily if not hourly to maintain uptime for these systems. I'm horrified that no one in management seems to have noticed and thought it odd. I'm horrified that no one has seen fit to fix any of these problems, especially the guys who have been doing the work. And I'm horrified that even if the guys couldn't correct the root problem, that they didn't even attempt to automate the required recovery steps. Seriously?!

My team is now trying to pick up the pieces but I have little Windows experience and the training hand-off occurred while I was on vacation so I'm missing huge chunks of knowledge about their architecture, single points of failure, and other gems one could collect from those who built & maintained these things. It doesn't take great knowledge though to know that This Isn't Right.

Remember your training, young padawan:
1. Automate everything.
2. Automate recoveries as much as possible.
3. If something breaks daily, fix it.
4. Document everything so the people coming after you have a guide.

July 2017

S M T W T F S
      1
2345678
9101112131415
16171819202122
23 242526272829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 28th, 2017 02:38 am
Powered by Dreamwidth Studios