Disaster Recovery in Action - American Trim IT Staffing and Prior Strategies

Disaster Recovery in Action - American Trim IT Staffing and Prior Strategies
Slide Note
Embed
Share

American Trim's journey in disaster recovery, IT staffing, and strategic planning is depicted through insightful slides and detailed information. Dive into their recovery processes, organizational structure, and historical strategies to enhance your understanding of effective IT management.

  • Disaster Recovery
  • American Trim
  • IT Staffing
  • Strategic Planning
  • Organizational Structure

Uploaded on Mar 03, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. & present Disaster Recovery in Action Paul Pellegrini, Director of IT Colin Watkins, DBA James Blankenship, Sr. QAD Consultant, QAD Certified Professional

  2. Who is American Trim? QAD customer since 1993 Lima, OH Wapakoneta, OH Sidney, OH Erie, PA Cullman, AL Monterrey, NL, MX Acquired Angell-Demmel, NA with facilities in Lebanon, KY Dayton, OH Dec 2012 Eight Manufacturing Facilities in four states, plus Mexico

  3. What do we do? www.amtrim.com

  4. American Trim IT Staffing DBA Network Engineer Systems Analyst Programmer Analyst EDI Manager Programmer/Analyst Applications Integration / Support Manager PC Technician Help Desk Director of IT reports to CFO Trusted Partners

  5. Prior to 2008 No formal plan existed Acted on task oriented opportunities Natural Gas Generator installation Liebert Room Battery backup Servers were stand alone and many (26 physical machines) Backups to tape daily and carried off-site Starting in 2008 Failure Modes and Effects Analysis (FMEA) We acknowledged existing plan was nonexistent Began documenting in earnest Server Hardware Server Applications Iterative cycle of evaluation

  6. Document Servers / Applications

  7. I Host Name - name of the server or virtual server that hosts the application A Application Name Lists each application supported by American Trim Information Technology J Hardware Name - name of the server hardware that hosts virtual servers B Application Importance 9/8/7 = PRI 1 - impacts at least 70% of the company 7/6/5 = PRI 2 - impacts at least 40% of the company 5/4/3 = PRI 3 - impacts large group, but work around exists 3/2/1/0 = PRI 4 & 5 - small impact, delays are acceptable during disaster mode K Host City - location of the actual hardware L Virtual - simple indicator of whether this is a virtual server yes / no M Current Host Recovery Time (hardware) 8/9 = No defined alternatives AND No active maintenance plan AND Backwards data loss of up to 48 hours Hardware/Application loss > 4 days expected 6/7 = No confirmed alternatives OR ( Hardware Maint Plan expires within 18 months AND Server Hardware > 5 years old ) AND Backwards data loss of up to 36 hours Hardware/Application loss of 3 to 4 days expected 4/5 = Have confirmed alternatives OR Hardware Maint Plan of Next Business Day or better AND Backwards data loss of no more than 24 hours Hardware/Application loss of 1 to 2 days expected 3 = Have confirmed alternatives OR Hardware Maint Plan w/Same Day or better plan AND Data loss of less than 18 hours Hardware/Application recovery within 24 hours expected 2 = Have recurring replication of host automated AND Hardware Maint Plan w/ 6 hours to repair AND Backwards data loss of less than 6 hours Hardware/Application recovery within 8 hours expected 1 = Have recurring replication of host automated with Backwards data loss of less than 4 hours AND Hardware Maint Plan w/ 6 hours to repair Hardware/Application recovery within 1 hour expected C Status of Planning 9 = Active Real-time Plan and Documentation 7/8 = Tested and Documented Plan 5/6/7 = Documented (untested) Plan 3/4/5 = Mental Plan 1/2 = Clueless D Forecasted Recovery Time (application) 9/8/7 - no idea - or - at least 10 days 6/5/4 - mental plan - > 3 days < 10 3/2/1 - known / specific # of days 0 - same day recovery E Target Recovery Time (application) 9/8/7 - okay to be down > 5 days 6/5/4 - okay to be down 3 to 5 days 3/2/1 - specific # of days < 1 - same day recovery F Employee count - hidden column (not used) G Recovery Gap (application) Simple variance (Forecasted - Target + 1) , if less than 1, then force = 1 N Target Host Recovery Time (hardware) Same key as "Current Host Recovery Time" H Application Risk Factor - Is a simple metric that values the application impact versus preparedness Application Importance [B] * ( 10 - Status of Planning [C] ) * Recovery Gap [G] O Application Hardware Risk Application Importance [B] * ( Current Host Recovery [M] - Target Host Recovery [N] + 1) if less than 1, then = Application Importance [B] P Total Risk Application Risk Factor multiplied by Application Hardware Risk Q Backup Hardware This field is used to call out confirmed alternatives where disk, CPU, and memory are guaranteed in sufficient volume to accommodate specific servers

  8. Document Options TOP 12 RISKS COMMON STRATEGIES PROS / CONS

  9. Document Strategies 1. Immediate action opportunities 2. Add hardware capacity 3. Backup frequency with immediate off-site copies 4. Implement redundant systems 5. Application specific solutions

  10. Confirm Assumptions

  11. Quick Timeline Activities 2009 2010 Standardize: Hardware Software Procedures Consistency is key! 2011 2012 January 21, 2012 May 7, 2012 January 26, 2013 Document your plan: if you re not there, how is someone else supposed to know where to start? April 6, 2013 2013 Document your Trusted Partners

  12. Environment American Trim environment: https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTGT-G2gY5kaAjSm1js5Whmx9lIiu7vRPQGpwYY-tZTw7Ksfg4L HP ProLiant DL370 G6 VMware ESXi 5.1 Red Hat Linux 5.5 QAD 2012SE (346 named users) Progress 10.2B07 Lima QADDR QADTEST QADDEV Sidney QADPROD

  13. Synchronizing QADPROD All Progress databases start with AI set to start a new file every 15 minutes, moving the rolled AI to /backup/ai/mfg folder CRON every 15 minutes All AI files GZIP then copy to QADDR rsync selective folders i.e. apps, cups, data, home, etc QADDR CRON every 15 minutes apply the copied AI files to each DB on DR

  14. Synchronizing OTHER QADDR CRON 23:40 daily take probackup, then copy to tape we suspend AI roll forward during backups There is additional logic that: Sends QADTEST & QADDEV files to QADDR and then to tape Sends QADDEV /apps to QADPROD for off-site copies Takes QADPROD local files backups to same server for human mistakes, such as an errant file deletes Refer to actual scripts for additional details and notification logic on the flash drive you have.

  15. Mock DR Lessons Learned Mock Disaster Lessons: Shut off EDI jobs before turning DR box up as PROD Enhanced step-by-step to EDI process shut down & start up Most end clients and servers needed ipconfig /flushdns Found Loftware DR problems (since resolved) Found MS Terminal Service licensing problems (since resolved) Enhanced step-by-step Factivity client start up Our Progress environment was up and fully accessible in less than 30 minutes

  16. May 2012 Lessons Learned May 2012 Lessons Learned: Communicate! Engage Trusted Partners immediately Inform IT staff, management, family Inform operations (customers); include how they should get updates Status updates (email blasts) as scope is understood Positive Attitude Take time to get your mental state under control Avoid quick/rash decisions Work as a team / sounding board Define contingencies & milestones Document first, prioritize, then act Health Don t forget to eat Walk away to clear your head Humor is a good pressure release! This is not the time to try things!

  17. Prevention An ounce of prevention VMware hardware probes VMware virtual server probes VMware baselines (CPU, Memory) The Dude WAN/LAN link probes The Dude performance (WAN link utilization) The Dude critical services probes The Dude device heart beats (simple ping) Access Points, Batteries, ILO/DRAC, NAS, DVRs, Kronos clocks, Shift Bells, Room Alerts, etc Microsoft Performance Monitor / Free disk space monitors

  18. Questions? Questions?

  19. Vendors American Trim leverages: BravePoint ProStar Software TailorPro Factivity Eagle Consulting & Development Cyberscience Cyberquery ACOM Solutions Trubiquity

Related


More Related Content