Job Title: Senior Linux System Administrator (L3)
Location: Irving, TX
FTE
Job Description
Role Summary
Seeking a hands-on L3 Linux Administrator to own stability, availability, and performance across large-scale Linux environments. The role demands deep troubleshooting skills, strong exposure to Veritas Clustering (VCS), SAN/NAS storage, and close coordination with data center teams for hardware incidents. The ideal candidate will work independently, lead incident resolution, and improve BAU operations through automation and best practices.
Key Responsibilities
Linux Administration (L3)
• Administer and troubleshoot RHEL, Oracle Linux, CentOS, SUSE in production.
• Diagnose complex OS issues: kernel panics, boot/GRUB failures, filesystem corruption, resource contention (CPU/RAM/I/O/Network), SELinux/AppArmor denials.
• Patch and upgrade OS at scale; manage package repositories and kernel updates with rollback strategies.
• Implement and audit security hardening (firewalld/iptables, CIS benchmarks, PAM, sudo, SSH, auditd).
• Manage system services (systemd), cron/timers, users/groups, sudoers, and system-wide configuration.
Veritas Cluster Server (VCS/InfoScale)
• Install, configure, and administer VCS for HA/DR across multi-node clusters.
• Create/maintain service groups, resources, dependency trees; configure LLT/GAB, I/O fencing, and quorum.
• Integrate VxVM/VxFS (disk groups, volumes, file systems) with application failover.
• Conduct DR drills, failover testing, and root cause analysis for cluster events.
Storage: SAN & NAS
• Liaise with storage teams for LUN provisioning, zoning, masking; validate multipathing (DM Multipath/PowerPath).
• Build and maintain filesystems (ext4/xfs/VxFS), mount policies, fstab and autofs.
• Manage NFS/CIFS/SMB exports/mounts, permissions, quotas, and locking issues.
• Troubleshoot pathing, latency, and I/O bottlenecks using OS, HBA, and array-side telemetry.
Data Center & Hardware Coordination
• Coordinate with DC teams for racking/stacking, cabling, console access, and physical triage.
• Diagnose hardware faults (CPU, memory, NIC/HBA, disks/RAID/SSD, backplane, PSU, fans) and firmware/BIOS alignment.
• Raise and track OEM tickets (Dell/HP/IBM/Cisco), manage RMA, and oversee replacements and post-fix validation.
BAU Operations & Incident Management
• Act as L3 escalation for P1/P2 incidents; drive bridge calls and lead technical recovery.
• Perform deep-dive log analysis (journald, syslog, dmesg, audit logs, application logs).
• Create/run SOPs/runbooks, maintain KB articles, and implement problem management (RCA, corrective actions).
• Support on-call rotation and scheduled maintenance windows (change management, CAB, MOPs).
Networking (Host-Level)
• Troubleshoot TCP/IP, routing, VLANs/bonding/teaming, MTU, host firewalls, DNS/DHCP, NTP/Chrony.
• Collaborate with network teams on L2/L3 connectivity, load balancers, and firewall rules.
Required Experience & Skills
• 8–12+ years in enterprise Linux system administration with proven L3 ownership.
• Strong hands-on with VCS (Veritas Cluster Server), VxVM, VxFS, and HA/DR patterns.
• Solid SAN/NAS experience: LUNs, zoning, multipath, NFS/SMB.
• Demonstrated success working independently and leading during critical incidents.
• Advanced troubleshooting: kernel, performance, storage, and cluster-level failures.
• Scripting proficiency (Bash; Python preferred). Familiar with Ansible.
• Familiarity with VMware/KVM and basic cloud (AWS/Azure/Linux in cloud) concepts.
• Strong documentation discipline (SOPs, MOPs, RCAs) and ITIL-aligned processes.