Site Reliability Engineer

As a Site Reliability Engineer (SRE) on our team, you will use your subject matter monitoring expertise and skills to improve the reliability of the VA’s applications via enterprise monitoring capability tools. You will be responsible for figuring out why an application with enterprise monitoring efforts allowed a high priority incident (HPI) or a critical priority incident (CPI).  You’ll work with the client’s Business Line Management (BLM) Teams, the Event Management (EM) Team, and the Enterprise Command Operations’ (ECO) Incident Management Team to detect, investigate, and diagnose monitoring problems and defects across Enterprise level applications and technology stacks. This position will be on a team dedicated to providing recommendations and instrumenting those approved recommendations in client monitoring tools to improve VA enterprise reliability and improve the quality of services provided to veterans. The monitoring tools will be focused on Splunk Enterprise/ITSI, AppDynamics, DynaTrace, SolarWinds, ScienceLogic, and Aternity.  You will be working with system and application owners to obtain existing design and functionality, leverage comprehension of workflow systems and application processes within multiple system environments and work across technology and development teams to diagnose outages due to inadequate monitoring instrumentation designs and recommend changes to increase reliability.

Required Experience:
  • 6+ years of monitoring and troubleshooting experience with two or more of the following monitoring tools, AppDynamics, DynaTrace, Splunk/ITSI, SolarWinds, ScienceLogic or Aternity
  • 8+ years of experience working with key indicators for IT system operability, reliability, application performance, and code quality
  • 8+ years of experience deploying, maintaining, and troubleshooting complex applications at an enterprise scale while working with cross-functional teams
  • Experience in one or more Technology Areas (Network, Windows, Desktop, Unix/Linux, AWS or Azure Cloud, WebSphere Middleware, Java/JS Development, Microsoft or Oracle Database)
  • 1+ years of experience in service virtualization, AWS or Azure Cloud technologies, and SaaS and PaaS implementation.
  • 2+ years experience leading teams
  • Experience with using Microsoft Office, including Word, Excel, and PowerPoint
  • Ability to work independently with little supervision
  • Master’s Degree in Computer Science, Engineering, or Equivalent and 10 total years of experience; or 20 total years of experience in lieu of a degree

Preferred Experience:

  • Experience with test-driven development, distributed systems, microservices, and cloud-native application implementation
  • Experience with the following tools: Oracle Enterprise Manager, Power Bi, and ServiceNow
  • Possession of excellent written and verbal communication skills
  • Possession of strong critical thinking and error assessment capabilities
  • Experience working in an Agile framework such as KanBan and Scrum.
  • Public Trust Clearance

To apply for consideration, please submit a comprehensive resume tailored to this job description. All work experience must include start and end dates (month and year). Education must cite school, degree, and year degree completed. Minimum required experience and education must be clearly illustrated in your resume. Offer to candidates is contingent upon successful background and clearance adjudication. Please submit your resume to with the role title in the subject line.

Full-time W2 position, 100% remote, must be based in the U.S. and a U.S. Citizen. Full benefits include medical, dental, vision, STD, LTD, Life, PTO, and a 401k matching program. Compensation for this role is $120,000-$125,000/year.

Candidates who do not meet minimum requirements will not be considered.