
Description Title: Site Reliability Engineer(SRE) Location: Houston, TX Education: Bachelor’s Degree Note: Candidates must relocate in 2 – 3 weeks of time. (Please get RTTO from the candidate for relocation). Job Description: As a Site Reliability Engineer for our technology teams, you will have the opportunity to instrument, build and maintain complex applications and also maintain vendor applications from a development and risk perspective. Required Skills: • Excellent debugging and trouble shooting skills. • Expert in performance monitoring and capacity management of large systems using various tools. • Expert in at least one technology stack (Java/J2EE/Python) with designing, coding, testing, and delivering software. • Expert in at least one of the relational databases (SQL Server, Oracle, DB2 etc.). • Hands-on experience with cloud technologies (Cloud Foundry, Kubernetes, AWS). • Hands-on experience with big data services (Hadoop, HDFS, Hive, Yarn, HBase, Kafka, Zookeeper). • Working knowledge of Groovy, batch scripting, PowerShell or shell scripting. • Experience developing, deploying and debugging distributed systems in a Linux, Hadoop environment. • Experience with monitoring tools such as AppD, Splunk, ELK, Geneos. • Analysis of SLI metrics and performance data. Interpreting and correlating it to SLOs and SLAs. • Experience with deployment automation, CI/CD, DevOps, Jenkins, GIT, BitBucket. • Experience with cloud/container environments, big data, analytical tools (Tableau, Alteryx). • Expert practitioner in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm. • Working knowledge of infrastructure components like routers, load balancers and networks. • Comfortable working in Agile mode and proficient in continuous integration and continuous delivery. Primary Responsibilities: • Troubleshoots incidents, conducts blameless post-mortems and ensures permanent closure of incidents. • Engages with development team throughout the life cycle to help develop software for reliability. • Applies analytics on historic data, such as incidents and usage patterns, to predict issues and take proactive action. • Drives adoption of self-healing and resiliency patterns such as circuit breaker, bulkhead etc. • Designs and conducts performance tests, identifies bottlenecks and opportunities for optimization. • Defines and drives adoption of best in class monitoring frameworks to accomplish end to end flow monitoring and noiseless alerting. • Designs, develops, tests and delivers software to automate manual operational work • Deploys software and product upgrades. • Adds value to team delivery and works with team to complete tasks to high quality and actively learns new skills. • Facilitates maximum speed of delivery by objectively binding to error budgets of the service. • Manages the effort split between manual operational work and engineering work. • Coaches other team members and manages teams as needed. | |
Marvel Infotech, Inc., 371 Hoes Lane, Suite #200, Piscataway, New Jersey 08854 Phone: 7329060444

Post a Comment