Operations and Monitoring Engineer - Sydney/Adelaide/Auckland
hummgroup (ASX-HUM) is one of Australasia’s most successful and enduring fintech organisations with a proud legacy of rewriting the playbook for digital spending. We help people buy everything, everywhere, every day. Our Philosophy is to deliver creative and tailored financial solutions to our customers around the globe. we enable and empower our people to build lifetime relationships of value and opportunity for the benefit of you, our colleagues, customers and shareholders.
Sitting in the Technology portfolio, the Operations & Monitoring team is responsible for the execution, management, monitoring, support, and maintenance of our product technology.
We are seeking a highly motivated and experienced Operations and Monitoring Engineer to join our team. As an Operations and Monitoring Engineer your primary responsibility will be proactive monitoring and troubleshooting complex issues to ensure our clients receive exceptional customer service.
The purpose of this role is to proactively support the day-to-day IT operations of the humm products.
You Will Be Responsible For
- Proactively monitor production systems ensuring availability across all IT critical services meets or exceeds agreed service levels.
- Design, implement, maintain, and operate monitoring and alerting platforms.
- Develop and maintain dashboards across platforms, applications and infrastructure to display real-time health and current health status of systems as well as enable rapid resolution of incidents, service degradation and problems.
- Implement and maintain metrics, traces, logs to measure and monitor the overall performance and health of our systems and platforms.
- Perform full monitoring and analysis of systems, including availability, capacity and performance for infrastructure and application.
- Provide support for critical event management, including warning, alerting, and troubleshooting of infrastructure and application issues.
- Provide regular reporting of historical, current, and forecast health, including capacity and performance trends and recommendations.
- Design, implement and maintain machine learning models for Anomaly Detection.
- Implement, support and manage integration of ITSM systems and Alerting systems.
- Provide recommendations for continuous improvement of observability and event management.
- Identify new candidates and provide input for additional monitoring and alerting.
- Keep abreast of industry standards trends and practises related to observability and event management.
- Provide ongoing training for supporting to our technology group.
- Work with Product Owners and Squads to establish requirements for observability and alerting.
- Work with our technology group to implement, configure and maintain agents such as Open Telemetry across our infrastructure, platforms and other IT relates resources and assets, using CD/CI and Infrastructure as Code methodologies.
- Automate daily system checks to ensure system availability and interpret the results.
- Resolve and mitigate the impact of missed SLAs.
- Providing Level 2 operational support for internally or externally developed applications, on which our internal & external customers are dependent.
- Investigating, analysing, and trouble-shooting issues, providing key input to problem investigation of application incidents from an operational perspective
- Ensure IT ticket requests and incidents, are prioritised, allocated and updated, working towards same day resolution of tickets where possible.
- Supporting operations overnight and weekends on a roster.
- Demonstrating high levels of communication and support to business stakeholders to maintain business confidence in application support.
- Produce and review operational documentation where a gap in documentation is identified to reduce key point dependencies.
- Must have proven knowledge and working experience with leading telemetry tools e.g. Elastic, Splunk, Dynatrace. Specific experience with Elastic Cloud will be regarded very favourably.
- Proven experience in the development, implementation and operation of infrastructure and application monitoring systems.
- Experience in monitoring traditional technology stacks including Windows OS, Linux OS, Microsoft IIS, Microsoft SQL DB, Oracle DB and standards such as SNMP.
- Solid working knowledge of observability in Public Cloud environments including Azure and AWS.
- A solid understanding and experience working with container technologies, including Kubernetes clusters in AWS.
- Experience with cloud-native observability technologies such as AWS CloudWatch or similar.
- An understanding of Open Telemetry within native-cloud and on-prem environments will be highly regarded.
- A proven and demonstrated passion for constantly learning and using industry leading DevOps/TechOps/SRE tools, best practises, patterns and technologies.
- Strong familiarity with C#/Java/Go/Python programming languages.
- In depth experience working with Restful APIs and Web Services.
- Solid understanding of core networking concepts (IP, DNS).
- Proven experience working with SQL, Linux, Oracle DB will be advantageous
- Good problem-solving and troubleshooting skills, and an analytical mind and the ability to multi-task.
- Willingness to grow and learn new skills, such as Machine Learning and Artificial Intelligence.
- High degree of accuracy and attention to detail.
- Demonstrated high level of interpersonal and communication skills.
- Proven ability to work effectively in a team environment and independently as required.
- Proven working experience in providing operations support or development experience.
- The ability to work under pressure.
- Willing to be rostered on overnight/weekend support.
- Bonus Skills: Able to demonstrate experience working with Elastic, Linux, Docker, k8s, configuration management (ansible, terraform, puppet) is a big plus.
We are looking for someone who is passionate about technology, enjoys working in a team environment, and has excellent problem-solving skills. The successful candidate will be able to work independently, communicate effectively, and have a strong attention to detail. If you have a proven track record in application/operations/support or systems engineer with a hunger and passion for troubleshooting, we want to hear from you.
humm group is proud to be an equal opportunity employer. We adopt and encourage diversity through an open and inclusive culture that values and respects all our employees, customers and communities in which we live, work and are a part of.
We recognise you have a life outside work, and we encourage flexible working to enable you to balance your work and family commitments.
We have a recognition scheme for you, your colleagues, and peers - we encourage everyone to say thanks regularly, and to nominate each other for our company awards.
Your application has been successfully submitted.