logo

CALLGOOSE

Prometheus - Incident Auto Remediation

This documentation provides a step-by-step guide on how to set up a Prometheus monitoring system and utilize Callgoose SQIBS Incident Auto Remediation to automatically resolve alerts generated by Prometheus Alertmanager.


Overview of Incident Auto Remediation Process


This process includes the following high-level steps:

  1. Prometheus Alertmanager sends alerts to the Callgoose SQIBS API.
  2. The Callgoose SQIBS API generates incidents based on the predefined filter values created by the Callgoose SQIBS user.
  3. Callgoose SQIBS invokes the automation workflow created by the user to resolve the incident.


Detailed Steps for Setting Up Prometheus and Callgoose SQIBS Incident Auto Remediation


1. Set Up Prometheus Using Podman

a) Configure Podman as a Rootless Container in Rocky Linux 9.x

1. Install Podman and enable linger for a normal user (rootless container):

bash

sudo dnf install -y podman
useradd podmanuser
loginctl enable-linger podmanuser

2. Grant sudo privileges to podmanuser:

bash

vim /etc/sudoers.d/podmanuser
podmanuser ALL = NOPASSWD:ALL
chmod 644 /etc/sudoers.d/podmanuser

3. Login and verify linger status:

bash

su - podmanuser
loginctl user-status | grep Linger

you can see Linger enabled


4. Set up runtime directory and ensure correct UID/GID mappings:

id -a podmanuser


vim /home/podmanuser/.bash_profile
export XDG_RUNTIME_DIR=/run/user/<UID>/
Log out and login to reflect the .bash_profile or source .bash_profile in the current terminal


b) Create Directories and Monitoring Network

  • Create necessary directories for containers:
bash

mkdir -p /container_home/alertmanager/conf
mkdir -p /container_home/blackbox_exporter/conf
mkdir -p /container_home/prometheus/{conf,data}
sudo chown -R podmanuser:podmanuser /container_home/{prometheus,blackbox_exporter,alertmanager}


  • Set up the monitoring network for containers:
bash

podman network create monitoring-network

Use the below commands to verify

podman network ls
podman network inspect monitoring-network


c) Configure and Run Alertmanager


1. Create the /container_home/alertmanager/alertmanager.yml configuration file for Alertmanager:

yaml
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 10m
# A default receiver
  receiver: 'callgoose-sqibs'
receivers:
  - name: 'callgoose-sqibs'
    webhook_configs:
      - url: 'https://xxxxx.callgoose.com/xxx/xx/xxxxxx?from=prometheus&token=ReplaceWithCGToken'
        send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']


2. Run Alertmanager inside a Podman container:


alertmanager and prometheus running as nobody user inside the podman containers. It is necessary to give the privilege to nobody:nobody user to /container_home/alertmanager


podman unshare chown -R nobody:nobody /container_home/alertmanager 


bash

podman run -d --name alertmanager \
  -p 9093:9093 \
  -v /container_home/alertmanager/conf/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  --network monitoring-network \
  prom/alertmanager

Choose your desired repo. Here Iā€™m choosing docker.io


Refer to this documentation for more information about Callgoose API token and API End point


Callgoose SQIBS API Token Documentation

Callgoose SQIBS API Endpoint Documentation

API Filter Instructions and FAQ

How to Send API


d) Set Up Blackbox Exporter


1. Create the /container_home/blackbox_exporter/conf/config.yml for Blackbox Exporter:

yaml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      preferred_ip_protocol: ip4  # Force use of IPv4
      valid_http_versions: ["HTTP/1.1", "HTTP/2", "HTTP/2.0"]
      valid_status_codes: [200, 201, 202, 204, 401, 403, 404, 301, 302, 303, 307]  # Accept only these specific status codes
      method: GET
      fail_if_ssl: false  # Allow SSL
      fail_if_not_ssl: false # Allow non-SSL if needed
      tls_config:
        insecure_skip_verify: true # This skips SSL verification


2. Run Blackbox Exporter in Podman:

podman unshare chown -R root:root /container_home/blackbox_exporter


bash

podman run -d --name blackbox_exporter \
  -p 9115:9115 \
  -v /container_home/blackbox_exporter/conf/config.yml:/etc/blackbox_exporter/config.yml \
  --network monitoring-network \
  prom/blackbox-exporter


e) Set Up Prometheus


1. Create the Prometheus configuration file /container_home/prometheus/conf/prometheus.yml:


yaml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 10s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - /etc/prometheus/alert_rules.yml
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Change this to the relevant module
    static_configs:
      - targets:
          - https://incident-auto-remediation.callgoose.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: instance  # Add this line to set the actual website as the instance
        source_labels: [__param_target]
      - target_label: __address__
        replacement: blackbox_exporter:9115  # Blackbox Exporter address


Replace https://incident-auto-remediation.callgoose.com with your website


2. Create alert rule file /container_home/prometheus/conf/alert_rules.yml


yaml

groups:
  - name: blackbox_alerts
    rules:
      - alert: WebsiteDown
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Website {{ $labels.instance }} is down"
          description: "The website {{ $labels.instance }} has been down for more than 1 minute."


3. Run Prometheus in Podman:

bash

podman unshare chown -R nobody:nobody /container_home/prometheus

podman run -d --name prometheus \
  -p 9090:9090 \
  -v /container_home/prometheus/data:/prometheus \
  -v /container_home/prometheus/conf:/etc/prometheus \
  --network monitoring-network \
  prom/prometheus


Access Prometheus web interface:


http://<your_prometheus_server_IP>:9090


4. Verify podman containers

Now at this stage, you have prometheus, alert manager, blackbox exporter ready and running in podman containers

podman ps -a

to check the status of the podman containers


2. Create Automation Action in Callgoose SQIBS


1. Go to Actions ā†’ Select Team and choose a type (e.g., bash, python, ansible).

2. Add the Action, fill in the details (e.g., Action Name, Description), and upload your script.

3. Save the action, which can later be used in workflows.


ā€œCALLGOOSE ACTIONSā€ has many automation scripts available or you can upload your own

Detailed explanation of Step 2.2

Click on + Add Action

Select action type: <example bash, python, ansible, terraform, Kubernetes etc>

Copy from: <If you want to copy any action from ā€œCALLGOOSE ACTIONSā€ or from others>

Enter Action Name:

Enter Description:

Success response: <Here you can add program or script exit code that indicates the success of the script status if available>

File name:

Editor or File Upload: <You can upload the script from your system using file upload or the Editor.>

Click on ā€œSaveā€

This process will create your Action profile, You can call this Action profile when you create the Incident Workflow


Refer to the Callgoose SQIBS Automation Action for more details.


3. Create Automation Profile in Callgoose SQIBS


You can run actions either by using the Callgoose Runner or by using the out-of-the-box integration from the Callgoose SQIBS Automation SaaS Platform. Follow the appropriate method as outlined in the Runner Documentation.


Detailed information about Automation Profile

A Profile in Callgoose SQIBS is used to define the context or environment in which your actions will be executed. You can run Actions using 2 methods

  • By using Callgoose SQIBS Runner
  • By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform

First, we need to create a Profile.

You need to select one option. We will go through both option here


3a) By using Callgoose SQIBS Runner


Runner program deployed in the customer environment behind the firewall. When there is an automation request in the Callgoose SQIBS Automation Platform, the Runner program securely connects to the Callgoose SQIBS platform, fetches the automation job details, and executes it in the client environment. Runner program can execute any of the integration from customer enviroment as long it is installed and configured in the server where Runner program is installed.


Example if you want to Runner program to execute the ansible, terraform, kubectl, ssh, python, bash script, powershell etc, it must be installed in that Runner server and Runner program OS user must have the privilege to execute those automation files.


Please refer this documentation for more details about How to setup Callgoose SQIBS Runner in your IT environment

Please refer more information about Callgoose SQIBS Automation


3b) By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform


Callgoose SQIBS Automation Platform has several out-of-the-box integrations. Example of out-of-the-box integrations ansible, terraform, kubectl, ssh, python, bash script, powershell etc many more.

Clients can use any of these integrations in their automation workflow to connect to their IT infrastructures. The platform securely connects to client IT infrastructures using these integrations and executes the automation workflow accordingly.


Please refer more information about Callgoose SQIBS Automation


3c) How to create Automation profile
using Callgoose SQIBS Runner


Go to Callgoose Dashboard ā€“> Automation

click on Profile ā†’ Select your Team

Select Type ā†’ here Iā€™m choosing ā€œRunnerā€.

Click on + Add Profiles

Select Profile type: runner

Name:

Description:


It will prompt you to download the key. This key is important when you install the Callgoose SQIBS Runner in your environment.


Download the credentials key for the newly created runner instance. You will need these credentials in yaml file to run the runner JAR.

Note: Store the file securely.


Click on Download to Download the Callgoose SQIBS Runner.yaml file


Click on ā€˜closeā€™ the window


You will see new option now in ā€œProfileā€

Here you can see the option to download the JAR

Download the Runner JAR

Download the Callgoose SQIBS runner JAR and yaml key to setup the Callgoose SQIBS runner deployment option behind the firewall.


Please refer more details about Callgoose SQIBS runner installation here

https://docs.callgoose.com/sqibs/cg_automation_runner


Go to the profile again

Select Type ā†’ here Iā€™m choosing ā€œRunnerā€.

There you will see the previously created Runner profile.

Click on Edit

Click on ā€œShow Internal Profilesā€


Here you will see


Internal Profiles

Here you can see many other options like ssh, bash, python, ansible, terraform, kubernates etc

you can it in ā€œOut of the box Integration Profileā€. Ensure that whatever the Internal profile you choose, ansible or terraform or others must be installed and configured in the server where Runner program is running. Ensure that Callgoose SQIBS Runner has outbound connectivity to Callgoose SQIBS platform. If your environment is restricted with outbound connectivity, Ensure that Callgoose SQIBS platform server IPā€™s must be allowed for outbound connectivity.


Please refer this page for more details about Callgoose SQIBS platform server details for outbound connectivity from your environment.


Here in this example, Iā€™m choosing ā€œbashā€ as Internal Profiles

Name:

Description:

Command to Run: here Iā€™m typing bash

This means the path of the bash, if it is there in the path you can mention bash, if it is not there in the path, you need to mention the complete path. Example like /opt/bin/bash

Example 2: /opt/ansible/2.17/bin/ansible


Refer to these documentations for more details about Automation, Profile, Actions and more

Automation

Automation Action

Automation Profile

Automation Runner Program


3d) How to add SSH keys to your server for Callgoose SQIBS Runner


You need to generate the SSH keys in the server where you installed the Callgoose SQIBS Runner.

Callgoose SQIBS Runner program must have ssh passwordless connection from that server to the server where the script will be executed.

In this example, Callgoose SQIBS Runner server must have ssh passwordless connection to myserver1.callgoose.com where this nginx website is running https://incident-auto-remediation.callgoose.com


Please refer to this documentation for more details about how you can create ssh passwordless connection

https://docs.callgoose.com/sqibs/ssh-passwordless-connection


3e) Server access for Callgoose SQIBS Runner


When you use Callgoose SQIBS Runner, it is running in your on-premise IT infrastructures behind firewall mostly. In this case the Callgoose SQIBS Runner need to have access to server as per the example mentioned here.


In this example Callgoose SQIBS Runner system must have access to myserver1.callgoose.com and should be able to connect using SSH ( Iā€™m using Rocky Linux 9.xx in this example for running Callgoose SQIBS Runner )


ssh_username : root or any other username

remote_server : myserver1.callgoose.com

ssh_remote_server_port : 22 or any other SSH port

webserver_service_name : nginx

website_url : https://incident-auto-remediation.callgoose.com


Callgoose SQIBS Runner program will execute the Incident workflow and their bash automation script (In this example) connect to the server myserver1.callgoose.com , check the website status , restart the nginx webservice , fix the issue , update to the Callgoose SQIBS platform and Callgoose SQIBS platform will resolve the Incident.


Refer to the documentation for more details about using Callgoose SQIBS Runner


3f) By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform


Go to Callgoose Dashboard ā€“> Automation

click on Profile ā†’ Select your Team

Click on + Add Profiles

Select Type ā†’ here Iā€™m choosing Bash. There are many options available

Name:

Description:


3g) How to add SSH keys to your server for Out of the box integration


Go to Callgoose SQIBS dashboard ā†’ Automation ā†’ Profiles

you will see the option ā€œAdd the following key as an authorized SSH key in your server to work with Out of the Box integrationsā€

copy that ssh public keys into your server for automation script to access it via ssh passwordless


Please refer to this documentation for more details about how you can create an ssh passwordless connection

https://docs.callgoose.com/sqibs/ssh-passwordless-connection


3h) Server access for Out of the box integration from Callgoose SQIBS Automation SaaS Platform


Prerequisites ā€“ You must allow our IPā€™s to access to the server myserver1.callgoose.com via SSH


Please check this website for more information about our automation platform server IPs


When you use Out of the box integration from Callgoose SQIBS Automation SaaS Platform, Callgoose SaaS Platform will directly connect to the affected server

In this example , Callgoose SQIBS SaaS platform will connect to the server myserver1.callgoose.com

using SSH and restart the nginx webserver and that will fix the issue and It will automatically resolve the Incident


ssh_username : root or any other username

remote_server : myserver1.callgoose.com

ssh_remote_server_port : 22

webserver_service_name : nginx

website_url : https://incident-auto-remediation.callgoose.com


4. Create Incident Workflow


4.1) In the Callgoose dashboard, go to Incident Workflow ā†’ Select Team ā†’ Add Workflow.

4.2) Select the created action, specify the necessary arguments, and save the workflow.

Refer to the detailed guide on Creating Incident Workflows.


4.3) More detailed information about step 4.1 and 4.2


Click on Incident Workflow from the Callgoose SQIBS dashboard.

Select your Team

Click on + Add Workflow

Name:

Description:

Category:

In ACTIONS

Type: Action

Action: Click on drop-down to choose the Action created earlier.


Callgoose Provides a lot of free Bash scripts, Python, Ansible, Terraform and many others under ā€œCALLGOOSE ACTIONSā€


Here Iā€™m choosing to cg_check_website_status action under bash from ā€œCALLGOOSE ACTIONSā€

Here you can see ā€œCALLGOOSE ACTIONSā€


You can see more information about ā€œCALLGOOSE ACTIONSā€ in the help guide link next to ā€œCALLGOOSE ACTIONSā€


It has all the information about what that automation script or program will do and what are the argument needed to execute that script etc


In this case, Iā€™m choosing cg_check_website_status.bsh and the help page of this check_website_status.bsh shows the below


Usage: ./cg_check_website_status.bsh ssh_username remote_server ssh_remote_server_port webserver_service_name website_url


it needs the above details as the argument


sh_username : root or any other username

remote_server : myserver1.callgoose.com

ssh_remote_server_port : 22 or other ssh port

webserver_service_name : nginx

website_url : https://incident-auto-remediation.callgoose.com


After Choosing action as ā€œCALLGOOSE ACTIONSā€ ā€œcg_check_website_status_actionā€


You need to add the ā€œArgumentsā€ according to this particular automation script. Each script has its own requirements. This purely depends upon the automation script. You will get complete privilege to add the arguments or variables depending upon your choice


Click on + inside ā€œArgumentsā€

add the below details as per the order

autouser Click on + inside ā€œArgumentsā€ and add until you complete all the argument entries

Finally, in Action Arguments ā€“ it looks like this


autousermyserver1.callgoose.com 22 nginx https://incident-auto-remediation.callgoose.com


Click on Save to save the Incident Workflow


Refer to this website for more information about how to create an Automation workflow in Callgoose SQIBS


5. Set Up Callgoose SQIBS API Filter


5.1) Go to Services ā†’ Select Team ā†’ Add or Update API Integration.

5.2) Choose the Prometheus template and customize the filter as required.

5.3) Enable the workflow created earlier.

5.4) More detailed information about steps 5.1 to 5.3


To customize the filter and add Incident Auto Remediation , do the following

Go to Callgoose SQIBS dashboard

Select Services ā€“> Select your Team

Click on ā€œAdd or Update API integrationā€

Select Integration template ā†’ Choose Prometheus if not already done and it will automatically fill up the filter for you.

Click on + ( plus symbol ) to add the new filter content check item

add Payload JSON key as

"alerts".[0]."annotations"."summary"



add the value as

website https://incident-auto-remediation.callgoose.com is down

Replace https://incident-auto-remediation.callgoose.com with your website


Enable the Incident Workflow

Select Workflow : Here you choose the workflow we created using ā€œCALLGOOSE ACTIONā€


ā€‹If success then make incident resolved : Select this


Waiting time for escalate to escalation policies: 4 minutes ( here Iā€™m choosing 4 minutes . It means, it will only escalate after 4 minutes if automation workflow canā€™t fix or canā€™t complete the task on-time. If there is a failure in automation workflow, it will escalate immediately.


Click on ā€œSaveā€


For more information, refer to the Callgoose API Documentation.


Callgoose SQIBS API Token Documentation

Callgoose SQIBS API Endpoint Documentation

API Filter Instructions and FAQ

How to Send API


6. Generate Incident and Test Callgoose SQIBS Incident Auto Remediation


1. Stop the web server (e.g., nginx) on the monitored server:

bash

systemctl stop nginx

2. Verify that the alert is triggered in Prometheus, which should send it to Callgoose SQIBS.

3. Callgoose will automatically initiate the remediation process and resolve the incident if successful.

You can monitor the workflow log in Callgoose SQIBS to check the status of the incident remediation.


Open the incident created by Prometheus in the Callgoose dashboard ā†’ Click on the ā€œShow Workflow Logā€. There you can see details about the performed Workflows and Result of Actions.


CALLGOOSE
SQIBS

Advanced Automation platform with effective On-Call schedule, real-time Incident Management and Incident Response capabilities that keep your organization more resilient, reliable, and always on

Callgoose SQIBS can Integrate with any applications or tools you use. It can be monitoring, ticketing, ITSM, log management, error tracking, ChatOps, collaboration tools or any applications

Callgoose providing the Plans with Unique features and advanced features for every business needs at the most affordable price.



Unique Features

  • 30+ languages supported
  • IVR for Phone call notifications
  • Dedicated caller id
  • Advanced API & Email filter
  • Tag based maintenance mode

Signup for a freemium plan today &
Experience the results.

No credit card required