Integrations
Prometheus - Incident Auto Remediation
This documentation provides a step-by-step guide on how to set up a Prometheus monitoring system and utilize Callgoose SQIBS Incident Auto Remediation to automatically resolve alerts generated by Prometheus Alertmanager.
Overview of Incident Auto Remediation Process
This process includes the following high-level steps:
- Prometheus Alertmanager sends alerts to the Callgoose SQIBS API.
- The Callgoose SQIBS API generates incidents based on the predefined filter values created by the Callgoose SQIBS user.
- Callgoose SQIBS invokes the automation workflow created by the user to resolve the incident.
Detailed Steps for Setting Up Prometheus and Callgoose SQIBS Incident Auto Remediation
1. Set Up Prometheus Using Podman
a) Configure Podman as a Rootless Container in Rocky Linux 9.x
1. Install Podman and enable linger for a normal user (rootless container):
bash sudo dnf install -y podman useradd podmanuser loginctl enable-linger podmanuser
2. Grant sudo privileges to podmanuser:
bash vim /etc/sudoers.d/podmanuser podmanuser ALL = NOPASSWD:ALL chmod 644 /etc/sudoers.d/podmanuser
3. Login and verify linger status:
bash su - podmanuser loginctl user-status | grep Linger
you can see Linger enabled
4. Set up runtime directory and ensure correct UID/GID mappings:
id -a podmanuser
vim /home/podmanuser/.bash_profile export XDG_RUNTIME_DIR=/run/user/<UID>/ Log out and login to reflect the .bash_profile or source .bash_profile in the current terminal
b) Create Directories and Monitoring Network
- Create necessary directories for containers:
bash mkdir -p /container_home/alertmanager/conf mkdir -p /container_home/blackbox_exporter/conf mkdir -p /container_home/prometheus/{conf,data} sudo chown -R podmanuser:podmanuser /container_home/{prometheus,blackbox_exporter,alertmanager}
- Set up the monitoring network for containers:
bash podman network create monitoring-network
Use the below commands to verify
podman network ls podman network inspect monitoring-network
c) Configure and Run Alertmanager
1. Create the /container_home/alertmanager/alertmanager.yml configuration file for Alertmanager:
yaml route: group_by: ['alertname'] group_wait: 10s group_interval: 30s repeat_interval: 10m # A default receiver receiver: 'callgoose-sqibs' receivers: - name: 'callgoose-sqibs' webhook_configs: - url: 'https://xxxxx.callgoose.com/xxx/xx/xxxxxx?from=prometheus&token=ReplaceWithCGToken' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
2. Run Alertmanager inside a Podman container:
alertmanager and prometheus running as nobody user inside the podman containers. It is necessary to give the privilege to nobody:nobody user to /container_home/alertmanager
podman unshare chown -R nobody:nobody /container_home/alertmanager
bash podman run -d --name alertmanager \ -p 9093:9093 \ -v /container_home/alertmanager/conf/alertmanager.yml:/etc/alertmanager/alertmanager.yml \ --network monitoring-network \ prom/alertmanager
Choose your desired repo. Here Iām choosing docker.io
Refer to this documentation for more information about Callgoose API token and API End point
Callgoose SQIBS API Token Documentation
Callgoose SQIBS API Endpoint Documentation
API Filter Instructions and FAQ
d) Set Up Blackbox Exporter
1. Create the /container_home/blackbox_exporter/conf/config.yml for Blackbox Exporter:
yaml modules: http_2xx: prober: http timeout: 5s http: preferred_ip_protocol: ip4 # Force use of IPv4 valid_http_versions: ["HTTP/1.1", "HTTP/2", "HTTP/2.0"] valid_status_codes: [200, 201, 202, 204, 401, 403, 404, 301, 302, 303, 307] # Accept only these specific status codes method: GET fail_if_ssl: false # Allow SSL fail_if_not_ssl: false # Allow non-SSL if needed tls_config: insecure_skip_verify: true # This skips SSL verification
2. Run Blackbox Exporter in Podman:
podman unshare chown -R root:root /container_home/blackbox_exporter
bash podman run -d --name blackbox_exporter \ -p 9115:9115 \ -v /container_home/blackbox_exporter/conf/config.yml:/etc/blackbox_exporter/config.yml \ --network monitoring-network \ prom/blackbox-exporter
e) Set Up Prometheus
1. Create the Prometheus configuration file /container_home/prometheus/conf/prometheus.yml:
yaml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 10s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - /etc/prometheus/alert_rules.yml # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: 'node_exporter' static_configs: - targets: ['node_exporter:9100'] - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] # Change this to the relevant module static_configs: - targets: - https://incident-auto-remediation.callgoose.com relabel_configs: - source_labels: [__address__] target_label: __param_target - target_label: instance # Add this line to set the actual website as the instance source_labels: [__param_target] - target_label: __address__ replacement: blackbox_exporter:9115 # Blackbox Exporter address
2. Create alert rule file /container_home/prometheus/conf/alert_rules.yml
yaml groups: - name: blackbox_alerts rules: - alert: WebsiteDown expr: probe_success == 0 for: 1m labels: severity: critical annotations: summary: "Website {{ $labels.instance }} is down" description: "The website {{ $labels.instance }} has been down for more than 1 minute."
3. Run Prometheus in Podman:
bash podman unshare chown -R nobody:nobody /container_home/prometheus podman run -d --name prometheus \ -p 9090:9090 \ -v /container_home/prometheus/data:/prometheus \ -v /container_home/prometheus/conf:/etc/prometheus \ --network monitoring-network \ prom/prometheus
Access Prometheus web interface:
http://<your_prometheus_server_IP>:9090
4. Verify podman containers
Now at this stage, you have prometheus, alert manager, blackbox exporter ready and running in podman containers
podman ps -a
to check the status of the podman containers
2. Create Automation Action in Callgoose SQIBS
1. Go to Actions ā Select Team and choose a type (e.g., bash, python, ansible).
2. Add the Action, fill in the details (e.g., Action Name, Description), and upload your script.
3. Save the action, which can later be used in workflows.
āCALLGOOSE ACTIONSā has many automation scripts available or you can upload your own
Detailed explanation of Step 2.2
Click on + Add Action
Select action type: <example bash, python, ansible, terraform, Kubernetes etc>
Copy from: <If you want to copy any action from āCALLGOOSE ACTIONSā or from others>
Enter Action Name:
Enter Description:
Success response: <Here you can add program or script exit code that indicates the success of the script status if available>
File name:
Editor or File Upload: <You can upload the script from your system using file upload or the Editor.>
Click on āSaveā
This process will create your Action profile, You can call this Action profile when you create the Incident Workflow
Refer to the Callgoose SQIBS Automation Action for more details.
3. Create Automation Profile in Callgoose SQIBS
You can run actions either by using the Callgoose Runner or by using the out-of-the-box integration from the Callgoose SQIBS Automation SaaS Platform. Follow the appropriate method as outlined in the Runner Documentation.
Detailed information about Automation Profile
A Profile in Callgoose SQIBS is used to define the context or environment in which your actions will be executed. You can run Actions using 2 methods
- By using Callgoose SQIBS Runner
- By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform
First, we need to create a Profile.
You need to select one option. We will go through both option here
3a) By using Callgoose SQIBS Runner
Runner program deployed in the customer environment behind the firewall. When there is an automation request in the Callgoose SQIBS Automation Platform, the Runner program securely connects to the Callgoose SQIBS platform, fetches the automation job details, and executes it in the client environment. Runner program can execute any of the integration from customer enviroment as long it is installed and configured in the server where Runner program is installed.
Example if you want to Runner program to execute the ansible, terraform, kubectl, ssh, python, bash script, powershell etc, it must be installed in that Runner server and Runner program OS user must have the privilege to execute those automation files.
Please refer this documentation for more details about How to setup Callgoose SQIBS Runner in your IT environment
Please refer more information about Callgoose SQIBS Automation
3b) By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform
Callgoose SQIBS Automation Platform has several out-of-the-box integrations. Example of out-of-the-box integrations ansible, terraform, kubectl, ssh, python, bash script, powershell etc many more.
Clients can use any of these integrations in their automation workflow to connect to their IT infrastructures. The platform securely connects to client IT infrastructures using these integrations and executes the automation workflow accordingly.
Please refer more information about Callgoose SQIBS Automation
3c) How to create Automation profile
using Callgoose SQIBS Runner
Go to Callgoose Dashboard ā> Automation
click on Profile ā Select your Team
Select Type ā here Iām choosing āRunnerā.
Click on + Add Profiles
Select Profile type: runner
Name:
Description:
It will prompt you to download the key. This key is important when you install the Callgoose SQIBS Runner in your environment.
Download the credentials key for the newly created runner instance. You will need these credentials in yaml file to run the runner JAR.
Note: Store the file securely.
Click on Download to Download the Callgoose SQIBS Runner.yaml file
Click on ācloseā the window
You will see new option now in āProfileā
Here you can see the option to download the JAR
Download the Runner JAR
Download the Callgoose SQIBS runner JAR and yaml key to setup the Callgoose SQIBS runner deployment option behind the firewall.
Please refer more details about Callgoose SQIBS runner installation here
https://docs.callgoose.com/sqibs/cg_automation_runner
Go to the profile again
Select Type ā here Iām choosing āRunnerā.
There you will see the previously created Runner profile.
Click on Edit
Click on āShow Internal Profilesā
Here you will see
Internal Profiles
Here you can see many other options like ssh, bash, python, ansible, terraform, kubernates etc
you can it in āOut of the box Integration Profileā. Ensure that whatever the Internal profile you choose, ansible or terraform or others must be installed and configured in the server where Runner program is running. Ensure that Callgoose SQIBS Runner has outbound connectivity to Callgoose SQIBS platform. If your environment is restricted with outbound connectivity, Ensure that Callgoose SQIBS platform server IPās must be allowed for outbound connectivity.
Please refer this page for more details about Callgoose SQIBS platform server details for outbound connectivity from your environment.
Here in this example, Iām choosing ābashā as Internal Profiles
Name:
Description:
Command to Run: here Iām typing bash
This means the path of the bash, if it is there in the path you can mention bash, if it is not there in the path, you need to mention the complete path. Example like /opt/bin/bash
Example 2: /opt/ansible/2.17/bin/ansible
Refer to these documentations for more details about Automation, Profile, Actions and more
3d) How to add SSH keys to your server for Callgoose SQIBS Runner
You need to generate the SSH keys in the server where you installed the Callgoose SQIBS Runner.
Callgoose SQIBS Runner program must have ssh passwordless connection from that server to the server where the script will be executed.
In this example, Callgoose SQIBS Runner server must have ssh passwordless connection to myserver1.callgoose.com where this nginx website is running https://incident-auto-remediation.callgoose.com
Please refer to this documentation for more details about how you can create ssh passwordless connection
https://docs.callgoose.com/sqibs/ssh-passwordless-connection
3e) Server access for Callgoose SQIBS Runner
When you use Callgoose SQIBS Runner, it is running in your on-premise IT infrastructures behind firewall mostly. In this case the Callgoose SQIBS Runner need to have access to server as per the example mentioned here.
In this example Callgoose SQIBS Runner system must have access to myserver1.callgoose.com and should be able to connect using SSH ( Iām using Rocky Linux 9.xx in this example for running Callgoose SQIBS Runner )
ssh_username : root or any other username
remote_server : myserver1.callgoose.com
ssh_remote_server_port : 22 or any other SSH port
webserver_service_name : nginx
website_url : https://incident-auto-remediation.callgoose.com
Callgoose SQIBS Runner program will execute the Incident workflow and their bash automation script (In this example) connect to the server myserver1.callgoose.com , check the website status , restart the nginx webservice , fix the issue , update to the Callgoose SQIBS platform and Callgoose SQIBS platform will resolve the Incident.
Refer to the documentation for more details about using Callgoose SQIBS Runner
3f) By using Out of the box integration from Callgoose SQIBS Automation SaaS Platform
Go to Callgoose Dashboard ā> Automation
click on Profile ā Select your Team
Click on + Add Profiles
Select Type ā here Iām choosing Bash. There are many options available
Name:
Description:
3g) How to add SSH keys to your server for Out of the box integration
Go to Callgoose SQIBS dashboard ā Automation ā Profiles
you will see the option āAdd the following key as an authorized SSH key in your server to work with Out of the Box integrationsā
copy that ssh public keys into your server for automation script to access it via ssh passwordless
Please refer to this documentation for more details about how you can create an ssh passwordless connection
https://docs.callgoose.com/sqibs/ssh-passwordless-connection
3h) Server access for Out of the box integration from Callgoose SQIBS Automation SaaS Platform
Prerequisites ā You must allow our IPās to access to the server myserver1.callgoose.com via SSH
Please check this website for more information about our automation platform server IPs
When you use Out of the box integration from Callgoose SQIBS Automation SaaS Platform, Callgoose SaaS Platform will directly connect to the affected server
In this example , Callgoose SQIBS SaaS platform will connect to the server myserver1.callgoose.com
using SSH and restart the nginx webserver and that will fix the issue and It will automatically resolve the Incident
ssh_username : root or any other username
remote_server : myserver1.callgoose.com
ssh_remote_server_port : 22
webserver_service_name : nginx
website_url : https://incident-auto-remediation.callgoose.com
4. Create Incident Workflow
4.1) In the Callgoose dashboard, go to Incident Workflow ā Select Team ā Add Workflow.
4.2) Select the created action, specify the necessary arguments, and save the workflow.
Refer to the detailed guide on Creating Incident Workflows.
4.3) More detailed information about step 4.1 and 4.2
Click on Incident Workflow from the Callgoose SQIBS dashboard.
Select your Team
Click on + Add Workflow
Name:
Description:
Category:
In ACTIONS
Type: Action
Action: Click on drop-down to choose the Action created earlier.
Callgoose Provides a lot of free Bash scripts, Python, Ansible, Terraform and many others under āCALLGOOSE ACTIONSā
Here Iām choosing to cg_check_website_status action under bash from āCALLGOOSE ACTIONSā
Here you can see āCALLGOOSE ACTIONSā
You can see more information about āCALLGOOSE ACTIONSā in the help guide link next to āCALLGOOSE ACTIONSā
It has all the information about what that automation script or program will do and what are the argument needed to execute that script etc
In this case, Iām choosing cg_check_website_status.bsh and the help page of this check_website_status.bsh shows the below
Usage: ./cg_check_website_status.bsh ssh_username remote_server ssh_remote_server_port webserver_service_name website_url
it needs the above details as the argument
sh_username : root or any other username
remote_server : myserver1.callgoose.com
ssh_remote_server_port : 22 or other ssh port
webserver_service_name : nginx
website_url : https://incident-auto-remediation.callgoose.com
After Choosing action as āCALLGOOSE ACTIONSā ācg_check_website_status_actionā
You need to add the āArgumentsā according to this particular automation script. Each script has its own requirements. This purely depends upon the automation script. You will get complete privilege to add the arguments or variables depending upon your choice
Click on + inside āArgumentsā
add the below details as per the order
autouser Click on + inside āArgumentsā and add until you complete all the argument entries
Finally, in Action Arguments ā it looks like this
autousermyserver1.callgoose.com 22 nginx https://incident-auto-remediation.callgoose.com
Click on Save to save the Incident Workflow
Refer to this website for more information about how to create an Automation workflow in Callgoose SQIBS
5. Set Up Callgoose SQIBS API Filter
5.1) Go to Services ā Select Team ā Add or Update API Integration.
5.2) Choose the Prometheus template and customize the filter as required.
5.3) Enable the workflow created earlier.
5.4) More detailed information about steps 5.1 to 5.3
To customize the filter and add Incident Auto Remediation , do the following
Go to Callgoose SQIBS dashboard
Select Services ā> Select your Team
Click on āAdd or Update API integrationā
Select Integration template ā Choose Prometheus if not already done and it will automatically fill up the filter for you.
Click on + ( plus symbol ) to add the new filter content check item
add Payload JSON key as
"alerts".[0]."annotations"."summary"
add the value as
website https://incident-auto-remediation.callgoose.com is down
Replace https://incident-auto-remediation.callgoose.com with your website
Enable the Incident Workflow
Select Workflow : Here you choose the workflow we created using āCALLGOOSE ACTIONā
āIf success then make incident resolved : Select this
Waiting time for escalate to escalation policies: 4 minutes ( here Iām choosing 4 minutes . It means, it will only escalate after 4 minutes if automation workflow canāt fix or canāt complete the task on-time. If there is a failure in automation workflow, it will escalate immediately.
Click on āSaveā
For more information, refer to the Callgoose API Documentation.
Callgoose SQIBS API Token Documentation
Callgoose SQIBS API Endpoint Documentation
API Filter Instructions and FAQ
6. Generate Incident and Test Callgoose SQIBS Incident Auto Remediation
1. Stop the web server (e.g., nginx) on the monitored server:
bash systemctl stop nginx
2. Verify that the alert is triggered in Prometheus, which should send it to Callgoose SQIBS.
3. Callgoose will automatically initiate the remediation process and resolve the incident if successful.
You can monitor the workflow log in Callgoose SQIBS to check the status of the incident remediation.
Open the incident created by Prometheus in the Callgoose dashboard ā Click on the āShow Workflow Logā. There you can see details about the performed Workflows and Result of Actions.