Troubleshooting tips ¶
This guide contains general tips on how to investigate an application deployment that doesn't work correctly.
How to troubleshoot a failed Helm release installation?
¶
Symptom
Install script fails due to failure when installing Helm release. This applies to all DC products. You will see the following error:
module.confluence[0].helm_release.confluence: Still creating... [20m10s elapsed]
Warning: Helm release "confluence" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
with module.confluence[0].helm_release.confluence,
on modules/products/confluence/helm.tf line 4, in resource "helm_release" "confluence":
4: resource "helm_release" "confluence" {
Error: timed out waiting for the condition
with module.confluence[0].helm_release.confluence,
on modules/products/confluence/helm.tf line 4, in resource "helm_release" "confluence":
4: resource "helm_release" "confluence" {
Releasing state lock. This may take a few moments...
Helm gives up waiting for a successful release, Usually, it means that Confluence (or any other product) pod failed to pass its readiness probe, or the pod is stuck in a Pending state.
Solution
To troubleshoot the error, run the following script:
scripts/collect_k8s_logs.sh atlas-dcapt-confluence-small-cluster us-east-2 /path/to/local/directory
Cluster name and region may differ (look at environment name and region in your config.tfvars
). For example, if your environment_name
is dcapt-confluence-small
, then your cluster name is atlas-dcapt-confluence-small-cluster
. The last argument is a destination path for a tar.gz with logs that the script will produce.
Share the archive in Slack #data-center-app-performance-toolkit channel along with your support request. You can also look at the pod and its logs, e.g.:
confluence-0_log.log
confluence-0_describe.log
Odds are that logs may shed some light on why the pod isn't ready. The product_describe.log
file will contain K8S events that may also help understand why the pod isn't in a Running
state.
It's also a good idea to get logs that are not sent to stdout/err:
kubectl exec confluence-0 -n atlassian -i -t -- cat /var/atlassian/application-data/confluence/logs/atlassian-confluence.log
Typically, if the pod is Running
but not marked as Ready
, it's the application that failed to start, i.e. it isn't an infrastructure issue.
How to troubleshoot instances that failed to join the Kubernetes cluster
¶
Symptom
When Terraform creates EKS infrastructure, EKS cluster (control plane) is created first. Once the cluster has been created, a NodeGroup (backed by an ASG) is created, and EC2 instances join the cluster as worker nodes.
If a node fails to join its cluster, you will typically see the following error:
Error: waiting for EKS Node Group (atlas-dcapt-jira-small-cluster:appNode-t3_xlarge-20240521085758213900000012) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
* i-0dd1a9dc64303a10b: NodeCreationFailure: Instances failed to join the kubernetes cluster
with module.base-infrastructure.module.eks.module.eks.module.eks_managed_node_group["appNodes"].aws_eks_node_group.this[0],
on .terraform/modules/base-infrastructure.eks.eks/modules/eks-managed-node-group/main.tf line 272, in resource "aws_eks_node_group" "this":
272: resource "aws_eks_node_group" "this" {
After enabling STS, destroy existing environment and re-run the installation.
How to fix 'exec plugin is configured to use API version' error?
¶
Symptom
When running install.sh
script the installation fails with an error:
module.base-infrastructure.kubernetes_namespace.products: Creating...
module.base-infrastructure.module.ingress.helm_release.ingress: Creating...
Error: Post "https://1D2E0AC7AE5EC290740D816BD53A68AB.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces": getting credentials: exec plugin is configured to use API version client.authentication.k8s.io/v1beta1, plugin returned version client.authentication.k8s.io/v1alpha1
with module.base-infrastructure.kubernetes_namespace.products,
on modules/common/main.tf line 43, in resource "kubernetes_namespace" "products":
43: resource "kubernetes_namespace" "products" {
Error: Kubernetes cluster unreachable: Get "https://1D2E0AC7AE5EC290740D816BD53A68AB.gr7.us-east-1.eks.amazonaws.com/version": getting credentials: decoding stdout: no kind "ExecCredential" is registered for version "client.authentication.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"
with module.base-infrastructure.module.ingress.helm_release.ingress,
on modules/AWS/ingress/main.tf line 44, in resource "helm_release" "ingress":
44: resource "helm_release" "ingress" {
The error is caused by API version mismatch. AWS CLI and Kubernetes and Helm providers use v1alpha1
and v1beta1
respectively.
Solution
Update AWS CLI to the most recent version.
How do I uninstall an environment using a different Terraform configuration file?
¶
Symptom
If you try to uninstall an environment by using a different configuration file than the one you used to install it or by using a different version of the code, you may encounter some issues during uninstallation. In most cases, Terraform reports that the resource cannot be removed because it's in use.
Solution
Identify the resource and delete it manually from the AWS console, and then restart the uninstallation process. Make sure to always use the same configuration file that was used to install the environment.
How do I deal with persistent volumes that do not get removed as part of the Terraform uninstallation?
¶
Symptom
Uninstall fails to remove the persistent volume.
Error: Persistent volume atlassian-dc-share-home-pv still exists (Bound)
Error: context deadline exceeded
Error: Persistent volume claim atlassian-dc-share-home-pvc still exists with
If a pod termination stalls, it will block pvc and pv deletion. To fix this problem we need to terminate product pod first and run uninstall command again.
kubectl delete pod <stalled-pod> -n atlassian --force
kubectl get pods -n atlassian
How do I deal with suspended AWS Auto Scaling Groups during Terraform uninstallation?
¶
Symptom
If for any reason Auto Scaling Group gets suspended, AWS does not allow Terraform to delete the node group. In cases like this the uninstall process gets interrupted with the following error:
Error: error waiting for EKS Node Group (atlas-<environment_name>-cluster:appNode) to delete: unexpected state 'DELETE_FAILED', wanted target ''. last error: 2 errors occurred:
* i-06a4b4afc9e7a76b0: NodeCreationFailure: Instances failed to join the kubernetes cluster
* eks-appNode-3ebedddc-2d97-ff10-6c23-4900d1d79599: AutoScalingGroupInvalidConfiguration: Couldn't terminate instances in ASG as Terminate process is suspended
Solution
Delete the reported Auto Scaling Group in AWS console and run uninstall command again.
How do I deal with Terraform AWS authentication issues during installation?
¶
Symptom
The following error is thrown:
An error occurred (ExpiredToken) when calling the GetCallerIdentity operation: The security token included in the request is expired
Solution
Terraform cannot deploy resources to AWS if your security token has expired. Renew your token and retry.
How do I deal with Terraform state lock acquisition errors?
¶
If user interrupts the installation or uninstallation process, Terraform won't be able to unlock resources. In this case, Terraform is unable to acquire state lock in the next attempt.
Symptom
The following error is thrown:
Acquiring state lock. This may take a few moments...
Error: Error acquiring the state lock
Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
ID: 26f7b9a8-1bef-0674-669b-1d60800dea4d
Path: atlassian-data-center-terraform-state-xxxxxxxxxx/bamboo-xxxxxxxxxx/terraform.tfstate
Operation: OperationTypeApply
Who: xxxxxxxxxx@C02CK0JYMD6V
Version: 1.0.9
Created: 2021-11-04 00:50:34.736134 +0000 UTC
Info:
Solution
Forcibly unlock the state by running the following command:
terraform force-unlock <ID>
Where <ID>
is the value that appears in the error message.
There are two Terraform locks; one for the infrastructure and another for Terraform state. If you are still experiencing lock issues, change the directory to ./modules/tfstate
and retry the same command.
How do I deal with state data in S3 that does not have the expected content?
¶
If Terraform state is locked and users forcefully unlock it using terraform force-unlock <id>
, it may not get a chance to update the Digest value in DynamoDB. This prevents Terraform from reading the state data.
Symptom
The following error is thrown:
Error refreshing state: state data in S3 does not have the expected content.
This may be caused by unusually long delays in S3 processing a previous state
update. Please wait for a minute or two and try again. If this problem
persists, and neither S3 nor DynamoDB are experiencing an outage, you may need
to manually verify the remote state and update the Digest value stored in the
DynamoDB table to the following value: 531ca9bce76bbe0262f610cfc27bbf0b
Solution
-
Open DynamoDB page in AWS console and find the table named
atlassian_data_center_<region>_<aws_account_id>_tf_lock
in the same region as the cluster. -
Click on
Explore Table Items
and find the LockID named<table_name>/<environment_name>/terraform.tfstate-md5
. -
Click on the item and replace the
Digest
value with the given value in the error message.
How do I deal with pre-existing state in multiple environment?
¶
If you start installing a new environment while you already have an active environment installed before, you should NOT use the pre-existing state.
The same scenario when you want to uninstall a non-active environment.
What is active environment?
Active environment is the latest environment you installed or uninstalled.
Tip
Answer 'NO' when you get a similar message during installation or uninstallation:
Do you want to copy existing state to the new backend? Pre-existing state was found while migrating
the previous "s3" backend to the newly configured "s3" backend. An existing non-empty state already
exists in the new backend. The two states have been saved to temporary files that will be removed
after responding to this query.
Do you want to overwrite the state in the new backend with the previous state? Enter "yes" to copy
and "no" to start with the existing state in the newly configured "s3" backend.
Enter a value:
Symptom
Installation or uninstallation break after you chose to use pre-existing state.
Solution
- Clean up the project before proceed. In root directory of the project run:
./scripts/cleanup.sh -s -t -x -r . terraform init -var-file=<config file>
- Then re-run the install/uninstall script.
How do I deal with Module not installed
error during uninstallation?
¶
There are some Terraform specific modules that are required when performing an uninstall. These modules are generated by Terraform during the install process and are stored in the .terraform
folder. If Terraform cannot find these modules, then it won't be able perform an uninstall of the infrastructure.
Symptom
Error: Module not installed
on main.tf line 7:
7: module "tfstate-bucket" {
This module is not yet installed. Run "terraform init" to install all modules required by this configuration.
Solution
- In the root directory of the project run:
./scripts/cleanup.sh -s -t -x -r . cd modules/tfstate terraform init -var-file=<config file>
- Go back to the root of the project and re-run the
uninstall.sh
script.
How to deal with getting credentials: exec: executable aws failed with exit code 2
error?
¶
Symptom
After performing an install.sh
the following error is encountered:
Error: Post "https://0839E580E6ADB7B784AECE0E152D8AF2.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces": getting credentials: exec: executable aws failed with exit code 2
with module.base-infrastructure.kubernetes_namespace.products,
on modules/common/main.tf line 39, in resource "kubernetes_namespace" "products":
39: resource "kubernetes_namespace" "products" {
Solution
Ensure you are using a version of the aws cli that is at least >= 2
The version can be checked by running:
aws --version
How to ssh
to application nodes?
¶
Sometimes you need to ssh to the application nodes. This can be done by running:
kubectl exec -it <pod-name> -n atlassian -- /bin/bash
where <pod-name>
is the name of the application pod you want to ssh such as bitbucket-0
or jira-1
. To get the pod names you can run:
kubectl get pods -n atlassian
How to access to the application log files?
¶
A simple way to access to the application log content is running the following command:
kubectl logs <pod-name> -n atlassian
where <pod-name>
is the name of the application pod you want to see the log such as bitbucket-0
or jira-1
. To get the pod names you can run:
kubectl get pods -n atlassian
However, another approach to see the full log files produced by the application would be to ssh
to the application pod and access directly the log folder.
kubectl exec -it <pod-name> -n atlassian -- /bin/bash
cd /var/atlassian/<application>/logs
where <application>
is the name of the application such as confluence
, bamboo
, bitbucket
, or jira
.
Note that for some applications log foler is /var/atlassian/<application>/log
and for others is /var/atlassian/<application>/logs
.
If you need to copy the log files to a local machine, you can use the following command:
kubectl cp atlassian/<pod-name>:/var/atlassian/<application>/logs/<log_files> <local-path>
How to deal with persistent volume claim destroy failed error?
¶
The PVC cannot be destroyed when bound to a pod. Overcome this by scaling down to 0
pods first before deleting PVC.
helm upgrade PRODUCT atlassian-data-center/PRODUCT --set replicaCount=0 --reuse-values -n atlassian
How to manually clean up resources when uninstall has failed?
¶
Sometimes Terraform is unable to destroy resources for various reasons. This normally happens at EKS level. One quick solution is to manually delete the EKS cluster, and re-run uninstall, so that Terraform will pick up from there.
To delete EKS cluster, go to AWS console > EKS service > the cluster you're deploying. You'll need to go to 'Configuration' tab > 'Compute' tab > click into node group. Then in node group screen > Details > click into Autoscaling group. It'll then direct you to EC2 > Auto Scaling Group screen with the ASG selected > 'Delete' the chosen ASG. Wait for the ASG to be deleted, then go back to EKS cluster > Delete.
How to deal with This object does not have an attribute named
error when running uninstall.sh
¶
It is possible that if the installation has failed, the uninstall script will return an error like:
module.base-infrastructure.module.eks.aws_autoscaling_group_tag.this["Name"]: Refreshing state... [id=eks-appNode-t3_xlarge-50c26268-ea57-5aee-4523-68f33af7dd71,Name]
Error: Unsupported attribute
on dc-infrastructure.tf line 142, in module "confluence":
142: ingress = module.base-infrastructure.ingress
├────────────────
│ module.base-infrastructure is object with 5 attributes
This object does not have an attribute named "ingress".
Error: Unsupported attribute
-s
argument. This will add -refresh=false
to terraform destroy command. How to deal with Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials
error
¶
It is possible that you see such an error when running uninstall script with -s
argument. If it's not possible to destroy infrastructure without it, delete the offending module from tfstate, for example:
terraform state rm module.base-infrastructure.module.eks.helm_release.cluster-autoscaler
How to deal with EIP AddressLimitExceeded error
¶
If you encounter the below error during installation stage, it means VPC is successfully created, but no Elastic IP addresses available.
Error: Error creating EIP: AddressLimitExceeded: The maximum number of addresses has been reached.
status code: 400, request id: 0061b744-ced3-4d0e-9905-503c85013bcc
with module.base-infrastructure.module.vpc.module.vpc.aws_eip.nat[0],
on .terraform/modules/base-infrastructure.vpc.vpc/main.tf line 1078, in resource "aws_eip" "nat":
1078: resource "aws_eip" "nat" {
It happens when an old VPC was deleted but associated Elastic IPs were not released. Refer to AWS documentation on how to release an Elastic IP address.
Another option is to increase the Elastic UP address limit.
How to deal with Nginx Ingress Helm deployment error
¶
If you encounter the below error when providing 25+ cidrs in whitelist_cidr
variable, it may be caused by the service controller being unable to create a Load Balancer due to exceeding the inbound rules quota in a security group:
module.base-infrastructure.module.ingress.helm_release.ingress: Still creating... [5m50s elapsed]
Warning: Helm release "ingress-nginx" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
kubectl describe ingress-nginx-controller -n ingress-nginx
Warning SyncLoadBalancerFailed 112s service-controller Error syncing load balancer: failed to ensure load balancer: error authorizing security group ingress: "RulesPerSecurityGroupLimitExceeded: The maximum number of rules per security group has been reached.\n\tstatus code: 400, request id: 7de945ea-0571-48cd-99a1-c2ca528ad412"
The service controller creates several inbound rules for ports 80 and 443 for each source cidr, and as a result the quota is reached if there are 25+ cidrs in whitelist_cidr
list.
To mitigate the problem you may either file a ticket with AWS to increase the quota of inbound rules in a security group (60 by default) or set enable_https_ingress
to false in config.tfvars
if you don't need https ingresses. Port 443 will be removed from Nginx service, and as a result fewer inbound rules are created in the security group.
With an increased inbound rules quota or enable_https_ingress
set to false (or both), it is recommended to delete Nginx Helm chart before re-running install.sh
:
helm delete ingress-nginx -n ingress-nginx