Troubleshooting tips ¶

This guide contains general tips on how to investigate an application deployment that doesn't work correctly.

How to troubleshoot a failed Helm release installation?

¶

Symptom

Install script fails due to failure when installing Helm release. This applies to all DC products. You will see the following error:

module.confluence[0].helm_release.confluence: Still creating... [20m10s elapsed]
Warning: Helm release "confluence" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
  with module.confluence[0].helm_release.confluence,
  on modules/products/confluence/helm.tf line 4, in resource "helm_release" "confluence":

  4: resource "helm_release" "confluence" {

Error: timed out waiting for the condition
  with module.confluence[0].helm_release.confluence,
  on modules/products/confluence/helm.tf line 4, in resource "helm_release" "confluence":

4: resource "helm_release" "confluence" {
Releasing state lock. This may take a few moments...

Helm gives up waiting for a successful release, Usually, it means that Confluence (or any other product) pod failed to pass its readiness probe, or the pod is stuck in a Pending state.

Solution

To troubleshoot the error, run the following script:

scripts/collect_k8s_logs.sh atlas-dcapt-confluence-small-cluster us-east-2 /path/to/local/directory

Cluster name and region may differ (look at environment name and region in your config.tfvars). For example, if your environment_name is dcapt-confluence-small, then your cluster name is atlas-dcapt-confluence-small-cluster. The last argument is a destination path for a tar.gz with logs that the script will produce.

Share the archive in Slack #data-center-app-performance-toolkit channel along with your support request. You can also look at the pod and its logs, e.g.:

confluence-0_log.log
confluence-0_describe.log

Odds are that logs may shed some light on why the pod isn't ready. The product_describe.log file will contain K8S events that may also help understand why the pod isn't in a Running state.

It's also a good idea to get logs that are not sent to stdout/err:

kubectl exec confluence-0 -n atlassian -i -t -- cat /var/atlassian/application-data/confluence/logs/atlassian-confluence.log

Typically, if the pod is Running but not marked as Ready, it's the application that failed to start, i.e. it isn't an infrastructure issue.

How to troubleshoot instances that failed to join the Kubernetes cluster

¶

Symptom

When Terraform creates EKS infrastructure, EKS cluster (control plane) is created first. Once the cluster has been created, a NodeGroup (backed by an ASG) is created, and EC2 instances join the cluster as worker nodes.

If a node fails to join its cluster, you will typically see the following error:

Error: waiting for EKS Node Group (atlas-dcapt-jira-small-cluster:appNode-t3_xlarge-20240521085758213900000012) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
* i-0dd1a9dc64303a10b: NodeCreationFailure: Instances failed to join the kubernetes cluster

with module.base-infrastructure.module.eks.module.eks.module.eks_managed_node_group["appNodes"].aws_eks_node_group.this[0],
on .terraform/modules/base-infrastructure.eks.eks/modules/eks-managed-node-group/main.tf line 272, in resource "aws_eks_node_group" "this":
272: resource "aws_eks_node_group" "this" {

There can be several reasons why nodes can't join the cluster. Permissions issues are the most common. Make sure STS is enabled for your account in the target region. With STS disabled, EKS control plane will deny join requests from the nodes.

After enabling STS, destroy existing environment and re-run the installation.

How to fix 'exec plugin is configured to use API version' error?

¶

Symptom

When running install.sh script the installation fails with an error:

module.base-infrastructure.kubernetes_namespace.products: Creating...
module.base-infrastructure.module.ingress.helm_release.ingress: Creating...

Error: Post "https://1D2E0AC7AE5EC290740D816BD53A68AB.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces": getting credentials: exec plugin is configured to use API version client.authentication.k8s.io/v1beta1, plugin returned version client.authentication.k8s.io/v1alpha1

  with module.base-infrastructure.kubernetes_namespace.products,
  on modules/common/main.tf line 43, in resource "kubernetes_namespace" "products":
  43: resource "kubernetes_namespace" "products" {


Error: Kubernetes cluster unreachable: Get "https://1D2E0AC7AE5EC290740D816BD53A68AB.gr7.us-east-1.eks.amazonaws.com/version": getting credentials: decoding stdout: no kind "ExecCredential" is registered for version "client.authentication.k8s.io/v1alpha1" in scheme "pkg/runtime/scheme.go:100"

  with module.base-infrastructure.module.ingress.helm_release.ingress,
  on modules/AWS/ingress/main.tf line 44, in resource "helm_release" "ingress":
  44: resource "helm_release" "ingress" {

The error is caused by API version mismatch. AWS CLI and Kubernetes and Helm providers use v1alpha1 and v1beta1 respectively.

Solution

Update AWS CLI to the most recent version.

How do I uninstall an environment using a different Terraform configuration file?

¶

Symptom

If you try to uninstall an environment by using a different configuration file than the one you used to install it or by using a different version of the code, you may encounter some issues during uninstallation. In most cases, Terraform reports that the resource cannot be removed because it's in use.

Solution

Identify the resource and delete it manually from the AWS console, and then restart the uninstallation process. Make sure to always use the same configuration file that was used to install the environment.

How do I deal with persistent volumes that do not get removed as part of the Terraform uninstallation?

¶

Symptom

Uninstall fails to remove the persistent volume.

Error: Persistent volume atlassian-dc-share-home-pv still exists (Bound)

Error: context deadline exceeded

Error: Persistent volume claim atlassian-dc-share-home-pvc still exists with

Solution

If a pod termination stalls, it will block pvc and pv deletion. To fix this problem we need to terminate product pod first and run uninstall command again.

kubectl delete pod <stalled-pod> -n atlassian --force

To see the stalled pod name you can run the following command:

kubectl get pods -n atlassian

How do I deal with suspended AWS Auto Scaling Groups during Terraform uninstallation?

¶

Symptom

If for any reason Auto Scaling Group gets suspended, AWS does not allow Terraform to delete the node group. In cases like this the uninstall process gets interrupted with the following error:

Error: error waiting for EKS Node Group (atlas-<environment_name>-cluster:appNode) to delete: unexpected state 'DELETE_FAILED', wanted target ''. last error: 2 errors occurred:
    * i-06a4b4afc9e7a76b0: NodeCreationFailure: Instances failed to join the kubernetes cluster
    * eks-appNode-3ebedddc-2d97-ff10-6c23-4900d1d79599: AutoScalingGroupInvalidConfiguration: Couldn't terminate instances in ASG as Terminate process is suspended

Solution

Delete the reported Auto Scaling Group in AWS console and run uninstall command again.

How do I deal with Terraform AWS authentication issues during installation?

¶

Symptom

The following error is thrown:

An error occurred (ExpiredToken) when calling the GetCallerIdentity operation: The security token included in the request is expired

Solution

Terraform cannot deploy resources to AWS if your security token has expired. Renew your token and retry.

How do I deal with Terraform state lock acquisition errors?

¶

If user interrupts the installation or uninstallation process, Terraform won't be able to unlock resources. In this case, Terraform is unable to acquire state lock in the next attempt.

Symptom

The following error is thrown:

Acquiring state lock. This may take a few moments...

 Error: Error acquiring the state lock

 Error message: ConditionalCheckFailedException: The conditional request failed
 Lock Info:
   ID:        26f7b9a8-1bef-0674-669b-1d60800dea4d
   Path:      atlassian-data-center-terraform-state-xxxxxxxxxx/bamboo-xxxxxxxxxx/terraform.tfstate
   Operation: OperationTypeApply
   Who:       xxxxxxxxxx@C02CK0JYMD6V
   Version:   1.0.9
   Created:   2021-11-04 00:50:34.736134 +0000 UTC
   Info:

Solution

Forcibly unlock the state by running the following command:

terraform force-unlock <ID>

Where <ID> is the value that appears in the error message.

There are two Terraform locks; one for the infrastructure and another for Terraform state. If you are still experiencing lock issues, change the directory to ./modules/tfstate and retry the same command.

How do I deal with state data in S3 that does not have the expected content?

¶

If Terraform state is locked and users forcefully unlock it using terraform force-unlock <id>, it may not get a chance to update the Digest value in DynamoDB. This prevents Terraform from reading the state data.

Symptom

The following error is thrown:

Error refreshing state: state data in S3 does not have the expected content.

This may be caused by unusually long delays in S3 processing a previous state
update.  Please wait for a minute or two and try again. If this problem
persists, and neither S3 nor DynamoDB are experiencing an outage, you may need
to manually verify the remote state and update the Digest value stored in the
DynamoDB table to the following value: 531ca9bce76bbe0262f610cfc27bbf0b

Solution

Open DynamoDB page in AWS console and find the table named atlassian_data_center_<region>_<aws_account_id>_tf_lock in the same region as the cluster.
Click on Explore Table Items and find the LockID named <table_name>/<environment_name>/terraform.tfstate-md5.
Click on the item and replace the Digest value with the given value in the error message.

How do I deal with pre-existing state in multiple environment?

¶

If you start installing a new environment while you already have an active environment installed before, you should NOT use the pre-existing state.

The same scenario when you want to uninstall a non-active environment.

What is active environment?

Active environment is the latest environment you installed or uninstalled.

Tip

Answer 'NO' when you get a similar message during installation or uninstallation:

Do you want to copy existing state to the new backend? Pre-existing state was found while migrating 
the previous "s3" backend to the newly configured "s3" backend. An existing non-empty state already 
exists in the new backend. The two states have been saved to temporary files that will be removed 
after responding to this query. 

Do you want to overwrite the state in the new backend with the previous state? Enter "yes" to copy 
and "no" to start with the existing state in the newly configured "s3" backend.

Enter a value:

Symptom

Installation or uninstallation break after you chose to use pre-existing state.

Solution

Clean up the project before proceed. In root directory of the project run:

./scripts/cleanup.sh -s -t -x -r .
terraform init -var-file=<config file>

Then re-run the install/uninstall script.

How do I deal with Module not installed error during uninstallation?

¶

There are some Terraform specific modules that are required when performing an uninstall. These modules are generated by Terraform during the install process and are stored in the .terraform folder. If Terraform cannot find these modules, then it won't be able perform an uninstall of the infrastructure.

Symptom

Error: Module not installed

  on main.tf line 7:
   7: module "tfstate-bucket" {

This module is not yet installed. Run "terraform init" to install all modules required by this configuration.

Solution

In the root directory of the project run:

./scripts/cleanup.sh -s -t -x -r .
cd modules/tfstate
terraform init -var-file=<config file>

Go back to the root of the project and re-run the uninstall.sh script.

How to deal with getting credentials: exec: executable aws failed with exit code 2 error?

¶

Symptom

After performing an install.sh the following error is encountered:

Error: Post "https://0839E580E6ADB7B784AECE0E152D8AF2.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces": getting credentials: exec: executable aws failed with exit code 2

with module.base-infrastructure.kubernetes_namespace.products,
on modules/common/main.tf line 39, in resource "kubernetes_namespace" "products":
39: resource "kubernetes_namespace" "products" {

Solution

Ensure you are using a version of the aws cli that is at least >= 2 The version can be checked by running:

aws --version

How to ssh to application nodes?

¶

Sometimes you need to ssh to the application nodes. This can be done by running:

kubectl exec -it <pod-name> -n atlassian -- /bin/bash

where <pod-name> is the name of the application pod you want to ssh such as bitbucket-0 or jira-1. To get the pod names you can run:

kubectl get pods -n atlassian

How to access to the application log files?

¶

A simple way to access to the application log content is running the following command:

kubectl logs <pod-name> -n atlassian

where <pod-name> is the name of the application pod you want to see the log such as bitbucket-0 or jira-1. To get the pod names you can run:

kubectl get pods -n atlassian

However, another approach to see the full log files produced by the application would be to ssh to the application pod and access directly the log folder.

kubectl exec -it <pod-name> -n atlassian -- /bin/bash
cd /var/atlassian/<application>/logs

where <application> is the name of the application such as confluence, bamboo, bitbucket, or jira.

Note that for some applications log foler is /var/atlassian/<application>/log and for others is /var/atlassian/<application>/logs.

If you need to copy the log files to a local machine, you can use the following command:

kubectl cp atlassian/<pod-name>:/var/atlassian/<application>/logs/<log_files> <local-path>

How to deal with persistent volume claim destroy failed error?

¶

The PVC cannot be destroyed when bound to a pod. Overcome this by scaling down to 0 pods first before deleting PVC.

helm upgrade PRODUCT atlassian-data-center/PRODUCT --set replicaCount=0 --reuse-values -n atlassian

How to manually clean up resources when uninstall has failed?

¶

Sometimes Terraform is unable to destroy resources for various reasons. This normally happens at EKS level. One quick solution is to manually delete the EKS cluster, and re-run uninstall, so that Terraform will pick up from there.

To delete EKS cluster, go to AWS console > EKS service > the cluster you're deploying. You'll need to go to 'Configuration' tab > 'Compute' tab > click into node group. Then in node group screen > Details > click into Autoscaling group. It'll then direct you to EC2 > Auto Scaling Group screen with the ASG selected > 'Delete' the chosen ASG. Wait for the ASG to be deleted, then go back to EKS cluster > Delete.

How to deal with This object does not have an attribute named error when running uninstall.sh

¶

It is possible that if the installation has failed, the uninstall script will return an error like:

module.base-infrastructure.module.eks.aws_autoscaling_group_tag.this["Name"]: Refreshing state... [id=eks-appNode-t3_xlarge-50c26268-ea57-5aee-4523-68f33af7dd71,Name]
Error: Unsupported attribute
on dc-infrastructure.tf line 142, in module "confluence":
142: ingress = module.base-infrastructure.ingress
├────────────────
│ module.base-infrastructure is object with 5 attributes
This object does not have an attribute named "ingress".
Error: Unsupported attribute

This happens because some of the modules failed to be installed. To fix the error, run the uninstall script with -s argument. This will add -refresh=false to terraform destroy command.

How to deal with Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials error

¶

It is possible that you see such an error when running uninstall script with -s argument. If it's not possible to destroy infrastructure without it, delete the offending module from tfstate, for example:

terraform state rm module.base-infrastructure.module.eks.helm_release.cluster-autoscaler

Once done, re-run the uninstall script.

How to deal with EIP AddressLimitExceeded error

¶

If you encounter the below error during installation stage, it means VPC is successfully created, but no Elastic IP addresses available.

Error: Error creating EIP: AddressLimitExceeded: The maximum number of addresses has been reached.
status code: 400, request id: 0061b744-ced3-4d0e-9905-503c85013bcc

with module.base-infrastructure.module.vpc.module.vpc.aws_eip.nat[0],
on .terraform/modules/base-infrastructure.vpc.vpc/main.tf line 1078, in resource "aws_eip" "nat":
1078: resource "aws_eip" "nat" {

It happens when an old VPC was deleted but associated Elastic IPs were not released. Refer to AWS documentation on how to release an Elastic IP address.

Another option is to increase the Elastic UP address limit.

How to deal with Nginx Ingress Helm deployment error

¶

If you encounter the below error when providing 25+ cidrs in whitelist_cidr variable, it may be caused by the service controller being unable to create a Load Balancer due to exceeding the inbound rules quota in a security group:

module.base-infrastructure.module.ingress.helm_release.ingress: Still creating... [5m50s elapsed]
Warning: Helm release "ingress-nginx" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.

To check if it's really the case, login to the cluster and run:

kubectl describe ingress-nginx-controller -n ingress-nginx

to find the following error in Events section:

Warning  SyncLoadBalancerFailed  112s  service-controller  Error syncing load balancer: failed to ensure load balancer: error authorizing security group ingress: "RulesPerSecurityGroupLimitExceeded: The maximum number of rules per security group has been reached.\n\tstatus code: 400, request id: 7de945ea-0571-48cd-99a1-c2ca528ad412"

The service controller creates several inbound rules for ports 80 and 443 for each source cidr, and as a result the quota is reached if there are 25+ cidrs in whitelist_cidr list.

To mitigate the problem you may either file a ticket with AWS to increase the quota of inbound rules in a security group (60 by default) or set enable_https_ingress to false in config.tfvars if you don't need https ingresses. Port 443 will be removed from Nginx service, and as a result fewer inbound rules are created in the security group.

With an increased inbound rules quota or enable_https_ingress set to false (or both), it is recommended to delete Nginx Helm chart before re-running install.sh:

helm delete ingress-nginx -n ingress-nginx