Mastering Hadoop: Insights and Best Practices for Efficient Data Processing

How to update custom stack version number in ambari

If you have a custom stacks installed in Ambari, for example HAProxy or your other custom services and if you want to update their latest version number on the UI, you can follow the below steps.

Here is the example to update this in the UI for a Mahout service. You can do the same for your custom service like HAProxy.

Take backup:-
create table repo_version_bkp as select * from repo_version;

Query to check the data before update:-

select replace(version_xml,'name="MAHOUT" version="0.9.0"','name="MAHOUT" version="0.9.1"') from repo_version where repo_version=1;

Check the output of this select sql, it should show what the replaced value is. Here i am replacing the string (name="MAHOUT" version="0.9.0"') from the column version_xml and the replaced value will be (name="MAHOUT" version="0.9.1").

Update:

Once you confirm the output looks correct, then update the table.

update repo_version set version_xml=replace(version_xml,'name="MAHOUT" version="0.9.0"','name="MAHOUT" version="0.9.1"') where repo_version_id=1;

Note: use the correct repo_version_id number in the above query based on your environment.

After the table change restart Ambari-server

Installing CDSW on HDP Cluster

Steps to install Cloudera data science workbench(CDSW) with HDP

Install Pre-requisites

Cluster Requirement:-

HDP-2.6.5 or HDP-3.1

Edge node requirements:

Enable memory cgroups on your operating system (Use RHEL7.5 - and should be enabled by default)
Disable swap for optimum stability and performance.

"sudo sysctl -w vm.swappiness=1"

For instructions, see Setting the vm.swappiness Linux Kernel Parameter.

Cloudera Data Science Workbench uses uid 8536 for an internal service account. Make sure that this user ID is not assigned to any other service or user account.

cat /etc/passwd |grep -i 8536

JDK 8
Disable all pre-existing iptables rules.

sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F
sudo iptables -X

Disable SELINUX or permissive mode.

To disable:-

vi /etc/selinux/config

Change "SELINUX=disabled"

(or)

setenforce 0

No DNS server running on port 53 the CDSW machines ( check by running lsof -i:53)

yum -y install bzip2

Install anaconda

wget https://repo.anaconda.com/archive/Anaconda2-5.2.0-Linux-x86_64.sh

/anaconda/anaconda2

Read and agree and type "yes"

Note this path , we need to set this cdsw.conf file with the environment variable ANACONDA_DIR later.

Anaconda2 will now be installed into this location:

/root/anaconda2

Install CDSW:

Add a new host to the cluster using ambari. Go to the Hosts page and select Actions > + Add New Hosts.
On the edge node,

Cd /etc/yum.repos.d

wget https://archive.cloudera.com/cdsw1/1.5.0/redhat7/yum/cloudera-cdsw.repo

sudo rpm --import https://archive.cloudera.com/cdsw1/1.5.0/redhat7/yum/RPM-GPG-KEY-cloudera

sudo yum install cloudera-data-science-workbench
vi /etc/cdsw/config/cdsw.conf

DOMAIN="rraman-docker-1.openstacklocal"

MASTER_IP="172.26.75.53"

DOCKER_BLOCK_DEVICES="/dev/vdb"

APPLICATION_BLOCK_DEVICE=""

JAVA_HOME="/usr/jdk64/jdk1.8.0_112"

TLS_ENABLE="false"

TLS_CERT=""

TLS_KEY=""

HTTP_PROXY=""

HTTPS_PROXY=""

ALL_PROXY=""

NO_PROXY=""

NVIDIA_GPU_ENABLE=false

NVIDIA_LIBRARY_PATH=

DISTRO="HDP"

DISTRO_DIR=""

ANACONDA_DIR="/root/anaconda2"

cdsw init

Hit the browser URL:-

http://rraman-docker-1.openstacklocal

Reference:

Deploying Cloudera Data Science Workbench 1.5.x on Hortonworks Data Platform:-

https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html#configure_gateway_hosts

How to rename a existing HDP cluster

To rename a HDP cluster :-

1. Rename through manage ambari -- > rename the cluster

2. Rename the existing ranger repositories, for the services where ranger plugin in enabled.

for example:-

Old cluster name is ClusterA and the renamed cluster name ClusterA_NEW

The HDFS ranger repository name will look like "ClusterA_hadoop" , that needs to be renamed to "ClusterA_NEW_hadoop" to match with cluster name.

Repeat this for all the service repositories.

3. Restart the relevant services where the changes are made.

example: HDFS, YARN, HIVE, Atlas, Kafka,..

4. (Optional) The principals / keytabs can still function with old cluster name on it. To avoid confusion, if you need all the service principals to carry the new cluster name, you can regenerate the keytabs through ambari. If we are doing this, then we can skip the step3 and restart after the new keytabs generated.

How to Fix Ranger Usersync Failure on Your HDP / CDP Cluster

Problem:-

If you're setting up a cluster and experiencing issues with Ranger usersync, you may encounter error messages in the /var/log/ranger/usersync/usersync.log file. Specifically, you might see errors like the following:

11 Feb 2022 15:15:46 ERROR CustomSSLSocketFactory [UnixUserSyncThread] - Unable to obtain keystore from file [/usr/hdp/current/ranger-usersync/conf/mytruststore.jks]

11 Feb 2022 15:15:46 ERROR UserGroupSync [UnixUserSyncThread] - Failed to initialize UserGroup source/sink. Will retry after 3600000 milliseconds. Error details: javax.naming.CommunicationException: adhost1.example.com:636 [Root exception is java.lang.NullPointerException]

These errors can be frustrating to deal with, but there are steps you can take to address them. One solution involves extracting an Active Directory (AD) cert and importing it into the Ranger usersync truststore. Finally, you'll need to update the password for the truststore through Ambari. By following these steps, you can get Ranger usersync up and running smoothly.

Solution: How to Fix Ranger Usersync Failure on Your Cluster

To resolve this issue, follow these simple steps:

Step 1: Extract the AD cert To extract the AD cert, use the following command:

perl
echo -n | openssl s_client -connect adhost1.example.com:636 \   | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/ad_cert.cert

Step 2: Import the extracted cert into Ranger usersync truststore To import the extracted cert into the Ranger usersync truststore, use the following command:

bash
keytool -import -trustcacerts -alias AD_cert -keystore /usr/hdp/current/ranger-usersync/conf/mytruststore.jks -file /tmp/ad_cert.cert

Make sure to choose the password you want to set for this keystore.

Step 3: Update the Ranger usersync truststore password To update the password for the Ranger usersync truststore, follow these steps:

Go to Ambari.
Navigate to Ranger --> Configs --> Advanced --> Advanced ranger-ugsync-site --> ranger.usersync.truststore.password.
Update the password.

By following these simple steps, you should be able to fix the Ranger usersync failure on your cluster.

How to install localstack - to use S3 API's onprem cluster

Localstack Install Steps:-

curl -sL https://rpm.nodesource.com/setup_10.x | bash -
yum -y install python-pip python-devel gcc gcc+ nodejs maven lsof wget
pip install --upgrade pip
pip install localstack awscli-local

These steps include installing , all relevant binaries required for localstack to install.

Ambari - How to store KDC admin credential persisted and regenerate keytabs using API call

Step1:

ambari-server setup-security

and choose [2] Encrypt passwords stored in ambari.properties file
and choose the password you want to set

Step2:

curl -H "X-Requested-By:ambari" -u admin:admin -X POST -d '{ "Credential" : { "principal" : "admin@EXAMPLE.COM", "key" : "paswwd12345", "type" : "persisted" } }' http://ambari-node1.example.com:8080/api/v1/clusters/hdp_cluster1/credentials/kdc.admin.credential

--> update with your actual principal name and password.

--> update the admin id password for ambari.

Step3:

curl -H "X-Requested-By:ambari" -u admin:admin -X PUT -d '{ "Clusters": { "security_type" : "KERBEROS" } }' http://ambari-node1.example.com:8080/api/v1/clusters/hdp_cluster1/?regenerate_keytabs=ALL

OS commands

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

To watch CPU core interrupts:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

/usr/bin/watch -d 'cat /proc/interrupts'

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

To increase WAN transfers, increase txqueuelen in eth interface:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

“For WAN transfers, it was discovered that a setting of 2,000 for the txqueuelen is sufficient to prevent any send stalls from occurring.”5 Default value for txqueuelen is 1000. I have successfully tested a value of 2500. Type the follow command to change it:

Example:

ifonfig eth4 txqueuelen 2500

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# To run the application or commands to a specific core "numactl"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#get the cpu physical core info from the below and use it in numactl.

less /proc/cpuinfo

numactl --physcpubind=1

Ex: numactl --physcpubind=1 top

This will run the top command using physical cpu core id "1"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#To Check the ethernet interface driver version:

ethtool -i eth0

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#To get Inode number for a group a files in a folder

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

cd /test

stat * |grep -i -E 'File:'\|'Inode' | awk '{ if($3 == "Inode:") print " "$3" "$4; else print $0 }'

#Good Format:

stat * |grep -i -E 'File:'\|'Inode' | awk '{ if($3 == "Inode:") print "\t"$3" "$4; else print $1 $2 }' |awk 'NR%2{printf "%s ",$0;next;}1'

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#To process any operation line by line in do while loop:(Example)

filename=/tmp/file1.txt

while read -r line

name="$line"

echo "Name read from file - $name"

done < "$filename"

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

#To extract the logs between two timestamps

Example:

sort hiveserver2.log |sed -n '/2017-04-26 20:40:44/,/2017-04-26 20:43:44/p' >> extracted_file.log

sed -n '/2017-04-26 20:43:44/,/2017-04-26 20:43:44/p' hiveserver2.log >> extracted_file.log

cat hiveserver2_knox_25_april1 |grep -a '' |sort |sed -n '/2017-04-25 20:/,/2017-04-25 21:/p' > hiveserver2_8:30PM_9:30PM

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

##Monitor I/O using SAR command:

sar -d -p 1

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Mac OS : converting delimiter $ to ','

perl -pi -w -e 's/\$/,/g;' test1.txt

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hive Server(HS2) open files group by PID :-

grep 'open files' /proc/$(ps aux | grep -i "hiveserver2"|grep -v 'grep'| awk '{print $2}')/limits

lsof -u hive |awk '{print $2}' |uniq

lsof -u hive |awk '{print $2}' |sort |uniq -c

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Mastering Hadoop: Insights and Best Practices for Efficient Data Processing

How to update custom stack version number in ambari

Installing CDSW on HDP Cluster

Install CDSW:

How to rename a existing HDP cluster

How to Fix Ranger Usersync Failure on Your HDP / CDP Cluster

How to install localstack - to use S3 API's onprem cluster

Ambari - How to store KDC admin credential persisted and regenerate keytabs using API call

OS commands

Boost Your Download Speed with lftp Segmentation

Other relevant topics