By Rajarao Vijjapu, Data Architect – Big Data, Analytics and AI/ML at AWS
By Kiran Erra, Data Architect – Big Data, Analytics and AI/ML at AWS
By Unni Ravindranathan – Director of Product Management, Okta Integration Network, Okta
As organizations continue to build data lakes on Amazon Web Services (AWS) and adopt Amazon EMR, especially when consuming data at enterprise scale, it’s critical to govern your data lakes by establishing federated access and having fine-grained controls to access your data.
A common use case that we hear is how to use the existing system of record (like Okta) for authentication while enforcing column-level authorization to address data consumption patterns based on the organization’s data classification standards. For example, if a marketing department can’t see certain columns related to sales data, column-level authorization needs to be in place, even if marketing has access to other columns in the same table.
In this post, we walk you through implementing SAML-based authentication (AuthN) using Okta for Amazon EMR, querying data using Zeppelin notebooks, and applying column-level authorization (AuthZ) using AWS Lake Formation.
Okta is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in Security and Government. Okta is a leading independent provider of identity for the enterprise. The Okta Identity Cloud enables organizations to securely connect the right people to the right technologies at the right time.
The following diagram depicts our solution’s infrastructure.
In this walkthrough, we set up two users (
Analyst2) for authorization using AWS Lake Formation. As shown in the figure below,
Analyst1 is allowed to SELECT only two columns from the table, whereas
Analyst2 can SELECT all the columns from the table.
This solution uses Okta to manage users and implement SAML-based authentication for those Okta users when connecting to the Zeppelin notebook running on a kerberized (cluster dedicated KDC) Amazon EMR cluster.
To complete this walkthrough, you must have the following:
- An Okta developer account.
- Valid AWS account with access to AWS services.
- Make sure that port 8442 is not blocked when testing this solution. This is needed for your desktop to communicate the proxy agent running on Amazon EMR.
- An AWS Identity and Access Management (IAM) role as a Lake Formation administrator (for example,
lfblog). Grant data lake admin permissions to the role on the Lake Formation console. You use this role when launching the provided AWS CloudFormation stack.
The CloudFormation templates used in this post are designed to work only in the
us-east-1 region. They are not intended for production use without modification.
Setting Up the Okta Users and Application
In this implementation, we use an Okta single sign-on (SSO) application to integrate with Amazon EMR.
1. Sign up and activate an Okta developer account.
2. Log in to the Okta account and switch to Classic UI.
3. In the Directory menu, on the People tab, choose Add person.
4. Add the following users:
- Analyst One – email@example.com
- Analyst Two – firstname.lastname@example.org
5. For Password, choose Set by admin and enter a valid password.
6. Deselect the option User must change password on first login.
7. Choose Save.
8. Edit the profile attributes for the newly created users to specify the Display name as
- Do not specify a domain name (for example,
@dsar.com) in the Display name field.
9. On the Applications tab, add a new application.
10. For Platform, choose Web.
11. For Sign on method, select SAML 2.0.
12. Choose Create.
13. For the SAML settings, enter the five following parameters:
|1||General||Single Sign-On URL||
|2||General||Audience URI (SP Entity ID)||
A filled out SAML application template with these settings is shown below.
After you launch the Amazon EMR cluster, before validating the steps, update the public DNS in Single Sign-On URL (parameter 1) and the SAML role ARN and
OktaSAMLProvider identity provider (IdP) ARN (parameter 3).
14. On the Assignments tab, assign
analyst2 to the newly created application.
- Do not specify a domain name (for example,
@dsar.com) in the User Name field when assigning people to the application.
15. On the Sign On tab, under Sign On Methods, choose the link location for the IdP metadata file (right-click) and choose Copy Link Location.
The link should have the following format: https://dev-<account>.okta.com/app/<randomString>/sso/saml/metadata.
Setting Up AWS Lake Formation
If this is your first time accessing AWS Lake Formation, you need to add administrators.
1. On the Lake Formation console, in the Welcome to Lake Formation window, choose Add administrators.
2. For IAM users and roles, add lfblog.
3. Choose Save.
4. Under Data catalog, choose Settings.
5. For Data catalog settings, deselect the two permissions check-boxes.
6. Choose Save.
7. Choose Admins and database creators from the navigation pane.
8. In the Revoke permissions section, for IAM users and roles, choose IAMAllowedPrincipals.
9. Choose Revoke.
10. Choose Databases from the navigation pane.
11. Choose Create database.
12. Create two databases:
- default – Leave the Location field blank.
- lfoktasamlblogdb – For Location, enter your Amazon Simple Storage Service (Amazon S3) location path (for example,
13. Choose Create database.
The following screenshot shows your databases listed on the Databases page.
Setting Up the Amazon EMR Cluster
To set up your Amazon EMR cluster, you deploy the following CloudFormation template to your account:
Provide the following parameters:
- StackName –
- userBucketName –
s3://<User bucket name>
- OktaAppMetadataURL –
- SAMLProviderName –
- Realm –
- KdcAdminPassword – Must be at least eight characters containing letters, numbers, and symbols; the default value is
- ReleaseLabel –
- InstanceType –
- VPCSubnet –
- myIPCidr –
<Your IP Address>/32
- oktaUser1 –
- oktaUser2 –
- EC2KeyPair (Optional) –
<keypair, if any>
The CloudFormation stack creates the following resources:
- The necessary IAM roles for AWS Lambda functions, Amazon EMR, and AWS Lake Formation.
- A Lambda function to move a sample dataset (New York taxi data) from the AWS-provided S3 bucket (source) to the user-created S3 bucket (target) and set up the IdP by uploading Okta metadata XML to IAM.
- An Amazon Athena named query to create an external table pointing to the dataset stored in the user S3 bucket.
- An Amazon EMR cluster associated with a security configuration and the appropriate IAM roles.
- A Lambda function to grant Lake Formation permissions for the Okta users (
Analyst1is granted with SELECT permissions on two columns (
Analyst2is granted with SELECT permissions on all columns on the
Validating the Solution
To validate our solution, complete the following steps:
1. Make sure that port 8442 is not blocked.
2. On the CloudFormation console, on the Outputs tab for your stack, record the primary DNS, SAML role ARN, and IdP ARN.
3. Update the SAML attributes in your Okta application.
4. Launch a web browser in incognito mode and open a Zeppelin notebook using the URL
https://<EMR primary DNS>:8442/gateway/default/zeppelin/ (replace <EMR primary DNS> with the primary public DNS of the EMR cluster).
The Single Sign-On login page appears. Okta validates the login credentials with the system of record, like Active Directory, and returns a SAML, which is parsed and the next page is displayed based on the redirect URL parameter.
5. Log in to Zeppelin as
6. After login, choose Create a note and run this SQL statement:
spark.sql("select * from lfoktasamlblogdb.taxi_data limit 10").show()
The following screenshot shows that
analyst1 can only see the two columns that you specified in Lake Formation.
7. Open another web browser in incognito mode and log in to Zeppelin as
The same select query shows all the columns, as shown in the following screenshot.
Delete the AWS CloudFormation stack to clean up all the resources created for this solution. Also, clean up the Amazon S3 log bucket that you specified in the CloudFormation parameters.
Okta Adaptive Multi-factor Authentication
One way to make this access even more secure is by using Okta’s Adaptive Multi-factor Authentication (MFA), which allows for dynamic policy changes and step-up authentication in response to changes in user and device behavior, location, or other contexts.
Adaptive MFA supports detection and authentication challenges for riskier situations, such as use of weak passwords, proxy use, geographic location, new devices, and anomalous behavior.
In this post, we went through setting up and validating SAML-based authentication for an Amazon EMR-based notebook using Okta. This decouples the authentication mechanism from Amazon EMR and uses the existing system of record, like Active Directory. We showed column-level authorization using Lake Formation and how this can help you grant appropriate permissions based on a user’s role.