Anaconda Enterprise 5¶
Anaconda Enterprise is an enterprise-ready, secure and scalable data science platform that empowers teams to govern data science assets, collaborate and deploy their data science projects.
With Anaconda Enterprise, you can do the following:
Develop: ML/AI pipelines in a central development environment that scales from laptops to thousands of nodes
Govern: Complete reproducibility from laptop to cluster with the ability to configure access control
Automate: Model training and deployment on scalable, container-based infrastructure
Installing Anaconda Enterprise¶
When you initially install Anaconda Enterprise, you can install it on one to five nodes. You are not bound to that initial configuration, however. After completing the installation, you can add or remove nodes on the cluster as needed, including GPUs.
When you’ve determined an initial topology for your cluster, follow this high-level process to install Anaconda Enterprise:
Installation requirements¶
For your Anaconda Enterprise installation to complete successfully, your systems must meet the requirements outlined below. The installation requirements for Anaconda Enterprise are the same whether you choose to install the platform on-premises, hosted VSphere, or on a cloud server. There are cloud-specific requirements related to performance, however, so ensure your chosen cloud platform meets the minimum specifications outlined here before you begin.
The installer performs pre-flight checks, and only allows installation to continue on nodes that are configured correctly, and include the required kernel modules. If you want to perform the system check yourself, before installation, you can run the command on your intended master and worker nodes after you download and extract the installer.
When you initially install Anaconda Enterprise, you can install the cluster on one to five nodes. You are not bound to that initial configuration, however. After completing the installation, you can add or remove nodes on the cluster as needed. For more information, see Adding and removing nodes.
A rule of thumb for determining how to size your system is 1 CPU, 1GB of RAM and 5 GB of disk space for each project session or deployment. For more information about sizing for a particular component, see the following minimum requirements:
To use Anaconda Enterprise with a cloud platform, refer to Cloud performance requirements for cloud-specific performance requirements.
To use Spark Hadoop data sources with Anaconda Enterprise, refer to Installing Livy server for Hadoop Spark access and Configuring Livy server for Hadoop Spark access.
To verify your systems meet the requirements, see Verifying system requirements.
Hardware requirements
The following are minimum specifications for the master and worker nodes, as well as the entire cluster:
Master node |
Minimum |
|---|---|
CPU |
32 cores |
RAM |
64GB |
Disk space in /opt/anaconda |
500GB* |
Disk space in /var/lib/gravity |
300GB** |
Disk space in /tmp or $TMPDIR |
50GB |
Worker nodes |
Minimum |
|---|---|
CPU |
16 cores |
RAM |
64GB |
Disk space in /var/lib/gravity |
300GB |
Disk space in /tmp or $TMPDIR |
50GB |
Cluster totals |
Minimum |
|---|---|
CPU |
96 cores |
RAM |
128GB |
*NOTES regarding the minimum disk space in /opt/anaconda:
This total includes project and package storage (including mirrored packages).
Currently
/optand/opt/anacondamust be anext4orxfsfilesystem, and cannot be an NFS mountpoint. Subdirectories of/opt/anacondamay be mounted through NFS. See Mounting an NFS share for more information.If you are installing Anaconda Enterprise on an
xfsfilesystem, it needs to supportd_typeto work properly. If your XFS filesystem has been formatted with the-n ftype=0option, it won’t supportd_type, and will therefore need to be recreated using a command similar to the following before installing Anaconda Enterprise:mkfs.xfs -n ftype=1 /path/to/your/device
**NOTES regarding the minumum disk space in /var/lib/gravity:
This volume MUST be mounted on local storage. Core components of Kubernetes run from this directory, some of which are extremely intolerant of disk latency. Network-Attached Storage (NAS) and Storage Area Network (SAN) solutions are susceptible to latency, and are therefore not supported.
This total includes additional space to accommodate upgrades, and is recommended to have available during installation as it can be difficult to add space after the fact.
We strongly recommend that you set up the
/opt/anacondaand/var/lib/gravitypartitions using Logical Volume Management (LVM), to provide the flexibility needed to accomodate easier future expansion.
To check the number of cores, run nproc.
Disk IOPS requirements
Master and worker nodes require a minimum of 3000 concurrent input/output operations per second (IOPS)–fewer than 3000 concurrent IOPS will fail. Cloud providers report concurrent disk IOPS.
Hard disk manufacturers report sequential IOPS, which are different than concurrent IOPS. On-premises installations require servers with disks that support a minimum of 50 sequential IOPS. We recommend using SSD or better.
Storage and memory requirements
Approximately 50GB of available free space on each node is required for the Anaconda Enterprise installer to temporarily decompress files to the /tmp directory during the installation process.
If adequate free space is not available in the /tmp directory, you can specify the location of the temporary directory to be used during installation by setting the TMPDIR environment variable to a different location.
EXAMPLE:
sudo TMPDIR=/tmp2 ./gravity install
Note
When using sudo to install, the temporary directory must be set explicitly in the command line to preserve TMPDIR. The master node and each worker node all require a temporary directory of the same size, and should each use the TMPDIR variable as needed.
To check your available disk space, use the built-in Linux df utility with the -h parameter for human readable format:
df -h /var/lib/gravity
df -h /opt/anaconda
df -h /tmp
# or
df -h $TMPDIR
To show the free memory size in GB, run:
free -g
Operating system requirements
Anaconda Enterprise cannot be installed with heterogeneous versions in the same cluster. Before installing, verify that all cluster nodes are operating the same version of the OS.
Anaconda Enterprise currently supports the following Linux versions:
RHEL/CentOS 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 8.0
Ubuntu 16.04
SUSE 12 SP2, 12 SP3 Requirement: Set
DefaultTasksMax=infinityin/etc/systemd/system.conf.
To find your operating system version run
cat /etc/*release*orlsb-release -a.Optionally create a new directory and set
TMPDIR. User 1000 (or the UID for the service account) needs to be able to write to this directory. This means they can read, write and execute on the$TMPDIR.For example, to give write access to UID 1000, run the following command:
sudo chown 1000 -R $TMPDIR
Note
When installing Anaconda Enterprise on a system with multiple nodes, verify that the clock of each node is in sync with the others prior to starting the installation process, to avoid potential issues. We recommend using the Network Time Protocol (NTP) to synchronize computer system clocks automatically over a network. See instructions here.
Security requirements
Verify you have
sudoaccess.Make sure that the firewall is permanently set to keep the required ports open, and will save these settings across reboots. Then restart the firewall to load these settings immediately.
Various tools may be used to configure firewalls and open required ports, including
iptables,firewall-cmd,susefirewall2, and others.
For all CentOS and RHEL nodes:
Ensure that SELinux is not in
enforcingmode, by either disabling it or putting it inpermissivemode in the/etc/selinux/configfile.
After rebooting, run the following command to verify that SELinux is not being enforced:
~]~ getenforce
The result should be either Disabled or Permissive.
Kernel module requirements
The Anaconda Enterprise installer checks to see if the following modules required for Kubernetes to function properly are present, and alerts you if any are not loaded:
Linux Distribution |
Version Modules |
|
|---|---|---|
CentOS |
7.2 |
bridge, ebtables, iptable_filter, overlay |
RedHat Linux |
7.2 |
bridge, ebtables, iptable_filter |
CentOS |
7.3, 7.4, 7.5, 7.6, 7.7, 8.0 |
br_netfilter, ebtables, iptable_filter, overlay |
RedHat Linux |
7.3, 7.4, 7.5, 7.6, 7.7, 8.0 |
br_netfilter, ebtables, iptable_filter, overlay |
Ubuntu |
16.04 |
br_netfilter, ebtables, ebtable_filter, iptable_filter, overlay |
Suse |
12 SP2, 12 SP3 |
br_netfilter, ebtables, iptable_filter, overlay |
Module name |
Purpose |
|---|---|
bridge |
Required for Kubernetes iptables-based proxy to work correctly |
br_netfilter |
Required for Kubernetes iptables-based proxy to work correctly |
overlay |
Required to use overlay or overlay2 Docker storage driver |
ebtables |
Required to allow a service to communicate back to itself via internal load balancing when necessary |
iptable_filter |
Required to make sure that the firewall rules that Kubernetes sets up function properly |
iptable_nat |
Required to make sure that the firewall rules that Kubernetes sets up function properly |
To check if a particular module is loaded, run the following command:
lsmod | grep <module_name>
If the command doesn’t produce any result, the module is not loaded.
Run the following command to load the module:
sudo modprobe <module_name>
If your system does not load modules at boot, run the following—for each module—to ensure they are loaded upon reboot:
sudo echo -e '<module_name>' > /etc/modules-load.d/<module_name>.conf
System control settings
Anaconda Enterprise requires the following sysctl settings to function properly:
System setting |
Purpose |
|---|---|
net.bridge.bridge-nf-call-iptables |
Works with bridge kernel module to ensure Kubernetes iptables-based proxy works correctly |
net.bridge.bridge-nf-call-ip6tables |
Works with bridge kernel module to ensure Kubernetes iptables-based proxy works correctly |
fs.may_detach_mounts |
Can cause conflicts with the docker daemon, and leave pods in stuck state if not enabled |
net.ipv4.ip_forward |
Required for internal load balancing between servers to work properly |
fs.inotify.max_user_watches |
Set to 1048576 to improve cluster longevity |
Run the following commands to set system control settings:
sudo sysctl -w <system_setting>=1
To persist system settings on boot, run the following for each setting:
sudo echo -e "<system_setting> = 1" > /etc/sysctl.d/10-<system_setting>.conf
Verifying system requirements
Anaconda Enterprise performs system checks during the install
to verify CPU, RAM and other system requirements. The system checks
can also be performed manually before the installation using the following commands
from the installer directory, ~/anaconda-enterprise-<installer-version>.
Note
You can perform this check after downloading and extracting the installer.
To perform system checks on a master node, run the following command as sudo or root user:
sudo ./gravity check --profile ae-master
To perform system checks on a worker node, run the following command as sudo or root user:
sudo ./gravity check --profile ae-worker
If all of the system checks pass and all requirements are met, the output from the above commands will be empty. If the system checks fail and some requirements are not met, the output will indicate which system checks failed.
GPU requirements
To use GPUs with Anaconda Enterprise, you’ll need to install version 9.2 or 10.0 of the NVIDIA CUDA driver on the host operating system of any GPU worker nodes. You can install the drivers using the package manager or the Nvidia runfile or by using rpm (local) or rpm (network) for SLES, CentOS, and RHEL, and deb(local) or deb (network) for Ubuntu.
GPU deployments should use one of the following models:
Tesla V100 (recommended)
Tesla P100 (adequate)
Network requirements
Anaconda Enterprise requires the following network ports to be externally accessible:
Port |
Protocol |
Description |
|---|---|---|
80 |
TCP |
Anaconda Enterprise UI (plaintext) |
443 |
TCP |
Anaconda Enterprise UI (encrypted) |
32009 |
TCP |
Operations Center Admin UI |
These ports need to be externally accessible during installation only, and can be closed after completing the install process:
Port |
Protocol |
Description |
|---|---|---|
4242 |
TCP |
Bandwidth checker utility |
61009 |
TCP |
Install wizard UI access required during cluster installation |
61008, 61010, 61022-61024 |
TCP |
Installer agent ports |
The following ports are used for cluster operation, and therefore must be open internally, between cluster nodes:
Port |
Protocol |
Description |
|---|---|---|
53 |
TCP and UDP |
Internal cluster DNS |
2379, 2380, 4001, 7001 |
TCP |
Etcd server communication |
3008-3012 |
TCP |
Internal Anaconda Enterprise service |
3022-3025 |
TCP |
Teleport internal SSH control panel |
3080 |
TCP |
Teleport Web UI |
5000 |
TCP |
Docker registry |
6443 |
TCP |
Kubernetes API Server |
6990 |
TCP |
Internal Anaconda Enterprise service |
7496, 7373 |
TCP |
Peer-to-peer health check |
7575 |
TCP |
Cluster status gRPC API |
8081, 8086-8091, 8095 |
TCP |
Internal Anaconda Enterprise service |
8472 |
UDP |
Overlay network |
9080, 9090, 9091 |
TCP |
Internal Anaconda Enterprise service |
10248-10250, 10255 |
TCP |
Kubernetes components |
30000-32767 |
TCP |
Kubernetes internal services range |
You’ll also need to update your firewall settings to ensure that the 10.244.0.0/16 pod subnet and 10.100.0.0/16 service subnet are accessible to every node in the cluster, and grant all nodes the ability to communicate via their primary interface.
For example, if you’re using iptables:
iptables -A INPUT -s 10.244.0.0/16 -j ACCEPT
iptables -A INPUT -s 10.100.0.0/16 -j ACCEPT
iptables -A INPUT -s <node_ip> -j ACCEPT
Where <node_ip> specifies the internal IP address(es) used by all nodes in the cluster to connect to the AE5 master.
If you plan to use online package mirroring, you’ll need to whitelist the following domains:
repo.anaconda.com
anaconda.org
conda.anaconda.org
binstar-cio-packages-prod.s3.amazonaws.com
If any Anaconda Enterprise users will use the local graphical program Anaconda Navigator in online mode, they will need access to these sites, which may need to be whitelisted in your network’s firewall settings.
https://repo.anaconda.com (or for older versions of Navigator and Conda, https://repo.continuum.io)
https://conda.anaconda.org if any users will use conda-forge and other channels on Anaconda Cloud (anaconda.org)
https://vscode-update.azurewebsites.net/ if any users will install Visual Studio Code
google-public-dns-a.google.com (8.8.8.8:53) to check internet connectivity with Google Public DNS
TLS/SSL certificate requirements
Anaconda Enterprise uses certificates to provide transport layer security for the cluster. To get you started, self-signed certificates are generated during the initial installation. You can configure the platform to use organizational TLS/SSL certificates after completing the installation.
You may purchase certificates commercially, or generate them using your organization’s internal public key infrastructure (PKI) system. When using an internal PKI-signed setup, the CA certificate is inserted into the Kubernetes secret.
In either case, the configuration will include the following:
a certificate for the root certificate authority (CA),
an intermediate certificate chain,
a server certificate, and
a private server key.
See Updating TLS/SSL certificates for more information.
DNS requirements
Web browsers use domain names and web origins to separate sites, so they cannot tamper with each other. Anaconda includes deployments from many users, and if these deployments had addresses on the same domain, such as https://anaconda.yourdomain.com/apps/001 and
https://anaconda.yourdomain.com/apps/002, one app could access the cookies of the other, and JavaScript in one app could access the other app.
To prevent this potential security risk, Anaconda assigns deployments unique addresses such as
https://uuid001.anaconda.yourdomain.com and
https://uuid002.anaconda.yourdomain.com, where `` yourdomain.com`` is replaced with your organization’s domain name, and uuid001 and uuid002 is replaced with dynamically generated universally unique identifiers (UUIDs), for example.
To facilitate this, Anaconda Enterprise requires the use of wildcard DNS entries that apply to a set of domain names such as *.anaconda.yourdomain.com.
For example, if you are using the fully qualified domain name (FQDN) anaconda.yourdomain.com with a master node IP address of 12.34.56.78, the DNS entries would be as follows:
anaconda.yourdomain.com IN A 12.34.56.78
*.anaconda.yourdomain.com IN A 12.34.56.78
The wildcard subdomain’s DNS entry points to the Anaconda Enterprise master node.
The master node’s hostname and the wildcard domains must be resolvable with DNS
from the master nodes, the worker nodes, and the end user machines. To ensure
the master node can resolve its own hostname, any /etc/hosts entries used
must be propagated to the gravity environment.
Existing installations of dnsmasq will conflict with Anaconda Enterprise. If dnsmasq is installed on the master node or any worker nodes, you’ll need to remove it from all nodes before installing Anaconda Enterprise.
Run the following commands to ensure dnsmasq is stopped and disabled:
To stop
dnsmasq:sudo systemctl stop dnsmasqTo disable
dnsmasq:sudo systemctl disable dnsmasqTo verify
dnsmasqis disabled:sudo systemctl status dnsmasq
Browser requirements
Anaconda Enterprise supports the following web browsers:
Chrome 39+
Firefox 49+
Safari 10+
The minimum browser screen size for using the platform is 800 pixels wide and 600 pixels high.
Note
JupyterLab and Jupyter Notebook don’t currently support Internet Explorer, so Anaconda Enterprise users will have to use another editor for their Notebook sessions if they choose to use that browser to access the AE platform.
Cloud performance requirements¶
The installation requirements for Anaconda Enterprise are the same whether you choose to install the platform on premises, hosted VSphere, or on a cloud server. The only cloud-specific requirement for running Anaconda Enterprise relates to performance, so ensure your chosen cloud platform meets these minimum specifications before you begin:
Amazon Web Services (AWS)
Due to etcd’s sensitivity to disk latency, only use EC2 instances with a minimum of 3000 IOPS. We recommend an instance type no smaller than m4.4xlarge for both master and worker nodes.
Microsoft Azure
To meet CPU and disk I/O requirements, the minimum size for the selected VM should be Standard D16s v3 (16 VCPUs, 64 GB memory).
Google Cloud Platform (GCP)
No requirements for installing Anaconda Enterprise are unique to Google Cloud Platform.
After you’ve verified that your system meets these performance requirements—as well as all system requirements—you are ready to installing the cluster.
Pre-install checklist¶
It’s essential that the systems in your environment where you will install Anaconda Enterprise meet ALL of the requirements outlined here. The installer performs some pre-flight checks, and only allows installation to continue on nodes that are configured correctly, but doesn’t verify all requirements are met.
We’ve created this pre-install checklist to help you verify that you’ve accounted for everything before you begin, and ensure your installation is successful. Consider printing out a copy and physically checking off items as you go.
Note
We’ve also packaged a pre-flight script that you can use to verify whether the systems on which you plan to install Anaconda Enterprise meet the minimum requirements to install successfully. Follow these instructions to install and run the script.
[ ] I’ve verified that all nodes in the cluster meet the minimum or recommended specifications for CPU, RAM and disk space.
[ ] I’ve verified that all nodes in the cluster meet the minimum IOPS required for reliable performance.
[ ] I’ve verified that there is 50GB of free space available in the /tmp directory (or another temporary directory to be used during installation) on each node in the cluster.
[ ] I’ve verified that all cluster nodes are operating the same version of the OS, and that the OS version is supported by Anaconda Enterprise.
[ ] I’ve used the Network Time Protocol (NTP) to synchronize computer system clocks, and I’ve verified that the clock of each node in the cluster is in sync with the others. (Instructions for using NTP are provided here.)
[ ] I’ve verified that I have sudo access on all systems, and the firewall is configured correctly.
[ ] I’ve verified that all required kernel modules are loaded.
[ ] I’ve verified that the system control settings are set correctly.
[ ] I’ve verified that any GPUs to be used with Anaconda Enterprise have a supported NVIDIA CUDA driver installed.
[ ] I’ve verified that the system meets all network port requirements, whether the specified ports need to be open internally, externally, or during installation only.
[ ] I’ve verified that any firewalls used for network security have been temporarily disabled, for the window of time when Anaconda Enterprise is being installed.
[ ] I’ve verified that the domains required for online package mirroring have been whitelisted, if applicable.
[ ] If I intend to replace the self-signed certificates generated during installation with others, I’ve gathered all the information and files for the TLS/SSL certificates I will use.
[ ] I’ve verified that the Anaconda Enterprise domain get resolved to the IP address of the master node, whether through an alias (A) record or canonical name (CNAME).
[ ] I’ve verified that any wildcard DNS entries for my organization’s domain names meet the DNS requirements outlined here.
[ ] I’ve verified that the /etc/resolve.conf file on all the nodes DOES NOT include the rotate option.
[ ] I’ve verified that any existing installations of Docker (and dockerd), dnsmasq, and lxd have been removed from all nodes, as they will conflict with Anaconda Enterprise.
[ ] I’ve verified that all web browsers to be used to access Anaconda Enterprise are supported by the platform.
Installing the cluster¶
After you have determined the initial topology for your Anaconda Enterprise cluster, and verified that your system meets all of the installation requirements, you’re ready to install the cluster.
Before you begin:
Note
If you haven’t already, consider using the pre-install checklist provided, to verify that you’ve accounted for everything before you begin.
By default, Anaconda Enterprise installs using a service account with the user ID (UID)
1000. You can change the UID of the service account by using the--service-uidoption or theGRAVITY_SERVICE_USERenvironment variable at installation time. To do so, you need to have first created a group for that user with the UID.For example, to use UID
1001, run the following commands on each node of the cluster:root$ groupadd mygroup -g 1001 root$ useradd --no-create-home -u 1001 -g mygroup myuser
The installer uses the
TMPDIRdirectory that’s configured on the master node, so be sure the default directory contains sufficient space or create an alternate directory (with sufficient space) for the installer to use. If you choose to use an alternate directory, ensure it has the correct permissions enabled (drwxrwxrwx), and either add it to/etc/environmentor explicitly specify the directory during installation.
Determine your install method
The method you use to install the cluster will vary, depending on your ability to access the target machine. If you have network access to the target machine, we recommend you install Anaconda Enterprise using a web browser. Otherwise, you’ll need to use a command line.
With both methods, you can create any number of nodes from one to five nodes. You can also add or remove nodes at any time after installation. For more information, see Adding and removing nodes.
If the cluster where you will install AE cannot connect to the internet, follow the instructions for Installing in an air-gapped environment.
Using a web browser (recommended)
On the master node, download and decompress the installer, replacing
<location_of_installer>with the location of the installer, and<version>with your installer version:curl -O <location_of_installer>.tar.gz tar xvf anaconda-enterprise-<version>.tar.gz cd anaconda-enterprise-<version>
On the master node, run the pre-installation system checks as sudo or root user before proceeding with the installation:
sudo ./gravity check --profile ae-master
To perform system checks on a worker node, run the following command as sudo or root user:
sudo ./gravity check --profile ae-worker
If all of the system checks pass and all requirements are met, the output from the above commands will be empty. If the system checks fail and some requirements are not met, the output will indicate which system checks failed.
After doing the pre-installation system checks, run the installer on the master node as sudo or root user:
sudo ./gravity wizard
Note
If you’re using a service account UID that’s different than the default
1000, append the command with the actual UID. For example, to use UID
1001, run sudo ./gravity wizard --service-uid=1001.
If you’re using an alternate TMPDIR, pre-pend the command with the
directory. For example, sudo TMPDIR=/mytmp ./gravity wizard
Tue Oct 29 14:22:22 UTC Starting enterprise installer
To abort the installation and clean up the system,
press Ctrl+C two times in a row.
If the you get disconnected from the terminal, you can reconnect to the installer
agent by issuing 'gravity resume' command.
If the installation fails, use 'gravity plan' to inspect the state and
'gravity resume' to continue the operation.
See https://gravitational.com/gravity/docs/cluster/#managing-an-ongoing-operation for details.
Tue Oct 29 14:22:22 UTC Connecting to installer
Tue Oct 29 14:22:27 UTC Connected to installer
Tue Oct 29 14:22:28 UTC Starting web UI install wizard
Tue Oct 29 14:22:28 UTC Open this URL in browser: https://172.31.67.113:61009/web/installer/new/gravitational.io/AnacondaEnterprise/5.4.0-36.gdf45da616?install_token=9954bf9f357b0eff8d2d2a4a48c8d9e6
Tue Oct 29 14:22:28 UTC Waiting for the operation to start
To start the browser-based install, copy the full URL that is generated into your browser. Ensure that you are connecting to the public network interface.
NOTES:
- If you’re using an alternate
TMPDIRand DID NOT add it to /etc/environment, edit the copied URL to include the directory in thesudo bashcommand. For example,sudo TMPDIR=/mytmp bash.
- If you’re using an alternate
If you’re unable to connect to the URL due to security measures in place at your organization, select File > New Incognito Window to launch the installer.
The installer will install a self-signed TLS/SSL certificate, so you can click the link at the bottom of this warning message to proceed:
After reviewing the License Agreement, check I Agree To The Terms and click Accept.
Enter the name to use for your deployment in the Cluster Name field. The Bare Metal option is already selected, so you can click Continue.
Select the number of nodes—between one and five—that you want to install in the cluster. One node will act as the master node, and any remaining nodes will be worker nodes. See Fault tolerance for more information on how to size your cluster.
On each node you plan to install Anaconda Enterprise, copy and run the command that’s provided as it applies to the master node and any worker nodes. As you run the command on each node, the host name of the node is listed below the nodes.
Use the IP Address drop-down to select the IP address for each node.
Accept the default directory for installing application data (
/opt/anaconda/) or enter another location.After all nodes are listed, click Continue. This process can take approximately 20 minutes to complete.
Note
To view the install logs, click the EXECUTABLE LOGS pulldown
at the bottom of the panel.
When the installation is complete, the following screen is displayed:
Click Continue & Finish Setup to proceed to Post-install configuration.
Note
The installer running in the terminal will note that installation is complete and that you can stop the installer process. Do not do so until you have completed the post-install configuration.
Using a command line
If you cannot connect to the server from a browser—because you’re installing from a different network, for example—you can install Anaconda Enterprise using a command line.
On each node in the cluster, download and decompress the installer, replacing <location_of_installer> with the location of the installer, and <version> with your installer version:
curl -O <location_of_installer>.tar.gz
tar xvf anaconda-enterprise-<version>.tar.gz
cd anaconda-enterprise-<version>
On the master node, run the pre-installation system checks—as sudo or root user—before proceeding with the installation:
sudo ./gravity check --profile ae-master
Create a file named values.yaml with the following values, replacing HOSTNAME with the fully-qualified domain name (FQDN) of the host server:
apiVersion: v1
kind: ConfigMap
metadata:
name: anaconda-enterprise-install
data:
values: |
hostname: HOSTNAME
generateCerts: true
keycloak:
includeMasterRealm: true
After running the pre-installation system checks and creating the YAML file, run the following command on the master node as sudo or root user, where you replace:
The
advertise-addrIP address with the address you want to be visible to the other nodesCLUSTERNAMEwith a name, otherwise a random cluster name will be assigned/path/to/values.yamlwith the path to thevalues.yamlfile you created
For flavor, choose from the following options the one that represents the number and type of nodes you want to install in the cluster:
small: installs a single-node cluster (one ae-master node). This is the default flavor.medium: installs three nodes (one ae-master node and two ae-worker nodes)large: installs five nodes (one ae-master node, two k8s-master nodes and two ae-worker nodes):sudo ./gravity install --advertise-addr=192.168.1.1 --token=anaconda-enterprise --cluster=CLUSTERNAME --flavor=small --config /path/to/values.yaml
NOTES:
If you’re using a service account UID that’s different than the default 1000, append the command with the actual UID. For example, to use UID 1001, run:
sudo ./gravity install --advertise-addr=192.168.1.1 --token=anaconda-enterprise --cluster=CLUSTERNAME --flavor=small --config /path/to/values.yaml --service-uid=1001
-or-
sudo GRAVITY_SERVICE_USER=1001 ./gravity install --advertise-addr=192.168.1.1 --token=anaconda-enterprise --cluster=CLUSTERNAME --flavor=small --config /path/to/values.yaml
If you’re using an alternate TMPDIR , pre-pend the command with the directory. For example:
sudo TMPDIR=/mytmp ./gravity install --advertise-addr=192.168.1.1 --token=anaconda-enterprise --cluster=CLUSTERNAME --flavor=small --config=/path/to/values.yaml
The command line displays the installer’s progress:
* [0/100] starting installer
* [0/100] preparing for installation... please wait
* [0/100] application: AnacondaEnterprise:5.2.x
* [0/100] starting non-interactive install
* [0/100] still waiting for 1 nodes of role "worker" to join
* [0/100] still waiting for 1 nodes of role "worker" to join
* [0/100] still waiting for 1 nodes of role "worker" to join
* [0/100] initializing the operation
* [20/100] configuring packages
* [50/100] installing software
If you’re installing AE on AWS , use the --cloud-provider option when installing the master. The installer automatically detects EC2 and uses the VPC-based flannel backend instead of VXLAN. To force the use of VXLAN, use the --cloud-provider generic option.
On each worker node, run the following command, replacing the advertise-addr IP address with the address you want to be visible to the other nodes:
sudo ./gravity join 192.168.1.1 --advertise-addr=192.168.1.2 --token=anaconda-enterprise --role=ae-worker
The command line displays the installer’s progress:
* [0/100] joining cluster
* [0/100] connecting to cluster
* [0/100] connected to installer at 192.168.1.1
* [0/100] initializing the operation
* [20/100] configuring packages
* [50/100] installing software
This process takes approximately 20 minutes.
After you’ve finished installing Anaconda Enterprise, you’ll need to create a local user account and password to log into the Anaconda Enterprise Operations Center.
First, enter the Anaconda Enterprise environment on any of the master or worker nodes:
sudo gravity enter
Then, run the following command to create a local user account and password for the Anaconda Enterprise Operations Center, replacing <your-email> and <your-password> with the email address and password you want to use.
Note
Passwords must be at least six characters long.
gravity --insecure user create --type=admin --email=<your-email> --password=<your-password> --ops-url=https://gravity-site.kube-system.svc.cluster.local:3009
Installing in an air-gapped environment
If the cluster where you will install Anaconda Enterprise cannot connect to the internet, follow these instructions:
Download the installer tarball file to a jumpbox or USB key.
Move the installer tarball file to a designated head node in the cluster.
Untar the installer file and run
sudo ./gravity wizardfor browser-based installation orsudo ./gravity installfor CLI-based installation.
Installation and post-install configuration steps are the same for air-gapped and internet-connected installations, so you can continue the installation process from this point, choosing your preferred method:
Browser installation (must be on the same network as the target machines)
Post-install configuration
After completing either installation path, complete the post-install configuration steps.
Post-install configuration¶
There are a few platform settings that need to be updated after installing Anaconda Enterprise, before you can begin using it. Follow the instructions below, based on whether you used a web browser or a command-line to install the platform. Then you’ll be ready to test your installation and perform additional configuration, specific to your organization.
Browser-based instructions¶
If you installed Anaconda Enterprise using a web browser, a UI will guide you through some post-install configuration steps.
Note
It may take a moment for the Post-Install Setup screen to appear. If you see an error immediately after clicking Continue at the end of the installation process, please refresh your browser after a few seconds to display the UI.
Enter the cluster Admin account credentials that you will use to log in to the Anaconda Enterprise Operations Center initially, and click Next. (You can change these, or authorize additional Operations Center Admins, as needed.)
Note
The installer will generate self-signed SSL certificates that you can use temporarily to get started. See Updating TLS/SSL certificates for information on how to change them later, if desired.
Enter the fully-qualified domain name (FQDN) where the cluster will be accessed and click Finish Setup.
Log in to the Anaconda Enterprise Operations Center using the cluster Admin credentials you provided in Step 1, and follow the instructions below to update the platform settings with the FQDN of the host server.
Command-line instructions¶
If you performed an unattended installation using the command-line instructions, follow the instructions below to generate self-signed SSL certificates that you can use temporarily to get started. See Updating TLS/SSL certificates for information on how to change them later, if desired.
Note
You need to have OpenJDK installed to be able to use the following method to generate self-signed SSL certificates.
On the master node for your Anaconda Enterprise installation, run the following commands to save your secrets file to a location where Anaconda Enterprise can access it, replacing
YOUR_FQDNwith the fully-qualified domain name of the cluster on which you installed Anaconda Enterprise.:cd path/to/Anaconda/Enterprise/unpacked/installer cd DIY-SSL-CA bash create_noprompt.sh YOUR_FQDN cp out/DESIRED_FQDN/secret.yml /var/lib/gravity/planet/share/secret.yml
Now /var/lib/gravity/planet/share/secret.yml is accessible as /ext/share/secret.yml within the Anaconda Enterprise environment, which can be accessed with the following command:
sudo gravity enter
Replace the default secrets cert with the contents of your
secret.ymlfile by running the following commands from within the Anaconda Enterprise environment:$ kubectl delete secrets anaconda-enterprise-certs secret "anaconda-enterprise-certs" deleted $ kubectl create -f /ext/share/secret.yml secret "anaconda-enterprise-certs" created
Note
If the post-install process doesn’t complete after using the CLI install, you can complete the process by running the following command within gravity.
To complete the post-install process:
gravity --insecure site complete
Now you are ready to follow the instructions below to test your installation.
Testing your installation¶
After you’ve finished installing Anaconda Enterprise, and completed the post-install configuration steps, you can do the following to verify that your installation succeeded:
Access the Anaconda Enterprise console by entering the URL of your AE server in a web browser:
https://anaconda.example.com, replacinganaconda.example.comwith the fully-qualified domain name of the host server.Login with the default username and password
anaconda-enterprise/anaconda-enterprise. After testing your installation, update the credentials for this default login. See Configuring user access for more information.
You can verify a successful installation by doing any or all of the following:
Creating a new project and starting an editing session
Deploying a project
Generating a token from a deployment
Note
Some of the sample projects can only be deployed after mirroring the package repository. To test your installation without doing this first, you can deploy the “Hello Anaconda Enterprise” sample project.
Next steps:
Now that you’ve completed these essential steps, you can do any of the following optional steps:
Updating TLS/SSL certificates¶
You can replace the self-signed certificates generated as part of the initial post-install configuration at any time.
Before you begin, follow the processes outlined below. Then you can update the Anaconda Enterprise platform to use your own certificates using the Anaconda Enterprise Admin Console or the command line.
Before you begin:
Ask all users to save their work, stop any sessions and deployments, and log out of the platform while you update the certificates.
Backup your current Anaconda Enterprise configuration following the backup process.
Gather all of the following information and files related to your certificates, so you have it available to copy and paste from in the procedure that follows:
Registered domain name for the server
SSL certificate for
servername.domain.tld, namedtls.crtSSL private key for
servername.domain.tld, namedtls.keyRoot SSL certificate (such as this default Root CA), named
rootca.crt. A root certificate is optional but recommended.Intermediate SSL certificate chain/bundle, named
intermediate.pem(This certificate may also appear as the second entry in yourfullchain.pemfile.)Wildcard domain name
Wildcard certificate for
*.servername.domain.tld, namedwildcard.crt.Wildcard private key for
*.servername.domain.tld, namedwildcard.key.
After you’ve gathered all the information above, follow the steps below that correspond to whether you will use the Admin console or the command line to update the Anaconda Enterprise platform to use your certificates.
To update the platform using the Admin console:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Log in to the console using the Administrator credentials configured after installation.
Select Web Certificates from the left menu.
Copy and paste the certificate and key information from the files you gathered previously into the appropriate fields.
Click Save to update the platform with your changes.
Note
The default SSL certificate file names generated by the installer vary slightly between versions. If you have upgraded from a previous version of Anaconda Enterprise, you may need to update your configuration to make sure all services are referencing the correct SSL certificate filenames (see below).
Previous |
Updated |
|---|---|
|
|
|
|
|
|
|
|
|
|
Note
The keystore.jks filename remains unchanged.
To update the platform using the command line:
On the system where the certificate and private key reside:
Install
openjdk. For example, use the following command to installjava-1.8.0-openjdkon CentOS 7.5:yum install java-1.8.0-openjdk -y
Run the following command to create the
keystore.jksfile that will be used by Java:openssl pkcs12 -passout pass:anaconda -export -in CERT.PEM -inkey KEY.PEM -out certificate.p12 -name auth keytool -importkeystore -deststorepass anaconda -destkeypass anaconda -destkeystore keystore.jks -srckeystore certificate.p12 -srcstoretype PKCS12 -srcstorepass anaconda -alias auth
Note
If you’re using a certificate provided by Let’s Encrypt, use FULLCHAIN.PEM instead of CERT.PEM.
Create an updated Root CA to use with the system:
cat ROOT.CA /etc/ssl/certs/ca-bundle.trust.crt > updated-trust-ca.crt
Note
If you’re using a certificate provided by Let’s Encrypt your can obtain the Root CA here.
You must also prepend the CHAIN.PEM to the Root CA.
Note
For RHEL-based systems, the path to the trusted CA is: /etc/ssl/certs/ca-bundle.trust.crt. For Ubuntu-based systems, the path to the system CA is /etc/ssl/certs/ca-certificates.crt.
Setup the basic structure of the
certificates.yamlfile, that you’ll be updating in the next several steps:cat > certificates.yaml <<EOL apiVersion: v1 kind: Secret metadata: name: anaconda-enterprise-certs type: Opaque data: EOL
Add the main domain for the SSL certificate. For example
test.anaconda.com:printf " tls.crt: " >> certificates.yaml base64 -i --wrap=0 CERT.PEM >> certificates.yaml
Add the private key for the certificate:
printf "\n tls.key: " >> certificates.yaml base64 -i --wrap=0 KEY.PEM >> certificates.yaml
Add the SAN certificate to the file. For example
*.test.anaconda.com:printf "\n wildcard.crt: " >> certificates.yaml base64 -i --wrap=0 CERT.PEM >> certificates.yaml
Add the private key for the SAN certificate:
printf "\n wildcard.key: " >> certificates.yaml base64 -i --wrap=0 KEY.PEM >> certificates.yaml
Add the keystore you generated in Step 2:
printf "\n keystore.jks: " >> certificates.yaml base64 -i --wrap=0 keystore.jks >> certificates.yaml
Add the updated Root CA that you created in Step 3:
printf "\n rootca.crt: " >> certificates.yaml base64 -i --wrap=0 updated-trust-ca.crt >> certificates.yaml
Add a new line at the end of the file:
printf '\n' >> certificates.yaml
Copy the file to the share directory inside gravity:
cp certificates.yaml /var/lib/gravity/planet/share
Run the following commands to enter gravity and list your secrets:
gravity enter kubectl get secrets
In the next step you’ll be removing and recreating a secret, so create a backup of the existing secrets first:
kubectl get secret anaconda-enterprise-certs -o yaml --export > anaconda_certs.backup
Remove the existing secret, and recreate it from the file you placed in the share directory (in Step 12):
kubectl delete secret anaconda-enterprise-certs kubectl create -f /ext/share/certificates.yaml
Restart all pods to update Anaconda Enterprise to use your certificate:
kubectl get pods | cut -d' ' -f1 | xargs kubectl delete pods
Extracting TLS/SSL certificates¶
Run the following command for each certificate file you wish to extract, replacing rootca.crt below with the name of the specific file:
kubectl get secrets anaconda-enterprise-certs -o jsonpath="{.data['rootca\.crt']}" | base64 -d > rootca.crt
The following certificate files are available:
rootca.crt: the root certificate authority bundletls.crt: the SSL certificate for individual servicestls.key: the private key for the above certificatewildcard.crt: the SSL certificate for “wildcard” services, such as deployed apps and sessionswildcard.key: the private key for the above certificatekeystore.jks: the Java Key Store containing these certificates used by some services
To copy the extracted root certificate and add it to the default RHEL/CentOS or Ubuntu trusted CA bundles, run the following commands:
# On Ubuntu
$ cp rootca.crt /usr/share/ca-certificates
$ update-ca-certificates
# RHEL/CentOS
$ cp rootca.crt /etc/pki/ca-trust/source/anchors/
$ update-ca-trust
Verifying TLS/SSL certificates¶
If you are using privately signed certificates, extract the rootca, then use openssl to verify the certificates and make sure the final Verify return code is 0:
# On Ubuntu
$ openssl s_client -connect anaconda.example.com:443 -CAfile /etc/ssl/certs/ca-certificates.crt
...
Verify return code: 0 (ok)
# On RHEL/CentOS
$ openssl s_client -connect anaconda.example.com:443 -CAfile /etc/pki/tls/certs/ca-bundle.crt
...
Verify return code: 0 (ok)
Note
The root CA for the self-signed certificates generated as part of the installation is contained in the certificate bundle at /etc/pki/tls/certs/ca-bundle.crt.
You can now install and use the Anaconda Enterprise CLI to configure the certificates for the platform repository.
Installing conda for packaging mirroring¶
To help improve performance and security, Anaconda Enterprise enables you to create a local copy of an online package repository so users can access the packages from a centralized, on-premises location. This copy is called a mirror. A mirror can be complete, partial, or include specific packages or types of packages.
The Anaconda Enterprise installer contains a bootstrap executable that you can run to install conda.
Prerequisites:
To install conda:
In a terminal window, navigate to the directory where you downloaded and extracted the Anaconda Enterprise installer, replacing
<version>with your specific version number:$ cd anaconda-enterprise-<version>\installer
Run the following command to verify whether the
bzip2package is installed:which bunzip2
If the command returns a valid package, you can run the bootstrap executable. Otherwise use your package manager to install the binary, using either yum install bzip2 or apt-get install bzip2.
Run the following command to run the bootstrap executable:
$ ./conda-bootstrap-<version>
Type
yeswhen prompted to accept the end user license agreement (EULA).Accept the default path, or enter an alternate path when prompted.
When prompted, type
yesto activate the conda commmand at shell initialization.Re-initialize your terminal for the previous steps to take effect:
source ~/.bashrc
Now that you’ve installed conda, you can configure access to the source of the packages to be mirrored, whether an online repository or a tarball (if an air-gapped installation). Then you’ll be ready to begin mirroring channels and packages.
Installing the Anaconda Enterprise CLI¶
Warning
The following process for installing the Anaconda Enterprise CLI results in a broken conda env. Follow the workaround described here instead, until the issue is addressed in a future release of Anaconda Enterprise.
If you want to be able to create and share channels and packages from your Anaconda repository using conda commands, you need to download and install the Anaconda Enterprise CLI. If you updated the platform’s TLS/SSL certificate, you can also use the AE CLI to configure the TLS/SSL certs for the repository.
Prerequisites:
To install the CLI, run the following command, replacing
anaconda.example.comwith the fully-qualified domain name (FQDN) of your Anaconda Enterprise instance:conda install -kc https://anaconda.example.com/repository/conda/anaconda-enterprise anaconda-enterprise-cli cas-mirror git
Note
You’ll notice that running this command also installs cas-mirror, the package mirroring tool. For more information on package mirroring, see Configuring channels and packages.
After the list of package dependencies has been resolved, type
yto proceed with the installation.
Configuring the Anaconda Enterprise CLI¶
To add the url of the Anaconda repository to the set of available sites, run the following command with the fully-qualified domain name (FQDN) of your Anaconda Enterprise instance:
anaconda-enterprise-cli config set sites.master.url https://<your.domain.com>/repository/api
Run the following command to configure the instance of Anaconda repository you will be using as the default site:
anaconda-enterprise-cli config set default_site master
To see a consolidated view of the configuration, run the following command:
anaconda-enterprise-cli config view
The Anaconda Enterprise CLI reads configuration information from the following places:
System-level configuration: /etc/anaconda-platform/cli.yml
User-level configuration: $INSTALL_PREFIX/anaconda-platform/cli.yml and $HOME/.anaconda/anaconda-platform/cli.yml
To change how it’s configured, modify the appropriate cli.yml file(s), based on your needs.
Note
Changing configuration settings at the user level overrides any system-level configuration.
If you updated the platform’s TLS/SSL certificate, you can use the AE CLI to configure the certificates for the repository using the following commands:
$ anaconda-enterprise-cli config set ssl_verify true
# On Ubuntu
$ anaconda-enterprise-cli config set sites.master.ssl_verify /etc/ssl/certs/ca-certificates.crt
# On RHEL/CentOS
$ anaconda-enterprise-cli config set sites.master.ssl_verify /etc/pki/tls/certs/ca-bundle.crt
Logging in to the Anaconda Enterprise CLI¶
Run this command to access the CLI:
anaconda-enterprise-cli login
Log in to the CLI using the same username and password that you use to log in the Anaconda Enterprise web interface:
Username: <your-username> Password: <your-password>
Next Steps: You can now configure access to the source of the packages to be mirrored, whether an online repository or a tarball (if an air-gapped installation). Then you’ll be ready to begin mirroring channels and packages.
Installing Livy server for Hadoop Spark access¶
To support your organization’s data analysis operations, Anaconda Enterprise enables platform users to connect to remote Apache Hadoop or Spark clusters. Anaconda Enterprise uses Apache Livy to handle session management and communication to Apache Spark clusters, including different versions of Spark, independent clusters, and even different types of Hadoop distributions.
Livy provides all the authentication layers that Hadoop administrators are used to, including Kerberos. AE also authenticates to HDFS with Kerberos. Kerberos Impersonation must be enabled.
When Livy is installed, users can connect to a remote Spark cluster when creating projects by selecting the Spark template. They can either use the Python libraries available on the platform, or package a specific environment to target for the job. For more information, see Hadoop / Spark.
Before you begin:
Verify the connection requirements. The following table outlines the supported configurations for connecting to remote Hadoop and Spark clusters with Anaconda Enterprise.
Software |
Version |
|---|---|
Hadoop and HDFS |
2.6.0+ |
Spark and Spark API |
1.6+ and 2.X |
Sparkmagic |
0.12.7 |
Livy |
0.5 |
Hive |
1.1.0+ |
Impala |
2.11+ |
Note
The Hive metastore may be Postgres or MySQL. The Livy server must run on an “edge node” or client in the Hadoop/Spark cluster. Verify that the spark-submit and/or the spark repl commands work on this machine.
Installing Livy
Follow the instructions below to install Livy into an existing Spark cluster, or download and install the offical version of Livy.
Note
This example is specific to a Red Hat-based Linux distribution, with a Hadoop installation based on Cloudera CDH. To use other systems, you’ll need to look up the corresponding commands and locations.
Locate the directory that contains Anaconda Livy. Typically this will be
anaconda-enterprise-X.X.X-X.X/installer/anaconda-livy-0.5.0, whereX.X.X-X.Xcorresponds to the Anaconda Enterprise version.Copy the entire directory that contains Anaconda Livy to an edge node on the Spark/Hadoop cluster.
After installing Livy server, you’ll need to configure it to work with Anaconda Enterprise. For example, you’ll need to enable impersonation, so users running Spark sessions are able to log in to each machine in the Spark cluster. For more information, see Configuring Livy server for Hadoop Spark access.
Upgrading Anaconda Enterprise¶
The process of moving from one version of Anaconda Enterprise to another varies slightly, depending on which version you are migrating to and from, so follow the instructions that correspond to your Anaconda Enterprise implementation. If you’re moving from an implementation of AE 4 to AE 5, we consider that a migration. If you’re moving between point releases of the same version, we consider that an upgrade.
Migrating between major releases of Anaconda Enterprise requires Administrators to migrate the package repository and project owners to migrate their notebooks.
Upgrading Anaconda Enterprise generally involves exporting or backing up your current package repository and all project data, uninstalling the existing version and installing the newer version, then importing or restoring this information on the new platform.
Upgrading between versions of AE5¶
Due to the potential complexity of your custom configuration, please contact Anaconda Support before initiating the upgrade.
After you have determined the topology for your Anaconda Enterprise cluster, and verified that your system meets all of the installation requirements, you’re ready to upgrade the cluster.
Before you begin:
Configure your A record in DNS for the master node with the actual domain name you will use for your Anaconda Enterprise installation.
If you are using a firewall for network security, we recommend you temporarily disable it while you upgrade Anaconda Enterprise.
When installing Anaconda Enterprise on a system with multiple nodes, verify that the clock of each node is in sync with the others prior to starting the installation process, to avoid potential issues. We recommend using the Network Time Protocol (NTP) to synchronize computer system clocks automatically over a network. See instructions here.
Back up the
anaconda-enterprise-anaconda-platform.ymlfile used to configure the platform, as config map settings such as external Git configuration are not automatically migrated to the new cluster as part of the upgrade process.Back up your custom
cas-mirrorandanaconda-enterprise-cliconfigurations (see Step 4 below), as$HOME/cas-mirrorwill be overwritten during the upgrade process. To avoid any compatibility issues, we recommend you upgrade your mirror tools as part of the upgrade process. Afterwards, simply copy over the configuration files you backed up to restore your custom configuration.
Warning
After the upgrade or backup process has begun, it won’t be possible to capture or back up data for any open sessions or deployments. We therefore recommend that you ask all users to save their work, stop any sessions and deployments, and log out of the platform during the upgrade window. The backup.sh script that runs as part of the upgrade process will restart all pods, so if they don’t, they will lose any unsaved work. They may also encounter a 404 error after the upgrade. The workaround for the error message is to stop and restart the session or deployment that generated the error, but there is no way to retrieve lost data.
The upgrade process varies slightly, depending on your current version and which version you’re installing. To update an existing Anaconda Enterprise installation to a newer version, follow the process that corresponds to your particular scenario:
Upgrading from AE 5.3.0/5.3.1 to 5.4.x¶
Anaconda Enterprise 5.3.0 and 5.3.1 supports in-place upgrades, so you can follow these simple steps to update your 5.3.0 or 5.3.1 installation to the latest version.
Ensure that all AE users have closed any open sessions, stopped any deployed applications, and logged out of the platform. The
backup.shscript that runs as part of the upgrade process will restart all pods, so if they don’t, they will lose any unsaved work.On the master node running your current installation of AE, download and decompress the new installer, and then cd into the install directory, replacing
<location_of_installer>with the location of the installer, and<version>with your installer version:curl -O <location_of_installer>.tar.gz tar xvzf anaconda-enterprise-<version>.tar.gz cd anaconda-enterprise<version>
Run the following command to upload the installer to the AE environment:
sudo ./upload
When the upload process finishes, run the following command to start the upgrade process:
sudo ./gravity upgrade
cd into the install directory:
cd ../anaconda-enterprise-<version>
Depending on your implementation, the upgrade process may take an hour or more to complete. You can check the status of the upgrade process by running
sudo ./gravity status.
If you encounter errors while upgrading, you can check the status of the operation by running sudo ./gravity plan. You can then roll back any step in the upgrade process by running the rollback command against the name of the phase, as it’s listed in the Phase column:
sudo ./gravity rollback --phase=/<name-of-phase>
After addressing the error(s), you can resume the upgrade by running the following command:
sudo ./gravity upgrade --resume --force
After the upgrade process completes, follow the steps to verify that your upgrade was successful.
After you’ve confirmed that your upgrade was successful—and everything works as expected—you can run a script to remove images leftover from the previous installation and free up space. This will help prevent the cluster from running out of disk space on the master node.
Upgrading from AE 5.2.x/5.3.0 to 5.3.1¶
Anaconda Enterprise 5.2.x and 5.3.0 supports in-place upgrades, so you can follow these simple steps to update your 5.2.x or 5.3.0 installation to the latest version.
Ensure that all AE users have closed any open sessions, stopped any deployed applications, and logged out of the platform. The
backup.shscript that runs as part of the upgrade process will restart all pods, so if they don’t, they will lose any unsaved work.On the master node running your current installation of AE, download and decompress the new installer, replacing
<location_of_installer>with the location of the installer, and<version>with your installer version:curl -O <location_of_installer>.tar.gz tar xvzf anaconda-enterprise-<version>.tar.gz cd anaconda-enterprise<version>
Run the following command to upload the installer to the AE environment:
sudo ./upload
When the upload process finishes, run the following command to start the upgrade process:
sudo ./gravity upgrade
The upgrade process may take up to an hour to complete. You can check the status of the upgrade process by running
sudo ./gravity status.
If you encounter errors while upgrading, you can check the status of the operation by running sudo ./gravity plan. You can then roll back any step in the upgrade process by running the rollback command against the name of the phase, as it’s listed in the Phase column:
sudo ./gravity rollback --phase=/<name-of-phase>
After addressing the error(s), you can resume the upgrade by running the following command:
sudo ./gravity upgrade --resume --force
After the upgrade process completes, follow the steps to verify that your upgrade was successful.
After you’ve confirmed that your upgrade was successful—and everything works as expected—you can run a script to remove images leftover from the previous installation and free up space. This will help prevent the cluster from running out of disk space on the master node.
Verify installation¶
After you’ve verified that all pods are running and updated the Anaconda Enterprise URLs, you can confirm that your upgrade was successful by doing the following:
Return to the Authentication Center and select Users in the Manage menu on the left.
Click View all users and verify that all user data has also been restored.
Access the Anaconda Enterprise user console by visiting this URL in your browser:
https://example.anaconda.com/—replacingexample.anaconda.comwith the FQDN of your server—and logging in using the same credential you used in your previous installation.Review the Projects list to verify that all project data has been restored.
Note
If you didn’t configure SSL certificates as part of the post-install configuration, do so now. See Updating TLS/SSL certificates for more information.
If you’re upgrading a cluster with external Git configured:
Note
The git section of the anaconda-enterprise-anaconda-platform.yml file used to configure Anaconda Enterprise 5.3.1 includes parameter changes. If you backed up your Anaconda Enterprise config map before upgrading, and copied it onto the newly-updated master node, you’ll need to update your config map with the new information as described here.
If you’re upgrading a Spark/Hadoop configuration:
After you successfully restore your Anaconda Enterprise data, run the following commands on the master node of the newly-installed Anaconda Enterprise server:
kubectl replace -f <path-to-anaconda-config-files-secrets.yaml>
To verify that your configuration upgraded correctly:
Log in to Anaconda Enterprise.
If your configuration uses Kerberos authentication, open a Hadoop terminal and authenticate yourself through Kerberos using the same credentials you used previously. For example,
kinit <username>.Open a Jupyter Notebook that uses Sparkmagic, and verify that it behaves as expected. For example, run the
sccommand to connect to Sparkmagic and start Spark.
After you’ve confirmed that your upgrade was successful, we recommend you run the following command to remove all unused packages and images from previous versions of the application, and repopulate the registry to include only those images required by the current version of the application:
sudo gravity gc
The command’s progress is displayed in the terminal, so you can watch as it marks packages associated with the latest version as required, and deletes older versions.
If running the command generates an error, you can resume the command (after you fix the issue that caused the error) by running the following command:
sudo gravity gc —-resume
Backing up and restoring AE¶
Before you begin any upgrade, you must back up your Anaconda Enterprise configuration and data files. You may also choose to back up AE regularly, based on your organization’s disaster recovery policies.
CAUTION: After the back up process has begun, it won’t be possible to back up data for any open sessions or deployments. We therefore recommend that you ask all users to save their work, stop any sessions and deployments, and log out of the platform during the upgrade window. If they don’t, they will lose any usaved work. They may also encounter a 404 error after the upgrade. The workaround for the error message is to stop and restart the session or deployment that generated the error, but there is no way to retrieve lost data.
If you are performing backing up Anaconda Enterprise as part of an upgrade, note that after installing AE 5.2.x, you’ll need to re-configure your SSL certificates, so ensure all certificate-related information—–including the private key—–is accessible at that point in the process. See upgrading between versions for AE5 for the complete upgrade process.
Backing up Anaconda Enterprise¶
The number of channels and packages being backed up will impact the amount of free space and time required to perform the backup, so ensure you have sufficient free space and time available to complete the process. To prevent potential disk pressure issues, you can create another volume and specify that location instead of the default /opt/anaconda. See Troubleshooting known issues for more information.
All of the following commands should be run on the master node.
Copy the
backup.shscript from the location where you saved the installer tarball to the Anaconda Enterprise environment using the following command:sudo cp backup.sh /opt/anaconda
Back up Anaconda Enterprise by running the following commands:
cd /opt/anaconda bash backup.sh
The following backup files are created and saved to /opt/anaconda:
ae5-data-backup-${timestamp}.tar
ae5-state-backup-${timestamp}.tar.gz
Move the backup files to a remote location to preserve them, as the
/opt/anacondadirectory will be deleted in future steps. After uninstalling AE, you’ll copyae5-data-backup-${timestamp}.tarback to your local filesystem.Exit the Anaconda Enterprise environment by typing
exit.
If your existing configuration includes Spark/Hadoop, perform these additional steps to migrate configuration information specific to your cluster:
Run the following command to retrieve configuration info. from the 5.1.x server, and generate the
anaconda-config-files-secret.yamlfile:kubectl get secret anaconda-config-files -o yaml > <path-to-anaconda-config-files-secret.yaml>
NOTE: This file will be deleted in future steps, so move it to a remote location to preserve it, and ensure that you can access this file from the server where you’re installing the newer version of AE 5.2.x.
Open the
anaconda-config-files-secret.yamlfile, locate the metadata section, and delete everything under it except for the following:name: anaconda-config-files.
For example, if it looks like this to begin with:
apiVersion: v1
data:
xxxx
kind: Secret
metadata:
creationTimestamp: 2018-07-31T19:30:54Z
name: anaconda-config-files
namespace: default
resourceVersion: "981426"
selfLink: /api/v1/namespaces/default/secrets/anaconda-config-files
uid: 3de10e2b-94f8-11e8-94b8-1223fab00076
type: Opaque
It will look like this afterwards:
apiVersion: v1
data:
xxxx
kind: Secret
metadata:
name: anaconda-config-files
type: Opaque
Restoring Anaconda Enterprise¶
If you backed up your Anaconda Enterprise installation, you can restore configuration information from the backup files. The restore script restores data, and can be optionally used to restore state information.
NOTE: When upgrading from 5.1.x to 5.2.x, we recommend restoring only data from the backup, and using the state generated during installation of 5.2.0. See upgrading between versions for AE5 for the complete upgrade process.
Copy the restore.sh script from the location where you saved the installer tarball to the Anaconda Enterprise environment using the following command:
sudo cp restore.sh /opt/anaconda
To restore only data, run:
cd /opt/anaconda/
bash restore.sh <path-to-data-backup-file>
NOTE: Replace path-to-data-backup-file with the path to the data backup file generated when you ran the Anaconda Enterprise backup script.
To restore data and state, run:
cd /opt/anaconda/
bash restore.sh <path-to-data-backup-file> <path-to-state-backup-file>
For help, run the bash restore.sh -h command.
After recovery, manually stop and restart all active sessions and deployments and job runs with the UI.
Uninstalling AE¶
Before using the following instructions to uninstall Anaconda Enterprise, be sure to follow the steps to backup your current installation so you’ll be able to restore your data from the backup after installing Anaconda Enterprise 5.2.
To uninstall Anaconda Enterprise on a healthy cluster worker nodes, run:
sudo gravity leave
sudo killall gravity
sudo killall planet
To uninstall Anaconda Enterprise on a healthy cluster master node, run:
sudo gravity system uninstall
sudo killall gravity
sudo killall planet
sudo rm -rf /var/lib/gravity /opt/anaconda
To uninstall a failed or faulty cluster node, run:
sudo gravity remove --force
To remove an offline node that cannot be reached from the cluster, run:
sudo gravity remove <node>
Where <node> specifies the node to be removed. This value can be the node’s assigned
hostname, its IP address (the one that was used as an “advertise address” or
“peer address” during install), or its Kubernetes name (which you can obtain by running kubectl get nodes).
Migrating from AE 4 to AE 5¶
The process of migrating from AE 4 to AE 5 involves the following tasks:
For Administrators:
Export all packages and package info. from your AE 4 Repository.
Import the packages into Anaconda Enterprise 5.
For Notebook users:
Export each project environment to a
.ymlfile.
Due to architectural changes between versions of the platform, there are some additional steps you may need to follow to migrate code between AE4 and AE5. These steps vary, based your current and new platform configurations.
Exporting packages¶
Anaconda Enterprise enables you to create a site dump of all packages used by your organization, including the owners and permissions associated with each package.
Log in to the AE 4 Repo and switch to the
anaconda-serveruser.To export your packages, run the following command on the server hosting your Anaconda Repository:
anaconda-server-admin export-site
Running this command creates a directory structure containing all files and user information from your Anaconda Repository. For example:
site-dump/
├── anaconda-user-1
│ ├── 59921152446b5703f430383f--moto
│ ├── 5992115f446b5703fa30383e--pysocks
│ └── meta.json
├── anaconda-organization
│ ├── 5989fbd1446b575b99032652--future
│ ├── 5989fc1d446b575b99032786--iso8601
│ ├── 5989fc1f446b575b990327a8--simplejson
│ ├── 5989fc26446b575b99032802--six
│ ├── 5989fc31446b575b990328b0--xz
│ ├── 5989fc35446b575b990328c6--zlib
│ └── meta.json
└── anaconda-user-2
└── meta.json
Each subdirectory of site-dump contains the contents of the Repository as it pertains to a particular user. For example anaconda-user-1 has two packages, moto and pysocks. The meta.json file in each user directory contains information about any groups of end users that user belongs to, as well as their organizations.
Package directories contain the package files, prefixed with the id of the database. The meta.json file in each package directory contains metadata about the packages, including version, build number, dependencies, and build requirements.
Note
Other files included in the site-dump—such as projects and environments—are NOT imported by the package import tool. That’s why users have to export their Notebook projects separately.
Importing packages¶
You can choose whether to import packages into Anaconda Enterprise 5 by username or organization, or import all packages.
Before you begin:
We recommend you compare the import options before proceeding, so you can choose the option that most closely aligns with the desired outcome for your organization.
You’ll be using the Anaconda Enterprise command line interface (CLI) to import the packages you exported, so be sure to install the AE CLI if you haven’t already.
Log into the command line interface using the following command:
anaconda-enterprise-cli login
Follow the instructions below for the method you want to use to import packages.
To import packages by username or organization:
As you saw in the example above, the packages for each user are put in a separate directory in the site-dump. This means that the import process is the same whether you specify a directory based on a username or organization.
Import a single directory from the site-dump using the following command:
anaconda-enterprise-cli admin import site-dump/name
Replacing name with the actual name of the directory you want to import.
Note
You can also pass a list of directories to import.
To import all packages:
Run the following command to import all packages in the site dump:
anaconda-enterprise-cli admin import site-dump/*
How channels of imported packages are named
When you import packages by username, a new channel is created for each unique label the user has applied to their packages, using the username as a prefix. (The default package label “main” is not included in channel names.)
For example, if anaconda-user-1 has the following packages:
moto-0.4.31-2.tar.bz2with labelmainpysocks-1.6.6-py35_0.tar.bz2with labeltest
The following channels are created:
anaconda-user-1containing the package filemoto-0.4.31-2.tar.bz2anaconda-user-1/testcontaining the package filepysocks-1.6.6-py35_0.tar.bz2
When you import all packages in an organization, a new channel is created for each organization, group, and label. The script appends any groups associated with the organization to the channel name it creates. (The default package label “main” and default organization label “Owner” are not included in channel names.)
For example, if anaconda-organization includes a group called Devs, and the site dump for anaconda-organization contains a package file named xz-5.2.2-1.tar.bz2 with the label Test, running the script will create the following channels:
anaconda-organization– This channel contains all packages that the organization owner can access.anaconda-organization/Devs– This channel contains all packages that theDevgroup can access.anaconda-organization/Devs/Test– This channel contains all packages labeledTestthat theDevgroup can access.
Granting access to channels and packages
After everything is uploaded, each channel created as part of the import process is shared with the appropriate users and groups. In the case of the example above,``anaconda-user-1`` is granted read-write access to the anaconda-user-1 and anaconda-user-1/test channels, and all members of the Devs group will have read permission for everything in the Devs channel.
You can change these access permissions as needed using the Anaconda Enterprise UI or CLI. See Managing channels and packages for more information.
Migrating AE 4 Notebook Projects¶
Before you begin:
If your project refers to channels in your on-premises repository or other channels in anaconda.org, ask you System Administrator to mirror those channels and make them available to you in AE 5.
If your project use non-conda packages, you’ll need to upload those packages to AE 5.
If your notebook refers to multiple kernels or environments, set the kernel to a single environment.
If your project contains several notebooks, verify that they all are using the same kernel or environment.
Exporting your project¶
Exporting a project creates a yml file that includes all the environment information for the project.
Log in to your Anaconda Enterprise Notebooks server.
Open a terminal window and activate conda environment 2.6 for your project.
Install
anaconda projectin the environment:conda install anaconda-project=0.6.0
If you get a
not foundmessage, install it from anaconda.org:conda install -c anaconda anaconda-project=0.6.0
Export your environment to a file:
conda env export -n default -f _env.yml
<default>is the name of the environment where the notebook runs.Verify that the format of the environment file looks similar to the following, and that the dependencies for each notebook in the project are listed:
yaml channels: - wakari - r - https://conda.anaconda.org/wakari - defaults - anaconda-adam prefix: /projects/anaconda/MigrationExample/envs/default dependencies: - _license=1.1=py27_1 - accelerate=2.3.1=np111py27_0 - accelerate_cudalib=2.0=0 - alabaster=0.7.9=py27_0 # ... etc ...
If it contains any warning messages, run this script to modify the encoding and remove the warnings:
import ruamel_yaml with open("_env.yml") as env_fd: env = ruamel_yaml.load(env_fd) with open("environment.yml", "w") as env_fd: ruamel_yaml.dump(env, env_fd, Dumper=ruamel_yaml.RoundTripDumper)
Converting your project¶
To create a project that’s compatible with Anconda Enterprise 5, perform these steps:
Run the following command from an interactive shell:
anaconda-project init
AE 4 supports Linux only, so run the following command to remove the Windows and MacOS platforms from the project’s
anaconda-project.ymlconfiguration file:anaconda-project remove-platforms win-64 osx-64
Run the following command to verify the platforms were removed:
anaconda-project list-platforms
Add
/.indexer.pidand.gitto the.projectignorefile.Run the following command to compress your project:
anaconda-project archive FILENAME.tar.gz
Note
There is a 1GB file size limit for project files, and project names cannot contain spaces or special characters.
In Anaconda Enterprise Notebooks, from your project home page, open the Workbench. Locate your project file (e.g.,
AENProject.tar.gzin the image below) in the file list, right-click and select Download.
Now your project is ready to be uploaded into Anaconda Enterprise 5.
Uploading your project to AE 5¶
Log in to the Enterprise v5 interface and upload your project file FILENAME.tar.gz. See Working with projects for help.
Note
To maintain performance, there is a 1GB file size limit for project files you upload. Anaconda Enterprise projects are based on Git, so we recommend you commit only text-based files relevant to a project, and keep them under 100MB. Binary files are difficult for version control systems to manage, so we recommend using storage solutions designed for that type of data, and connecting to those data sources from within your Anaconda Enterprise sessions.
Migrating code¶
AE4 and AE5 are based on a different architecture, therefore some code inside your AE4 notebooks might not run as expected in AE5. AE4 sessions ran directly on the host filesystem, where the libraries, drivers, packages, and connectors required to run them were available. AE5 sessions run in isolated containers with their own independent file system, so they don’t necessarily have access to everything on the host.
This difference in architecture primarily impacts the following:
Connecting to external data sources¶
If you currently rely on ODBC/JDBC drivers to connect to specific databases such as Oracle and Impala, we recommend you use services that support this, such as Apache Impala and Apache Hive, instead. Additionally, using a language and platform agnostic connector such as Thrift allows you to create reproducible code that is more portable.
For best practices on how to connect to different external systems inside AE5, see Connecting to the Hadoop and Spark ecosystem.
Service/System |
Recommended |
|---|---|
Apache Impala |
|
Apache Hive |
|
Oracle |
build conda package with their driver |
If this is not possible, we recommended you obtain or build conda packages for the connectors and drivers you need. This enables you to add them as package dependencies for your project that will be installed when you start a Notebook session or deploy the project.
This has the added benefit of enabling you to update dependencies on connectors on a per-project basis.
Sharing custom Python and R libraries¶
It’s quite common to share custom libraries by adding them to a location in the filesystem where all users can access the libraries they need. AE5 sessions and deployments run in isolated containers, so users cannot use this method to access shared libraries.
Instead, we recommend you create a conda package for each library. This enables you to control access to each package library and version it—both essential to managing software at the enterprise level.
After you create the package, upload it to the internal AE5 repository, where it can be shared with users and included as a dependency in user sessions and deployments.
Installing external dependencies¶
If you typically install dependencies using system package managers such as apt and yum, you can continue to do so in Anaconda Enterprise 5. Dependencies installed from the command line are available during the current session only, however.
If you want them to persist across project sessions and deployments, add them as packages in the project’s anaconda-project.yml configuration file. See Configuring project settings for more information.
If your project depends on package that is not available in your internal Anaconda Enterprise repository, search anaconda.org or build your own conda package using conda-build then upload the conda package to the AE5 repository.
If you don’t have the expertise required to build the custom packages your organization needs, consider engaging our consulting team to make your mission-critical analytics libraries available as conda packages.
Administering Anaconda Enterprise¶
There are several aspects of Anaconda Enterprise that can be configured to meet your organization’s specific requirements, including the following:
Configuring and monitoring the utilization of cluster resources.
Configuring user access to the platform and its resources.
Configuring channels of packages, plus environments and custom installers to distribute software.
Configuring advanced platform settings, including configuring Livy server for Hadoop Spark access, configuring external version control, and mounting NFS shares.
Administrators use different consoles to perform tasks in each of these areas, with credentials required to access each console. This gives enterprises the flexibility they need to choose whether to grant the permissions required to access a particular console to a single Admin, or different individuals, based on their area(s) of expertise within the organization.
Some configuration options fall outside of these general categories—and you may not necessarily follow this linear process—however, the following offers a high-level overview of the configuration workflow you’re likely to follow:
Managing cluster resources¶
After you’ve installed an Anaconda Enterprise cluster, you’ll want to continue to manage and monitor the cluster to ensure that it scales with your organization as needs change. These on-going management and monitoring tasks include the following:
When you’ve outgrown your initial Anaconda Enterprise cluster installation,
you can easily add new nodes—including GPUs. To make these nodes available to platform users, you’ll configure resource profiles.
To help you manage your organization’s cluster resources more efficiently, Anaconda Enterprise enables you to track which sessions and deployments are running on specific nodes or by specific users. You can also monitor cluster resource usage in terms of CPU, memory, disk space, network and GPU utilization.
To help you gain insights into user services and troubleshoot issues, Anaconda Enterprise provides detailed logs and debugging information related to the Kubernetes services it uses, as well as all activity performed by users. See fault tolerance in Anaconda Enterprise for information about what to do if a master node fails.
Adding and removing nodes¶
You can view, add, edit and delete server nodes from Anaconda Enterprise using the Admin Console’s Operations Center. If you would prefer to use a command line to join additional nodes to the AE master, follow the instructions provided below.
NOTES:
Each installation can only support a single AE master node, as this node includes storage for the platform. DO NOT add an additional AE master node to your installation.
As a best practice for etcd optimal cluster size, we recommend you add any additional Kubernetes master nodes in pairs, so that the total number (including the AE master) is an odd number.
Anaconda Enterprise doesn’t support running heterogeneous versions in the same cluster. Before adding a new node, verify that the node is operating the same version of the OS as the rest of the cluster.
If you’re adding a GPU node, make sure it meets the GPU requirements.
To manage the servers on your system:
Log in to Anaconda Enterprise, select the Menu icon
in the top
right corner and click the Administrative Console link displayed at the
bottom of the slide out window. You must be logged in with a user assigned to
the ae-adminrole.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Nodes from the menu on the left to display the configured nodes in your cluster, their IP address, hostname and profile.
To add an existing server to Anaconda Enterprise:
Click the Add Node button at the top right.
Select an appropriate profile for the server and click Continue.
Copy and paste the command provided into a terminal window to add the server.
When you refresh the page, your server will appear in the list.
To remove a server node:
Click the Actions menu
at the far right of the node you want to remove
and select Delete….
To log on to a server:
Click on the terminal icon
of the server you want to work with,
and select root to open a terminal window. It will open a new tab in your browser.
When you are finished, simply close the console window by clicking the
icon.
Using the command line to add nodes¶
Download the gravity binary that corresponds to your version of Anaconda Enterprise from the S3 location provided to you by Anaconda onto the server you’re adding to the cluster.
Rename the file to something simpler, then make it executable. For example:
mv gravity-binary-6.1.9 gravity chmod +x gravity
On the AE master, run the following command to obtain the join token and IP address for the AE master node:
gravity status
The results should look similar to the following:
Copy and paste the join token for the cluster and the IP address for the AE master somewhere accessible. You’ll need to provide this information when you add a new worker node. You’ll also need the IP address of the server node you’re adding.
On the worker node, run the following command to add the node to the cluster:
./gravity join --token JOIN-TOKEN --advertise-addr=NODE-IP --role=NODE-ROLE --cloud-provider=CLOUD-PROVIDER MASTER-IP-ADDR
Where:
JOIN-TOKEN = The join token that you obtained in Step 3.
NODE-IP = The IP address of the worker node. This can be a private IP address, as long as the network it’s on can access the AE master.
NODE-ROLE = The type of node you’re adding: ae-worker, gpu-worker, or k8s-master.
CLOUD-PROVIDER = This is auto-detected, and can therefore be excluded unless you don’t have Internet access. In this case, use generic.
MASTER-IP-ADDR = The IP address of the AE master that you obtained in Step 3.
Warning
The --role flag must be provided and assigned to either ae-worker,
gpu-worker or k8s-master. Without it the node will be added
with the role ae-master and may cause your cluster to crash.
The progress of the join operation is displayed:
To monitor the impact of the join operation on the cluster, run the
gravity statuscommand on the AE master.
The output will look similar to the following:
Note that the size of the cluster is expanding and the status of the new node being added is offline. When the node has successfully joined, the cluster returns to an active state, and the status of the new node changes to healthy:
Setting resource limits for sessions and deployments¶
Note
You can separate system-level pods from user-level sessions and deployments as long as you have a multi-node setup (that is, a master node and at least one worker node). Contact support to complete this operation.
Each project editor session and deployment uses compute resources on the Anaconda Enterprise cluster. If Anaconda Enterprise users need to run applications which require more memory or compute power than provided by default, you can customize your installation to include these resources and allow users to access them while working within AE.
After the server resources are installed as nodes in the cluster, you create custom resource profiles to configure the number of cores and amount of memory/RAM available to users—so that it corresponds to your specific system configuration and the needs of your users.
For example, if your installation includes nodes with GPUs, add a GPU resource profile so users can use the GPUs to accelerate computation within their projects—essential for machine learning model training. For installation requirements, see Installation requirements.
Resource profiles apply to all nodes, users, editor sessions, and deployments in the cluster. So if your installation includes nodes with GPUs that you want to make available for users to acclerate computation within their projects, you’d create a GPU resource profile. Any resource profiles you configure are listed for users to select from when configuring a project and deploying a project. Anaconda Enterprise finds the node that matches their request.
To add a resource profile for a resource you have installed:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Use the Config map drop-down menu to select the
anaconda-enterprise-anaconda-platform.ymlconfiguration file.Make a manual backup copy of this file before editing it, as any changes you make will impact how Anaconda Enterprise functions.
Scroll down to the
resource-profilessection:
Add an additional resource following the format of the default specification. For example, to create a GPU resource profile, add the following to the
resource-profilessection of the Config map:gpu-profile: description: 'GPU resource profile' user_visible: true resources: limits: cpu: '4' memory: '8Gi' nvidia.com/gpu: 1
By default, CPU sessions and deployments are also allowed to run on GPU nodes. To reserve GPU nodes for only those sessions and deployments that require a GPU—by preventing CPU sessions and deployments from accessing GPU nodes—comment out the following additional specification included after the gpu-profile entry:
Note
Resource profiles are listed in alphabetical order—after any defaults—so if you want them to appear in a particular order in the drop-down list that users see, be sure to name them accordingly.
Click Apply to save your changes.
To update the Anaconda Enterprise server with your changes, you’ll need to do the following:
Restart the workspace and deploy services by running the following command:
kubectl delete pods -l 'app in (ap-workspace, ap-deploy)'
Then check the project Settings and Deploy UI to verify that each resource profile you added or edited appears in the Resource Profile drop-down menu.
Monitoring cluster utilization¶
Anaconda Enterprise enables you to monitor cluster resource usage in terms of CPU, memory, disk space, network and GPU utilization.
To access the Operations Center:
Log in to Anaconda Enterprise, select the Menu icon
in the top
right corner and click the Administrative Console link displayed at the
bottom of the slide out window.Click Manage Resources.
Login to the Operations Center using the Administrator credentials configured after installation
Total cluster resource utilization¶
The Dashboard tab in the Operations Center displays the total CPU and Memory utilize aggregated across all nodes (master and worker) nodes in the Anaconda Enterprise cluster.
Monitoring dashboard¶
Click Monitoring in the menu on the left.
The graphs displayed include the following:
Overall Cluster CPU Usage
CPU Usage by Node
Individual CPU Usage
Overall Cluster Memory Usage
Memory Usage by Node
Individual Node Memory Usage
Overall Cluster Network Usage
Network Usage by Node
Individual Node Network Usage
Overall Cluster Filesystem Usage
Filesystem Usage by Node
Individual Filesystem Usage
Use the control in the upper right corner to specify the range of time for which you want to view usage information, and how often you want to refresh the results.
Monitoring Kubernetes¶
To view the status of your Kubernetes nodes, pods, services, jobs, daemon sets and deployments from the Operations Center, click Kubernetes in the menu on the left and select Pods.
See Monitoring sessions and deployments for more information.
To view the status or progress of a cluster installation, click Operations in the menu on the left, and select an operation in the list. Clicking on a specific operation switches to the Logs view, where you can also view logs based on container or pod.
Monitoring sessions and deployments¶
Anaconda Enterprise enables you to see which sessions and deployments are running on specific nodes or by specific users, so you can monitor cluster resource usage. You can also view session details for a specific user in the Authorization Center. See Managing users for more information.
Log in to Anaconda Enterprise, select the Menu icon
in the top
right corner and click the Administrative Console link displayed at the
bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Monitoring from the menu on the left to display the monitoring dashboards.
Individual pod¶
To display the monitoring graph for a user session or deployment you’ll need to identify the appropriate Kubernetes pod name.
For an editor session the Kubernetes pod name corresponds to the hostname
of the session container. Run hostname in a terminal window. For
deployments the pod name is available from the logs tab of the deployment
under the heading name.
Click the Monitoring tab from the menu on the left
Click Cluster at the top left of the dashboard
Select Compute Resource / Workload
To display the monitoring graph for an individual pod
Select
defaultfrom the namespace menuSelect the desired pod from the the workload menu
Scroll down further to display the memory usage.
Using the CLI:
Note
For more expanded monitoring, see AE5 Tools.
Open an SSH session on the master node in a terminal by logging into the Operations Center and selecting Servers from the menu on the left.
Click on the IP address for the Anaconda Enterprise master node and select SSH login as root.
In the terminal window, run
sudo gravity enter.
To view total node CPU and memory utilization run
kubectl top nodes --heapster-namespace=monitoring
To view CPU and memory utilization per pod run
kubectl top pods --heapster-namespace=monitoring
Viewing system logs¶
To help you gain insights into user services and troubleshoot issues, Anaconda Enterprise provides detailed logs and debugging information related to the Kubernetes services and containers it uses.
To access these logs from the Operations Center:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Click Logs in the left menu to display the complete system log.
Use the Filter drop-down to view logs based on a container.
Note
You can also access the logs for a specific pod by clicking Kubernetes in the left menu, clicking the Pods tab, clicking the name of a pod, and selecting Logs.
Individual pods¶
To display the logs for a user session or deployment you’ll need to identify the appropriate Kubernetes pod name.
For an editor session the Kubernetes pod name corresponds to the hostname
of the session container. Run hostname in a terminal window. For
deployments the pod name is available from the logs tab of the deployment
under the heading name.
Click the Kubernetes tab from the menu on the left
Click the Pods tab to display a list of all pods and containers. Editor sessions are named
anaconda-session-XXXXXand deployments are namedanaconda-app-XXXX.
For the chosen pod click the pull-down button on an individual container to view the Logs or to gain SSH access.
To use the CLI:
Open an SSH session on the master node in a terminal by logging into the Operations Center and selecting Servers from the menu on the left.
Click on the IP address for the Anaconda Enterprise master node and select SSH login as root.
In the terminal window, run
sudo gravity enter.Run
kubectl get podsto view a list of all running session pods.Run
kubectl logs <POD-NAME>to display the logs for the pod specified.
Viewing activity logs¶
Anaconda Enterprise logs all activity performed by users, including the following:
Each system login.
All Admin actions.
Each time a project is created and updated.
Each time a project is deployed.
In each case, the user who performed the action and when it occurred are tracked, along with any other important details.
As an Administrator, you can log in to the Administrative Console’s Authentication Center to view the log of all login and Admin events:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Users.
Log in to the Authentication Center using the Administrator credentials required to access it.
Click Events in the left menu to display a log of all Login Events.
Click the Admin Events tab to view a sumary of all actions performed by Admin users.
To filter events:
Event data can become difficult to manage as it accumulates, so Anaconda Enterprise provides a few options to make it more manageable:
Click the Config tab to configure the type of events you want Anaconda Enterprise to log, clear events, and schedule if you want to periodically delete event logs from the database.
Use the Filter options available on both the Login Events and Admin Events windows to control the results displayed based on variables such as event or operation, user or resource, and a range of dates.
Click Update to refresh the results based on the filter you configured, and Reset to return to the original log results.
Select the maximum number of results you want displayed: 5, 10, 50 or 100.
To view activity at the project level:
Switch to the User Console and click Projects in the top menu.
Select the project you want to view information about to display a list of all actions performed on the project in the Activity window.
Fault tolerance in Anaconda Enterprise¶
Anaconda Enterprise employs automatic service restarts and health monitoring to remain operational if a process halts or a worker node becomes unavailable. Additional levels of fault tolerance, such as service migration, are provided if there are at least three nodes in the deployment. However, the master node cannot currently be configured for automatic failover and does present a single point of failure.
When Anaconda Enterprise is deployed to a cluster with three or more nodes, the core services are automatically configured into a fault tolerant mode—whether Anaconda Enterprise is initially configured this way or changed later. As soon as there are three or more nodes available, the service fault tolerance features come into effect.
This means that in the event of any service failure:
Anaconda Enterprise core services will automatically be restarted or, if possible, migrated.
User-initiated project deployments will automatically be restarted or, if possible, migrated.
If a worker node becomes unresponsive or unavailable, it will be flagged while the core services and backend continue to run without interruption. If additional worker nodes are available the services that had been running on the failed worker node will be migrated or restarted on other still-live worker nodes. This migration may take a few minutes.
The process for adding new worker nodes to the Anaconda Enterprise cluster is described in Adding and removing nodes.
Storage and persistency layer
Anaconda Enterprise does not automatically configure storage or persistency layer fault tolerance when using the default storage and persistency services. This includes the database, Git server, and object storage. If you have configured Anaconda Enterprise to use external storage and persistency services then you will need to configure these for fault tolerance.
Recovering after node failure
Other than storage-related services (database, Git server, and object storage), all core Anaconda Enterprise services are resilient to master node failure.
To maintain operation of Enterprise in the event of a master node
failure, /opt/anaconda/ on the master node should be located on a redundant
disk array or backed up frequently to avoid data loss. See
Backing up and restoring AE for more information.
To restore Anaconda Enterprise operations in the event of a master node failure:
Create a new master node. Follow the installation process for adding a new cluster node, described in command-line installations.
Note
To create the new master node, select --role=ae-master instead of --role=ae-worker.
Restore data from a backup. After the installation of the new master node is complete, follow the instructions in Backing up and restoring AE.
Configuring user access¶
As an Administrator, you’ll need to authorize users so they can use Anaconda Enterprise. This involves adding users to the system, setting their credentials, mapping them to roles, and optionally assigning them to one or more groups.
To help expedite the process of authorizing large groups of users, you can connect to an external identity provider such as LDAP or Active Directory and federate those users.
You’ll need access to the Administrative Console’s Authentication Center to be able to use it to configure identity and access management for Anaconda Enterprise. Follow these instructions to grant Admins permission to manage AE users.
Connecting to external identity providers¶
Anaconda Enterprise comes with out-of-the-box support for LDAP, Active Directory, SAML and Kerberos. As each enterprise configuration is different, coordinate with your LDAP/AD Administrator to obtain the provider-specific information you need to proceed. We’ve also provided an example of an LDAP setup to help guide you through the process.
Note
You must have pagination turned off before starting.
Adding a provider¶
You’ll use the Administrative Console’s Authentication Center to add an identity provider:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.
Click Manage Users.
Login to the Authentication Center using the Administrator credentials required to access it.
In the Configure menu on the left, select User Federation.
Select
ldapfrom the Add provider selector to display the initial Required Settings screen.
Multiple fields are required. The most important is the Vendor drop-down list, which will prefill default settings based on the LDAP provider you select. Make sure you select the correct one: Active Directory, Red Hat Directory Server, Tivoli, or Novell eDirectory. If none of these matches, select Other and coordinate with your LDAP Administrator to provide values for the required fields:
- Username LDAP attribute
Name of the LDAP attribute that will be mapped to the username. Active Directory installations may use
cnorsAMAccountName. Others often useuid.- RDN LDAP attribute
Name of the LDAP attribute that will be used as the RDN for a typical user DN lookup. This is often the same as the above “Username LDAP attribute”, but does not have to be. For example, Active Directory installations may use
cnfor this attribute while usingsAMAccountNamefor the “Username LDAP attribute”.- UUID LDAP attribute
Name of an LDAP attribute that will be unique across all users in the tree. For example, Active Directory installations should use
objectGUID. Other LDAP vendors typically define a UUID attribute, but if your implementation does not have one, any other unique attribute (such asuidorentryDN) may be used.- User Object Classes
Values of the LDAP
objectClassattribute for users, separated by a comma. This is used in the search term for looking up existing LDAP users, and if read-write sync is enabled, new users will be added to LDAP with theseobjectClassvalues as well.- Connection URL
The URL used to connect to your LDAP server. Click Test connection to make sure your connection to the LDAP server is configured correctly.
- Users DN
The full DN of the LDAP tree–the parent of LDAP users. For example,
'ou=users,dc=example,dc=com'.- Authentication Type
The LDAP authentication mechanism to use. The default is
simple, which requires the Bind DN and password of the LDAP Admin.- Bind DN
The DN of the LDAP Admin, required to access the LDAP server.
- Bind Credential
The password of the LDAP Admin, required to access the LDAP server. After supplying the DN and password, click Test authentication to confirm that your connection to the LDAP server can be authenticated.
Configuring sync settings¶
By default, users will not be synced from the LDAP / Active Directory store until they log in. If you have a large number of users to import, it can be helpful to set up batch syncing and periodic updates.
Configuring mappers¶
After you complete the initial setup, the auth system generates a set of “mappers” for your configuration. Each mapper takes a value from LDAP and maps it to a value in the internal auth database.
Go through each mapper and make sure it is set up appropriately.
Check that each mapper reads the correct “LDAP attribute” and maps it to the right “User Model Attribute”.
Check that the attribute’s “read-only” setting is correct.
Check whether the attribute should always be read from the LDAP store and not from the internal database.
For example, the username mapper sets the Anaconda Enterprise username from the LDAP attribute
configured.
Configuring advanced mappers¶
Instead of manually configuring each user, you can automatically import user data from LDAP using additional mappers. The following mappers are available:
- User Attribute Mapper (
user-attribute-ldap-mapper) Maps LDAP attributes to attributes on the AE5 user. These are the default mappers set up in the initial configuration.
- FullName Mapper (
full-name-ldap-mapper) Maps the full name of the user from LDAP into the internal database.
- Role Mapper (
role-ldap-mapper) Sets role mappings from LDAP into realm role mappings. One role mapper can be used to map LDAP roles (usually groups from a particular branch of an LDAP tree) into realm roles with corresponding names.
Multiple role mappers can be configured for the same provider. It’s possible to map roles to a particular client (such as the
anaconda-deployservice), but it’s usually best to map in realm-wide roles.- Hardcoded Role Mapper (
hardcoded-ldap-role-mapper) Grants a specified role to each user linked with LDAP.
- Hardcoded Attribute Mapper (
hardcoded-ldap-attribute-mapper) Sets a specified attribute to each user linked with LDAP.
- Group Mapper (
group-ldap-mapper) Sets group mappings from LDAP. Can map LDAP groups from a branch of an LDAP tree into groups in the Anaconda Platform realm. It will also propagate user-group membership from LDAP. We generally recommend using roles and not groups, so the role mapper may be more useful.
Warning
The group mapper provides a setting Drop non-existing groups during sync. If this setting is turned on, existing groups in Anaconda Enterprise Authentication Center will be erased.
- MSAD User Account Mapper (
msad-user-account-control-mapper) Microsoft Active Directory (MSAD) specific mapper. Can tightly integrate the MSAD user account state into the platform account state, including whether the account is enabled, whether the password is expired, and so on. Uses the
userAccountControlandpwdLastSetLDAP attributes.For example if
pwdLastSetis0, the user is required to update their password and there will be anUPDATE_PASSWORDrequired action added to the user. IfuserAccountControlis514(disabled account), the platform user is also disabled.
Mapper configuration example¶
To map LDAP group membership to Anaconda Platform roles, use a role mapper.
Add a mapper of the role-ldap-mapper type:
In consultation with your LDAP administrator and internal LDAP documentation, define which LDAP group
tree will be mapped into roles in the Anaconda Platform realm. The roles are mapped directly by
name, so an LDAP membership of ae-deployer will map to the role of the same name in Anaconda Platform.
Authorizing LDAP groups and roles¶
To authorize LDAP group members or roles synced from LDAP to perform various functions, add them
to the anaconda-enterprise-anaconda-platform.yml configmap.
EXAMPLE: To give users in the LDAP group “AE5”, and users with the LDAP-synced role “Publisher”, permission to deploy apps, the deploy section would look like this:
deploy:
port: 8081
prefix: '/deploy'
url: https://abc.demo.anaconda.com/deploy
https:
key: /etc/secrets/certs/privkey.pem
certificate: /etc/secrets/certs/cert.pem
hosts:
- abc.demo.anaconda.com
db:
database: anaconda_deploy
users: '*'
deployers:
users: []
groups:
- developers
- AE5
roles:
- Publisher
After editing the configmap, restart all pods for your changes to take effect:
kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
Configuring LDAPS (Outbound SSL)¶
To make correct requests to secure internal resources such as internal enterprise LDAP servers using corporate SSL certificates, you must configure a “trust store”. This is optional. If your internal servers instead use certificates issued by a public root CA, then the default trust store is sufficient.
To create a trust store, you must have the public certificates you wish to trust available.
Note
These are certificates for your trusted server such as Secure LDAP, not for Anaconda Enterprise.
Option 1¶
If the CA certificates are directly available to you, run the following command, replacing CAFILE.cert with your CA certificate file:
keytool -import -file CAFILE.cert -alias auth -keystore LDAPS.jks
Note
If you want to add an intermediate certificate, run this command again with a unique alias, to include it in the LDAPS.jks file.
Option 2¶
Alternatively, if you also have the server certificate and key, you can construct a full trust chain in the store.
Convert the certificate and key files to PKCS12 format—if they are not already—by running the following command:
openssl pkcs12 -export -chain -in CERT.pem -inkey CERT-KEY.pem -out PKCS-CHAIN.p12 -name auth -CAfile CA-CHAIN.pem
In this example, replace CERT.pem with the server’s certificate, CERT-KEY.pem with the server’s key,
PKCS-CHAIN.p12 with a temporary file name, and CA-CHAIN.pem with the trust chain file (up to
the root certificate of your internal CA).
Create a Java keystore to store the trusted certs:
keytool -importkeystore -destkeystore LDAPS.jks -srckeystore PKCS-CHAIN.p12 -alias auth
You will be prompted to set a password. Record the password.
Final steps¶
For both options, you’ll need to follow the steps below to expose the certificates to the Anaconda Enterprise Auth service:
Export the existing SSL certificates for your system by running the following commands:
sudo gravity enter kubectl get secrets anaconda-enterprise-certs --export -o yaml > /opt/anaconda/secrets-exported.yml
Exit the gravity environment, and back up the secrets file before you edit it:
cp secrets-exported.yml secrets-exported-orig.yml
Run the following command to encode the newly created truststore as
base64:echo " ldaps.jks: "$(base64 -i --wrap=0 OUTPUT.jks)
Copy the output of this command, and paste it into the
datasection of thesecrets-exported.ymlfile.Run the following commands to update Anaconda Enterprise with the secrets certificate:
sudo gravity enter kubectl replace -f /opt/anaconda/secrets-exported.yml
Verify that the
LDAPS.jksentry has been added to the secret:kubectl describe secret anaconda-enterprise-certs
Edit the platform configuration by setting the
auth.https.truststoreconfiguration key to/etc/secrets/certs/ldaps.jks, andauth.https.truststore-passwordto the matching password. For example, after editing, it should resemble the following:
Run the following commands to restart the
authservice:sudo gravity enter kubectl get pods | grep ap-auth | cut -d' ' -f1 | xargs kubectl delete pods
Managing users¶
Managing access to Anaconda Enterprise involves adding and removing users, setting passwords, mapping users to roles, and optionally assigning them to groups. To help expedite the process of authorizing large groups of users at once, you can connect to an external identity provider using LDAP, Active Directory, SAML, or Kerberos to federate those users.
Note
To be able to perform these actions, you’ll need the appropriate login credentials required to access the Administrative Console’s Authentication Center.
The process of authorizing Operations Center Admins is slightly different. See Managing System Administrators for more information.
To access the Authentication Center:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.Click Manage Users.
If this is the first time accessing the Authentication Center, log in using the default
admincredentials. Otherwise, use the credentials that grant you Admin privileges in the Authentication Center.
Note
To create and manage other Authentication Center Admins, use the realm selector in the upper left corner to switch to the Master realm before proceeding.
In the Manage menu on the left, click Users.
On the Lookup tab, click View all users to list every user in the system, or search the user database for all users that match the criteria you enter, based on their first name, last name, or email address.
Note
This will search the local user database and not the federated database (such as LDAP) because not all external identity provider systems inlcude a way to page through users. If you want users from a federated database to be synced into the local database, select User Federation in the Configure menu on the left, and adjust the Sync Settings for your user federation provider.
To create a new Anaconda Enterprise user, click Add user and specify a user name—and optionally provide values for the other fields—before clicking Save.
Warning
User names containing unicode characters—special characters, punctuation, symbols, spaces—are not permitted.
To configure a user, click the user’s ID in the list and use the available tabs as follows:
Use the Details tab to specify information for the user, optionally enable user registration and required actions and impersonate the user. If you include an email address, an invitation to join Anaconda Enterprise will be sent to the email address specified.
Use the Credentials tab to manage the user’s password. If the Temporary switch is on, this new password can only be used once—the user will be asked to change their password after they use it to log in to Anaconda Enterprise.
Use the Role Mappings tab to assign the user one or more roles, and the Groups tab to add them to one or more groups. See managing roles and groups for more information.
Note
To grant Authentication Center Administrators sufficient authority to manage AE users, you’ll need to assign them the admin role.
Use the Sessions tab to view a summary of all sessions the user has started, and log them out of all sessions in a single click. This is handy if a user goes on vacation without logging out of their sessions. You can use the Operations Center to view a summary of all sessions running on specific nodes or by specific users. See monitoring sessions and deployments for more information.
To view and edit a set of fine grain permissions that you can enable and use to define policies for allowing other users to manage users in the selected realm, return to the Users list and select the Permissions tab:
Enabling user registration¶
You can use the Authentication Center to enable users to self register and create their own account. When enabled, the login page will have a Register link users can click to open the registration page where they can enter the user profile information and password required to create their new account.
Click Realm Settings under Configure in the menu on the left menu.
Click the Login tab, and enable the User registration switch.
You can change the look and feel of the registration form as well as removing or adding additional fields that must be entered. See the Server Developer Guide for more information.
Enabling required actions¶
You can use the Required User Actions drop-down list—on the Details tab for each user—to select the tasks that a user must complete (after providing their credentials) before they are allowed to log in to Anaconda Enterprise:
- Update Profile
This requires the user to update their profile information, such as their name, address, email, and phone number.
- Update Password
When set, a user must change their password.
- Configure OTP
When set, a user must configure a one-time password generator on their mobile device using either the Free OTP or Google Authenticator application.
Setting default required actions¶
You can specify default required actions that will be added to all new user accounts. Select Authentication from the Configure menu on the left and use the Required Actions tab to specify whether you want each required action to be enabled—available for selection—or also pre-populated as a default for all new users.
Note
A required action must be enabled to be specified as a default.
Using terms and conditions¶
Many organizations have a requirement that when a new user logs in for
the first time, they need to agree to the terms and conditions of the
website. This functionality can be implemented as a
required action, but it requires some configuration. In addition to enabling
Terms and Conditions as a required action, you must also edit the terms.ftl file
in the base login theme. See the Server Developer Guide
for more information on extending and creating themes.
Impersonating users¶
It is often useful for an Administrator to impersonate a user. For example, a user may be experiencing an issue using an application and an Admin may want to impersonate the user to see if they can duplicate the problem.
Note
Any user with the realm’s impersonation role can impersonate a user.
The Impersonate command is available from both the Users list and the Details tab for a user.
Click Impersonate to display a list of applications the user has accessed on the platform, including editor sessions and deployments.
Click the Anaconda Platform link to interact with Anaconda Enterprise as the impersonated user.
Note
If the Admin and the user are in the same realm, the Admin will be logged out and automatically logged in as the user being impersonated. If the Admin and user are not in the same realm, the Admin will remain logged in and be logged in as the user in that user’s realm.
Managing roles and groups¶
Assigning access and permissions to individual users can be too fine-grained and cumbersome for organizations to manage, so Anaconda Enterprise enables you to assign access permissions to specific roles, then use groups to assign one or more roles to sets of users. Users inherit the attributes and role mappings assigned to each group they are members of—whether multiple or none.
The use of groups to assign permissions is entirely optional, so you can rely solely on roles to assign users permission to perform certain actions in Anaconda Enterprise.
You’ll use the Admin Console’s Authentication Center to create and manage roles and groups. This includes creating new roles and groups, configuring defaults for each, and assigning roles to groups.
You’ll use the Admin Console’s Operations Center to configure permissions for any roles you create, and optionally the default system roles provided by Anaconda Enterprise.
Note
When naming users and groups that you create, consider that Anaconda Enterprise users can add collaborators by user or group name when sharing their projects and deployments, as well as packages and channels.
To access the Authentication Center:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.
Click Manage Users.
Login to the Authentication Center using the Administrator credentials configured after installation.
To manage roles:
Use roles to authorize individual or groups of users to perform specific actions within Anaconda Enterprise. Default roles allow you to automatically assign user role mappings when any user is newly created or imported (for example, through LDAP.
You’ll use the Authentication Center to configure new roles and specify default roles to be automatically added to all new user accounts.
In the Configure menu on the left, click Roles to display a list of roles configured for use with Anaconda Enterprise.
To get you started, Anaconda Enterprise provides a set of “realm” roles. You can use these system roles as is, or as a basis for creating your own.
- ae-admin
Allows a user to access the Administrative console.
- ae-creator
Allows a user to create new projects.
- ae-deployer
Allows a user to create new deployments from projects.
- ae-uploader
Allows a user to upload packages.
Note
To define roles that are global to Anaconda Enterprise, use the realm selector in the upper left corner to switch to the Master realm before proceeding.
To create a new role, click Add Role on the Realm Roles tab.
Enter a name and description of the role, and click Save.
Note
Roles can be assigned to users automatically or require an explicit request. If a user has to explicitly request a realm role, enable the Scope Param Required switch. The role must then be specified using the scope parameter when requesting a token.
The new role is now available to be used as a default role, or to be assigned to groups of users.
To configure default roles, click the Default Roles tab.
When working with the AnacondaPlatform realm, you can configure default roles for Anaconda Enterprise users using the list of available and default Realm Roles.
When working with the Master realm, you can configure defaut roles for a specific client or service namespace using the list of available and default roles for the client you select from the Client Roles drop-down list.
Note
To customize the list of roles available for Anaconda Enterprise Admins to use, select AnacondaPlatform-realm from the list.
To manage groups:
In the Manage menu on the left, click Groups to display a list of groups configured for use with Anaconda Enterprise.
To get you started, Anaconda Enterprise provides a set of default groups, with different role mappings for each. You can use these defaults as is, or as a basis for creating your own. Default groups allow you to automatically assign group membership whenever a new user is created or imported.
Double-click the name of a group to view information about the group and modify it:
Use the Role Mappings tab to assign roles to the group from the list of available Realm Roles and Client Roles. See managing roles for information on how to create new roles. Permission to perform certain actions in Anaconda Enterprise are based on a user’s role, so you can grant permissions to a group of users by mapping the associated role(s) with the group. See the section below for the steps to configure permissions by role.
Use the Members tab to view all users who currently belong to the group. You add users to groups at the user level using the Groups tab for the user. See managing users for more information.
Use the Permissions tab to enable a set of fine grain permissions to use to define policies for allowing Admin users to manage the group. See the section below to understand how to configure permissions by role.
To configure permissions for roles:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left to display the config map for Anaconda Enterprise.
Note
If anaconda-platform.yml is not displayed, be sure anaconda-enterprise-anaconda-platform.yml is selected in the Config maps drop-down list.
The following sections of the config map have permissions associated with them:
deploy:deployers—used to configure which users can deploy projects
workspace:users—used to configure which users can open project sessions
storage:creators—used to configure which users can create projects
repository:uploaders—used to configure which users can upload packages to the AE repository
Save a copy of this file before making any changes to
anaconda-platform.yml. Any changes you make to the platform configuration will impact how Anaconda Enterprise functions, so you’ll want to have a backup if the need to restore a previous configuration arises.Add each new role you create to the appropriate section—based on the permission you want to grant the role—and click Apply to save your changes.
For example, if you create a new role called ae-managers, and you want users with this role to be able to deploy applications, you need to add that role to the list of roles under deploy:deployers to map the permission to the role.
Managing System Administrators¶
Anaconda Enterprise distinguishes between System Administrators responsible for authorizing AE platform users, and System Administrators responsible for managing AE resources. This enables enterprises to grant the permissions required for configuring each to different individuals, based on their area of responsibility within the organization.
Sys Admins who are granted permission to access the Authentication Center can configure authentication for all platform users, including platform Admins. See managing users for information on how to create and manage Authentication Center Admins.
Sys Admins who are granted permission to access the Operations Center can manage AE resources and configure advanced platform settings.
Note
The login credentials for the Operations Center are initally set as part of the post-install configuration process. Follow the steps outlined below to authorize additional Admin users to manage cluster resources, using the Operations Center UI or using a command line. If you prefer to use OpenID Connect (OIDC), see Configuring Operations Center Admins using Google OIDC.
Managing Operations Center Admins using the UI¶
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Login to the Operations Center using the Administrator credentials configured after installation.
Select Settings in the login menu in the upper-right corner.
In the left menu, select Users, then click + New User in the upper-right corner.
Select
@teleadminfrom the Roles drop-down list, and click Create invite link.
Copy the invitation URL that is generated, replace the private IP address with the fully-qualified domain name of the host, if necessary, and send it to the individual using your preferred method of secure communication. They’ll use it to set their password, and will be automatically logged in to the Operations Center when they click Continue.
To generate a new invitation URL, select Renew invitation in the Actions menu for the user.
Select Revoke invitation to prevent them from being able to use the invitation to create a password and access the Operations Center. This effectively deletes the user before they have a chance to set their credentials.
To delete—or otherwise manage—an Operations Center user after they have set their credentials and completed the authorization process, select the appropriate option from the Actions menu.
Managing Operations Center Admins using a command line¶
To create a new Admin:
Run the following commands on the Anaconda Enterprise master node, replacing <email> and <yourpass> with the email address and password for the user:
sudo gravity enter
gravity --insecure user create --type=admin --email=<email> --password=<yourpass> --ops-url=https://gravity-site.kube-system.svc.cluster.local:3009
To verify that the user was created, run the following command:
sudo gravity resource get users
To update an Admin user’s password:
To update an Admin user’s password, you’ll need to delete the user account, then re-create it, replacing <email> and <yourpass> with the email address and new password:
sudo gravity enter
gravity --insecure user delete --email=<email> --ops-url=https://gravity-site.kube-system.svc.cluster.local:3009
gravity --insecure user create --type=admin --email=<email> --password=<yourpass> --ops-url=https://gravity-site.kube-system.svc.cluster.local:3009
Configuring session timeouts¶
As an Administrator, you can configure session timeouts for Anaconda Enterprise platform users, to help you adhere to your organization’s security standards or enforce policies.
You’ll use the Administrative Console’s Authentication Center to set the various parameters related to session timeouts:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.
Click Manage Users.
Login to the Authentication Center using the Administrator credentials required to be able to access it.
In the Configure menu on the left, select Realm Setting.
Click the Tokens tab at the top to display the following:
Use the available configuration options to specify maximum thresholds for each aspect of user sessions, including the following:
Time limits for idle browser sessions and single sign on (SSO) tokens
Lifespans for OpenID access tokens
Time limits for login-related actions, such as resetting a forgotten password
Configuration option |
Description |
|---|---|
Revoke Refresh Token |
If enabled, limits refresh tokens to one-time use |
SSO Session Idle |
User will be logged out of session if inactive for this length of time |
SSO Session Max |
Maximum time a user session can remain active, regardless of activity |
Offline Session Idle |
Amount of time an offline session can be idle before the access token is revoked |
Access Token Lifespan |
Amount of time an access token will remain valid, before expiring |
Access Token Lifespan For Implicit Flow |
Timeout for access tokens created with Implicit Flow–no refresh token is provided |
Client login timeout |
Maximum time a client can take to complete the authorization process |
Login timeout |
Maximum time a user can take to authenticate before the process restarts |
Login action timeout |
Maximum time a user can spend on any one page in the authentication process |
User-Initiated Action Lifespan |
Maximum time before a user-initiated action (e.g., forgot password email) expires |
Default Admin-Initiated Action Lifespan |
Maximum time before an admin-initiated action (e.g., issue token to user) expires |
Override User-Initiated Action Lifespan |
Use to optionally configure different timeouts for each user-initiated action |
Click Save to save your changes to the Anaconda Enterprise platform.
LDAP setup example¶
Configuring identity and access management is complex, and each enterprise has a different LDAP directory structure. While your implementation will be based on the specific structure and needs of your organization, the principles and processes outlined here will enable you to:
Reduce the number of users that need to be mapped into Anaconda Enterprise (by mapping a functional role—AE5 User—to an LDAP group). This also simplifies licence management through a single group membership.
Reduce the number of groups that are mapped into Anaconda Enterprise (by filtering groups to include only relevant functional roles and team memberships).
Automate the import of new groups for team memberships based on filters.
Automate the provision of AE5 roles to users based on group membership of functional roles.
Roles are used to determine the types of objects in Anaconda Enterprise that users with the role can access using the platform, such as packages or projects. This example is provided to help guide you through the process of mapping default Anaconda Enterprise roles to the following common functional business roles:
Business Analyst
Data Scientist
Data Engineer
DevOps
Administrator
Follow the general processes outlined below for your specific implementation:
Retrieving directory structures and user attributes¶
The organizational structure of your enterprise is represented in LDAP by a directory structure or tree. You’ll need to request the bind user credentials from your Security Administrator.
While you can make assumptions about the directory structure based on the bind user credentials, it’s extremely difficult to setup an identity provider without the complete structure. For example, if the bind user credentials are uid=binduser,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io, we can deduce that the root or base of the tree is dc=tools,dc=continuum,dc=io.
Tools are available to help you visualize your organization’s directory structure. For example, phpldapadmin generated the following view:
The rest of the bind user credentials become apparent after looking at the directory structure. In this example, we can see that users live under cn=accounts > cn=users, and groups live under cn=accounts > cn=groups
Now that you know the directory structure, you can gather information about the user and group entries that you’ll need later.
You can use the ldapsearch tool—along with the binduser credentials—to learn details about an individual user based on their uid. Here’s a sample command for the user gandalf:
ldapsearch -D 'uid=binduser,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io' -W -H ldap://ipa.tools.continuum.io -b dc=tools,dc=continuum,dc=io "(uid=gandalf)"
Results will resemble the following:
# gandalf, users, compat, tools.continuum.io
dn: uid=gandalf,cn=users,cn=compat,dc=tools,dc=continuum,dc=io
objectClass: posixAccount
objectClass: ipaOverrideTarget
objectClass: top
gecos: gandalf the grey
cn: gandalf the grey
uidNumber: 1666600031
gidNumber: 1666600031
loginShell: /bin/sh
homeDirectory: /home/gandalf
ipaAnchorUUID:: OklQQTp0b29scy5jb250aW51dW0uaW86OTEyYTMwNjgtZDhmYy0xMWU4LTgzYT
UtMTIyYTE3YWNlMzJh
uid: gandalf
# gandalf, users, accounts, tools.continuum.io
dn: uid=gandalf,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
displayName: gandalf the grey
uid: gandalf
krbCanonicalName: gandalf@TOOLS.CONTINUUM.IO
objectClass: top
objectClass: person
objectClass: organizationalperson
objectClass: inetorgperson
objectClass: inetuser
objectClass: posixaccount
objectClass: krbprincipalaux
objectClass: krbticketpolicyaux
objectClass: ipaobject
objectClass: ipasshuser
objectClass: ipaSshGroupOfPubKeys
objectClass: mepOriginEntry
loginShell: /bin/sh
initials: gt
gecos: gandalf the grey
sn: the grey
homeDirectory: /home/gandalf
mail: gandalf@tools.continuum.io
krbPrincipalName: gandalf@TOOLS.CONTINUUM.IO
givenName: gandalf
cn: gandalf the grey
ipaUniqueID: 912a3068-d8fc-11e8-83a5-122a17ace32a
uidNumber: 1666600031
gidNumber: 1666600031
krbPasswordExpiration: 20181026085310Z
krbLastPwdChange: 20181026085310Z
memberOf: cn=ipausers,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io
memberOf: cn=grp-ae5-user,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io
memberOf: cn=grp-ae5-wizards,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io
memberOf: cn=grp-lord-of-the-rings,cn=groups,cn=accounts,dc=tools,dc=continuum
,dc=io
# search result
search: 2
result: 0 Success
# numResponses: 3
# numEntries: 2
Within these results, you’ll find the information you need to set up user federation for LDAP.
Setting up LDAP user federation¶
You’ll use the Anaconda Enterprise Administrative Console’s Authentication Center to add LDAP as your identity provider:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.
Click Manage Users and login to the Authentication Center using the Administrator credentials configured after installation.
In the Configure menu on the left, select User Federation.
Select
ldapfrom the Add provider selector to display the Add user federation provider Required Settings.Configure the fields as follows: (Bold items are described in more detail below the table.)
Field |
Setting |
|---|---|
Enabled |
ON |
Console Display Name |
ldap(tools.continuum.io) |
Priority |
0 |
Import Users |
ON |
Edit Mode |
READ_ONLY |
Sync Registration |
OFF |
Vendor |
Red Hat Directory Server |
Username LDAP attribute |
uid |
RDN LDAP attribute |
uid |
UUID LDAP attribute |
uidNumber |
User Object Classes |
person,organizationalperson,inetorgperson |
Connection URL |
|
Users DN |
cn=users,cn=accounts,dc=tools,dc=continuum,dc=io |
Authentication Type |
simple |
Bind DN |
uid=binduser,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io |
Bind Credential |
|
Custom User LDAP Filter |
(&(objectClass=person)(uid=*)(memberOf=cn=grp-ae5-user,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io)) |
Search Scope |
One level |
Validate Password Policy |
OFF |
User Truststore SPI |
Only for ldaps |
Connection Pooling |
ON |
Connection Timeout |
|
Read Timeout |
|
Pagination |
ON |
Allow Kerberos authentication |
OFF |
User Kerberos for Password Authentication |
OFF |
Batch Size |
1000 |
Periodic Full Sync |
OFF |
Periodic Changed Users Sync |
OFF |
Cache Policy |
DEFAULT |
Vendor
When you select a vendor from the drop-down list, defaut values for the the most commonly used attributes will be prefilled. Be sure to select the correct one, and note that the default values may not match the way your organization has set up their application. Our example uses Red Hat Directory Server, which is based on Free IPA.
Username, RDN, UUID, User Object Classes, Users DN and Bind DN
Locate the values for these fields in the results of the ldapsearch command you ran previously. The following table outlines how the fields map to the relevant values from our gandalf user example:
Field |
LDAP Search Value |
Description |
|---|---|---|
Username |
uid: gandalf |
The unique ID used to identify the user. |
RDN |
uid: gandalf |
Usually the same as the Username, but may default to something else depending on the vendor selected |
UUID |
uidNumber: 1666600031 |
Unique identifier |
User Object Classes |
objectClass: person objectClass: organizationalperson objectClass: inetorgperson |
User object classes combined in a single field |
Users DN |
dn: uid=gandalf,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io |
The dn less the |
Bind DN |
uid=binduser,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io |
Usually provided by Security Admin |
Custom User LDAP Filter
You can use a custom filter to restrict which users are returned from LDAP. In this case, we want only those persons (objectClass=person) with any uid (uid=*) that are a member of group grp-ae5-user (memberOf=cn=grp-ae5-user,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io). No other users will be able to log in, thereby preventing unauthorized access. This is also useful for managing licences, as users will have to be explicitly added to this group to be able to access the platform.
Filters also limit the need to synchronize a large number of objects from LDAP, which will help prevent out of memory errors in the auth pod.
Note
Avoid the temptation to add new groups into the Custom User LDAP Filter. LDAP search criteria are notorious for their complexity, and if it’s implemented incorrectly, all user access could be suspended or functionality disabled.
Testing your provider setup¶
Use the Test connection and Test authentication buttons to verify that the platform can connect to the provider with the credentials provided. You’ll need to resolve any errors before continuing.
By default, users will not be synced from LDAP until they log in. To test whether the Custom User LDAP Filter is working correctly, you can add or remove users in LDAP, then enable the sync settings to see if your changes are picked up and user authentication works as expected.
After you save the Required Settings, the provider is listed under User Federation:
Configuring group mappers¶
After you have sucessfully set up user federation, set up a group mapper for your identify provider using the Mappers tab. For example, you can create one called ldap-group-mapper and configure it based on the results generated by the ldapsearch command. In this case, we ran the command against a known group to retrieve additional information needed:
ldapsearch -D 'uid=binduser,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io' -W -H ldap://ipa.tools.continuum.io -b dc=tools,dc=continuum,dc=io "(cn=grp-ae5-user)"
With the following results:
# grp-ae5-user, groups, compat, tools.continuum.io
dn: cn=grp-ae5-user,cn=groups,cn=compat,dc=tools,dc=continuum,dc=io
objectClass: posixGroup
objectClass: ipaOverrideTarget
objectClass: ipaexternalgroup
objectClass: top
gidNumber: 1666600026
memberUid: czhang
memberUid: dlawrence
memberUid: edill
memberUid: escissorhands
memberUid: gcavanaugh
memberUid: jsandhu
memberUid: rbarthelmie
memberUid: vghadban
memberUid: gandalf
ipaAnchorUUID:: OklQQTp0b29scy5jb250aW51dW0uaW86NGFhOTQ4NzYtZDg4YS0xMWU4LWE2ZD
ctMTIyYTE3YWNlMzJh
cn: grp-ae5-user
# grp-ae5-user, groups, accounts, tools.continuum.io
dn: cn=grp-ae5-user,cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io
objectClass: top
objectClass: groupofnames
objectClass: nestedgroup
objectClass: ipausergroup
objectClass: ipaobject
objectClass: posixgroup
cn: grp-ae5-user
ipaUniqueID: 4aa94876-d88a-11e8-a6d7-122a17ace32a
gidNumber: 1666600026
member: uid=czhang,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=dlawrence,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=edill,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=escissorhands,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=gcavanaugh,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=jsandhu,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=rbarthelmie,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=vghadban,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
member: uid=gandalf,cn=users,cn=accounts,dc=tools,dc=continuum,dc=io
# search result
search: 2
result: 0 Success
# numResponses: 3
# numEntries: 2
Field |
LDAP Search Value |
|---|---|
Name * |
ldap-group-mapper |
Mapper Type |
group-ldap-mapper |
LDAP Groups DN |
cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io |
Group Name LDAP Attribute |
cn |
Group Object Classes |
groupOfNames |
Preserve Group Inheritance |
ON |
Ignore Missing Groups |
OFF |
Membership LDAP Attribute |
member |
Membership Attribute Type |
DN |
Membership User LDAP Attribute |
uid |
LDAP Filter |
(cn=grp-ae5*) |
Mode |
READ_ONLY |
User Groups Retrieve Strategy |
LOAD_GROUPS_BY_MEMBER_ATTRIBUTE |
Member-Of LDAP Attribute |
memberOf |
Mapped Group Attributes |
|
Drop non-existing groups during sync |
OFF |
Note
Avoid the temptation to add new groups into the LDAP Filter in the Group Mapper. LDAP search criteria are notorious for their complexity, and if it’s implemented incorrectly all user access could be suspended or functionality disabled.
LDAP Groups DN
Derived from the ldapsearch field: dn: cn=grp-ae5-user,**cn=groups,cn=accounts,dc=tools,dc=continuum,dc=io**
Group Name LDAP Attribute
Derived from the ldapsearch field: cn: grp-ae5-user
Group Object Classes
A default should have been selected. In this case it is objectClass: groupofnames.
LDAP Filter
All relevant groups—whether they are based on functional role or team membership–have been set up with the prefix grp-ae5-. This prefix is used to filter the relevant groups from the User Federation provider, preventing any unnecessary groups from being pulled into the AE platform.
For example, the user Gandalf is a member of the following groups:
If you perform a group synchronisation, only the groups in bold will be imported. Additionally, when Gandalf logs in, only the grp-ae5-prefixed groups from his profile will be imported. You can test this by deleting the grp-ae5-wizards group, then login as the user gandalf. His team membership group grp-ae5-wizards will be visible in the Auth Center, but the group grp-lord-of-the-rings will be filtered out and therefore not imported.
Mapping group roles¶
As a final step, you can map Anaconda Enterprise roles to the LDAP groups that are imported into the platform.
In this example, we’ll assign functional role groups the default roles that will allow them to interact with the platform in a way that makes sense for the business. You can also create custom roles, if needed.
LDAP Group |
ae-admin |
ae-creator |
ae-deployer |
ae-uploader |
offline_access |
uma_authorization |
Description |
|---|---|---|---|---|---|---|---|
grp-ae5-biz-analyst |
X |
Business Analysts can access the system. They cannot create projects or grant others access to the system. |
|||||
grp-e5-data-scientist |
X |
X |
X |
X |
X |
Data Scientists can create and share projects, but cannot deploy them. |
|
grp-ae5-data-engineer |
X |
X |
X |
X |
Data Engineers can additionally deploy projects, as well as grant access to others. |
||
grp-ae5-devops |
X |
X |
X |
DevOps can deploy projects and upload packages, but cannot create projects. |
|||
grp-ae5-sec-admin |
This group should be used to administer user access within the system. Therefore, no roles should be defined in the |
||||||
grp-ae5-sysadmin |
X |
By default, the |
|||||
grp-ae5-sysacct |
The roles for system accounts are yet to be defined. These could be used for automated CI/CD tasks. |
||||||
grp-ae5-user |
This is used as a coarse-grained control for access to AE5, so no roles are defined. |
||||||
grp-ae5-wizards |
This is a team membership role, so no AE roles are defined for it. |
Note
Functional role groups should be setup once and left alone.
Use the Role Mappings tab to assign the appropriate role(s) to each group:
Google IAM setup example¶
In addition to providing out-of-the-box support for LDAP, Active Directory, SAML and Kerberos, Anaconda Enterprise also enables you to configure the platform to use other external identity providers to authenticate users. If your enterprise uses Google’s Cloud IAM (Identity and Access Management) to manage access to Google Cloud Platform (GCP) resources, for example, you can use the following process to configure the platform to use Cloud IAM as your identity provider. This will allow users to log in to the platform using their Google (or G-Suite) credentials.
Before you begin:
You’ll need to configure a Google Cloud project on GCP.
You’ll need to enable the Google+ API for the project.
You’ll need to create the credentials to use to authorize the platform to connect to Google IAM.
Enabling the Google+ API¶
With your project selected in Google Cloud Platform:
Select APIs & Services from the menu on the left.
Select ENABLE APIs AND SERVICES, then locate and select the Google+ API card in the API library.
Click ENABLE.
Now you can create credentials for the platform to access your Google Cloud project.
Creating Google+ credentials¶
With your project selected in Google Cloud Platform:
Select APIs & Services > Credentials from the menu on the left.
Click Create credentials and select Help me choose from the drop-down menu.
Note
If you haven’t already, be sure to enable the Google+ API before proceeding.
Select Google+ API from the API drop-down list, Web server from the next drop-down, and User data for the last question.
Click What credentials do I need? to create the appropriate credentials for the platform.
Enter a meaningful name, such as
Anaconda Enterprise, to identify the platform (and help differentiate it from any other web applications you may have configured to use Google IAM).In the Authorized JavaScript origins field, provide the FQDN of the Anaconda Enterprise server instance.
Open the Anaconda Enterprise Auth Center (see instructions below), and copy and paste the value from the Redirect URI field into the Authorized redirect URIs field here.
Note
If the domain is not an authorized domain, you’ll see an Invalid Redirect error, and be prompted to add it to the authorized domains list before proceeding.
Click Create OAuth client ID.
On the OAuth consent screen tab:
Set the Application type to Public.
Set the Application name to
Anaconda Enterprise(or something else meaningful to platform users).Optionally, upload a logo to help users recognize Anaconda Enterprise.
Provide a Support email address for users to reach out for help.
Provide the full path to the authorized homepage where users will access Anaconda Enteprise.
Optionally provide authorized links to a your organization’s privacy policy and terms of service.
Click Create to display the OAuth client credentials that you’ll need to copy and paste into Anaconda Enterprise, to enable the platform to authenticate with Google. (See Step 5 below.)
Configuring Google to be your identity provider¶
Now that you’ve configured your GCP project to work with Anaconda Enterprise, you need to use the Anaconda Enterprise Administrative Console’s Authentication Center to configure Google as your external identity provider:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.
Click Manage Users and login to the Authentication Center using the Administrator credentials configured after installation.
In the Configure menu on the left, select Identity Providers and select
Googlefrom the Add provider drop-down list.The Settings tab displays the Redirect URI you need to copy to the Google Cloud project’s configuration. The Redirect URI will looking similar to this:
https://<full-qualified-domain-name>/auth/realms/AnacondaPlatform/broker/google/endpoint.Copy and paste the credentials from GCP (Step 9 above) into the Client ID and Client Secret fields, and click Save.
Now that you’ve completed the configuration, the Anaconda Enteprise login screen will include a Google login option.
Note
When users choose this option and log in to the platform, they’ll be automatically addded as new AE users. As an Administrator, you can then configure their group assignments and role mappings. For more information, see Managing roles and groups.
Configuring channels and packages¶
Anaconda Enterprise enables you to distribute software through the use of channels and packages. Channels represent locations of repositories where Anaconda Enterprise looks for packages. Packages are used to bundle software files and information about the software—such as its name, specific version and description—into a single file that can be easily installed and managed.
NOTE: Anaconda Enterprise supports the use of both conda and pip packages in its repository.
The process for distributing packages within an organization resembles the following:
Configure access to a cloud-based repository or a private location on a remote or local repository that you or your organization created. See Accessing remote package repositories for more information.
Mirror the entire Anaconda repository or specific packages. You can also mirror packages in a repository in an airgapped environment without internet access.
Share channels with specific users or groups to give them access to the packages within the channel. You can copy packages from one channel into another, customize each channel by including different versions of packages, and delete channels when they are no longer needed. See Managing channels and packages for more information.
Your organization can also optionally configure Anaconda Enterprise to point conda to an on-premises repository, or use a proxy for conda packages.
Accessing remote package repositories¶
As an Administrator, you can configure Anaconda Enterprise to use packages from an online package repository
such as anaconda and r.
You can then mirror channels & packages into your organization’s internal AE repository so users can access the packages from a centralized, on-premises location.
If users are permitted to install packages from off-site package repositories, you can make it easier for users to access them from within their editing sessions by configuring them as default channels.
To do so, edit your Anaconda Enterprise configuration—anaconda-enterprise-anaconda-platform.yml—to include the appropriate channels, as follows:
conda:
channels:
- defaults
default_channels:
- https://repo.anaconda.com/pkgs/main
- https://repo.anaconda.com/pkgs/free
- https://repo.anaconda.com/pkgs/r
channel_alias: https://<ANACONDA_ENTERPRISE_FQDN>/repository/conda
To update Anaconda Enterprise with your changes to the configuration, restart its services:
sudo gravity enter
kubectl get pods | grep 'ap-' | cut -d' ' -f1 | xargs kubectl delete pods
Mirroring channels and packages¶
Anaconda Enterprise enables you to create a local copy of a repository so users can access the packages from a centralized, on-premises location.
The mirror can be complete, partial, or include specific packages or types of packages. You can also create a mirror in an air gapped environment to help improve performance and security.
Note
It can take hours to mirror the full repository.
Before you can use Anaconda Enterprise’s convenient syncing tools to configure local mirrors for channels and packages, you’ll need to configure access to the source of the packages to be mirrored, whether an online repository or a tarball (if an airgapped installation).
Prerequisites:
Types of mirroring:
To create a complete mirror, see Mirroring the Anaconda repository or Mirroring a PYPI repository.
To create partial mirror, see Mirroring specific packages.
To mirror a repository in a system without internet access, see Mirroring in an air-gapped environment.
To share mirrors, see Configuring Anaconda Enterprise and Sharing channels.
Configuration options:
Log into Anaconda Enterprise as an existing user using the following command:
$ anaconda-enterprise-cli login
Username: anaconda-enterprise
Password:
Logged anaconda-enterprise in!
Note
If Anaconda Enterprise 5 is installed in a proxied environment, see Mirroring in a proxied environment for information on setting the NO_PROXY variable.
Mirroring the Anaconda repository¶
We recommend the following process as a best practice for mirroring the Anaconda repository.
Instead of using the default
anaconda.yamlfile included in the mirror tool installation, create twoyamlfiles, one for mirroring themainchannel, and another for mirroring thefreechannel.
Example main.yaml file:
dest_channel: main
channels:
- https://repo.anaconda.com/pkgs/main
platforms:
- linux-64
- noarch
Example free.yaml file:
dest_channel: free
channels:
- https://repo.anaconda.com/pkgs/free
platforms:
- linux-64
- noarch
If you saved both of these files to the home directory, you can use the following commands to mirror these channels. Otherwise, amend the path so that it corresponds to where you saved the files:
cas-sync-api-v5 --file ~/main.yaml cas-sync-api-v5 --file ~/free.yaml
This mirrors all of the packages from these channels in the Anaconda repository. If the channel doesn’t already exist, it will be automatically created and shared with all authenticated users. You can customize the permissions on the mirrored packages by sharing the channel.
Tip
If you plan to mirror these channels on a regular basis, consider adding the -c flag to get a clean mirror each time. This will automatically remove any packages that have been removed from the Anaconda repository between mirrors from your internal repository—excluding any packages your organization has blacklisted.
Verify that the mirror was successful by logging into your account and navigating to the Packages tab. You should see a list of the mirrored packages.
Mirroring a PyPI repository¶
The full PyPI mirror size is currently close to 4TB, so ensure that your file storage location has sufficient disk space before proceeding. Rather than mirror the entire PyPI repository, you can use a configuration file such as
$PREFIX/etc/anaconda-platform/mirrors/pypi.yaml to customize the mirror behavior and specify the subset of packages you want to mirror.
To create a PyPI mirror:
anaconda-enterprise-cli mirror pypi --config pypi.yaml
This command loads the packages on https://pypi.org into the pypi user account. Mirrored
packages can be viewed at <https://anaconda.example.com>/repository/pypi/pypi/simple/,
replacing <https://anaconda.example.com> with the actual URL to your installation of Anaconda
Enterprise. (The second pypi in the url should match the user configuration value described
below.)
The following configuration options are available for you to customize your configuration file:
Name |
Description |
|---|---|
|
The local user under which the PyPI packages are
imported. Default: |
|
A list of packages to mirror. Only packages listed
are mirrored. If this is set, |
|
A list of packages to mirror. Only packages listed
are mirrored. If the list is empty, all packages are
checked. Default: |
|
A list of packages to skip. The packages listed are
ignored. Default: |
|
Only download the latest versions of the packages.
Default: |
|
The URL of the PyPI mirror. |
|
A custom value for XML RPC URL. If this value is
present, it takes precedence over the URL built using
|
|
A custom value for the simple index URL. If this
value is present, it takes precedence over the URL
built using |
|
Whether to use the XML RPC API as specified by
PEP381.
If this is set to |
|
Whether to use the serial number provided by the XML
RPC API. Only packages updated since the last serial
saved are checked. If this is set to false, all PyPI
packages are checked for updates. Default: |
|
Create the mirror user as an organization instead of
a regular user account. All superusers are added to
the “Owners” group of the organization. Default:
|
Note that all mirrored PyPI-like channels are publicly available to pull packages from both inside and outside the cluster (i.e. no auth token required).
EXAMPLE:
whitelist:
- requests
- six
- numpy
- simplejson
latest_only: true
remote_url: https://pypi.org/
use_xml_rpc: true
Configuring pip¶
To configure pip to use this new mirror, create pip.conf as follows:
[global]
index-url=<https://anaconda.example.com>/repository/pypi/pypi/simple/
replacing <https://anaconda.example.com> with the actual URL to your Anaconda Enterprise.
To configure Anaconda Enterprise sessions and deployments to automatically use
the pip.conf run the following command.
anaconda-enterprise-cli spark-config --config /etc/pip.conf pip.conf
Alternately, if you can use the --index-url flag directly when invoking pip.
For example,
pip install --index-url <https://anaconda.example.com>/repository/pypi/pypi/simple/ <package_name>
replacing <https://anaconda.example.com> with the actual URL to your Anaconda Enterprise
installation, and <package_name> with the name of a package that is in your local mirror. In
the example URL, the second pypi should match the user configuration value described
above.
For more specific information on configuring pip, refer to the official documentation at https://pip.pypa.io/en/stable/user_guide/#config-file.
Mirroring specific packages¶
Alternately, you may not wish to mirror all packages. In this case, you can specify which platforms or specific packages you want to mirror —or— use the whitelist, blacklist or license_blacklist functionality to control which packages are mirrored, by editing the provided mirror files. You cannot combine these methods. For more information, see Mirror configuration options.
cas-sync-api-v5 --file ~/my-custom-anaconda.yaml
Mirroring R packages¶
An example configuration file for mirroring R packages is also provided:
# This is destination channel of mirrored packages on your local repository.
dest_channel: r
# conda packages from these channels are mirrored to dest_channel on your local repository.
channels:
- https://repo.anaconda.com/pkgs/r/
# if doing a mirror from an airgap tarball, the channels should point to the tarball:
# channels:
# - file:///path-to-expanded-tarball/repo-mirrors-<date>/r/pkgs/
# Only conda packages of these platforms are mirrored.
# Omitting this will mirror packages for all platforms available on specified channels.
# If the repository will only be used to install packages on the v5 system, it only needs linux-64 packages.
platforms:
- linux-64
cas-sync-api-v5 --file ~/cas-mirror/etc/anaconda-platform/mirrors/r.yaml
Mirroring in an air-gapped environment¶
To mirror the repository in a system with no internet access, create a local
copy of the repository using a USB drive provided by Anaconda, and point
cas-sync-api-v5 to the extracted tarball.
First, mount the USB drive and extract the tarball. In this example we will
extract to /tmp:
cd /tmp
tar xvf <path to>/mirror.tar
Note
Replace <path to> with the actual path to the mirror file.
Now you have a local file-system repository located at /tmp/mirror/pkgs. You can
mirror this repository by editing <path to cas-mirror>/etc/anaconda-platform/mirrors/anaconda.yaml to contain:
channels:
- /tmp/mirror/pkgs
And then run the command:
cas-sync-api-v5 --file etc/anaconda-platform/mirrors/conda.yaml
This mirrors the contents of the local file-system repository to your
Anaconda Enterprise installation under the username anaconda.
Configuring Anaconda Enterprise¶
After creating the mirror, edit your Anaconda Enterprise configuration to add this new mirrored channel to the default Anaconda Enterprise channels and make the packages available to users.
conda:
channels:
- defaults
default_channels:
- main
- free
- r
channel_alias: https://<anaconda.example.com>/repository/conda
Replacing <anaconda.example.com> with the actual URL to your installation of Anaconda Enterprise.
Note
The ap-workspace pod must be restarted for the configuration change to take effect on new project editor sessions.
To update the Anaconda Enterprise server with your changes, you’ll need to do the following:
Run the following command in an interactive shell to identify the pod associated with the workspace services:
kubectl get pods
Restart the workspace services by running the following command:
kubectl delete pod anaconda-enterprise-ap-workspace-<pod ID>
Sharing channels¶
To make your new channels visible to your users in their Channels list, you need to share the channels with them.
EXAMPLE: To share new channels main, free, and r with group everyone for read access:
anaconda-enterprise-cli channels share --group everyone --level r main
anaconda-enterprise-cli channels share --group everyone --level r free
anaconda-enterprise-cli channels share --group everyone --level r r
After running the share command, verify by logging onto the user interface and viewing the Channels list.
For more information, see Sharing channels and packages
Mirror configuration options¶
You can use the following options to configure your mirror:
remote_url
Specifies the remote URL from which the conda packages and the Anaconda and
Miniconda installers are downloaded. The default value is: https://repo.continuum.io/.
channels
Specifies the remote channels from which conda packages are downloaded. The
default is a list of the channels <remote_url>/pkgs/free/ and <remote_url>/pkgs/pro/
All specification information should be included in the same file, and can be
passed to the cas-sync-api-v5 command via the --file argument:
cas-sync-api-v5 --file ~/cas-mirror/etc/anaconda-platform/mirrors/anaconda.yaml
destination channel
The configuration option dest_channel specifies where files will be uploaded.
The default value is: anaconda.
SSL verification¶
The mirroring tool uses two different settings for configuring SSL verification.
When the mirroring tool connects to its destination, it uses the ssl_verify setting
from anaconda-enterprise-cli to determine how to validate certificates. For example,
to use a custom certificate authority:
anaconda-enterprise-cli config set sites.master.ssl_verify /etc/ssl/certs/ca-certificates.crt
The mirroring tool uses conda’s configuration to determine how to validate certificates when connecting to the source that it is pulling packages from. For example, to disable certificate validation when connecting to the source:
conda config --set ssl_verify false
Mirroring in a proxied environment¶
If Anaconda Enterprise 5 is installed in a proxied environment, set the
NO_PROXY variable. This ensures the mirroring tool does not use the proxy when
communicating with the repository service, and prevents errors such as Max
retries exceeded, Cannot connect to proxy, and Tunnel connection failed:
503 Service Unavailable.
export NO_PROXY=<master-node-domain-name>
Platform-specific mirroring¶
By default, the cas-sync-api-v5 tool mirrors all platforms. If you do
not need all platforms, edit the YAML file to specify the platform(s)
you want mirrored:
platforms:
- linux-64
- osx-64
- win-64
Note
The platform argument is evaluated before any other argument.
Package-specific mirroring¶
In some cases you may want to mirror only a small subset of the repository. Rather than blacklisting a long list of packages you do not want mirrored, you can instead simply enumerate the list of packages you DO want mirrored.
Note
This argument cannot be used with the blacklist, whitelist or license_blacklist arguments—it can only be combined with platform-specific and version-specific mirroring.
EXAMPLE:
pkg_list:
- accelerate
- pyqt
- zope
This example mirrors only the three packages: Accelerate, PyQt & Zope. All other packages will be completely ignored.
Python version-specific mirroring¶
Mirror the repository with a Python version or versions specified.
EXAMPLE:
python_versions:
- 3.3
Mirrors only Anaconda packages built for Python 3.3.
License blacklist mirroring¶
The mirroring script supports license blacklisting for the following license families:
AGPL
GPL2
GPL3
LGPL
BSD
MIT
Apache
PSF
Public-Domain
Proprietary
Other
EXAMPLE:
license_blacklist:
- GPL2
- GPL3
- BSD
This example mirrors all the packages in the repository EXCEPT those that are GPL2-, GPL3-, or BSD-licensed, because those three licenses have been blacklisted.
Blacklist mirroring¶
The blacklist allows access to all packages EXCEPT those explicitly listed. If the license_blacklist and blacklist arguments are combined, license_blacklist is evaluated first, and blacklist is a supplemental modifier.
EXAMPLE:
blacklist:
- bzip2
- tk
- openssl
This example mirrors the entire repository EXCEPT the bzip2, Tk,
and OpenSSL packages.
Whitelist mirroring¶
The whitelist argument adds or includes packages that would be otherwise excluded by the blacklist and/or license_blacklist functions.
EXAMPLE:
license_blacklist:
- GPL2
- GPL3
whitelist:
- readline
This example mirrors the entire repository EXCEPT any GPL2- or GPL3-licenses
packages, but includes readline, despite the fact that it is GPL3-licensed.
Combining multiple mirror configurations¶
You may find that combining two or more of the arguments above is the easiest way to get the exact combination of packages that you want.
Note
The platform argument is evaluated before any other argument.
EXAMPLE: This example mirrors only Linux-64 distributions of the dnspython, Shapely and GDAL packages:
platforms:
- linux-64
pkg_list:
- dnspython
- shapely
- gdal
If the license_blacklist and blacklist arguments are combined, license_blacklist is evaluated first, and blacklist is a supplemental modifier.
EXAMPLE: In this example, the mirror configuration does not mirror GPL2-licensed
packages. It does not mirror the GPL3 licensed package pyqt because it has
been blacklisted. It does mirror all other packages in the repository:
license_blacklist:
- GPL2
blacklist:
- pyqt
If the blacklist and whitelist arguments are both employed, the blacklist is
evaluated first, with the whitelist functioning as a modifier.
EXAMPLE: This example mirrors all packages in the repository except astropy and pygments.
Despite being listed on the blacklist, accelerate is mirrored because it is
listed on the whitelist.
blacklist:
- accelerate
- astropy
- pygments
whitelist:
- accelerate
Managing channels and packages¶
Anaconda Enterprise makes it easy for you to manage the various channels and packages used by your organization—whether you prefer using the UI or the CLI.
Log in to the console using the Administrator credentials required to access the Administrative Console.
Select Channels in the left menu to view the list of existing channels, each channel’s owner and when the channel was last updated.
Note
Private channels are displayed with a lock
next to their name in the list, to indicate their secure status.
Click on a channel name to view details about the packages in the channel, including the supported platforms, versions and when each package in the channel was last modified. You can also see the number of times each package has been downloaded.
To add a package to an existing channel, click Upload and browse for the package.
Note
There is a 1GB file size limit for package files you upload.
Click on a package name to view the list of files that comprise the package, and the command used to install the package.
To remove a package from a channel, select Delete from the command menu
for the package in the list.
Warning
The anaconda-enterprise channel is used for internal purposes only, and should not be modified.
Sharing channels¶
To share a public channel, click Share, copy the URL location of the channel, and distribute it to the people with whom you want to share the channel.
To give other platform users read-write access to the channel, click Share and add them as a collaborator. You can share a channel with individual users or groups of users—the easiest way to control access to a channel. See Managing roles and groups for more information.
Note
The default is to grant all collaborators read-write access, so if you want to prevent them from adding and removing packages from the channel, be sure they have read-only access. You’ll need to use the CLI to grant read-only access to specific users or groups (see below).
To create a new channel and add packages to the channel for others to access:
Click Create in the top right corner, enter a meaningful name for the channel and click Create.
Note
Channels are Public—accessible by non-authenticated users–by default. To make the channel Private, and therefore available to authenticated users only, disable the toggle to switch the channel setting from Public to Private.
Click Upload to select the packages you want to add to the channel.
Using the CLI:¶
Get a list of all the channels on the platform with the channels list command:
anaconda-enterprise-cli channels list
Share a channel with a specific user using the share command:
anaconda-enterprise-cli channels share --user username --level r <channelname>
You can also share a channel with an existing group:
anaconda-enterprise-cli channels share --group GROUPNAME --level r <channelname>
Replacing GROUPNAME with the actual name of the group.
Note
Adding --level r grants this group read-only access to the channel.
You can “unshare” a channel using the following command:
anaconda-enterprise-cli channels share --user <username> --remove <channelname>
Run anaconda-enterprise-cli channels --help to see more information about
what you can do with channels.
For help with a specific command, enter that command followed by --help:
anaconda-enterprise-cli channels share --help
Pointing conda to an on-premises repository¶
Anaconda Enterprise users who are familiar with conda may use it to install the packages they need, rather than rely on you to make them available for download via shared channels.
If your organization wants to limit platform users to only access packages in your on-premises repository, you can configure conda accordingly. When you do this at the system level, it overrides any user-level configuration files installed by the user, or on individual machines.
Listing channel locations in the .condarc file overrides conda defaults, causing conda to search only the channels listed, in the order specified.
To configure conda, create or update the ~.condarc system configuration file in the root directory of the environment to add the repository channel:
channel_alias: https://<your-server.domain.com>/repository/conda/
Replacing <your-server.domain.com> with the fully-qualified domain name (FQDN) of your installation of Anaconda Enterprise.
See this section of the conda docs for more information.
Using a proxy for conda packages¶
You can configure Anaconda Enterprise to use a proxy for conda packages, if your organization’s network security policy requires it. To do so, you’ll need to do the following:
Installing Miniconda¶
Install Miniconda, a mini version of Anaconda that includes conda, its dependencies, and Python.
Download the Miniconda installer to the current working directory.
Note
If you want the file saved in a different directory, make sure you cd to the working directory before running this command.
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Run the following command to install Miniconda to the root directory (e.g.,
~centos/miniconda3):sh Miniconda3-latest-Linux-x86_64.sh
Re-initialize your terminal for the previous steps to take effect:
source ~/.bashrc
Whitelist the local repository by running the following command to set the
NO_PROXYenvironment variable, providing the FQDN of the local repo:export NO_PROXY=https://<your-server.domain.com>
Installing and configuring the Anaconda Enterprise CLI¶
You’ll need to use the Anaconda Enterprise CLI in subsequent steps, so install it now if you haven’t already done so.
Run the following command to install the Anaconda Enteprise CLI and package mirroring tool:
conda install -kc https://<your-server.domain.com>/repository/conda/anaconda-enterprise anaconda-enterprise-cli cas-mirror git
After the list of package dependencies has been resolved, type
yto proceed with the installation.To configure the Anaconda CLI, run the following commands, using the FQDN of your Anaconda Enterprise instance:
anaconda-enterprise-cli config set sites.master.url https://<your-server.domain.com>/repository/api anaconda-enterprise-cli config set default_site master anaconda-enterprise-cli config set ssl_verify false
Configuring Anaconda Enterprise¶
After you’ve installed and configured the required tools, you can update your Anaconda Enterprise configuration:
Log in to the Operations Center UI at
https://<your-server.domain.com>:32009, using either the default credentialsaeplatform@anaconda.com / aeplatformor the credentials of another Operations Center Admin user.Click Configuration in the menu on the left and use the Config maps drop-down menu to select the
anaconda-enterprise-anaconda-platform.ymlconfiguration file.
Warning
We strongly recommend you make a manual backup copy of this file before editing it, as any changes you make will impact how Anaconda Enterprise functions.
Scroll down to the
condasection and ensure it looks like the following:conda: # Common conda settings for editing sessions and deployments channels: - defaults default_channels: # List of channels that should be used for channel 'defaults' - https://repo.anaconda.com/pkgs/main - https://repo.anaconda.com/pkgs/free - https://repo.anaconda.com/pkgs/
Run this command to access the CLI:
anaconda-enterprise-cli login
Log in using the same username and password that you use to log in the Anaconda Enterprise web interface (or the default Admin credentials
anaconda-enterprise/anaconda-enterprise).Create a config file (
condarc.secret.txt) for conda proxying with the following content, and mount it at/etc/conda/.condarc:proxy_servers: http: http://proxy.url.com:<port> https: https://proxy.url.com:<port>
Run the following command to create a Kubernetes secret:
anaconda-enterprise-cli spark-config --config /etc/conda/.condarc condarc.secret.txt
Upload the secret to Kubernetes.
Warning
This will delete any existing custom Kubernetes secrets in anaconda-config-files-secret.yaml, so if you’ve already configured other secrets (e.g., for Hadoop Spark access) make sure you include those secrets and move the existing file to a remote location to preserve it.
sudo kubectl replace -f anaconda-config-files-secret.yaml -n default
Restart the relevant pods:
sudo gravity enter kubectl get pods | grep 'ap-deploy\|ap-workspace\|ap-ui' | cut -d' ' -f1 | xargs kubectl delete pods
Verifying the proxy works¶
After you’ve configured the platform, you can test your changes to verify that it’s using the proxy.
Log into Anaconda Enterprise.
Click Projects, and open the project you want to use to test the proxy.
Note
If the project already has an open session, you’ll need to stop the current session and start a new session.
Open a terminal window within JupyterLab and run the following command to display the conda configuration:
conda config --show
Verify the proxy config information from
condarc.secret.txtis being set.Run the following command to prepare the project:
anaconda-project prepare
Packages should be resolving and being pulled from public Anaconda repositories.
Generating custom Anaconda installers¶
As an Anaconda Enterprise Administrator, you can create custom environments. These environments include specific packages and their dependencies. You can then create a custom installer for the environment, that can be shipped to HDFS and used in Spark jobs.
Custom installers enable IT and Hadoop administrators to maintain close control of a Hadoop cluster while also making these tools available to data scientists who need Python and R libraries. They provide an easy way to ship multiple custom Anaconda distributions to multiple Hadoop clusters.
Creating an environment¶
Log in to the console using the Administrator credentials configured after installation.
Select Environments in the left menu.
Click Create in the upper right corner, give the environment a unique name and click Save.
Note
Environment names can contain alphanumeric characters and underscores only.
Check the channel you want to choose packages from, then select the specific packages–and version of each–you want to include in the installer.
Click Save in the window banner to create the environment.
Anaconda Enterprise resolves all the package dependencies and displays the environment in the list. If there is an issue resolving the dependencies, you’ll be notified and prompted to edit the environment.
You can now use the environment as a basis for creating additional versions of the environment or other environments.
To edit an existing environment:
Click on an environment name to view details about the packages included in the environment, then click Edit.
Change the channels and/or packages included in the environment, and enter a version number for the updated package before clicking Save. The new version is displayed in the list of environments.
To copy an environment:
Enter a unique name for the environment and click Save. The new environment is diplayed in the list of environments.
Now that you’ve created an environment, you can create an installer for it.
Creating a custom installer for an environment¶
Select the environment in the list, click the Create installer icon
, and select the type of installer you want to create:
Anaconda Enterprise creates the installer and displays it in the Installers list:
To view the relevant logs, download or delete the installer, click the
icon and choose the appropriate command.
If you created a management pack, you’ll need to install it on your Hortonworks HDP cluster and add it to your local Ambari server to make it available to users. For more information, see this blog post about generating custom management packs.
If you created a parcel, you’ll need to install it on your Cloudera CDH cluster to make it available to users:
Note
If you are using CDH 5.x, you’ll need to manually download the parcel, move it to the Cloudera Manager node, then configure Cloudera Manager for a local parcel repository. This is because CDH 5.x does not work with TLS 1.2 that Anaconda Enterprise uses to serve the parcel, so you’ll see a protocol version error if you attempt to use AE as a remote parcel repository with CDH 5.x.
If you are using CDH 6.x with parcels, you can configure Anaconda Enterprise as a remote parcel repository, or you can manually download the parcel and configure a local parcel repository.
In the Installers list, click the parcel name to view it’s details—including the logs generated during the creation process.
Depending on the version of CDH you are using (see NOTE above), either copy the path to the parcel or download the parcel installer.
From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
Click the Configuration button on the top right of the Parcels page to display the Parcel Settings.
If you downloaded the parcel from AE in Step 2 above, copy it to the Local Parcel Repository Path you’ve configured for Cloudera Manager.
–or–
To configure AE as a remote parcel repository, add the URL you copied in Step 2 above to the Remote Parcel Repository URLs section and click Save Changes.
If automatic downloading and distribution are not enabled, go to the Parcels page and select Distribute to install the parcel across your CDH cluster. The custom-generated Anaconda parcel is now ready to use with Spark or other distributed frameworks on your Cloudera CDH cluster.
For more information, see these instructions from Cloudera.
Advanced platform settings¶
After installing Anaconda Enterprise, there are default settings that you may want to update with information specific to your installation, including the password for the database and the redirect URLs for the AE platform.
If you’ve installed Livy server, you’ll need to configure it to work with the platform so users can access your Hadoop Spark cluster.
If your organization already uses a repository such as GitHub, Bitbucket, or GitLab for version control, you can configure Anaconda Enterprise to use that repository instead of the internal Git server.
You can also add one or more NFS shares to your organization’s configuration, for platform users to store data and source code that they can access within their sessions and deployments.
You may want to replace the self-signed certificates generated during installation with your organization’s own certificates—or change other default security settings—after initial installation.
Editing platform settings¶
You configure the Anaconda Enterprise platform settings using a configuration file or Config map. The configuration file, anaconda-enterprise-anaconda-platform.yml, contains both global and per-service configuration settings for your Anaconda Enterprise installation.
You can modify the default configuration using the Operations Center UI, or a command shell. Any changes you make will impact how Anaconda Enterprise functions, so we strongly recommend that you save a copy of the original file and familiarize yourself with the configuration options before making any changes.
To modify the platform configuration using the UI:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Use the Config map drop-down menu to select the
anaconda-enterprise-anaconda-platform.ymlconfiguration file.
Warning
Unless you want to configure global environment variables, please ignore the other entries in the Config maps and Namespace drop-downs. They impact the underlying Kubernetes system, so making changes to them may have unintended consequences or cause the platform to behave unpredictably.
You’ll notice that it contains GLOBAL CONFIGURATION specifications related to the following:
The Authentication Center client URL
The internal database
Optional NFS server volume mounts
HTTPS certificate settings
Resource profiles
The Kubernetes cluster
Any users, groups or roles with Admin authorization
The git commit file size limit (The default limit is 50MB, though this limit is configurable. We recommend keeping files under 100MB.)
It also contains PER-SERVICE CONFIGURATION settings, related to these services:
The authentication server used to secure access
The deployment server used to deploy apps
The workspace server used to run sessions
The storage server used to store and version projects
The local repository server used for channels and packages
The S3 endpoint and Git server used to store object and data
The local documentation server URL and platform UI configuration
Edit the specification in the section that corresponds to the setting you want to update, and click Apply to save your changes.
Note
If you navigate away from the Config map without saving your edits, you will be warned that you have unsaved changes. You can abandon your edits by clicking Disregard and continue, or return to editing by clicking Close.
To edit the platform configuration using a command line:
Enter the following commands in an interactive shell on the master node:
sudo gravity enter kubectl edit cm anaconda-enterprise-anaconda-platform.yml
Make your changes to the file, and save it.
Restart all pods using the following command:
kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
Changing the database password¶
You can change the password for the Anaconda Enterprise database as needed, to adhere to your organization’s policies. To do so, you’ll need to connect to the associated pod, make the change, and update the platform with the new password.
Run the following command to determine the id of the postgres pod:
kubectl get pod | grep postgres
Run the following command to connect to the postgres pod, where
<id>represents the id of the pod:kubectl exec -it anaconda-enterprise-postgres-<id> /bin/sh
Run this
psqlcommand to connect to the database:psql -h localhost -U postgres
Set the password by running the following command:
ALTER USER postgres WITH PASSWORD 'new_password';
To update the platform settings with the database password of the host server:
Access the Anaconda Enterprise Operations Center by entering this URL in your browser:
https://anaconda.example.com:32009, replacinganaconda.example.comwith the FQDN of the host server.Login with the default username and password:
aeplatform@yourcompany.com/aeplatform. You’ll be asked to change the default password when you log in.Click Configuration in the left menu to display the Anaconda Enterprise Config map.
In the GLOBAL CONFIGURATION section of the configuration file, locate the
dbsection and enter the password you just set:Click Apply to update the platform with your changes.
Restart all the service pods using the following command:
kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
Changing the platform redirect URLs¶
You’ll use the Anaconda Enterprise Authentication Center to update the redirect URLs for the platform.
Enter the following URL in your browser,
https://<server-name.domain.com>/auth/, replacingserver-name.domain.comwith the fully-qualified domain name of the host server.Login with username and password configured to authorize access to the platform. See Managing System Administrators for instructions on setting these credentials, if you haven’t already done so.
Verify that
AnacondaPlatformis displayed as the current realm, then select Clients from the Configure menu on the left.
In the Clients list, click
anaconda-platformto display the platform settings.On the Settings tab, update all URLS in the following fields with the FQDN of the Anaconda Enterprise server, or the following symbols:
Note
If you choose to provide the FQDN of your AE server, be sure each field also ends with the symbols shown. For example, the Valid Redirect URIs would look something like this: https://server-name.domain.com/*.
Click Save to update the server with your changes.
Configuring Livy server for Hadoop Spark access¶
After installing Livy server, there are main 3 aspects you need to configure on Apache Livy server for Anaconda Enterprise users to be able to access Hadoop Spark within Anaconda Enterprise:
If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to allow Livy to access the services. Additionally, you can configure Livy as a secure endpoint. For more information, see Configuring Livy to use HTTPS below.
Configuring Livy impersonation¶
To enable users to run Spark sessions within Anaconda Enterprise, they need to be able to log in to each machine in the Spark cluster. The easiest way to accomplish this is to configure Livy impersonation as follows:
Add
Hadoop.proxyuser.livyto your authenticated hosts, users, or groups.Check the option to
Allow Livy to impersonate usersand set the value to all (*), or a list of specific users or groups.
If impersonation is not enabled, the user executing the livy-server (livy) must exist on every machine. You can add this user to each machine by running the following command on each node:
sudo useradd -m livy
Note
If you have any problems configuring Livy, try setting the log level to DEBUG in the conf/log4j.properties file.
Configuring cluster access¶
Livy server enables users to submit jobs from any remote machine or analytics cluster—even where a Spark client is not available—without requiring you to install Jupyter and Anaconda directly on an edge node in the Spark cluster.
To configure Livy server, put the following environment variables into a user’s .bashrc file, or the conf/livy-env.sh file that’s used to configure the Livy server.
These values are accurate for a Cloudera install of Spark with Java version 1.8:
export JAVA_HOME=/usr/java/jdk1.8.0_121-cloudera/jre/
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/
export SPARK_CONF_DIR=$SPARK_HOME/conf
export HADOOP_HOME=/etc/hadoop/
export HADOOP_CONF_DIR=/etc/hadoop/conf
Note that the port parameter that’s defined as livy.server.port in conf/livy-env.sh is the same port that will generally appear in the Sparkmagic user configuration.
The minimum required parameter is livy.spark.master. Other possible values include the following:
local[*]—for testing purposesyarn-cluster—for using with the YARN resource allocation systema full spark URI like
spark://masterhost:7077—if the spark scheduler is on a different host.
Example with YARN:
livy.spark.master = yarn-cluster
The YARN deployment mode is set to cluster for Livy. The livy.conf file, typically located in $LIVY_HOME/conf/livy.conf, may include settings similar to the following:
livy.server.port = 8998
# What spark master Livy sessions should use: yarn or yarn-cluster
livy.spark.master = yarn
# What spark deploy mode Livy sessions should use: client or cluster
livy.spark.deployMode = cluster
# Kerberos settings
livy.server.auth.type = kerberos
livy.impersonation.enabled = true
# livy.server.launch.kerberos.principal = livy/$HOSTNAME@ANACONDA.COM
# livy.server.launch.kerberos.keytab = /etc/security/livy.keytab
# livy.server.auth.kerberos.principal = HTTP/$HOSTNAME@ANACONDA.COM
# livy.server.auth.kerberos.keytab = /etc/security/httplivy.keytab
# livy.server.access_control.enabled = true
# livy.server.access_control.users = livy,hdfs,zeppelin
# livy.superusers = livy,hdfs,zeppelin
After configuring Livy server, you’ll need to restart it:
./bin/anaconda-livy-server stop
./bin/anaconda-livy-server start
Consider using a process control mechanism to restart Livy server, to ensure that it’s reliably restarted in the event of a failure.
Using Livy with Kerberos authentication¶
If the Hadoop cluster is configured to use Kerberos authentication, you’ll need to do the following to allow Livy to access the services:
Generate 2 keytabs for Apache Livy using
kadmin.local.
IMPORTANT: The keytab principals for Livy must match the hostname that the Livy server is deployed on, or you’ll see the following exception: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos credentials).
These are hostname and domain dependent, so edit the following example according to your Kerberos settings:
$ sudo kadmin.local
kadmin.local: addprinc livy/ip-172-31-3-131.ec2.internal
WARNING: no policy specified for livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM; defaulting to no policy
Enter password for principal "livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
Re-enter password for principal "livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
kadmin.local: xst -k livy-ip-172-31-3-131.ec2.internal.keytab livy/ip-172-31-3-131.ec2.internal@ANACONDA.COM
...
kadmin.local: addprinc HTTP/ip-172-31-3-131.ec2.internal
WARNING: no policy specified for HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM; defaulting to no policy
Enter password for principal "HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
Re-enter password for principal "HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM":
kadmin.local: xst -k HTTP-ip-172-31-3-131.ec2.internal.keytab HTTP/ip-172-31-3-131.ec2.internal@ANACONDA.COM
...
This will generate two files: livy-ip-172-31-3-131.ec2.internal.keytab and HTTP-ip-172-31-3-131.ec2.internal.keytab.
Change the permissions of these two files so they can be read by
livy-server.Enable Kerberos authentication and reference these two keytab files in the
conf/livy.confconfiguration file, as shown:livy.server.auth.type = kerberos livy.impersonation.enabled = false # see notes below # principals and keytabs to exactly match those generated before livy.server.launch.kerberos.principal = livy/ip-172-31-3-131@ANACONDA.COM livy.server.launch.kerberos.keytab = /home/centos/conf/livy-ip-172-31-3-131.keytab livy.server.auth.kerberos.principal = HTTP/ip-172-31-3-131@ANACONDA.COM livy.server.auth.kerberos.keytab = /home/centos/conf/HTTP-ip-172-31-3-131.keytab # this may not be required when delegating auth to kerberos livy.server.access-control.enabled = true livy.server.access-control.allowed-users = livy,zeppelin,testuser livy.superusers = livy,zeppelin,testuser
NOTES:
The hostname and domain are not the same—verify that they match your Kerberos configuration.
livy.server.access-control.enabled = trueis only required if you’re going to also whitelist the allowed users with thelivy.server.access-control.allowed-users <user>key.
Configuring project access¶
After you’ve installed Livy and configured cluster access, some additional configuration is required before Anaconda Enterprise users will be able to connect to a remote Hadoop Spark cluster from within their projects. For more information, see Connecting to the Hadoop Spark ecosystem.
If the Hadoop installation used Kerberos authentication, add the
krb5.confto the global configuration using the following command:anaconda-enterprise-cli spark-config --config /etc/krb5.conf krb5.conf
To use Sparkmagic, pass two flags to the previous command to configure a Sparkmagic configuration file:
anaconda-enterprise-cli spark-config --config /etc/krb5.conf krb5.conf --config /opt/continuum/.sparkmagic/config.json config.json
This creates a yaml file—anaconda-config-files-secret.yaml—with the data converted for Anaconda Enterprise.
Use the following command to upload the yaml file to the server:
sudo kubectl replace -f anaconda-config-files-secret.yaml
To update the Anaconda Enterprise server with your changes, run the following command to identify the pod associated with the workspace services:
kubectl get pods
Restart the workspace services by running:
kubectl delete pod anaconda-enterprise-ap-workspace-<unique ID>
Now, whenever a new project is created, /etc/krb5.conf will be populated with the appropriate data.
Configuring Livy to use HTTPS¶
If you want to use Sparkmagic to communicate with Livy via HTTPS, you need to do the following to configure Livy as a secure endpoint:
Generate a keystore file, certificate, and truststore file for the Livy server—or use a third-party SSL certificate.
Update Livy with the keystore details.
Update your Sparkmagic configuration.
Restart the Livy server.
If you’re using a self-signed certificate:
Generate a keystore file for Livy server using the following command:
keytool -genkey -alias <host> -keyalg RSA -keysize 1024 –dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us –keypass <keyPassword> -keystore <keystore_file> -storepass <storePassword>
Create a certificate:
keytool -export -alias <host> -keystore <keystore_file> -rfc –file <cert_file> -storepass <StorePassword>
Create a truststore file:
keytool -import -noprompt -alias <host> -file <cert_file> -keystore <truststore_file> -storepass <truststorePassword>
Update
livy.confwith the keystore details. For example:livy.keystore = /home/centos/livy-0.5.0-incubating-bin/keystore.jks livy.keystore.password = anaconda livy.key-password = anaconda
Update
~/.sparkmagic/config.json. For example:"kernel_python_credentials" : { "username": "", "password": "", "url": "https://35.172.121.109:8998", "auth": "None" }, "ignore_ssl_errors": true,
Note
In this example, ignore_ssl_errors is set to true because this configuration uses self-signed certificates. Your production cluster setup may be different.
Warning
If you misconfigure a .json file, all Sparkmagic kernels will fail to launch. You can test your Sparkmagic configuration by running the following Python command in an interactive shell: python -m json.tool config.json.
If you have formatted the JSON correctly, this command will run without error. Additional edits may be required, depending on your Livy settings.
Restart the Livy server.
The Livy server should now be accessible over https. For example, https://<livy host>:<livy port>.
To test your SSL-enabled Livy server, run the following Python code in an interactive shell to create a session:
livy_url = "https://<livy host>:<livy port>/sessions"
data = {'kind': 'spark', 'numExecutors': 1}
headers = {'Content-Type': 'application/json'}
r = requests.post(livy_url, data=json.dumps(data), headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False)
r.json()
Run the following Python code to verify the status of the session:
session_url = "https://<livy host>:<livy port>/sessions/0"
headers = {'Content-Type': 'application/json'}
r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False)
r.json()
Then submit the following statement:
session_url = "https://<livy host>:<livy port>/sessions/0/statements"
data ={"code": "sc.parallelize(1 to 10).count()"}
headers = {'Content-Type': 'application/json'}
r = requests.get(session_url, headers=headers, auth=HTTPKerberosAuth(mutual_authentication=REQUIRED, sanitize_mutual_error_response=False), verify=False)
r.json()
If you’re using a third-party certificate:
Note
Ensure that Java JDK is installed on the Livy server.
Create the
keystore.p12file using the following command:openssl pkcs12 -export -in [path to certificate] -inkey [path to private key] -certfile [path to certificate ] -out keystore.p12
Use the following command to create the
keystore.jksfile:keytool -importkeystore -srckeystore keystore.p12 -srcstoretype pkcs12 -destkeystore keystore.jks -deststoretype JKS
If you don’t already have the
rootca.crt, you can run the following command to extract it from your Anaconda Enterprise installation:kubectl get secrets anaconda-enterprise-certs -o jsonpath="{.data[`rootca\.crt`]}" | base64 -d > /ext/share/rootca.crt
Add the
rootca.crtto thekeystore.jksfile:keytool -importcert -keystore keystore.jks -storepass <password> -alias rootCA -file rootca.crt
Add the
keystore.jksfile to thelivy.conffile. For example:livy.keystore = /home/centos/livy-0.5.0-incubating-bin/keystore.jks livy.keystore.password = anaconda livy.key-password = anaconda
Restart the Livy server.
Run the following command to verify that you can connect to the Livy server (using your actual host and port):
openssl s_client -connect anaconda.example.com:8998 -CAfile rootca.crt
If running this command returns
0, you’ve successfully configured Livy to use HTTPS.
To add the trusted root certificate to the AE server, do the following:
Install the
ca-certificatespackage:yum install ca-certificates
Enable dynamic CA configuration:
update-ca-trust force-enable
Add your
rootca.crtas a new file:cp rootca.crt /etc/pki/ca-trust/source/anchors
Update the certificate authority trust:
update-ca-trust extract
To connect to Livy within a session, open the project and run the following command in an interactive shell:
import os
os.environ['REQUESTS_CA_BUNDLE'] = /path/to/root.ca
You can also edit the anaconda-project.yml file for the project and set the environment variable there. See Hadoop / Spark for more information.
Connecting to an external version control repository¶
If your organization already uses a shared repository for version control, you can configure Anaconda Enterprise to use that repository instead of the internal Git server. To associate an external repository with Anaconda Enterprise, you simply need to provide the information required to connect to it.
After you do so, platform users will be able to access the repository within their sessions and deployments without having to leave the platform. Anaconda Enterprise creates a repository for each project that’s created by platform users.
Anaconda Enterprise supports integration with the following external repositories:
External repository |
Supported versions |
|---|---|
Bitbucket Enterprise |
5.9.1, 5.12.1, and 6.2.0 |
Bitbucket Cloud |
bitbucket.org |
GitHub Enterprise |
2.15, 2.16, and 2.17 |
GitHub Cloud |
github.com |
GitLab Enterprise |
10.4.2, 10.7.1, 11.10.0, and 12.1.6 |
GitLab Cloud |
gitlab.com |
Warning
If you are going to use an external repository for version control, we strongly recommend you set it up before users start creating projects in Anaconda Enterprise. If your organization changes Git hosting services, and you therefore need to migrate projects from one version control repository to another, we recommend you follow the process outlined here.
NOTES:
Neither Bitbucket.com or GitLab.com support versioning of archive downloads and app deployments. In other words, the latest revision will always be downloaded or deployed.
To provide permission granularity and maintain parity with your Git hosting solution, Anaconda Enterprise will grant individual platform users access to individual repositories. To prevent default permissions being applied to all users within a group, users cannot belong to the given organization or group.
Platform users will be prompted for their personal access token before they create their first project in Anaconda Enterprise. We recommend you advise users to create an ever-lasting token, so they can retain permanent access to their files from within Anaconda Enterprise. The specific auth token permissions required for each repository are outlined here.
Before you begin, gather the following information:
The fully qualified domain name (FQDN) of your versions control server
The organization, team or group name associated with your service account
The username of the Administrator for the organization, team or group. This user will require full Admin permissions.
The personal access token or password required to connect to your version control repository
To associate a specific version control repository with a project:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Use the Config map drop-down menu to select the
anaconda-enterprise-anaconda-platform.ymlconfiguration file.
Warning
Please ignore the other entries in the Config maps and Namespace drop-downs. They impact the underlying Kubernetes system, so making changes to them may have unintended consequences or cause the platform to behave unpredictably.
Locate the
gitsection of the configuration file. The default behavior is to use the internal Anaconda Enterprise repository for version control (see default settings pictured below).
To override this default setting, uncomment the
Example external repo configurationsection of the Config map, and replace the placeholder settings with the correct values for your organization’s repository:
Where:
name = A descriptive name for the service your organization uses.
type = The type of version control repository your organization uses: github-v3-api (GitHub Enterprise and Cloud), bitbucket-v1-api (Bitbucket Server), bitbucket-v2-api (Bitbucket Cloud), or gitlab-v4-api (GitLab Cloud and GitLab server).
NOTE: The values for this parameter have changed from AE 5.3.0.
url = The URL of the API (e.g., https://api.github.com/, https://api.bitbucket.org, or https://gitlab.com).
credential-url = The URL to authenticate against for repository operations such as cloning and pushing.
NOTE: This parameter replaces the credential-hostname parameter used in AE 5.3.0.
repository = Must be '{owner}-{id}' encased in single quotes.
organization = The name of your Github organization, Bitbucket team, or GitLab group. (Bitbucket does not support dashes in team names.)
username = The username associated with the Administrator account at Github, Bitbucket, or GitLab. This account must have full Admin permissions.
auth-token = The Github personal access token, Bitbucket app password, or GitLab access token for the Administrator account associated with the username. (You must enable 2FA to get personal access tokens in GitLab.)
Comment out the
Internal repo configurationsection of the Config map that follows, as it relates to the Anaconda Enterprise internal Git server settings that you are overriding:
Click Apply to save your changes to the Config map.
To update the Anaconda Enterprise server with your changes, restart services by running these commands on the master node:
sudo gravity enter kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
To verify that Anaconda Enterprise users can access the version control repository you added, create a project. See Working with projects for more information.
Migrating projects between version control repositories¶
If your organization has changed Git hosting services, and you therefore need to migrate projects from one supported version control repository to another, we recommend you follow this high-level process:
Prequisites:
Update the Anaconda Enterprise config map with the information required to connect to the external version control repository.
To run the project migration script, you’ll need Administrator access to a command line tool that can run bash or Python scripts on the master node of the Anaconda Enterprise cluster.
You’ll also need the Postgres database password, origin Git host token/password, and destination Git host token/password.
Pre-migration setup¶
If you haven’t already done so, install the version of conda provided with the Anaconda Enterprise installer on the master node:
bash anaconda-enterprise-5.3.1-56.gf54c3abad/installer/conda-bootstrap-4.5.12
After conda is finished installing, login to the terminal again.
Install git, using the command that’s appropriate for your environment:
On RHEL/CentOS:
yum install gitOn Ubuntu/Debian:
apt install gitUse the following command to create the conda environment:
conda create --name migrate --file anaconda-enterprise-5.3.1-56.gf54c3abad/environment.txt
Use the following command to activate the conda environment:
conda activate migrate
Temporarily disable reverse proxy authentication by adding the following key-value pair to the
gitsection (outside of thestoragesection in the config map) of theanaconda-enterprise-anaconda-platform.ymlfile used to configure the platform to use an external version control repository:reverse-proxy-auth: false
This should look similar to the following:
Run the following command to restart the associated pod on the master node:
kubectl delete pod -l 'app=ap-git-storage'
Create a user mappings file that maps Anaconda Enterprise user IDs to Git user IDs. This is a colon-separated text file where the first field is the AE user name, and the second field is the corresponding Git user name. For example:
ae-admin:git-admin ae-user1:git-user1 ae-user2:git-user2
Using the migration tool¶
Note
If you’ve migrated to https://github.com, whenever a user is added to a project as a collaborator, they’ll be sent an invitation to collaborate via email. They’ll need to accept this invitation to be able to commit changes to the repository associated with the project. This does not apply to Github Enterprise.
The migration tool is a Python script, migrate_projects.py, found in the AE5 installation tarball. It can be used in the following ways:
usage: migrate_projects.py [-h] [--parallel PARALLEL] [--log-file LOG_FILE]
[--force-migrate] [--scratch-dir SCRATCH_DIR]
--postgres-host POSTGRES_HOST
[--postgres-user POSTGRES_USER]
[--postgres-passwd POSTGRES_PASSWD]
[--origin-api-type {internal,bitbucket-v1-api,bitbucket-v2-api,github-v3-api,gitlab-v4-api}]
--origin-api-url ORIGIN_API_URL
[--origin-username ORIGIN_USERNAME]
[--origin-token ORIGIN_TOKEN]
[--origin-organization ORIGIN_ORGANIZATION]
[--dest-api-type {internal,bitbucket-v1-api,bitbucket-v2-api,github-v3-api,gitlab-v4-api}]
--dest-api-url DEST_API_URL
[--dest-username DEST_USERNAME]
[--dest-token DEST_TOKEN]
[--dest-organization DEST_ORGANIZATION]
--dest-user-mappings DEST_USER_MAPPINGS
optional arguments:
-h, --help show this help message and exit
--parallel PARALLEL Number of parallel migration jobs to spawn
--log-file LOG_FILE Path prefix to log directory, suffixed with a
timestamp, e.g. migrate-projects-
log-1559234750640867208
--force-migrate Forces migration by replacing local and destination
repositories
--scratch-dir SCRATCH_DIR
The scratch directory for cloning project repositories
--postgres-host POSTGRES_HOST
Hostname of AE5 Postgres DB
--postgres-user POSTGRES_USER
Username of AE5 postgres DB
--postgres-passwd POSTGRES_PASSWD
Password of AE5 postgres DB
--origin-api-type {internal,bitbucket-v1-api,bitbucket-v2-api,github-v3-api,gitlab-v4-api}
Origin git host API type
--origin-api-url ORIGIN_API_URL
Origin git host API URL
--origin-username ORIGIN_USERNAME
Origin git host username
--origin-token ORIGIN_TOKEN
Origin git host auth token
--origin-organization ORIGIN_ORGANIZATION
Origin git host organization
--dest-api-type {internal,bitbucket-v1-api,bitbucket-v2-api,github-v3-api,gitlab-v4-api}
Destination git host API type
--dest-api-url DEST_API_URL
Destination git host API URL
--dest-username DEST_USERNAME
Destination git host username
--dest-token DEST_TOKEN
Destination git host auth token
--dest-organization DEST_ORGANIZATION
Destination git host organization
--dest-user-mappings DEST_USER_MAPPINGS
Colon-separated AE-to-git-host mappings file, e.g. ae-
user1:github-user1
For example, the tool can be used in the following way:
python migrate_projects.py --postgres-host localhost --origin-api-url https://localhost:8443/ --origin-username root --dest-api-type gitlab-v4-api --dest-api-url https://mbrock-gitlab.anacondaenterprise.com/ --dest-username root --dest-organization demo --dest-user-mappings user-mappings-gitea-to-gitlab.txt --force-migrate --parallel 4
To ensure tokens are not visible in bash history, they can be omitted and can be entered via stdin when running the script.
Post-migration cleanup¶
After the script finishes migrating the projects, re-enable reverse proxy authentication by editing the key-value pair you previously added to the git section of the anaconda-enterprise-anaconda-platform.yml file, so it looks like the following:
reverse-proxy-auth: true
Warning
If you do not re-enable reverse proxy authentication, Anaconda Enterprise will not work.
To verify that the new repository is being used by Anaconda Enterprise, edit an existing project and commit your changes to it.
Disabling sudo for yum¶
By default, sudo access for yum is enabled on the Anaconda Enterprise platform. You can easily disable it, however, if your organization requires it.
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Verify that the
anaconda-enterprise-anaconda-platform.ymlconfiguration file is selected in the Config map drop-down menu.
Note
We recommend that you make a backup copy of this file since you will be editing it directly.
Scroll down to the
sudo-yumsection of the Config map:
Change the setting from
defaulttodisable:sudo-yum: disable
Click Apply to save your changes.
To update the Anaconda Enterprise server with your changes, restart services by running these commands on the master node:
sudo gravity enter kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
To re-enable sudo yum, simply change this Config map setting back to default, save your changes, and restart services.
Specifying alternate wildcard domains¶
By default, Anaconda Enterprise expects the wildcard domain to be the same for the primary platform server and the application domain.
If your particular implementation uses different domains, you’ll need to update the configuration file for the platform with the fully qualified domain name (FQDN) for each server.
Note
Make sure the wildcard domain has a TLS cert and DNS entry that meets these requirements before you follow the process below to specify it as an apps-host or workspace-host.
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Verify that the
anaconda-enterprise-anaconda-platform.ymlconfiguration file is selected in the Config map drop-down menu.
Note
Any changes you make will impact how Anaconda Enterprise functions, so we strongly recommend that you save a copy of the original file before making any changes.
Scroll down to the
Deployment server configurationsection of the Config map:
Search for and update the
apps-hostsetting with the FQDN of the host server you’ll be deploying apps to, if it’s different than the default Kubernetes server.Scroll down to the
Workspace server configurationsection of the Config map:
Update the
workspace-hostsetting with the FQDN of the host server you’ll be using as a workspace server, if it’s different than the default Kubernetes server.Click Apply to save your changes.
To update the Anaconda Enterprise server with your changes, restart services by running these commands on the master node:
sudo gravity enter kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
Setting global config variables¶
Anaconda Enterprise provides a secondary config map (named anaconda-enterprise-env-var-config) that you can use to configure the platform. Any environment variables that you add to this config map will be available to sessions, deployments and schedules. This is a convenient alternative to using the Anaconda Enterprise CLI, as you can add any variable supported by conda configuration.
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Log in to the Operations Center using the Administrator credentials configured after installation.
Select Configuration from the menu on the left.
Use the Config map drop-down menu to select the
anaconda-enterprise-env-var-config.ymlconfiguration file. The default config map contains a placeholder only:ENV_VAR_PLACEHOLDER: foo.
To add an environment variable, replace this placeholder with an acutual entry. For example, to configure Anaconda Enterprise to use a proxy for conda packages, you might add entries that resemble the following:
HTTP_PROXY=proxy.url.com:3128 NO_PROXY=anaconda-test.url.com
Click Apply to save your changes.
To update Anaconda Enterprise with your changes, restart services by running these commands on the master node:
sudo gravity enter kubectl get pods | grep ap- | cut -d' ' -f1 | xargs kubectl delete pods
Using Anaconda Enterprise¶
Working with projects¶
Anaconda Enterprise makes it easy for you to create and share interactive data visualizations, live notebooks or machine learning models built using popular libraries such Python, R, Bokeh and Shiny.
AE uses projects to encapsulate all of the components necessary to use or run an application: the relevant packages, channels, scripts, notebooks and other related files, environment variables, services and commands, along with a configuration file named anaconda-project.yml. For more information, see Developing a project.
Project components are all compressed into a .tar.bz2, .tar.gz or .zip file to make the project portable–so it’s easier to store and share with others.
To get you started, Anaconda Enterprise provides several sample projects, including the following:
Anaconda Distribution for Python 2.7, 3.5 and 3.6
Minimal Python templates for versions 2.7, 3.5, 3.6, and 3.7
R notebooks & R Shiny apps
Matplotlib and HvPlots written in Jupyter Notebooks
Panel and HoloViz tutorials
Dashboards for Gapminder data set, oil and gas exploration, NYC taxi data, and attractor equations
TensorFlow apps for Flask, Tornado and MNIST trained data
Tutorial on the Intake data catalog package
Tutorials for database access and time series modeling
You can access them by clicking Sample Projects from the Projects view. To use a sample project as a starting point, you can copy it to your project list.
To work with a project, click on it or select View details from its menu in the list view. Then use the menu on the left as follows:
Click Session to open the project in the default editor. This is Jupyter Notebook, unless you’ve specified a different editor under Settings.
Click Deployments to view deployments initiated from this project.
Click Schedules to view and schedule deployments of the project.
Click Runs to view a list of all project deployments that have run based on a schedule.
Click Share to share the project with selected collaborators.
Click Audit Trail to view a list of all actions performed on the project.
Click Settings to change the project name or default editor–Jupyter Notebook–for the project. For example, if you prefer to work with Apache Zeppelin or JupyterLab, choose it as your default editor.
You can also select a resource profile that meets or exceeds your requirements for the project, or delete the project. With admin configurations, your projects (sessions/deployments) can now run separately from the master node.
Warning
Deleting a project is irreversible, and therefore can only be done if it is not shared.
To make changes to the project files, click Open session
.
Note
If the system gets overloaded and there are issues copying, opening, or saving changes to a project, the platform will visually notify you by displaying it in red—in addition to generating a text notification. We recommend you check the notifications in the Audit Trail for additional information about the error, or delete the project and try again.
To work with the contents offline, you can download
the compressed file and then upload it to work with it within AE.
You can also create new—or upload existing—projects to add them to the server.
To update the project repository with your changes, you commit your changes to the project.
Note
To maintain performance, there is a 1GB file size limit for project files you upload. Anaconda Enterprise projects are versioned using Git, so we recommend you commit only text-based files relevant to a project, and keep them under 100MB. Binary files are difficult for version control systems to manage, so we recommend using storage solutions designed for that type of data, and connecting to those data sources from within your Anaconda Enterprise sessions.
If your organization would prefer to use its own supported external version control repository, your Administrator can configure Anaconda Enterprise to use that repository instead of the internal GitHub server. After they do so, you will be prompted for your personal access token before you create your first project in Anaconda Enterprise. We recommend you create an ever-lasting token, so you can retain permanent access to your files from within Anaconda Enterprise. See Configuring your user settings for the permissions that must be set for your auth token, and the steps to configure connectivity to your version control repository.
Editing a project¶
After you have created your project so that it appears in your Projects list, you can open a project session to make changes to the project.
To edit a project:
Make your changes to the project, and save them locally. A badge is displayed on the Commit Changes icon
to indicate that you’ve made changes that haven’t been committed to the server.
When you’re ready to update the repository with your changes, click the icon to commit your changes. If the project is shared, others will then be able to to access your changes, See collaborating on projects for important things to consider when working with others on shared projects.
When you’re done working with the project, click the Stop session icon
. The session is listed in the Audit Trail for the project.
You can also leave a project session open, and click the Return to session
when you’re ready to resume work.
Tip
See Developing your project to learn how to manage the dependencies for your project, so you can run it and deploy it.
Developing a project¶
To enable Anaconda Enterprise to manage the dependencies for your project—so you can run it and deploy it—you need to configure the following settings for each project you create or upload:
Include all the packages used by the project (e.g., conda, pip, system).
Specify environment variables to use in editor sessions and deployments.
All dependencies are tracked in a project’s anaconda-project.yml file.
While there are various ways to modify this file—using the user interface or
a command-line interface—any changes to a project’s configuration will
persist for future project sessions and deployments, regardless of the method
you use.
Note
This is different than using conda install to add a package using the
conda environment during a session, as that method impacts the project
temporarily, during the current session only.
Jupyter Notebook supports anaconda-project commands only.
You’ll need to run these commands in a terminal.
To open a terminal window within a Jupyter Notebook editor session:
If you prefer to use the UI to configure your project settings, you’ll need to change the default editor from Jupyter Notebook to JupyterLab. Do this in the project’s Settings and restart the editor session.
Adding packages to a project¶
Anaconda Enterprise offers several ways to add packages to a project, so you can choose the method you prefer:
In a JupyterLab editing session, click the Project tab on the far left and click the Edit pencil icon in the PACKAGES field. Add your packages and click Save.
–or–
In a terminal run
anaconda-project add-packagesfollowed by the package names and optionally the versions.EXAMPLE:
anaconda-project add-packages hvplot pandas=0.25
The command may take a moment to run as it collects the dependencies and downloads the packages. The packages will be visible in the project’s anaconda-project.yml file. If this file is already open, close it and reopen it to see your changes.
To install packages from a specific channel:
EXAMPLE: anaconda-project add-packages -c conda-forge tranquilizer
Warning
anaconda-project commands must be run from the lab_launch environment.
This is the default environment when using the Jupyter Notebook terminal. For
JupyterLab it will be the first terminal on left. If your terminal prompt is not
(lab_launch) you can activate it with the command conda activate lab_launch.
Note
The default channel_alias for conda in Anaconda Enterprise is configured to point to the internal package repository, which means that short channel names will refer to channels in the internal package repository.
To use packages from an external or online package repository, you will need to specify the full channel URL such as anaconda-project add-packages bokeh -c https://conda.anaconda.org/pyviz in a command or in anaconda-project.yml. The channel_alias can be customized by an administrator, which affects all sessions and deployments.
If you are working in an air-gapped environment (without internet access), your Administrator will need to mirror the packages into your organization’s internal package repository for you to be able to access them.
To install pip packages:¶
List the packages in the pip: section of anaconda-project.yml. For example:
packages:
- six>=1.4.0
- gunicorn==19.1.0
- pip:
- python-mimeparse
- falcon==1.0.0
After editing the anaconda-project.yml file to include the pip packages you want to install, run the anaconda-project prepare command to install the packages.
To install system packages:¶
In a terminal, run sudo yum install followed by the package name.
EXAMPLE: sudo yum install sqlite
Note
Any system packages you install from the command line are available during the current session only. If you want them to persist, add them to the project’s anaconda-project.yml file. The system package must be available in an Anaconda Enterprise channel for it to be installed correctly via the anaconda-project.yml file.
Custom project environment¶
Note
Each project only supports the use of a single environment.
For the standard template projects the Conda environments have been pre-built as a bootstrap to reduce initialization time when additional packages are added as described above. However, you may wish to create a custom environment specification.
You may use either of these methods to specify the environment for a project:
In a JupyterLab editing session, click the Project tab on the far left and click the plus sign to the right of the ENVIRONMENTS field. Choose whether you want to Prepare all environments or Add environments.
Select an environment and then select Run, Check or Edit. Running an environment opens a terminal window with that environment active.
When creating an environment, you may choose to inherit from an existing environment, and choose the environment’s supported platforms, its channels, and its packages.
–or–
You can use the terminal and command line. For example, to create an environment called
new_envwith notebook, pandas, and bokeh:anaconda-project add-env-spec --name new_env anaconda-project add-packages --env-spec new_env notebook pandas=0.25 panel=0.6
Remove the original environment that corresponds to the template you chose when you initially created the project. For example, to remove the Python 3.6 environment:
anaconda-project remove-env-spec anaconda50_py36
Warning
For your changes to take effect, you must commit all changes to the project, then stop and re-start the project.
Note
You must include the notebook package for the environment to edit and run notebooks in either the Jupyter Notebook or JupyterLab editors.
Tip
Using the anaconda-project command ensures that the environment will prepare correctly when the session is restarted. For more information about anaconda-project commands type anaconda-project --help.
To verify whether an environment has been initialized for a Notebook session:
Within the Notebook session, open a terminal window:
Run the following commands to list the contents of the parent directory:
If the environment is being initialized, you’ll see a file named preparing. When the environment has finished initializing, it will be replaced by a file named prepare.log.
Tip
If you need to troubleshoot session startup, you can use a terminal to view the session startup logs. When session startup begins, the output of the anaconda-project prepare command is written to /opt/continuum/preparing, and when the command completes, the log is moved to /opt/continuum/prepare.log.
Adding deployment commands to a project¶
You can use Anaconda Enterprise to deploy projects containing notebooks, Bokeh applications, and generic scripts or web frameworks. Before you can deploy a project, it needs to have an appropriate deployment command associated with it.
Each of the following methods can be used to add a deployment command in the project’s config file anaconda-project.yml:
In a JupyterLab editing session, click the Project tab on the far left and click the plus sign to the right of the COMMANDS field. Add information about the command and click Save.
Note
This method is available within the JupyterLab editor only, so you’ll need to set that as your default editor—in the project’s Settings—and restart the project session to see this option in the user interface. The two methods described below do not show notifications in the user interface.
–or–
Use the command line interface:
EXAMPLE: anaconda-project add-command --type notebook default data-science-notebook.ipynb
The following are example deployment commands you can use:
For a Notebook:
commands:
default:
notebook: your-notebook.ipynb
For a project with a Bokeh (version 0.12) app defined in a main.py file:
commands:
default:
bokeh_app: .
supports_http_options: True
For a Panel dashboard (panel must be installed in your project):
commands:
default:
unix: panel serve script-or-notebook-file
supports_http_options: True
For a generic script or web framework, including Python or R:
commands:
default:
unix: bash run.sh
supports_http_options: true
commands:
default:
unix: python your-script.py
supports_http_options: true
commands:
default:
unix: Rscript your-script.R
supports_http_options: true
Note
For deployment commands that can handle anaconda-project-- arguments (like Panel) supports_http_options: True must be added to the command.
To validate your anaconda-project.yml and verify your project will deploy successfully:
Within the Notebook session, open a terminal window:
Run the following command, replacing
anaconda44_py35with the name of your environment, if it’s different:anaconda-project prepare --env-spec anaconda44_py35
If the environment includes everything needed to deploy the project, you’ll see a message like the following:
Otherwise, any errors preventing a successful deployment will be identified.
If you want to test the deployment immediately after preparing the environment, run the following command instead:
anaconda-project run <command-name>
If there are any errors preventing a successful deployment, they will be displayed in the terminal.
Environment variables¶
You can add environment variables that will be set when you run notebooks in an editor session and at the start of a deployment command.
In a JupyterLab editing session, click the Project tab on the far left and click the
+button next to VARIABLES. Provide the name, description and default value of all variables you require.
–or–
You can use the terminal and command line. For example, to add an environment variable that sets
MY_VARtohello.:anaconda-project add-variable --default hello MY_VAR
Saving and committing changes in a project¶
Saving changes to files within an editor or project is different from committing those changes to the Anaconda Enterprise server. For example, when you select File > Save within an editor, you save your changes to your local copy of the file to preserve the work you’ve done.
Warning
Files names containing unicode characters–special characters, punctuation, symbols—can’t be committed to the server, so avoid them when naming files.
When you’re ready to update the server with your changes, you commit your changes to the project. This also allows others to access your changes, if the project is shared. See collaborating on projects for important things to consider when working with others on shared projects.
Note
If the size of your stored git files totals more than 1GB, your system may become bogged down and inoperative. As such, we recommend keeping file sizes under 50MB. If a file of greater size is required for your work, please contact your administrator.
Binary files are difficult for version control systems to manage, so we recommend using storage solutions designed for that type of data, and connecting to those data sources from within your Anaconda Enterprise sessions.
To commit your changes:
Select the files you have modified and want to commit. If a file that you changed isn’t displayed in this list, make sure you saved it locally.
Note
Editors create temporary files that may be displayed in the file list. For example, Jupyter Notebook and JupyterLab both create a hidden folder named .ipynb_checkpoints for each notebook project you create. This folder is hidden because the editor uses it internally, to capture the state of your .ipynb file between auto-save operations. We recommend you add this and any other hidden folders to your .gitignore file, so they are excluded from the list of project files that are checked into version control.
Enter a message that briefly describes the changes you made to the files or project. This information is useful for differentiating your commit from others.
Enter a meaningful label or version number for your project commit in the Tag field. You can use tags to create multiple versions of a single project so that you—or collaborators—can easily deploy a specific version of the project. See deploying a project for more information.
Click Commit.
Collaborating on projects¶
Effective collaboration is key to the success of any enterprise-level project, so it’s essential to understand how to work well with others in shared projects.
To give others access to a project that you’ve created, you can add them as a collaborator.
When you add a user (or group of users) as collaborators on a project, it means that they have permission to edit the project files and commit changes to the master copy on the server while you may be actively working on a local copy. The only project setting they’ll be able to change is the default editor—all other project settings will be disabled for editing.
Note
Anaconda Enterprise creates a repository for each project that you create, and will authorize only those users who have been explicitly added as project collaborators to update the version control repository configured for your organization with their changes to the project.
Anaconda Enterprise tracks all changes to a project and lets you know when files have been updated, so you can choose which version to use.
Sharing a project¶
You can share a project with specific users or groups of users
Click Projects to view all of your projects.
Click the project you want to share and select Share in the left menu.
Start typing the name of the user or group in the Add New Collaborator drop-down to search for matches. Select the one that corresponds to what you want and click Add.
To unshare—or remove access to—a project, check the large X next to the collaborator you want to remove and click Remove to confirm your selection.
Note
If you remove a collaborator from a project while they have a session open for that project, they might see a 500 Internal Server Error message. To avoid this, ask them to close their running session before you remove them from the project.
Any collaborators you share your project with will see the project in their Projects list when they log in to AE, and if others share their projects with you, they’ll appear in yours.
Getting updates from other users¶
When a collaborator makes a change to the project, a badge will appear beside
the Fetch Changes icon
.
Click this icon to pull changes from the server and update your local copy of the project with any changes made by other collaborators.
Anaconda Enterprise compares the copy of the project files you have locally with those on the server and notifies if any files have a conflict. If there is no file conflict, your local copies are updated.
Note
Fetching the latest changes may overwrite or delete your local copy of files without warning if a collaborator has committed changes to the server and you have not made changes to the same files, as there is no file conflict.
EXAMPLE:
Alice and Bob are both collaborators a project that includes
file1.txt.Alice deletes
file1.txtfrom her local copy of the project and commits her changes to the server.Bob pulls the latest changes from the server. Bob hasn’t edited
file1.txt, sofile1.txtis deleted from Bob’s local version of the project. Bob’s local copy of the project and the version on the server now match exactly.
If the updates on the server conflict with changes you have made locally, you can choose one of the following options:
Cancel the Pull.
Keep theirs and Pull—discards your local changes in favor of theirs. Your changes will be lost.
Keep mine and Pull—discards changes on the server in favor of your local changes. Their changes will be overwritten.
Keep both and Pull—saves the conflicting files with different filenames so you can compare the content of the files and decide how you want to reconcile the differences. See resolving file conflicts below for more information.
Note
If you have a file open that has been modified by fetching changes, close and reopen the file for the changes to be reflected. Otherwise, the next time you save the file, you may see a “File has been overwritten on disk” alert in JupyterLab. This alert lets you choose whether to cancel the save, discard the current version and open the version of the file on disk, or overwrite the file on disk with the current version.
Committing your changes¶
After you have saved your changes locally, click the Commit Changes icon
to update the master copy on the server with your changes.
If your changes conflict with updates made by other collaborators, a list of the files impacted will be highlighted in red. You may choose how you want to proceed from the following options:
Cancel the Commit.
Proceed with the Commit—overwrites your collaborators’ changes. Proceed with caution when choosing this option. Collaborators may not appreciate having their work overwritten, and important work may be lost in the process.
Selectively Commit—commit only those files which don’t have conflicts by unchecking the ones highlighted in red.
Committing changes to the server involves a full sync, so any changes that have been made to the project on the server–that do not conflict with your changes–are pulled in the process. This means that after committing your changes, your local copy will match the master copy on the server.
Resolving file conflicts¶
File conflicts result whenever you have updated a file locally, while a collaborator has changed that same file in their copy of the project and committed their changes to the master copy on the server.
In these cases, you may want to select Keep both and Pull to save the conflicting files with different filenames. This enables you to compare the content of the files and decide the best approach to take to reconcile the differences. The solution will likely involve manually editing the file to combine both sets of changes and then committing the file.
EXAMPLE: If a file is named Some Data.txt and Alice has committed updates to that file on the server, your new local copy of the file from the server—containing Alice’s changes—will be named Some Data.txt (Alice's conflicted file). Your local copy named Some Data.txt will not change.
Using project templates¶
In addition to sample projects, Anaconda Enterprise provides project templates to help you get started with configuring and developing in your project. Project templates provide pre-built Conda environments in your project editor session where a number of packages have already been installed.
Templates are provided for common environments such as Python, R, Spark and Hadoop, and SAS.
Each template environment includes many of the post popular data science
packages You can use anaconda-project commands to
customize them as needed.
To use one of the available templates, simply select it from the Environment list when you create your project:
Python templates¶
Anaconda Enterprise templates environments are provided Python versions 2.7, 3.5, and 3.6.
In a running project session the Python 2.7 environment includes all of the packages in the `Anaconda distribution for Python 2.7`_ with a check mark in the In Installer column. The same is true for the Python 3.5 and Python 3.6 environments.
Additional Conda and pip packages can be added using the process described in Developing a project.
For example, to upgrade to a newer version of Pandas and add the HvPlot package, run the following in a terminal
anaconda-project add-packages pandas=0.25 hvplot
Python notebooks can be edited with any of the editors provided with Anaconda Enterprise: Jupyter Notebooks, JupyterLab, or Apache Zeppelin. To change the default editor for your Python project, click on it or select View details from its menu in the list view. Then click Settings to select your preferred editor. For more information, see Working with projects.
R templates¶
The R template contains the R Essentials bundle of approximately 80 packages: r-base version 3.4.2, plus the most commonly used R packages for data science, including caret, dplyr, ggplot2, glmnet, irkernel, rbokeh, shiny, and tidyverse.
You can add other R packages as described in Developing a project. You’ll need to be able to connect to the appropriate repository to do so. Otherwise, your Administrator may need to mirror the channels and packages for you to be able to access them.
R notebooks can be edited with any of the editors provided with Anaconda Enterprise: Jupyter Notebooks, JupyterLab, or Apache Zeppelin. To change the default editor for your R project, click on it or select View details from
its menu in the list view. Then click Settings to select your preferred editor. For more information, see Working with projects.
Hadoop / Spark¶
If your Anaconda Enterprise Administrator has configured Livy server for Hadoop and Spark access, you’ll be able to access them within the platform.
The Hadoop/Spark project template includes sample code to connect to the
following resources, with and without Kerberos authentication:
In the editor session there are two environments created. anaconda50_hadoop
contains the packages consistent with the Python 3.6 template plus additional
packages to access Hadoop and Spark resources. The anaconda50_impyla
environment contains packages consistent with the Python 2.7 template plus
additional packages to access Impala tables using the Impyla Python package.
Using Kerberos authentication¶
If the Hadoop cluster is configured to use Kerberos authentication—and your Administrator has configured Anaconda Enterprise to work with Kerberos—you can use it to authenticate yourself and gain access to system resources. The process is the same for all services and languages: Spark, HDFS, Hive, and Impala.
Note
You’ll need to contact your Administrator to get your Kerberos principal, which is the combination of your username and security domain.
To perform the authentication, open an environment-based terminal in the interface. This is normally in the Launchers panel, in the bottom row of icons, and is the right-most icon.
When the interface appears, run this command:
kinit myname@mydomain.com
Replace myname@mydomain.com with the Kerberos principal, the
combination of your username and security domain, which was
provided to you by your Administrator.
Executing the command requires you to enter a password. If there is no error
message, authentication has succeeded. You can verify by issuing the klist
command. If it responds with some entries, you are authenticated.
You can also use a keytab to do this. Upload it to a project and execute a command like this:
kinit myname@mydomain.com -kt mykeytab.keytab
Note
Kerberos authentication will lapse after some time, requiring you to repeat the above process. The length of time is determined by your cluster security administration, and on many clusters is set to 24 hours.
For deployments that require Kerberos authentication, we recommend generating a
shared Kerberos keytab that has access to the resources needed by the
deployment, and adding a kinit command that uses the keytab as part of the
deployment command.
Alternatively, the deployment can include a form that asks for user credentials
and executes the kinit command.
Using Spark¶
Apache Spark is an open source analytics engine that runs on compute clusters to provide in-memory operations, data parallelism, fault tolerance, and very high performance. Spark is a general purpose engine and highly effective for many uses, including ETL, batch, streaming, real-time, big data, data science, and machine learning workloads.
Note
Using Anaconda Enterprise with Spark requires Livy and Sparkmagic. The Hadoop/Spark project template includes Sparkmagic, but your Administrator must have configured Anaconda Enterprise to work with a Livy server.
Supported versions¶
The following combinations of the multiple tools are supported:
Python 2 and Python 3, Apache Livy 0.5, Apache Spark 2.1, Oracle Java 1.8
Python 2, Apache Livy 0.5, Apache Spark 1.6, Oracle Java 1.8
Livy¶
Apache Livy is an open source REST interface to submit and manage jobs on a Spark cluster, including code written in Java, Scala, Python, and R. These jobs are managed in Spark contexts, and the Spark contexts are controlled by a resource manager such as Apache Hadoop YARN. This provides fault tolerance and high reliability as multiple users interact with a Spark cluster concurrently.
With Anaconda Enterprise, you can connect to a remote Spark cluster using Apache Livy with any of the available clients, including Jupyter notebooks with Sparkmagic. Anaconda Enterprise provides Sparkmagic, which includes Spark, PySpark, and SparkR notebook kernels for deployment.
The Apache Livy architecture gives you the ability to submit jobs from any remote machine or analytics cluster, even where a Spark client is not available. It removes the requirement to install Jupyter and Anaconda directly on an edge node in the Spark cluster.
Livy and Sparkmagic work as a REST server and client that:
Retains the interactivity and multi-language support of Spark
Does not require any code changes to existing Spark jobs
Maintains all of Spark’s features such as the sharing of cached RDDs and Spark Dataframes, and
Provides an easy way of creating a secure connection to a Kerberized Spark cluster.
When Livy is installed, you can connect to a remote Spark cluster when creating a new project by selecting the Spark template.
Kernels¶
When you copy the project template “Hadoop/Spark” and open a Jupyter editing session, you will see several kernels such as these available:
Python 3PySparkPySpark3Python 3RSparkSparkRPython 2
To work with Livy and Python, use PySpark. Do not use
PySpark3.
To work with Livy and R, use R with the sparklyr
package. Do not use the kernel SparkR.
To work with Livy and Scala, use Spark.
You can use Spark with Anaconda Enterprise in two ways:
Starting a notebook with one of the Spark kernels, in which case all code will be executed on the cluster and not locally.
Note that a connection and all cluster resources will be assigned as soon as you execute any ordinary code cell, that is, any cell not marked as
%%local.Starting a normal notebook with a Python kernel, and using
%load_ext sparkmagic.magics. That command will enable a set of functions to run code on the cluster. See examples (external link).
To display graphical output directly from the cluster, you must use SQL
commands. This is also the only way to have results passed back to your local
Python kernel, so that you can do further manipulation on it with pandas or
other packages.
In the common case, the configuration provided for you in the Session will be correct and not require modification. However, in other cases you may need to use sandbox or ad-hoc environments that require the modifications described below.
Overriding session settings¶
Certain jobs may require more cores or memory, or custom environment variables
such as Python worker settings. The configuration passed to Livy is generally
defined in the file ~/.sparkmagic/conf.json.
You may inspect this file, particularly the section "session_configs", or
you may refer to the example file in the spark directory,
sparkmagic_conf.example.json. Note that the example file has not been
tailored to your specific cluster.
In a Sparkmagic kernel such as PySpark, SparkR, or similar, you can change the
configuration with the magic %%configure. This syntax is pure JSON, and the
values are passed directly to the driver application.
EXAMPLE:
%%configure -f
{"executorMemory": "4G", "executorCores":4}
To use a different environment, use the Spark configuration to set
spark.driver.python and spark.executor.python on all compute nodes in
your Spark cluster.
EXAMPLE:
If all nodes in your Spark cluster have Python 2 deployed at /opt/anaconda2
and Python 3 deployed at /opt/anaconda3, then you can select Python 2 on all
execution nodes with this code:
%%configure -f
{"conf": {"spark.driver.python": "/opt/anaconda2/bin/python", "spark.executor.python": "/opt/anaconda2/bin/python"}}
If all nodes in your Spark cluster have Python 2 deployed at /opt/anaconda2
and Python 3 deployed at /opt/anaconda3, then you can select Python 3 on all
execution nodes with this code:
%%configure -f
{"conf": {"spark.driver.python": "/opt/anaconda3/bin/python", "spark.executor.python": "/opt/anaconda3/bin/python"}}
If you are using a Python kernel and have done %load_ext sparkmagic.magics,
you can use the %manage_spark command to set configuration options. The
session options are in the “Create Session” pane under “Properties”.
Overriding session settings can be used to target multiple Python and R interpreters, including Python and R interpreters coming from different Anaconda parcels.
Using custom Anaconda parcels and management packs¶
Anaconda Enterprise Administrators can generate custom parcels for Cloudera CDH or custom management packs for Hortonworks HDP to distribute customized versions of Anaconda across a Hadoop/Spark cluster using Cloudera Manager for CDH or Apache Ambari for HDP. See Using installers, parcels and management packs for more information.
As a platform user, you can then select a specific version of Anaconda and Python on a per-project basis by including the following configuration in the first cell of a Sparkmagic-based Jupyter Notebook.
For example:
%%configure -f
{"conf": {"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/opt/anaconda/bin/python",
"spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON": "/opt/anaconda/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "/opt/anaconda/bin/python",
"spark.pyspark.python": "/opt/anaconda/bin/python",
"spark.pyspark.driver.python": "/opt/anaconda/bin/python"
}
}
Note
Replace /opt/anaconda/ with the prefix of the name and location for the particular parcel or management pack.
Overriding basic settings¶
In some more experimental situations, you may want to change the Kerberos or Livy connection settings. This could be done when first configuring the platform for a cluster, usually by an administrator with intimate knowledge of the cluster’s security model.
Users could override basic settings if their administrators have not configured Livy, or to connect to a cluster other than the default cluster.
In these cases, we recommend creating a krb5.conf file and a
sparkmagic_conf.json file in the project directory so they will be saved
along with the project itself. An example Sparkmagic configuration is included,
sparkmagic_conf.example.json, listing the fields that are typically set. The
"url" and "auth" keys in each of the kernel sections are especially
important.
The krb5.conf file is normally copied from the Hadoop cluster, rather than
written manually, and may refer to additional configuration or certificate
files. These files must all be uploaded using the interface.
To use these alternate configuration files, set the KRB5_CONFIG variable
default to point to the full path of krb5.conf and set the values of
SPARKMAGIC_CONF_DIR and SPARKMAGIC_CONF_FILE to point to the Sparkmagic
config file. You can set these either by using the Project pane on the left of
the interface, or by directly editing the anaconda-project.yml file.
For example, the final file’s variables section may look like this:
variables:
KRB5_CONFIG:
description: Location of config file for kerberos authentication
default: /opt/continuum/project/krb5.conf
SPARKMAGIC_CONF_DIR:
description: Location of sparkmagic configuration file
default: /opt/continuum/project
SPARKMAGIC_CONF_FILE:
description: Name of sparkmagic configuration file
default: sparkmagic_conf.json
Note
You must perform these actions before running kinit or starting any notebook/kernel.
Warning
If you misconfigure a .json file, all Sparkmagic kernels will fail to launch. You can test your Sparkmagic configuration by running the following Python command in an interactive shell: python -m json.tool sparkmagic_conf.json.
If you have formatted the JSON correctly, this command will run without error. Additional edits may be required, depending on your Livy settings. See Installing Livy server for Hadoop Spark access and Configuring Livy server for Hadoop Spark access for information on installing and configuring Livy.
Python¶
Example code showing Python with a Spark kernel:
sc
data = sc.parallelize(range(1, 100))
data.mean()
import pandas as pd
df = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("col1", "col2"))
sparkdf = sqlContext.createDataFrame(df)
sparkdf.select("col1").show()
sparkdf.filter(sparkdf['col2'] == 2).show()
Using HDFS¶
The Hadoop Distributed File System (HDFS) is an open source, distributed, scalable, and fault tolerant Java based file system for storing large volumes of data on the disks of many computers. It works with batch, interactive, and real-time workloads.
Dependencies¶
python-hdfs
Supported versions¶
Hadoop 2.6.0, Python 2 or 3
Kernels¶
[anaconda50_hadoop] Python 3
Connecting¶
To connect to an HDFS cluster you need the address and port to the HDFS Namenode, normally port 50070.
To use the hdfscli command line, configure the ~/.hdfscli.cfg file:
[global]
default.alias = dev
[dev.alias]
url = http://<Namenode>:port
Once the library is configured, you can use it to perform actions on HDFS with
the command line by starting a terminal based on the [anaconda50_hadoop] Python 3
environment and executing the hdfscli command. For example:
$ hdfscli
Welcome to the interactive HDFS python shell.
The HDFS client is available as `CLIENT`.
In [1]: CLIENT.list("/")
Out[1]: ['hbase', 'solr', 'tmp', 'user']
Python¶
Sample code showing Python with HDFS without Kerberos:
from hdfs import InsecureClient
client = InsecureClient('http://<Namenode>:50070')
client.list("/")
Python with HDFS with Kerberos:
from hdfs.ext.kerberos import KerberosClient
client = KerberosClient('http://<Namenode>:50070')
client = KerberosClient('http://ip-172-31-14-99.ec2.internal:50070')
client.list("/")
Using Hive¶
Hive is an open source data warehouse project for queries and data analysis. It provides an SQL-like interface called HiveQL to access distributed data stored in various databases and file systems.
Hive is very flexible in its connection methods and there are multiple ways to connect to it, such as JDBC, ODBC and Thrift. Anaconda recommends Thrift with Python and JDBC with R.
Dependencies¶
pyhiveRJDBC
Supported versions¶
Hive 1.1.0, JDK 1.8, Python 2 or Python 3
Kernels¶
[anaconda50_hadoop] Python 3
Drivers¶
Using JDBC requires downloading a driver for the specific version of Hive that you are using. This driver is also specific to the vendor you are using.
Cloudera EXAMPLE:
We recommend downloading the respective JDBC drivers and committing them to the project so that they are always available when the project starts.
Once the drivers are located in the project, Anaconda recommends using the RJDBC library to connect to Hive. Sample code for this is shown below.
Connecting¶
To connect to a Hive cluster you need the address and port to a running Hive Server 2, normally port 10000.
To use PyHive, open a Python notebook based on the [anaconda50_hadoop] Python 3
environment and run:
from pyhive import hive
conn = hive.connect('<Hive Server 2>', port=10000)
cursor = conn.cursor()
cursor.execute('SHOW DATABASES')
cursor.fetchall()
Python¶
Anaconda recommends the Thrift method to connect to Hive from Python. With Thrift you can use all the functionality of Hive, including security features such as SSL connectivity and Kerberos authentication. Thrift does not require special drivers, which improves code portability.
Instead of using an ODBC driver for connecting to the SQL engines, a Thrift client uses its own protocol based on a service definition to communicate with a Thrift server. This definition can be used to generate libraries in any language, including Python.
Hive using PyHive:
from pyhive import hive
conn = hive.connect('<Hive Server 2>', port=10000, auth='KERBEROS', kerberos_service_name='hive')
cursor.execute('SHOW TABLES')
cursor.fetchall()
# This prints: [('iris',), ('t1',)]
cursor.execute('SELECT * FROM iris')
cursor.fetchall()
# This prints the output of that table
Note
The output will be different, depending on the tables available on the cluster.
R¶
Anaconda recommends the JDBC method to connect to Hive from R.
Using JDBC allows for multiple types of authentication including Kerberos. The only difference between the types is that different flags are passed to the URI connection string on JDBC. Please follow the official documentation of the driver you picked and for the authentication you have in place.
Hive using RJDBC:
library("RJDBC")
hive_classpath <- list.files("<PATH TO JDBC DRIVER>", pattern="jar$", full.names=T)
drv <- JDBC(driverClass = "com.cloudera.hive.jdbc4.HS2Driver", classPath = hive_classpath, identifier.quote="'")
url <- "jdbc:hive2://<HIVE SERVER 2 HOST>:10000/default;SSL=1;AuthMech=1;KrbRealm=<KRB REALM>;KrbHostFQDN=<KRB HOST>;KrbServiceName=hive"
conn <- dbConnect(drv, url)
dbGetQuery(conn, "SHOW TABLES")
dbDisconnect(conn)
Note
The output will be different, depending on the tables available on the cluster.
Using Impala¶
Apache Impala is an open source, native analytic SQL query engine for Apache Hadoop. It uses massively parallel processing (MPP) for high performance, and works with commonly used big data formats such as Apache Parquet.
Impala is very flexible in its connection methods and there are multiple ways to connect to it, such as JDBC, ODBC and Thrift. Anaconda recommends Thrift with Python and JDBC with R.
Dependencies¶
impylaimplyrRJDBC
Supported versions¶
Impala 2.12.0, JDK 1.8, Python 2 or Python 3
Kernels¶
Python 2
Drivers¶
Using JDBC requires downloading a driver for the specific version of Impala that you are using. This driver is also specific to the vendor you are using.
Cloudera EXAMPLE:
We recommend downloading the respective JDBC drivers and committing them to the project so that they are always available when the project starts.
Once the drivers are located in the project, Anaconda recommends using the RJDBC library to connect to both Hive and Impala. Sample code for this is shown below.
Connecting¶
To connect to an Impala cluster you need the address and port to a running Impala Daemon, normally port 21050.
To use Impyla, open a Python Notebook based on the Python 2
environment and run:
from impala.dbapi import connect
conn = connect('<Impala Daemon>', port=21050)
cursor = conn.cursor()
cursor.execute('SHOW DATABASES')
cursor.fetchall()
Python¶
Anaconda recommends the Thrift method to connect to Impala from Python. With Thrift you can use all the functionality of Impala, including security features such as SSL connectivity and Kerberos authentication. Thrift does not require special drivers, which improves code portability.
Instead of using an ODBC driver for connecting to the SQL engines, a Thrift client uses its own protocol based on a service definition to communicate with a Thrift server. This definition can be used to generate libraries in any language, including Python.
Impala using Impyla:
from impala.dbapi import connect
conn = connect(host='<Impala Daemon>', port=21050, auth_mechanism='GSSAPI', kerberos_service_name='impala')
cursor = conn.cursor()
cursor.execute('SHOW TABLES')
results = cursor.fetchall()
results
# This prints: [('iris',),]
cursor.execute('SELECT * FROM iris')
cursor.fetchall()
# This prints the output of that table
Note
The output will be different, depending on the tables available on the cluster.
R¶
Anaconda recommends the JDBC method to connect to Impala from R.
Using JDBC allows for multiple types of authentication including Kerberos. The only difference between the types is that different flags are passed to the URI connection string on JDBC. Please follow the official documentation of the driver you picked and for the authentication you have in place.
Anaconda recommends Implyr to manipulate tables from Impala. This library provides a dplyr interface for Impala tables that is familiar to R users. Implyr uses RJBDC for connection.
Impala using RJDBC and Implyr:
library(implyr)
library(RJDBC)
impala_classpath <- list.files(path = "<PATH TO JDBC DRIVER>", pattern = "\\.jar$", full.names = TRUE)
drv <- JDBC(driverClass = "com.cloudera.hive.jdbc4.HS2Driver", classPath = hive_classpath, identifier.quote="'")
url <- "jdbc:impala://<IMPALA DAEMON HOST>:10000/default;SSL=1;AuthMech=1;KrbRealm=<KRB REALM>;KrbHostFQDN=<KRB HOST>;KrbServiceName=impala"
# Use implyr to create a dplyr interface
impala <- src_impala(drv, url)
# This will show all the available tables
src_tbls(impala)
Note
The output will be different, depending on the tables available on the cluster.
Working with SAS¶
With Anaconda Enterprise, you can connect to a remote SAS server process using the official sas_kernel and saspy. This allows you to merge SAS and Python/R workflows in a single interface, and to share your SAS-based work with your colleagues within the Enterprise platform.
Note
SAS is currently available in interactive development mode session only, not in deployments.
sas_kernel is distributed under the Apache 2.0 Licence, and requires SAS
version 9.4, or later. SAS is (c) SAS Institute, Inc.
Anaconda Enterprise and sas_kernel¶
Anaconda connects to a remote SAS server application over a secure SSH connection.
After you configure and establish the connection with the provided SAS kernel, SAS commands are sent to the remote server, and results appear in your notebook.
Note
Each open notebook starts a new SAS session on the server, which stays alive while the notebook is being used. This may affect your SAS license utilization.
Configuration¶
The file sascfg_personal.py in the project root directory provides the
configuration for the SAS kernel to run.
Normally your system administrator will provide the values to be entered here.
The connection information is stored in a block like this:
default = {
'saspath' : '/opt/sas9.4/install/SASHome/SASFoundation/9.4/bin/sas_u8',
'ssh' : '/usr/bin/ssh',
'host' : 'username@55.55.55.55',
'options' : ["-fullstimer"]
}
'saspath' must match the exact full path of the SAS binary on the remote system.
'host' must be a connection string that SSH can understand. Note that it
includes both a login username and an IP or hostname. A successful connection
requires that both are correct. The IP or hostname may have an optional suffix
of a colon and a port number, so both username@55.55.55.55 and
username@55.55.55.55:2022 are possible values.
Establishing a Connection¶
Whenever you start a new editing session, you must perform the following steps before creating or running a notebook with a SAS kernel:
From the SAS project, edit the configuration file sascfg_personal.py with your SAS path and host as mentioned in the configuration section.
In the following example, replace the default values with your own:
{{SAS_config_names = ['default'] SAS_config_options = {'lock_down': False} SAS_output_options = {'output': 'html5'} default = { 'saspath' : '/opt/sas94/sashome/SASFoundation/9.4/sas', 'ssh' : '/usr/bin/ssh', 'host' : '<username>@<ip-addr>', 'options' : ["-fullstimer"] }}}
Open the Terminal from the project and run the following to generate a key:
ssh-keygen
It will prompt you to enter the file, hit enter to save the key as id_rsa in /opt/continuum/.ssh.
Next, it will prompt you to enter a passphrase. Hit enter for no passphrase.
You will see two files, id_rsa and id_rsa.pub. The file ending in .pub must be known to the SAS server. You can view this file in the terminal with the cat command:
cat id_rsa.pub
Log in to your SAS server using your username and password. Edit the file ~/.ssh/authorized_keys as user and append the contents of the id_rsa.pub file there. You can edit the file with any console text editor on your system, such as nano or vi.
From the project Terminal, run the following command to test the connection to your SAS server:
ssh <connection-string> -o StrictHostKeyChecking=no echo OK
Replace connection-string with the host entry in sascfg_personal.py. You should not be prompted for the SSH key’s passphrase or password.
Now you can start the notebooks with the SAS kernel from the launcher pane, or switch the kernel of any notebook that is already open.
Working with packages¶
Anaconda Enterprise uses packages to bundle software files and information about the software—such as its name, specific version and description—into a single file that can be easily installed and managed.
Packages are distributed via channels. Channels may point to a cloud-based repository or a private location on a remote or local repository that you or someone else in your organization created. For more information, see Configuring channels and packages.
Note
Anaconda Enterprise supports the use of both conda and pip packages in its repository. To create and share channels and packages from your Anaconda Repository using conda commands, first install anaconda-enterprise-cli and log in to your AE instance.
Creating a package requires familiarity with the conda package manager and command line interface (CLI), so not all AE users will create packages and channels.
Many Anaconda Enterprise users may interact with packages primarily within the context of projects and deployments. In this case, they will likely do the following:
Access and download any packages and installers they need from the list of those available under Channels.
Work with the contents of the package as they create models and dashboards, then
Add any packages the project depends on to the project before deploying it.
Other users may primarily build packages, upload them to channels and share them with others to access and download.
Building a conda package¶
You can build a conda package to bundle software files and information about the software—such as its name, specific version and description—into a single file that can be easily installed and managed.
Building a conda package requires installing conda build and creating a conda build recipe.
You then use the conda build command to build the conda package from the conda recipe.
Tip
If you are new to building packages with conda, here are some video tutorials that you may find helpful:
- Production-grade Packaging with Anaconda | AnacondaCon 2018
This 41-minute presentation by Mahmoud Hashemi covers using conda and conda environments to build an OS package (RPM) and Docker images.
- The Sheer Joy of Packaging | SciPy 2018 Tutorial
This 210-minute presentation by Michael Sarahan, Filipe Fernandes, Chris Barker, Matt Craig, Matt McCormick, Jean-Christophe Fillion-Robin, Jonathan Helmus, and Ray Donnelly provides end-to-end examples of packaging with PyPI and conda. You can find materials from the tutorial here.
- Making packages and packaging “just work” | PyData 2017 Tutorial
This 40-minute presentation by Michael Sarahan walks you through critical topics such as the anatomy of a Python package, tools available to make packaging easier, plus how to automate builds and why you might want to do so.
You can build conda packages from a variety of source code projects, most notably Python. For help packaging a Python project, see the Setuptools documentation.
Note
Setuptools is a package development process library designed to facilitate packaging Python projects, and is not part of Anaconda, Inc. Conda-build uses the build system that’s native to the language, so in the case of Python that’s setuptools.
After you build the package, you can upload it to a channel for others to access.
Uploading a conda package¶
After you build a conda package, you can upload it to a channel to make it available for others to use.
A channel is a specific location for storing packages, and may point to a cloud-based repository or a private location on a remote or local repository that you or your organization created. See Accessing remote package repositories for more information.
Note
There is a 1GB file size limit for package files you upload.
To add a package to an existing channel:¶
Click Channels in the top menu to display your existing channels.
Select the specific channel you want to add your package to—information about any packages already in the channel is displayed.
Click Upload, browse for the package and click Upload. The package is added to the list.
Now you can share the channel and packages with others.
To create a new channel to add packages to:¶
Click Create in the upper right corner, enter a meaningful name for the channel and click Create.
Note
Channels are Public—accessible by non-authenticated users–by default. To make the channel Private, and therefore available to authenticated users only, disable the toggle to switch the channel setting from Public to Private.
Click Upload to add your package(s) to the channel.
Using the CLI:
You can also create a channel by running the following in a terminal window:
anaconda-enterprise-cli channels create <channelname>
Note
The channel name <channelname> you enter must not already exist.
Now you can upload a package to the channel by entering the following:
anaconda-enterprise-cli upload path/to/pkgs/notebookname.tar.bz2 --channel <channelname>
Replacing path/to/pkgs/notebookname.tar.bz2 with the actual path to the package you want to
upload, and <channelname> with the actual channel name.
To remove a package from a channel, select Delete from the command menu for the package:
Note
If the Delete command is not available, you don’t have permission to remove the package from the channel.
Setting a default channel¶
There is no default_channel in a fresh install, so you’ll have to enter a specific channel each time.
If you don’t want to enter the --channel option with each command, you
can set a default channel:
anaconda-enterprise-cli config set default_channel <channelname>
To display your current default channel:
$ anaconda-enterprise-cli config get default_channel
'<channelname>'
After setting the default channel, upload to your default channel:
anaconda-enterprise-cli upload <path/to/pkgs/packagename.tar.bz2>
Replacing <path/to/pkgs/packagename.tar.bz2> with the actual path to the
package you want to upload.
Sharing channels and packages¶
After you build a package and upload it to a channel, you can enable others to access it by sharing the channel with them. You can share a channel with specific users, or groups of users.
To share multiple packages with the same set of users, you can upload all of the packages to a channel and share that channel. This enables you to create channels for each type of user you support, and add the packages they need to each.
Anyone you share the channel with will see it in their Channels list when they log in to Anaconda Enterprise. They can then download the packages in the channel they want to work with, and add any packages their project depends on to their project before deploying it.
Note
The default is to grant collaborators read-write access, so if you want to prevent them from adding and removing packages from the channel, be sure they have read-only access. You’ll need to use the CLI to make a channel read-only.
Using the CLI:¶
Get a list of all the channels on the platform with the channels list command:
anaconda-enterprise-cli channels list
Share a channel with a specific user using the share command:
anaconda-enterprise-cli channels share --user username --level r <channelname>
You can also share a channel with an existing group created by your Administrator:
anaconda-enterprise-cli channels share --group GROUPNAME --level r <channelname>
Replacing GROUPNAME with the actual name of your group.
Note
Adding --level r grants this group read-only access to the channel.
You can “unshare” a channel using the following command:
anaconda-enterprise-cli channels share --user <username> --remove <channelname>
Run anaconda-enterprise-cli channels --help to see more information about
what you can do with channels.
For help with a specific command, enter that command followed by --help:
anaconda-enterprise-cli channels share --help
Configuring conda¶
If you are familiar with conda and want to use it to install the packages you need, you can configure conda to search a specific set of channels for packages. Listing channel locations in the .condarc file overrides conda defaults, causing conda to search only the channels listed, in the order specified.
The channels you specify can be public or private. Private channels will require you to authenticate before you can conda install packages from them.
If your organization has configured conda at the system level to limit platform users to only access packages in your on-premises repository, this will override your user-level configuration file.
To configure conda, create or update your ~.condarc configuration file in the root directory of your local machine to include your preferred repository channels. For example:
channels:
- <anaconda_dot_org_username>
- http://some.custom/channel
- file:///some/local/directory
- defaults
For more information, see this section of the conda docs.
Using installers, parcels and management packs¶
In addition to Anaconda and Miniconda installers, your Administrator may create custom installers, Cloudera Manager parcels, or Hortonworks Data Manager management packs for you and your colleagues to use. They make these specific packages and their dependencies available to you via channels.
To view the installers available to you, select the top Channels menu, then click the Installers link in the top right corner.
To download an installer, simply click on its name in the list.
Note
If you don’t see an installer that you expected to see, please contact your Administrator and ask them to generate the installer you need.
Working with data¶
Loading data into your project¶
Anaconda Enterprise uses projects to encapsulate all of the components necessary to use or run an application: the relevant packages, channels, scripts, notebooks and other related files, environment variables, services and commands, along with a configuration file named anaconda-project.yml.
You can also access and load data in a variety of formats, stored in common sources including the following:
Distributed version control repositories such as Git and Bitbucket (if configured by your Administrator).
The amount of data you read into your project will impact the resources required to successfully run the project, whether in a notebook session or deployment. See the following section on understanding resource profiles to learn more.
Understanding resource profiles¶
Resource profiles are used to limit the amount of CPU cores and RAM available for use when running a project session or deployment.
Note
Choosing a resource profile with a greater number of available cores is not guaranteed to improve performance—it will also depend on whether the libraries used by the project can take advantage of multiple cores, for example.
Memory limits are enforced by the Linux kernel, so when the memory limit is exceeded the most recent process will crash. Be sure to select a resource profile that offers sufficient runtime resources required by your project to avoid such errors. A best practice recommendation is to choose a resource profile with roughly double the amount of memory required by the size of data you need to read.
To see the total memory in use, open a terminal and run the following command:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes | awk '{print $1/1024/1024}'
Uploading files to a project¶
Open an editing session for the project, then choose the file you want to upload. The process of uploading files varies slightly, based on the editor used:
In Jupyter Notebook, click Upload and select the file to upload. Then click the blue Upload button displayed in the file’s row to add the file to the project
In JupyterLab, click the Upload files icon and select the file. In the top right corner, click Commit Changes to add the file to your project.
In Zeppelin, use the Import note feature to select a JSON file or add data from a URL.
Once a file is in the project, you can use code to read it. For example, to load the iris dataset from a comma separated value (CSV) file into a pandas DataFrame:
import pandas as pd
irisdf = pd.read_csv('iris.csv')
Accessing data stored in databases¶
You can also connect to the following database engines to access data stored within them:
See Storing secrets for information about adding credentials to the platform, to make them available in your projects. Any secrets you add will be available across all sessions and deployments associated with your user account.
Hadoop Distributed File System (HDFS), Spark, Hive, and Impala¶
Loading data from HDFS, Spark, Hive, and Impala is discussed in Hadoop / Spark.
SAS¶
You can connect to SAS servers and load data from SAS files as described in Working with SAS.
Exploring project data¶
With Anaconda Enterprise, you can explore project data using visualization libraries such as Bokeh and Matplotlib, and numeric libraries such as NumPy, SciPy, and Pandas.
Use these tools to discover patterns and relationships in your datasets, and develop approaches for your analysis and deployment pipelines.
The following examples use the Iris flower data set, and this mini customer data set (customers.csv):
customer_id,title,industry
1,data scientist,retail
2,data scientist,academia
3,compiler optimizer,academia
4,data scientist,finance
5,compiler optimizer,academia
6,data scientist,academia
7,compiler optimizer,academia
8,data scientist,retail
9,compiler optimizer,finance
Begin by importing libraries, and reading data into a Pandas DataFrame:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
irisdf = pd.read_csv('iris.csv')
customerdf = pd.read_csv('customers.csv')
%matplotlib inline
Then list column / variable names:
print(irisdf.columns)
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
Summary statistics include minimum, maximum, mean, median, percentiles, and more:
print('length:', len(irisdf)) # length of data set
print('shape:', irisdf.shape) # length and width of data set
print('size:', irisdf.size) # length * width
print('min:', irisdf['sepal_width'].min())
print('max:', irisdf['sepal_width'].max())
print('mean:', irisdf['sepal_width'].mean())
print('median:', irisdf['sepal_width'].median())
print('50th percentile:', irisdf['sepal_width'].quantile(0.5)) # 50th percentile, also known as median
print('5th percentile:', irisdf['sepal_width'].quantile(0.05))
print('10th percentile:', irisdf['sepal_width'].quantile(0.1))
print('95th percentile:', irisdf['sepal_width'].quantile(0.95))
length: 150
shape: (150, 5)
size: 750
min: 2.0
max: 4.4
mean: 3.0573333333333337
median: 3.0
50th percentile: 3.0
5th percentile: 2.3449999999999998
10th percentile: 2.5
95th percentile: 3.8
4. Use the value_counts function to show the number of items in each category, sorted
from largest to smallest. You can also set the ascending argument to True to display the list from smallest to largest.
print(customerdf['industry'].value_counts())
print()
print(customerdf['industry'].value_counts(ascending=True))
academia 5
finance 2
retail 2
Name: industry, dtype: int64
retail 2
finance 2
academia 5
Name: industry, dtype: int64
Categorical variables¶
In statistics, a categorical variable may take on a limited number of possible values. Examples could include blood type, nation of origin, or ratings on a Likert scale.
Like numbers, the possible values may have an order, such as from disagree to neutral to agree. The values cannot, however, be used for numerical operations such as addition or division.
Categorical variables tell other Python libraries how to handle the data, so those libraries can default to suitable statistical methods or plot types.
The following example converts the class variable of the Iris dataset from object to category.
print(irisdf.dtypes)
print()
irisdf['class'] = irisdf['class'].astype('category')
print(irisdf.dtypes)
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
class object
dtype: object
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
class category
dtype: object
Within Pandas, this creates an array of the possible values, where each value appears only once, and replaces the strings in the DataFrame with indexes into the array. In some cases, this saves significant memory.
A categorical variable may have a logical order different than the lexical order. For
example, for ratings on a Likert scale, the lexical order could alphabetize the
strings and produce agree, disagree, neither agree nor disagree, strongly
agree, strongly disagree. The logical order could range from most
negative to most positive as strongly disagree, disagree, neither agree nor
disagree, agree, strongly agree.
Time series data visualization¶
The following code sample creates four series of random numbers over time, calculates the cumulative sums for each series over time, and plots them.
timedf = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2015', periods=1000), columns=list('ABCD'))
timedf = timedf.cumsum()
timedf.plot()
This example was adapted from http://pandas.pydata.org/pandas-docs/stable/visualization.html.
Histograms¶
This code sample plots a histogram of the sepal length values in the Iris data set:
plt.hist(irisdf['sepal_length'])
plt.show()
Bar charts¶
The following sample code produces a bar chart of the industries of customers in the customer data set.
industries = customerdf['industry'].value_counts()
fig, ax = plt.subplots()
ax.bar(np.arange(len(industries)), industries)
ax.set_xlabel('Industry')
ax.set_ylabel('Customers')
ax.set_title('Customer industries')
ax.set_xticks(np.arange(len(industries)))
ax.set_xticklabels(industries.index)
plt.show()
This example was adapted from https://matplotlib.org/gallery/statistics/barchart_demo.html.
Scatter plots¶
This code sample makes a scatter plot of the sepal lengths and widths in the Iris data set:
fig, ax = plt.subplots()
ax.scatter(irisdf['sepal_length'], irisdf['sepal_width'], color='green')
ax.set(
xlabel="length",
ylabel="width",
title="Iris sepal sizes",
)
plt.show()
Sorting¶
To show the customer data set:
customerdf
row |
customer_id |
title |
industry |
|---|---|---|---|
0 |
1 |
data scientist |
retail |
1 |
2 |
data scientist |
academia |
2 |
3 |
compiler optimizer |
academia |
3 |
4 |
data scientist |
finance |
4 |
5 |
compiler optimizer |
academia |
5 |
6 |
data scientist |
academia |
6 |
7 |
compiler optimizer |
academia |
7 |
8 |
data scientist |
retail |
8 |
9 |
compiler optimizer |
finance |
To sort by industry and show the results:
customerdf.sort_values(by=['industry'])
row |
customer_id |
title |
industry |
|---|---|---|---|
1 |
2 |
data scientist |
academia |
2 |
3 |
compiler optimizer |
academia |
4 |
5 |
compiler optimizer |
academia |
5 |
6 |
data scientist |
academia |
6 |
7 |
compiler optimizer |
academia |
3 |
4 |
data scientist |
finance |
8 |
9 |
compiler optimizer |
finance |
0 |
1 |
data scientist |
retail |
7 |
8 |
data scientist |
retail |
To sort by industry and then title:
customerdf.sort_values(by=['industry', 'title'])
row |
customer_id |
title |
industry |
|---|---|---|---|
2 |
3 |
compiler optimizer |
academia |
4 |
5 |
compiler optimizer |
academia |
6 |
7 |
compiler optimizer |
academia |
1 |
2 |
data scientist |
academia |
5 |
6 |
data scientist |
academia |
8 |
9 |
compiler optimizer |
finance |
3 |
4 |
data scientist |
finance |
0 |
1 |
data scientist |
retail |
7 |
8 |
data scientist |
retail |
The sort_values function can also use the following arguments:
axisto sort either rows or columnsascendingto sort in either ascending or descending orderinplaceto perform the sorting operation in-place, without copying the data, which can save spacekindto use the quicksort, merge sort, or heapsort algorithmsna_positionto sort not a number (NaN) entries at the end or beginning
Grouping¶
customerdf.groupby('title')['customer_id'].count() counts the items in each
group, excluding missing values such as not-a-number values (NaN). Because
there are no missing customer IDs, this is equivalent to
customerdf.groupby('title').size().
print(customerdf.groupby('title')['customer_id'].count())
print()
print(customerdf.groupby('title').size())
print()
print(customerdf.groupby(['title', 'industry']).size())
print()
print(customerdf.groupby(['industry', 'title']).size())
title
compiler optimizer 4
data scientist 5
Name: customer_id, dtype: int64
title
compiler optimizer 4
data scientist 5
dtype: int64
title industry
compiler optimizer academia 3
finance 1
data scientist academia 2
finance 1
retail 2
dtype: int64
industry title
academia compiler optimizer 3
data scientist 2
finance compiler optimizer 1
data scientist 1
retail data scientist 2
dtype: int64
By default groupby sorts the group keys. You can use the sort=False
option to prevent this, which can make the grouping operation faster.
Binning¶
Binning or bucketing moves continuous data into discrete chunks, which can be used as ordinal categorical variables.
You can divide the range of the sepal length measurements into four equal bins:
pd.cut(irisdf['sepal_length'], 4).head()
0 (4.296, 5.2]
1 (4.296, 5.2]
2 (4.296, 5.2]
3 (4.296, 5.2]
4 (4.296, 5.2]
Name: sepal_length, dtype: category
Categories (4, interval[float64]): [(4.296, 5.2] < (5.2, 6.1] < (6.1, 7.0] < (7.0, 7.9]]
Or make a custom bin array to divide the sepal length measurements into integer-sized bins from 4 through 8:
custom_bin_array = np.linspace(4, 8, 5)
custom_bin_array
array([4., 5., 6., 7., 8.])
Copy the Iris data set, and apply the binning to it:
iris2=irisdf.copy()
iris2['sepal_length'] = pd.cut(iris2['sepal_length'], custom_bin_array)
iris2['sepal_length'].head()
0 (5.0, 6.0]
1 (4.0, 5.0]
2 (4.0, 5.0]
3 (4.0, 5.0]
4 (4.0, 5.0]
Name: sepal_length, dtype: category
Categories (4, interval[float64]): [(4.0, 5.0] < (5.0, 6.0] < (6.0, 7.0] < (7.0, 8.0]]
Then plot the binned data:
plt.style.use('ggplot')
categories = iris2['sepal_length'].cat.categories
ind = np.array([x for x, _ in enumerate(categories)])
plt.bar(ind, iris2.groupby('sepal_length').size(), width=0.5, label='Sepal length')
plt.xticks(ind, categories)
plt.show()
This example was adapted from http://benalexkeen.com/bucketing-continuous-variables-in-pandas/ .
Data preparation¶
Anaconda Enterprise supports data preparation using numeric libraries such as NumPy, SciPy, and Pandas.
These examples use this small data file vendors.csv:
Vendor Number,Vendor Name,Month,Day,Year,Active,Open Orders,2015,2016,Percent Growth
"104.0",ACME Inc,2,15,2014,"Y",200,"$45,000.00",$54000.00,20.00%
205,Apogee LTD,8,12,2015,"Y",150,"$29,000.00","$30,450.00",5.00%
143,Zenith Co,4,5,2014,"Y",290,"$18,000.00",$23400.00,30.00%
166,Hollerith Propulsion,9,25,2015,"Y",180,"$48,000.00",$48960.00,2.00%
180,Airtek Industrial,8,2,2014,"N",Closed,"$23,000.00",$17250.00,-25.00%
The columns are the vendor ID number, vendor name, month day and year of first purchase from the vendor, whether the account is currently active, the number of open orders, purchases in 2015 and 2016, and percent growth in orders from 2015 to 2016.
Converting data types¶
Computers handle many types of data, including integer numbers such as 365, floating point numbers such as 365.2425, strings such as “ACME Inc”, and more.
An operation such as division may work for integers and floating point numbers, but produce an error if used on strings.
Often data libraries such as pandas will automatically use the correct types, but they do provide ways to correct and change the types when needed. For example, you may wish to convert between an integer such as 25, the floating point number 25.0, and strings such as “25”, “25.0”, or “$25.00”.
Pandas data types or dtypes correspond to similar Python types.
Strings are called str in Python and object in pandas.
Integers are called int in Python and int64 in pandas, indicating that
pandas stores integers as 64-bit numbers.
Floating point numbers are called float in Python and float64 in pandas,
also indicating that they are stored with 64 bits.
A boolean value, named for logician George Boole, can be either True or False.
These are called bool in Python and bool in pandas.
Pandas includes some data types with no corresponding native Python type:
datetime64 for date and time values, timedelta[ns] for storing the
difference between two times as a number of nanoseconds, and category where
each item is one of a list of strings.
Here we import the vendor data file and show the dtypes:
import pandas as pd
import numpy as np
df = pd.read_csv('vendors.csv')
df.dtypes
Vendor Number float64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 object
2016 object
Percent Growth object
dtype: object
Try adding the 2015 and 2016 sales:
df['2015']+df['2016']
0 $45,000.00$54000.00
1 $29,000.00$30,450.00
2 $18,000.00$23400.00
3 $48,000.00$48960.00
4 $23,000.00$17250.00
dtype: object
These columns were stored as the type “object”, and concatenated as strings, not added as numbers.
Examine more information about the DataFrame:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Vendor Number 5 non-null float64
Vendor Name 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
Open Orders 5 non-null object
2015 5 non-null object
2016 5 non-null object
Percent Growth 5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes
Vendor Number is a float and not an int. 2015 and 2016 sales, percent
growth, and open orders are stored as objects and not numbers. The month, day,
and year values should be converted to datetime64, and the active column
should be converted to a boolean.
The data can be converted with the astype() function, custom functions, or
pandas functions such as to_numeric() or to_datetime().
astype()¶
The astype() function can convert the Vendor Number column to int:
df['Vendor Number'].astype('int')
0 104
1 205
2 143
3 166
4 180
Name: Vendor Number, dtype: int64
astype() returns a copy, so an assignment statement will convert the
original data. This can be checked by showing the dtypes.
df['Vendor Number'] = df['Vendor Number'].astype('int')
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 object
2016 object
Percent Growth object
dtype: object
However, trying to convert the 2015 column to a float or the Open Orders
column to an int returns an error.
df['2015'].astype('float')
ValueError: could not convert string to float: '$23,000.00'
df['Open Orders'].astype('int')
ValueError: invalid literal for int() with base 10: 'Closed'
Even worse, trying to convert the Active column to a bool completes with no
errors, but converts both Y and N values to True.
df['Active'].astype('bool')
0 True
1 True
2 True
3 True
4 True
Name: Active, dtype: bool
astype() works if the data is clean and can be interpreted simply as a
number, or if you want to convert a number to a string. Other conversions
require custom functions or pandas functions such as to_numeric() or
to_datetime().
Custom conversion functions¶
This small custom function converts a currency string like the ones in the 2015
column to a float by first removing the comma (,) and dollar sign ($)
characters.
def currency_to_float(a):
return float(a.replace(',','').replace('$',''))
Test the function on the 2015 column with the apply() function:
df['2015'].apply(currency_to_float)
0 45000.0
1 29000.0
2 18000.0
3 48000.0
4 23000.0
Name: 2015, dtype: float64
Convert the 2015 and 2016 columns and show the dtypes:
df['2015'] = df['2015'].apply(currency_to_float)
df['2016'] = df['2016'].apply(currency_to_float)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 float64
2016 float64
Percent Growth object
dtype: object
Convert the Percent Growth column:
def percent_to_float(a):
return float(a.replace('%',''))/100
df['Percent Growth'].apply(percent_to_float)
0 0.20
1 0.05
2 0.30
3 0.02
4 -0.25
Name: Percent Growth, dtype: float64
df['Percent Growth'] = df['Percent Growth'].apply(percent_to_float)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 float64
2016 float64
Percent Growth float64
dtype: object
NumPy’s np.where() function is a good way to convert the Active column to
bool. This code converts “Y” values to True and all other values to
False, then shows the dtypes:
np.where(df["Active"] == "Y", True, False)
array([ True, True, True, True, False])
df["Active"] = np.where(df["Active"] == "Y", True, False)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders object
2015 float64
2016 float64
Percent Growth float64
dtype: object
Pandas helper functions¶
The Open Orders column has several integers, but one string. Using astype()
on this column would produce an error, but the pd.to_numeric() function
built in to pandas will convert the numeric values to numbers and any other
values to the “not a number” or “NaN” value built in to the floating point
number standard:
pd.to_numeric(df['Open Orders'], errors='coerce')
0 200.0
1 150.0
2 290.0
3 180.0
4 NaN
Name: Open Orders, dtype: float64
In this case, a non-numeric value in this field indicates that there are zero
open orders, so we can convert NaN values to zero with the function
fillna():
pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
0 200.0
1 150.0
2 290.0
3 180.0
4 0.0
Name: Open Orders, dtype: float64
Similarly, the pd.to_datetime() function built in to pandas can convert the
Month Day and Year columns to datetime64[ns]:
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2014-02-15
1 2015-08-12
2 2014-04-05
3 2015-09-25
4 2014-08-02
dtype: datetime64[ns]
Use these functions to change the DataFrame, then show the dtypes:
df['Open Orders'] = pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
df['First Purchase Date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders float64
2015 float64
2016 float64
Percent Growth float64
First Purchase Date datetime64[ns]
dtype: object
Converting data as it is read¶
You can apply dtype and converters in the pd.read_csv() function.
Defining dtype is like performing astype() on the data.
A dtype or a converter can only be applied once to a specified column.
If you try to apply both to the same column, the dtype is skipped.
After converting as much of the data as possible in pd.read_csv(), use code
similar to the previous examples to convert the rest.
df2 = pd.read_csv('vendors.csv',
dtype={'Vendor Number': 'int'},
converters={'2015': currency_to_float,
'2016': currency_to_float,
'Percent Growth': percent_to_float})
df2["Active"] = np.where(df2["Active"] == "Y", True, False)
df2['Open Orders'] = pd.to_numeric(df2['Open Orders'], errors='coerce').fillna(0)
df2['First Purchase Date'] = pd.to_datetime(df2[['Month', 'Day', 'Year']])
df2
Vendor Number Vendor Name Month Day Year Active Open Orders 2015 2016 Percent Growth First Purchase Date
0 104 ACME Inc 2 15 2014 True 200.0 45000.0 54000.0 0.20 2014-02-15
1 205 Apogee LTD 8 12 2015 True 150.0 29000.0 30450.0 0.05 2015-08-12
2 143 Zenith Co 4 5 2014 True 290.0 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion 9 25 2015 True 180.0 48000.0 48960.0 0.02 2015-09-25
4 180 Airtek Industrial 8 2 2014 False 0.0 23000.0 17250.0 -0.25 2014-08-02
df2.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders float64
2015 float64
2016 float64
Percent Growth float64
First Purchase Date datetime64[ns]
dtype: object
We thank http://pbpython.com/pandas_dtypes.html for providing data preparation examples that inspired these examples.
Merging and joining data sets¶
You can use pandas to merge DataFrames:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on='key')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
The available merge methods are left to use keys from the left frame only,
right to use keys from the right frame only, outer to use the union of
keys from both frames, and the default inner to use the intersection of keys
from both frames.
This merge using the default inner join omits key combinations found in only one of the source DataFrames:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
This example omits the rows with key1 and key2 set to K0, K1, K2,
K1, or K2, K0.
Joins also copy information when necessary. The left DataFrame had one row with
the keys set to K1, K0 and the right DataFrame had two. The output DataFrame
has two, with the information from the left DataFrame copied into both rows.
The next example shows the results of a left, right, and outer merge on the same inputs. Empty cells are filled in with NaN values.
pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
pd.merge(left, right, how='outer', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
If a key combination appears more than once in both tables, the output will contain the Cartesian product of the associated data.
In this small example a key that appears twice in the left frame and three times in the right frame produces six rows in the output frame.
left = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]})
right = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]})
pd.merge(left, right, on='B', how='outer')
A_x B A_y
0 1 2 4
1 1 2 5
2 1 2 6
3 2 2 4
4 2 2 5
5 2 2 6
To prevent very large outputs and memory overflow, manage duplicate values in keys before joining large DataFrames.
While merging uses one or more columns as keys, joining uses the indexes, also known as row labels.
Join can also perform left, right, inner, and outer merges, and defaults to left.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
left.join(right, how='inner')
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
This is equivalent to using merge with arguments instructing it to use the indexes:
pd.merge(left, right, left_index=True, right_index=True, how='inner')
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
You can join a frame indexed by a join key to a frame where the key is a column:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])
left.join(right, on='key')
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
You can join on multiple keys if the passed DataFrame has a MultiIndex:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1']})
index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
('K2', 'K0'), ('K2', 'K1')])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=index)
right
C D
K0 K0 C0 D0
K1 K0 C1 D1
K2 K0 C2 D2
K1 C3 D3
left.join(right, on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K2 K1 C3 D3
Note that this defaulted to a left join, but other types are also available:
left.join(right, on=['key1', 'key2'], how='inner')
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K2 K1 C3 D3
For more information, including examples of using merge to join a single index to a multi-index or to join two multi-indexes, see the pandas documentation on merging.
When column names in the input frames overlap, pandas appends suffixes to
disambiguate them. These default to _x and _y but you can customize
them:
left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})
pd.merge(left, right, on='k')
k v_x v_y
0 K0 1 4
1 K0 1 5
pd.merge(left, right, on='k', suffixes=['_l', '_r'])
k v_l v_r
0 K0 1 4
1 K0 1 5
Join has similar arguments lsuffix and rsuffix.
left = left.set_index('k')
right = right.set_index('k')
left.join(right, lsuffix='_l', rsuffix='_r')
v_l v_r
k
K0 1 4.0
K0 1 5.0
K1 2 NaN
K2 3 NaN
You can join a list or tuple of DataFrames on their indexes:
right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
left.join([right, right2])
v_x v_y v
K0 1 4.0 NaN
K0 1 5.0 NaN
K1 2 NaN 7.0
K1 2 NaN 8.0
K2 3 NaN 9.0
If you have two frames with similar indices and want to fill in missing values
in the left frame with values from the right frame, use the combine_first()
method:
df1 = pd.DataFrame([[np.nan, 3., 5.],
[-4.6, np.nan, np.nan],
[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2],
[-5., 1.6, 4]],
index=[1, 2])
df1.combine_first(df2)
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN -8.2
2 -5.0 7.0 4.0
The method update() overwrites values in a frame with values from another
frame:
df1.update(df2)
df1
0 1 2
0 NaN 3.0 5.0
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
The pandas documentation on merging has more information, including examples of combining time series and other ordered data, with options to fill and interpolate missing data.
We thank the pandas documentation for many of these examples.
Filtering data¶
This example uses a vendors DataFrame similar to the one we used above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'VendorNumber': [104, 205, 143, 166, 180],
'VendorName': ['ACME Inc', 'Apogee LTD', 'Zenith Co', 'Hollerith Propulsion', 'Airtek Industrial'],
'Active': [True, True, True, True, False],
'OpenOrders': [200, 150, 290, 180, 0],
'Purchases2015': [45000.0, 29000.0, 18000.0, 48000.0, 23000.0],
'Purchases2016': [54000.0, 30450.0, 23400.0, 48960.0, 17250.0],
'PercentGrowth': [0.20, 0.05, 0.30, 0.02, -0.25],
'FirstPurchaseDate': ['2014-02-15', '2015-08-12', '2014-04-05', '2015-09-25', '2014-08-02']})
df['FirstPurchaseDate'] = df['FirstPurchaseDate'].astype('datetime64[ns]')
df
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
0 104 ACME Inc True 200 45000.0 54000.0 0.20 2014-02-15
1 205 Apogee LTD True 150 29000.0 30450.0 0.05 2015-08-12
2 143 Zenith Co True 290 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25
4 180 Airtek Industrial False 0 23000.0 17250.0 -0.25 2014-08-02
To filter only certain rows from a DataFrame, call the query method with a
boolean expression based on the column names.
df.query('OpenOrders>160')
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
0 104 ACME Inc True 200 45000.0 54000.0 0.20 2014-02-15
2 143 Zenith Co True 290 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25
Filtering can be done with indices instead of queries:
df[(df.OpenOrders < 190) & (df.Active == True)]
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
1 205 Apogee LTD True 150 29000.0 30450.0 0.05 2015-08-12
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25
Using statistics¶
Anaconda Enterprise supports statistical work using the R language and Python libraries such as NumPy, SciPy, Pandas, Statsmodels, and scikit-learn.
The following Jupyter notebook Python examples show how to use these libraries to calculate correlations, distributions, regressions, and principal component analysis.
These examples also include plots produced with the libraries seaborn and Matplotlib.
We thank these sites, from whom we have adapted some code:
Start by importing necessary libraries and functions, including Pandas, SciPy, scikit-learn, Statsmodels, seaborn, and Matplotlib.
This code imports load_boston to provide the Boston housing dataset from the
datasets included with scikit-learn.
import pandas as pd
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import statsmodels.formula.api as sm
%matplotlib inline
Load the Boston housing data into a Pandas DataFrame:
#Load dataset and convert it to a Pandas dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['target'] = boston.target
In the Boston housing dataset, the target variable is MEDV, the median home value.
Print the dataset description:
#Description of the dataset
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
Show the first five records of the dataset:
#Check the first five records
df.head()
row CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
=== ======= ==== ===== ==== ===== ===== ==== ====== === ===== ======= ====== ===== ======
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
Show summary statistics for each variable: count, mean, standard deviation, minimum, 25th 50th and 75th percentiles, and maximum.
#Descriptions of each variable
df.describe()
stat CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
===== ========== ========== ========== ========== ========== ========== ========== ========== ========== ========== ========== ========== ========== ==========
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
Correlation matrix¶
The correlation matrix lists the correlation of each variable with each other variable.
Positive correlations mean one variable tends to be high when the other is high, and negative correlations mean one variable tends to be high when the other is low.
Correlations close to zero are weak and cause a variable to have less influence in the model, and correlations close to one or negative one are strong and cause a variable to have more influence in the model.
#Here shows the basic correlation matrix
corr = df.corr()
corr
variable CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
======== ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =========
CRIM 1.000000 -0.199458 0.404471 -0.055295 0.417521 -0.219940 0.350784 -0.377904 0.622029 0.579564 0.288250 -0.377365 0.452220 -0.385832
ZN -0.199458 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
INDUS 0.404471 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
CHAS -0.055295 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
NOX 0.417521 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
RM -0.219940 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
AGE 0.350784 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
DIS -0.377904 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
RAD 0.622029 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
TAX 0.579564 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
PTRATIO 0.288250 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
B -0.377365 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
LSTAT 0.452220 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
target -0.385832 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
Format with asterisks¶
Format the correlation matrix by rounding the numbers to two decimal places and adding asterisks to denote statistical significance:
def calculate_pvalues(df):
df = df.select_dtypes(include=['number'])
pairs = pd.MultiIndex.from_product([df.columns, df.columns])
pvalues = [pearsonr(df[a], df[b])[1] for a, b in pairs]
pvalues = pd.Series(pvalues, index=pairs).unstack().round(4)
return pvalues
# code adapted from https://stackoverflow.com/questions/25571882/pandas-columns-correlation-with-statistical-significance/49040342
def correlation_matrix(df,columns):
rho = df[columns].corr()
pval = calculate_pvalues(df[columns])
# create three masks
r0 = rho.applymap(lambda x: '{:.2f}'.format(x))
r1 = rho.applymap(lambda x: '{:.2f}*'.format(x))
r2 = rho.applymap(lambda x: '{:.2f}**'.format(x))
r3 = rho.applymap(lambda x: '{:.2f}***'.format(x))
# apply marks
rho = rho.mask(pval>0.01,r0)
rho = rho.mask(pval<=0.1,r1)
rho = rho.mask(pval<=0.05,r2)
rho = rho.mask(pval<=0.01,r3)
return rho
columns = df.columns
correlation_matrix(df,columns)
variable CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
======== ======== ======== ======== ======== ======== ======== ======== ======== ======== ======== ======== ======== ======== ========
CRIM 1.00*** -0.20*** 0.40*** -0.06 0.42*** -0.22*** 0.35*** -0.38*** 0.62*** 0.58*** 0.29*** -0.38*** 0.45*** -0.39***
ZN -0.20*** 1.00*** -0.53*** -0.04 -0.52*** 0.31*** -0.57*** 0.66*** -0.31*** -0.31*** -0.39*** 0.18*** -0.41*** 0.36***
INDUS 0.40*** -0.53*** 1.00*** 0.06 0.76*** -0.39*** 0.64*** -0.71*** 0.60*** 0.72*** 0.38*** -0.36*** 0.60*** -0.48***
CHAS -0.06 -0.04 0.06 1.00*** 0.09** 0.09** 0.09* -0.10** -0.01 -0.04 -0.12*** 0.05 -0.05 0.18***
NOX 0.42*** -0.52*** 0.76*** 0.09** 1.00*** -0.30*** 0.73*** -0.77*** 0.61*** 0.67*** 0.19*** -0.38*** 0.59*** -0.43***
RM -0.22*** 0.31*** -0.39*** 0.09** -0.30*** 1.00*** -0.24*** 0.21*** -0.21*** -0.29*** -0.36*** 0.13*** -0.61*** 0.70***
AGE 0.35*** -0.57*** 0.64*** 0.09* 0.73*** -0.24*** 1.00*** -0.75*** 0.46*** 0.51*** 0.26*** -0.27*** 0.60*** -0.38***
DIS -0.38*** 0.66*** -0.71*** -0.10** -0.77*** 0.21*** -0.75*** 1.00*** -0.49*** -0.53*** -0.23*** 0.29*** -0.50*** 0.25***
RAD 0.62*** -0.31*** 0.60*** -0.01 0.61*** -0.21*** 0.46*** -0.49*** 1.00*** 0.91*** 0.46*** -0.44*** 0.49*** -0.38***
TAX 0.58*** -0.31*** 0.72*** -0.04 0.67*** -0.29*** 0.51*** -0.53*** 0.91*** 1.00*** 0.46*** -0.44*** 0.54*** -0.47***
PTRATIO 0.29*** -0.39*** 0.38*** -0.12*** 0.19*** -0.36*** 0.26*** -0.23*** 0.46*** 0.46*** 1.00*** -0.18*** 0.37*** -0.51***
B -0.38*** 0.18*** -0.36*** 0.05 -0.38*** 0.13*** -0.27*** 0.29*** -0.44*** -0.44*** -0.18*** 1.00*** -0.37*** 0.33***
LSTAT 0.45*** -0.41*** 0.60*** -0.05 0.59*** -0.61*** 0.60*** -0.50*** 0.49*** 0.54*** 0.37*** -0.37*** 1.00*** -0.74***
target -0.39*** 0.36*** -0.48*** 0.18*** -0.43*** 0.70*** -0.38*** 0.25*** -0.38*** -0.47*** -0.51*** 0.33*** -0.74*** 1.00***
Heatmap¶
Heatmap of the correlation matrix:
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
Target variable distribution¶
Histogram showing the distribution of the target variable. In this dataset this is “Median value of owner-occupied homes in $1000’s”, abbreviated MEDV.
plt.hist(df['target'])
plt.show()
Simple linear regression¶
The variable MEDV is the target that the model predicts. All other variables are used as predictors, also called features.
The target variable is continuous, so use a linear regression instead of a logistic regression.
# Define features as X, target as y.
X = df.drop('target', axis='columns')
y = df['target']
Split the dataset into a training set and a test set:
# Splitting the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
A linear regression consists of a coefficient for each feature and one intercept.
To make a prediction, each feature is multiplied by its coefficient. The intercept and all of these products are added together. This sum is the predicted value of the target variable.
The residual sum of squares (RSS) is calculated to measure the difference between the prediction and the actual value of the target variable.
The function fit calculates the coefficients and intercept that minimize the
RSS when the regression is used on each record in the training set.
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# The intercept
print('Intercept: \n', regressor.intercept_)
# The coefficients
print('Coefficients: \n', pd.Series(regressor.coef_, index=X.columns, name='coefficients'))
Intercept:
36.98045533762056
Coefficients:
CRIM -0.116870
ZN 0.043994
INDUS -0.005348
CHAS 2.394554
NOX -15.629837
RM 3.761455
AGE -0.006950
DIS -1.435205
RAD 0.239756
TAX -0.011294
PTRATIO -0.986626
B 0.008557
LSTAT -0.500029
Name: coefficients, dtype: float64
Now check the accuracy when this linear regression is used on new data that it was not trained on. That new data is the test set.
# Predicting the Test set results
y_pred = regressor.predict(X_test)
# Visualising the Test set results
# code adapted from https://joomik.github.io/Housing/
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, color='green')
ax.set(
xlabel="Prices: $Y_i$",
ylabel="Predicted prices: $\hat{Y}_i$",
title="Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$",
)
plt.show()
This scatter plot shows that the regression is a good predictor of the data in the test set.
The mean squared error quantifies this performance:
# The mean squared error as a way to measure model performance.
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
Mean squared error: 29.79
Ordinary least squares (OLS) regression with Statsmodels¶
model = sm.ols('target ~ AGE + B + CHAS + CRIM + DIS + INDUS + LSTAT + NOX + PTRATIO + RAD + RM + TAX + ZN', df)
result = model.fit()
result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: target R-squared: 0.741
Model: OLS Adj. R-squared: 0.734
Method: Least Squares F-statistic: 108.1
Date: Thu, 23 Aug 2018 Prob (F-statistic): 6.95e-135
Time: 07:29:16 Log-Likelihood: -1498.8
No. Observations: 506 AIC: 3026.
Df Residuals: 492 BIC: 3085.
Df Model: 13
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.4911 5.104 7.149 0.000 26.462 46.520
AGE 0.0008 0.013 0.057 0.955 -0.025 0.027
B 0.0094 0.003 3.500 0.001 0.004 0.015
CHAS 2.6886 0.862 3.120 0.002 0.996 4.381
CRIM -0.1072 0.033 -3.276 0.001 -0.171 -0.043
DIS -1.4758 0.199 -7.398 0.000 -1.868 -1.084
INDUS 0.0209 0.061 0.339 0.735 -0.100 0.142
LSTAT -0.5255 0.051 -10.366 0.000 -0.625 -0.426
NOX -17.7958 3.821 -4.658 0.000 -25.302 -10.289
PTRATIO -0.9535 0.131 -7.287 0.000 -1.211 -0.696
RAD 0.3057 0.066 4.608 0.000 0.175 0.436
RM 3.8048 0.418 9.102 0.000 2.983 4.626
TAX -0.0123 0.004 -3.278 0.001 -0.020 -0.005
ZN 0.0464 0.014 3.380 0.001 0.019 0.073
==============================================================================
Omnibus: 178.029 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 782.015
Skew: 1.521 Prob(JB): 1.54e-170
Kurtosis: 8.276 Cond. No. 1.51e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Principal component analysis¶
The initial dataset has a number of feature or predictor variables and one target variable to predict.
Principal component analysis (PCA) converts these features into a set of principal components, which are linearly uncorrelated variables.
The first principal component has the largest possible variance and therefore accounts for as much of the variability in the data as possible.
Each of the other principal components is orthogonal to all of its preceding components, but has the largest possible variance within that constraint.
Graphing a dataset by showing only the first two or three of the principal components effectively projects a complex dataset with high dimensionality into a simpler image that shows as much of the variance in the data as possible.
PCA is sensitive to the relative scaling of the original variables, so begin by scaling them:
# Feature Scaling
x = StandardScaler().fit_transform(X)
Calculate the first three principal components and show them for the first five rows of the housing dataset:
# Project data to 3 dimensions
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(
data = principalComponents,
columns = ['principal component 1', 'principal component 2', 'principal component 3'])
principalDf.head()
row |
principal component 1 |
principal component 2 |
principal component 3 |
|---|---|---|---|
0 |
-2.097842 |
0.777102 |
0.335076 |
1 |
-1.456412 |
0.588088 |
-0.701340 |
2 |
-2.074152 |
0.602185 |
0.161234 |
3 |
-2.611332 |
-0.005981 |
-0.101940 |
4 |
-2.457972 |
0.098860 |
-0.077893 |
Show a 2D graph of this data:
plt.scatter(principalDf['principal component 1'], principalDf['principal component 2'], color ='green')
plt.show()
Show a 3D graph of this data:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(principalDf['principal component 1'], principalDf['principal component 2'], principalDf['principal component 3'])
plt.show()
Measure how much of the variance is explained by each of the three components:
# Variance explained by each component
explained_variance = pca.explained_variance_ratio_
explained_variance
array([0.47097344, 0.11015872, 0.09547408])
Each value will be less than or equal to the previous value, and each value will be in the range from 0 through 1.
The sum of these three values shows the fraction of the total variance explained by the three principal components, in the range from 0 (none) through 1 (all):
sum(explained_variance)
0.6766062376563704
Predict the target variable using only the three principal components:
y_test_linear = y_test
y_pred_linear = y_pred
X = principalDf
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
Plot the predictions from the linear regression in green again, and the new predictions in blue:
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, color='skyblue')
ax.scatter(y_test_linear, y_pred_linear, color = 'green')
ax.set(
xlabel="Prices: $Y_i$",
ylabel="Predicted prices: $\hat{Y}_i$",
title="Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$",
)
plt.show()
The blue points are somewhat more widely scattered, but similar.
Calculate the mean squared error:
print("Linear regression mean squared error: %.2f" % mean_squared_error(y_test_linear, y_pred_linear))
print("PCA mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
Linear regression mean squared error: 29.79
PCA mean squared error: 43.49
Working with deployments¶
When you deploy a project, Anaconda Enterprise finds and builds all of the software dependencies—the libraries on which the project depends in order to run—and encapsulates them, so they are completely self-contained and easy to share with others. This is called a deployment.
Whether you deploy a notebook, Bokeh application or REST API, everything needed to deploy and run the project is included. You can then share your deployment with others so they can interact with it.
Note
You can create multiple deployments from a single project. Each deployment can be a different version, and can be shared with different users.
After logging in to Anaconda Enterprise, click Deployments to view a list of all of the deployments you have created—or that others have shared with you. Simply click on a deployment to open the deployed Notebook or application and interact with it.
Anaconda Enterprise maintains a log of all deployments created by all users in the Administrator’s Authentication Center.
Deploying a project¶
When you are ready to use your interactive visualization, live notebook or machine learning model, you deploy the associated project. You can also deploy someone else’s project if you have been added as a collaborator on the project. See Collaborating on projects for more information.
When you deploy a project, Anaconda Enterprise finds and builds the software dependencies—all of the libraries required for it to run—and encapsulates them so they are completely self-contained. This allows you to easily share it with others.
You configure how a project is deployed by adding the appropriate command to run the project in the configuration file anaconda-project.yml. You can also accept the default command, like the following example Bokeh app:
See Configuring project settings for more information about adding deployment commands to project.
To deploy a project:¶
Select it in the Projects list and click Deploy.
Choose the runtime resources your project requires to run from the Resource Profile drop-down, or accept the default. Your Administrator configures the options in this list, so check with them if you aren’t sure.
If there are multiple versions of the project, select the version you want to deploy.
Select the command to use to deploy the project. If there is no deployment command listed, you cannot deploy the project.
Return to the project and add a deployment command, or ask the project owner to do so if it’s not your project. See Configuring project settings for more information about adding deployment commands.
Enter the URL where you want the deployment to be hosted in the Static URL field.
Note
This is the URL you’ll use to call the deployment from within a web application, and therefore it must be unique. Disable the Static URL toggle if you want Anaconda Enterprise to automatically generate a URL for the deployment.
Choose whether you want to keep the deployment
Private—and therefore acessible to authenticated platform users, only—or make itPublic, and therefore available to non-authenticated users. After it’s deployed, you can share the deployment with others.Click Deploy. Anaconda Enterprise displays the status of the deployment, then lists it in the project’s Deployments. Private deployments are displayed with a lock
next to their name, to indicate their secure status.
Note
It may take a few minutes to obtain and build all the dependencies for the project deployment.
To view or interact with a deployment, click its name in the list.
You can also schedule a project to be deployed on a regular basis or at a specific time.
Deploying a REST API¶
Anaconda Enterprise enables you to deploy your machine learning or predictive models as a REST API endpoint so others can query and consume results from them. REST APIs are web server endpoints, or callable URLs, which provide results based on a query, allowing developers to create applications that programmatically query and consume them via other user interfaces or applications.
Rather than sharing your model with other data scientists and having them run it, you can give them an endpoint to query the model, which you can continue to update, improve and redeploy as needed.
REST API endpoints deployed with Anaconda Enterprise are secure and only accessible to users that you’ve shared the deployment with or users that have generated a token that can be used to query the REST API endpoint outside of Anaconda Enterprise.
The process of deploying a REST API involves the following steps:
Create a project to encapsulate all of the components necessary to use or run your model.
Deploy the project with the
rest_apicommand (shown in Step 4 below) to build the software dependencies—all of the libraries required for it to run—and encapsulate them so they are completely self-contained.Share the deployment, or the URL of the endpoint, and generate a unique token so that they can connect to the deployment and use it from within notebooks, APIs or other applications.
Using the API wrapper¶
As an alternative to using the REST API wrapper provided with Anaconda Enterprise, you can construct an API endpoint using any web framework and serve the endpoint on port 8086 within your deployment, to make it available as a secure REST API endpoint in Anaconda Enterprise.
Follow this process to wrap your code with an API:
Open the Jupyter Notebook and add this code to be able to handle HTTP requests. Define a global REQUEST JSON string that will be replaced on each invocation of the API.
import json REQUEST = json.dumps({ 'path' : {}, 'args' : {}, 'body': {} })
Import the Anaconda Enterprise
publishfunction.from anaconda_enterprise import publish @publish(methods=['GET', 'POST']) def function(): ... return json.dumps(...)
Add the deployment command and an appropriate environment variable to the
anaconda-project.ymlfile:commands: deploy-api: rest_api: {notebook}.ipynb supports_http_options: true default: true variables: KG_FORCE_KERNEL_NAME: default: python3
Ensure the
anaconda-enterprisechannel is listed underchannels:and theanaconda-enterprise-web-publisherpackage is listed underpackages:. For example:packages: - python=3.6 - pandas - dask - matplotlib - scikit-learn - requests - anaconda-enterprise-web-publisher channels: - defaults - anaconda-enterprise
Use the following command to test the API within your notebook session (without deploying it):
anaconda-project run deploy-api
Now if you visit
http://localhost:8888/{function}from within a notebook session you will see the results of your function.From within a notebook session, execute the following command:
curl localhost:8888/{function}
Click the Deploy icon in the toolbar to deploy the project as an API.
This deploys the notebook as an API which you can then query.
To query externally, create a token and find the url to the running project.
Example using
curl:export TOKEN="<generated-token-goes-here>" # save long string of text in variable curl -L -H "Authorization: Bearer $TOKEN" <url-of-project>
The
-Loption tells curl to follow redirects. The-Hadds a header. In this case-Hadds the token required to authorize the client to visit that URL.If you deploy the project as described above you can add the
-X POSToption tocurlto access that function.
Deploying a Flask application¶
The process of deploying a Flask application (website and REST APIs) on Anaconda Enterprise involves the following:
Configuring Flask to run behind a proxy
Enabling Anaconda Project HTTP command-line arguments
Running Flask on the deployed host and port
Here is a small Flask application that includes the call to .run(). The file is saved to server.py.
This Flask application was written using Blueprints, which is useful for separating components when working with a large Flask application.
Here, the nested block in if __name__ == '__main__' could be in a separate
file from the 'hello' Blueprint.
from flask import Flask, Blueprint
hello = Blueprint('hello', __name__)
@hello.route('/')
def hello_world():
return 'Hello, World!'
if __name__ == '__main__':
app = Flask(__name__)
app.register_blueprint(hello, url_prefix='/')
app.run()
Running behind an HTTPS proxy¶
Anaconda Enterprise maintains all HTTPS connections into and out of the server and deployed instances. When writing a Flask app, you only need to inform it that will be accessed from behind the proxy provided by Anaconda Enterprise.
The simplest way to do this is with the ProxyFix function from werkzeug.
More information about proxies is provided here.
from flask import Flask, Blueprint
from werkzeug.contrib.fixers import ProxyFix
hello = Blueprint('hello', __name__)
@hello.route('/')
def hello_world():
return 'Hello, World!'
if __name__ == '__main__':
app = Flask(__name__)
app.register_blueprint(hello, url_prefix='/')
app.wsgi_app = ProxyFix(app.wsgi_app)
app.run()
Enabling command-line arguments¶
In your anaconda-project.yml file, you define a deployable command as follows:
commands:
default:
unix: python ${PROJECT_DIR}/server.py
supports_http_options: true
The flag supports_http_options means that server.py is expected to act on the
following command line arguments defined in the Anaconda Project Reference.
This is easily accomplished by adding the following argparse code before calling
app.run() in server.py
import sys
from argparse import ArgumentParser
# ... the Flask application blueprint
if __name__ == '__main__':
# arg parser for the standard anaconda-project options
parser = ArgumentParser(prog="hello_world",
description="Simple Flask Application")
parser.add_argument('--anaconda-project-host', action='append', default=[],
help='Hostname to allow in requests')
parser.add_argument('--anaconda-project-port', action='store', default=8086, type=int,
help='Port to listen on')
parser.add_argument('--anaconda-project-iframe-hosts',
action='append',
help='Space-separated hosts which can embed us in an iframe per our Content-Security-Policy')
parser.add_argument('--anaconda-project-no-browser', action='store_true',
default=False,
help='Disable opening in a browser')
parser.add_argument('--anaconda-project-use-xheaders',
action='store_true',
default=False,
help='Trust X-headers from reverse proxy')
parser.add_argument('--anaconda-project-url-prefix', action='store', default='',
help='Prefix in front of urls')
parser.add_argument('--anaconda-project-address',
action='store',
default='0.0.0.0',
help='IP address the application should listen on.')
args = parser.parse_args()
Running your Flask application¶
The final step is to configure the Flask application with the
Anaconda Project HTTP values and call app.run(). Note that
registering the Blueprint provides a convenient way to deploy
your application without having to rewrite the routes.
Here is the complete code for the Hello World application.
import sys
from flask import Flask, Blueprint
from argparse import ArgumentParser
from werkzeug.contrib.fixers import ProxyFix
hello = Blueprint('hello', __name__)
@hello.route('/')
def hello_world():
return "Hello, World!"
if __name__ == '__main__':
# arg parser for the standard anaconda-project options
parser = ArgumentParser(prog="hello_world",
description="Simple Flask Application")
parser.add_argument('--anaconda-project-host', action='append', default=[],
help='Hostname to allow in requests')
parser.add_argument('--anaconda-project-port', action='store', default=8086, type=int,
help='Port to listen on')
parser.add_argument('--anaconda-project-iframe-hosts',
action='append',
help='Space-separated hosts which can embed us in an iframe per our Content-Security-Policy')
parser.add_argument('--anaconda-project-no-browser', action='store_true',
default=False,
help='Disable opening in a browser')
parser.add_argument('--anaconda-project-use-xheaders',
action='store_true',
default=False,
help='Trust X-headers from reverse proxy')
parser.add_argument('--anaconda-project-url-prefix', action='store', default='',
help='Prefix in front of urls')
parser.add_argument('--anaconda-project-address',
action='store',
default='0.0.0.0',
help='IP address the application should listen on.')
args = parser.parse_args()
app = Flask(__name__)
app.register_blueprint(hello, url_prefix = args.anaconda_project_url_prefix)
app.config['PREFERRED_URL_SCHEME'] = 'https'
app.wsgi_app = ProxyFix(app.wsgi_app)
app.run(host=args.anaconda_project_address, port=args.anaconda_project_port)
Sharing deployments¶
After you have deployed a project, you can share the deployment with others. You can share a deployment publicly, with other Anaconda Enterprise users, or both.
Any collaborators you add to your deployment will see your deployment in their Deployments list when they log in to AE.
Note
Your Anaconda Enterprise Administrator creates the users and groups with whom you can share your deployments, so check with them if you need a new group created.
To enable others to reference a deployment from within their code:¶
Rather than sharing your model with other data scientists and having them run it, you can give them an endpoint to query the model, which you can continue to update, improve and redeploy as needed.
Note
If the deployment is going to be used as an endpoint that’s called by other code, you’ll want to provide a static URL when deploying the project, and NOT use an auto-generated URL.
If your deployment is Private, you’ll also need to generate a token that can be used to connect to the associated Notebooks, APIs or other running code. People will need both the deployment URL and the token to access a private deployment. Tokens are powerful and should be protected like passwords.
Click the deployment you want to generate a token for and select Settings in the left menu.
Scroll to the Generate Tokens setting and click Generate. Copy the token that’s generated to the clipboard with the
icon, or by copying it with mouse or keyboard shortcuts like any other text.
You can then share this token, and the Deployment URL, with others to enable them to connect to the deployment from within Notebooks, APIs and other running code.
To remove a deployment from the server—thereby making it unavailable to yourself and others—you terminate the deployment. This also frees up its resources.
Scheduling deployments¶
If you want to deploy a project on a regular basis, Anaconda Enterprise enables you to schedule the deployment. For example, you can schedule a deployment that’s resource intensive to run after regular business hours, or to import new data on a weekly basis.
Note
A task that’s run via a scheduled deployment can read data previously committed to the project from an editor session, but cannot be used to commit any new data to it. Any data written to a scheduled deployment’s container will be deleted immediately after the scheduled task runs, so we recommend that you ensure data is read from and written to external data sources.
To schedule a deployment:
Open the project you want to schedule a deployment for by clicking on it in the Projects list.
Click Schedules in the menu on the left.
Click Create a Schedule if it’s the first schedule to be created for the project, or the Schedule
button if there are existing schedules.
Give the schedule a meaningful name to help differentiate it from any other schedules.
Specify whether you want to deploy the latest version of the project, or select a particular version.
Specify the Deployment Command to use to deploy the project. Schedules are intended for automatic or non-interactive execution of script files or notebooks, therefore only
unix:commands are supported. See an example here.
Note
If there is no deployment command listed, you cannot deploy the project. Return to the project and add a deployment command, or ask the project owner to do so if it’s not your project. See Configuring project settings for more information about adding deployment commands.
Choose the runtime resources your project requires to run from the Resource Profile drop-down, or accept the default. Your Administrator configures the options in this list, so check with them if you aren’t sure.
Use the controls to specify how often and when you want to schedule the deployment, or select Custom and enter a valid cron expression. To help ensure your schedule runs when you intend it to, we recommend you verify your cron expression before saving your schedule.
Note
All scheduled times are in UTC (Coordinated Universal Time).
Alternatively, if you want it to run now—instead of scheduling it—select Run Now.
Click Schedule to create the schedule, and display it in the list of schedules for the project.
Click on a schedule in the list to view and edit its details.
Use the controls above the schedule to pause, edit, or delete a selected schedule.
Note
If you attempt to delete a schedule that is currently running or is scheduled to run, you will be prompted to confirm that you want to force the deletion.
To view a list of all the scheduled deployments that are currently running or have already run, click Runs in the menu on the left.
Select a specific run in the list to enable the controls to refresh, stop or delete it.
Terminating a deployment¶
When a deployment is no longer required, you can terminate it to stop it from running. This will remove it from the server and free up the resources it’s currently using. Terminating a deployment does not affect the original project from which the deployment was created—only the deployment. It does make the deployment unavailable to any users you had shared it with, however.
To terminate a deployment:
Click the top-level Deployments menu item to display all of your deployments.
Click the specific deployment you want to terminate, and click Settings in the menu on the left.
Scroll down until the Terminate button is visible, and click it.
Confirm that you want to stop the deployment. The deployment stops, and is removed from the list of deployments.
Using GPUs in sessions and deployments¶
Anaconda Enterprise enables you to leverage the compute power of graphics processing units (GPUs) from within your editor sessions. To do so, you can select a resource profile that features a GPU when you first create the project, or use the project’s Settings tab to select a resource profile after the project is created.
To enable access to a GPU while running a deployed application, select the appropriate resource profile when you deploy the associated project.
In either case, if the resource profile you need isn’t listed, ask your Administrator to configure one for you to use.
Configuring your user settings¶
Anaconda Enterprise maintains settings related to your user account, based on how the system was configured by your Administrator. There are times when you may need to update the information related to your user account—to change your password, add credentials required to access a version control repository, or add secrets that can be used to access file systems, data stores and other resources implemented by your organization, for example.
To access your account settings, click the User icon
in the upper-right corner and select the Settings option in the pull-down.
Click Advanced Settings to configure the following settings for your Anaconda Enterprise account:
To change the email or name associated with your account, edit the associated field for the Account.
To change the password you use to log in to Anaconda Enterprise, select Password.
To enable two-factor authentication for your account, select Authenticator.
To view a history of your sessions using Anaconda Enterprise, select Sessions. You can also log out of all sessions in one click here.
To view a list of AE applications currently running and the permissions you have been granted, select Applications.
To view a log of all activity related to your account, select Log.
Note
Fields that you are not permitted to edit appear grayed / disabled.
Configuring access to version control¶
If your Administrator has configured Anaconda Enterprise to use a supported version control repository other than the internal GitHub server, you’ll need to provide your credentials to be able to access that repository. We recommend you create an ever-lasting token, so you can retain permanent access to your files from within Anaconda Enterprise.
Your auth token must also have the following permissions:
External Repository |
Permissions Required |
|---|---|
Bitbucket Enterprise |
Admin access for Projects and Repositories |
GitHub Enterprise |
repo:status, repo_deployment, public_repo, repo:invite, and delete_repo |
GitLab Enterprise |
Check the |
Note
You’ll be prompted to configure your personal access token when you attempt to create your first project in Anaconda Enterprise, if you haven’t already done so.
Under External Version Control Credentials, click Add.
Enter the username and personal access token you use to access the repository in the relevant fields.
Click Add to update the platform with your credentials.
To manage credentials that you’ve added, click on the command menu
for the credentials, then choose whether you want to edit or delete them.
Now that you’ve configured access, you’ll be able access the repository within your sessions and deployments without having to leave the platform. Anaconda Enterprise creates a repository for each project that you create.
Storing secrets¶
Anaconda Enterprise enables you to securely store information such as user names, passwords, API keys, or authentication tokens. Any secrets you add will be available across sessions and deployments for all projects associated with your account–but the values are not shared with other users.
Secrets are mounted into deployments and sessions as files, where the name of the file matches the name of the secret. Each file stores the value provided for that secret. You can access the contents of these files from within your projects, to access file systems, data stores and other resources implemented by your organization.
Note
We highly recommend you use the secrets store over including credentials in your project, due to the potential security risk associated with storing them in version control.
Under Secrets, click Add.
Enter a Name and Value for the secrets you want to store, then click Add.
Note
Secret names can contain alphanumeric characters and underscores only—not special characters or paths.
Any secrets you add are listed by name. To manage your secrets, click on the command menu
for the item then choose whether you want to edit, delete or copy the name of the secret.
To access credentials you’ve added within a session, deployment, or scheduled job:
Open a new terminal window.
Change directory to the location where the secrets are stored:
/var/run/secrets/user_credentials/.Run
cat <credential_key>—replacingcredential_keywith the actual key name—to display the value you entered when you added the secret.Use the value to access the file system, data store or other resource as needed. See Loading data for more information.
Visualizations and dashboards¶
Anaconda Enterprise makes it easy for you to create and share interactive data visualizations, live notebooks or machine learning models built using popular libraries such as Bokeh and HoloViews.
To get you started quickly, Anaconda Enterprise provides sample projects of Bokeh applications for clustering and cross filtering data. There are also several examples of AE5 projects that use PyViz here.
Follow these steps to create an interactive plot:
From the Projects view, select Create + > New Project and create a project from the
Anaconda 3.6 (v5.0.1)template:
Open the project in a session
, select New > Terminal to open a terminal, and run the following command to install packages for hvplot,panel,pyct, andbokeh:anaconda-project add-packages hvplot panel
Select New > Python 3 to create a new notebook, rename it
tips.ipynb, and add the following code to create an interactive plot:
import pandas as pd
import hvplot.pandas
import panel
panel.extension()
df = pd.read_csv('http://bit.ly/tips-csv')
p = df.hvplot.scatter(x='total_bill', y='tip', hover_cols=['sex','day','size'])
pn.Pane(p).servable()
Note
In this example, the data is being read from the Internet. Alternatively,
you could download
the .csv and upload it to the project.
Open the project’s
anaconda-project.ymlfile, and add the following lines after thedescription. This is the deployment command that Anaconda Enterprise will use when you deploy the notebook
commands:
scatter-plot:
unix: panel serve tips.ipynb
supports_http_options: True
Save and commit your changes.
Now you’re ready to deploy the project.
To interact with the notebook—executing its cells without making changes to it—click the deployment’s name.
Tip
To dive deeper into the world of data visualization, follow this HoloViz tutorial.
To view and monitor the logs for the deployment while it’s running, click Logs in the left menu. The app section records the initialization steps and any messages printed to standard output by the command used in your project.
You can also share the deployment with others.
Machine learning and deep learning¶
Anaconda Enterprise facilitates machine learning and deep learning by enabling you to develop models, train them, and deploy them. You can also use AE to query and score models that have been deployed as a REST API.
To help get you started, Anaconda Enterprise includes several sample notebooks for common repetitive tasks. You can access them from the gallery of Sample Projects available from Projects. See Working with projects for more information.
We’ve also provided a walkthrough of the process for creating an interactive data visualization.
Developing models¶
Anaconda Enterprise makes it easy for you to create models that you can train to make predictions and facilitate machine learning based on deep learning neural networks.
You can deploy your trained model as a REST API, so that it can be queried and scored.
The following libraries are available in Anaconda Enterprise to help you develop models:
Scikit-learn–for algorithms and model training.
TensorFlow–to express numerical computations as stateful dataflow graphs.
XGBoost–a gradient boosting framework for C++, Java, Python, R and Julia.
Theano–expresses numerical computations & compiles them to run on CPUs or GPUs.
Keras–contains implementations of commonly used neural network building blocks to make working with image and text data easier.
Lasagne–contains recipes for building and training neural networks in Theano.
Neon–deep learning framework for building models using Python, with Math Kernal Library (MKL) support.
MXNet–framework for training and deploying deep neural networks.
Caffe–deep learning framework with a Python interface geared towards image classification and segmentation.
CNTK–cognitive toolkit for working with massive datasets to facilitate distributed deep learning. Describes neural networks as a series of computational steps via a directed graph.
Training models¶
Anaconda Enterprise provides machine learning libraries such as scikit-learn and Tensorflow that you can use to train the models you create.
To train a model:
When you are ready to run an algorithm against your model and tune it, download the scikit-learn or Tensorflow package from the anaconda channel. If you don’t see this channel or these packages in your Channels list, contact your Administrator to mirror these packages to make them available to you.
Serializing your model:
When you are ready to convert your model or application into a format that can be easily distributed and reconstructed by others, use Anaconda Enterprise to deploy it.
YAML – supports non-hierarchical data structures & scalar data
JSON – for client-server communication in web apps
HD5 – designed to store large amounts of hierarchical data; works well for time series data (stored in arrays)
Note
Your model or app must be written in a programming language that supports object serialization, such as Python, PHP, R or Java.
Deploying models as endpoints¶
Anaconda Enterprise enables you to deploy machine learning models as endpoints to make them available to others, so the models can be queried and scored. You can then save users’ input data as part of the training data, and retrain the model with the new training dataset.
Versioning your model:
To enable you to test variations of a model, you can deploy multiple versions of the model. You can then direct different sets of users to each of the versions, to faciliate A/B testing.
Deploying your model as an endpoint:
Deploying a model as an endpoint involves these simple steps:
Create a project to tell Anaconda Enterprise where to look for the artifacts that comprise the model.
Deploy the project to build the model and all of its dependencies. Now you—and others with whom you share the deployment—can interact with the app, and select different datasets and algorithms.
Querying and scoring models¶
Anaconda Enterprise enables you to query and score models that have been created in Python, R, or another language such as Curl, CLI, Java or Javascript. The model doesn’t have to have been created using AE, as long as the model has been deployed as an endpoint.
Scoring can be incredibly useful to an organization, including the following “real world” examples:
By financial institutions, to determine the level of risk that a loan applicant represents.
By debt collectors, to predict the likelihood of a debtor to repay their debt.
By marketers, to predict the likelihood of a subscriber list member to respond to a campaign.
By retailers, to determine the probability of a customer to purchase a product.
A scoring engine calculates predictions or makes recommendations based on your model. A model’s score is computed based on the model and query operators used:
Boolean queries—specify a formula
Vector space queries—support free text queries (with no query operators necesssarily connecting them)
Wildcard queries—match any pattern
Using an external scoring engine
Advanced scoring techniques used in machine learning algorithms can automatically update models with new data gathered. If you have an external scoring engine that you prefer to use on your models, you can do so within Anaconda Enterprise.
Troubleshooting¶
Anaconda Enterprise provides detailed logs and monitoring information related to the Kubernetes services and containers it uses. You can use the Operations Center and Kubernetes CLI to access this information, to help diagnose and debug errors that you or other users may encounter while using the platform.
The Anaconda Enterprise cluster¶
As an Operations Center Admin, you can use the Operations Center to configure and monitor the platform.
To access the Operations Center:
Log in to Anaconda Enterprise, select the Menu icon
in the top right corner, and click the Administrative Console link displayed at the bottom of the slide out window.
Click Manage Resources.
Login to the Operations Center using the Administrator credentials configured after installation.
To view resource utilization:
Select Servers in the menu on the left.
Click on the Private IP address of the Anaconda Enterprise master node, and select SSH login as root.
To display the current resource utilization of each node in the cluster, run this command:
kubectl top nodes --heapster-namespace=monitoring
Note
This is actual resource utilization, not limits or requests.
To view utilization and requests for a particular node, run the
kubectl describe nodecommand against the IP address for the node (listed underNAME). For example:kubectl describe node 172.31.25.175
To view the resource utilization per pod, run this command:
kubectl top pods --heapster-namespace=monitoring
To view the current status of all pods in the cluster, run
kubectl get pods.
The following table summarizes common pod states:
Status |
Description |
|---|---|
Running |
The pod has been bound to a node, and at least one container is running. |
Pending |
The pod is waiting for one or more container images to be created. |
Terminating |
The pod is in the process of being terminated. |
Error |
An error has occurred with the pod. |
Init:CrashLoopBackoff |
The pod failed to start, and will make another attempt in a few minutes. |
To view information for a particular pod, run the
kubectl describe podcommand against the pod (listed underNAME). For example:kubectl describe pod anaconda-session-89747d7fdb154b89b182d5eaa25b2e59-7f497db55wl9g
You can also use the Operations Center Logs to gain insights into pod behavior and troubleshoot issues. See logging for more information.
User errors¶
If a user experiences issues within a Notebook session, have them send you the name of the pod associated with their project session. They can obtain this information by running the hostname command from within a Jupyter Notebook or terminal window.
You can then use the commands described above or the Operation Center’s Monitoring and Logs features to investigate the issue. See Monitoring sessions and deployments for more information.
Tip
As an Administrator, you can also use the Authentication Center to impersonate a user to try to reproduce the problem they are experiencing.
To access the Authentication Center:
Login to Anaconda Enterprise, click the Menu icon
in the top right corner, then click the Administrative Console link in the bottom of the slideout menu.Click Manage Users.
In the Manage menu on the left, click Users.
On the Lookup tab, click View all users to list every user in the system, or search the user database for all users that match the criteria you enter, based on their first name, last name, or email address.
Click Impersonate in the row of Actions for the user to display a table of all Applications this user has interacted with on the platform, including editor sessions and deployments.
Click the Anaconda Platform lik to interact with Anaconda Enterprise as the user.
See Managing users for more information on managing users.
Editor sessions¶
To help you troubleshoot issues with editor sessions, it might be helpful to understand what is happening “behind the scenes”.
When a user starts a session, Anaconda Enterprise launches the appropriate editor for them to work with their project files. In the background, the editor environment and other services are running in Docker containers.
To improve startup time for projects, the editor container includes conda environments for each of the project template environments provided by the platform. These environments are stored in
/opt/continuum/anaconda/envs, along with any custom environments created during the editor session.The project repository is cloned into
/opt/continuum/project. (Only changes to files in this directory can be saved to the repository.)The
anaconda-project preparecommand runs, scans the project’sanaconda-project.ymlfile for new packages and environments, and installs them into the running session.During this phase, you can monitor the progress by watching the output of
/opt/continuum/preparing.When this process completes, the
/opt/continuum/prepare.logis created.
Warning
Any changes made to the container image will be lost when the session stops, so any packages installed from the command line are available during the current session only. To persist package installs across sessions, they must be added to the project’s anaconda-project.yml file.
Reference materials¶
The following information is provided for your reference, to help you understand some of the core terminology used in Anaconda Enterprise, and what changes were made between releases.
We also include answers to common questions you may have, and workarounds for known issues you may encounter while using the platform.
Additional information to help you get the most out of Anaconda features is available at https://support.anaconda.com/.
Glossary¶
Anaconda
Sometimes used as shorthand for the Anaconda Distribution, Anaconda, Inc. is the company behind Anaconda Distribution, conda, conda-build and Anaconda Enterprise.
Anaconda Cloud
A cloud package repository hosting service at https://www.anaconda.org. With a free account, you can publish packages you create to be used publicly.
Anaconda Distribution
Open source repository of hundreds of popular data science packages, along with the conda package and virtual environment manager for Windows, Linux, and MacOS. Conda makes it quick and easy to install, run, and upgrade complex data science and machine learning environments like scikit-learn, TensorFlow, and SciPy.
Anaconda Enterprise
A software platform for developing, governing, and automating data science and AI pipelines from laptop to production. Enterprise enables collaboration between teams of thousands of data scientists running large-scale model deployments on high-performance production clusters.
Anaconda Navigator
A desktop Graphical User Interface (GUI) included in Anaconda Distribution that allows you to easily use and manage IDEs, conda packages, environments, channels, and notebooks without the need to use the Command Line Interface (CLI).
Anaconda project
An encapsulation of your data science assets to make them easily
portable. Projects may include files, environment variables, runnable
commands, services, packages, channels, environment specifications,
scripts, and notebooks. Each project also includes an
anaconda-project.yml configuration file to automate setup, so you
can easily run and share it with others. You can create and configure
projects from the Enterprise web interface or command line interface.
Channel
A location in the repository where Anaconda Enterprise looks for packages. Enterprise Administrators and users can define channels, determine which packages are available in a channel, and restrict access to specific users or groups.
Commit
To make a set of local changes permanent by copying them to the remote server. Anaconda Enterprise checks to see if your work will conflict with any commits that your colleagues have made on the same project, so the files will not be overwritten unless you so choose to do so.
Conda
An open source package and environment manager that makes it quick and easy to install, run, and upgrade complex data science and machine learning environments like scikit-learn, TensorFlow, and SciPy. Thousands of Python and R packages can be installed with conda on Windows, MacOS X, Linux and IBM Power.
Conda-build
A tool used to build conda packages from recipes.
Conda environment
A superset of Python virtual environments, conda environments make it easy to create projects with different versions of Python and avoid issues related to dependencies and version requirements. A conda environment maintains its own files, directories, and paths so that you can work with specific versions of libraries and/or Python itself without affecting other Python projects.
Conda package
A binary tarball file containing system-level libraries, Python and R modules, executable programs, or other components. Conda tracks dependencies between specific packages and platforms, making it simple to create operating system-specific environments using different combinations of packages.
Conda recipe
Instructions used to tell conda-build how to build a package.
Deployment
A deployed Anaconda project containing a Notebook, web app, dashboard or machine learning model (exposed via an API). When you deploy a project, Anaconda Enterprise builds a container with all the required dependencies and runtime components—the libraries on which the project depends in order to run—and launches it with the security and access permissions defined by the user. This allows you to easily run and share the application with others.
Interactive data application
Visualizations with sliders, drop-downs and other widgets that allow users to interact with them. Interactive data applications can drive new computations, update plots and connect to other programmatic functionality.
Interactive development environment (IDE)
A suite of software tools that combines everything a developer needs to write and test software. It typically includes a code editor, a compiler or interpreter, and a debugger that the developer accesses through a single Graphical User Interface (GUI). An IDE may be installed locally, or it may be included as part of one or more existing and compatible applications accessed through a web browser.
Jupyter
A popular open source IDE for building interactive Notebooks by the Jupyter Foundation.
JupyterHub
An open source system for hosting multiple Jupyter Notebooks in a centralized location.
JupyterLab
Jupyter Foundation’s successor IDE to Jupyter, with flexible building blocks for interactive and collaborative computing. For Jupyter Notebook users, the interface for JupyterLab is familiar and still contains the notebook, file browser, text editor, terminal, and outputs.
Jupyter Notebook
The default browser-based IDE available in Anaconda Enterprise. It combines the notebook, file browser, text editor, terminal and outputs.
Live notebook
JupyterLab and Jupyter Notebooks are web-based IDE applications that allow you to create and share documents that contain live code in R or Python, equations, visualizations, and explanatory text.
Package
Software files and information about the software—such as its name, description, and specific version—bundled into a file that can be installed and managed by a package manager. Packages can be encapsulated into Anaconda projects for easy portability.
Project template
Contains all the base files and components to support a particular programming environment. For example, a Python Spark project template contains everything you need to write Python code that connects to Spark clusters. When creating a new project, you can select a template that contains a set of packages and their dependencies.
Repository
Any storage location from which software or software assets may be retrieved and installed on a local computer.
REST API
A common way to operationalize a machine learning model is through a REST API. A REST API is a web server endpoint, or callable URL, which provides results based on a query. REST APIs allow developers to create applications that incorporate machine learning and prediction, without having to write models themselves.
Session
An open project, running in an editor or IDE.
Spark
A distributed SQL database and project of the Apache Foundation. While Spark has historically been tightly associated with Apache Hadoop and run on Hadoop clusters, recently the Spark project has sought to separate itself from Hadoop by releasing support for Spark on Kubernetes. The core data structure in Spark is the RDD (Resilient Distributed Dataset)—a collection of data types, distributed in redundant fashion across many systems. To improve performance, RDDs are cached in memory by default, but can also be written to disk for persistence. Spark Ignite is a project to offer Spark RDDs that can be shared in-memory across applications.
Release notes¶
The following notes are provided to help you understand the major changes made between releases, and therefore may not include minor bug fixes and updates. If you are experiencing issues using Anaconda Enterprise, consider reviewing the known issues documented here to find workarounds.
Anaconda Enterprise 5.4.1¶
Released: April 15, 2020
Administrator-facing changes
You can now configure size limits for files (Default value of 50MB) being committed into the internal git by changing the related values on the config map flag. This ensures that projects don’t get bogged down by oversized internal storage. We recommend keeping files below 50MB and using external file storage for large data sets. (AENT-5922)
You can now set the number of max concurrent queue jobs and enable/disable project creation with a queue using a config map flag. By implementing a queue, Kubernetes jobs for project creation are performed only when resources are available, ensuring that project creation doesn’t fail due to lack of cluster resources. (AENT-5801)
Default SSO Timeout increased to 1 day.
User-facing changes
You now have the ability to see whether your project is in the queue or actively being created.
You will now be alerted when your commits fail, saving time and work.
You can now schedule your deployment in multiple timezones via a dropdown in the Scheduler UI. Note that these scheduled deployment times will be displayed in UTC.
You can now access public channels and deployments if not added as collaborators.
CRON string validation has been added to schedules UI.
Backend improvements (non-visible changes)
There was an issue with users trying to create multiple projects at a time, overwhelming the cluster resources and ultimately causing some projects to fail to create. We’ve fixed that by implementing a job queue, limiting the number of simultaneous project creations based on configuration and available system resources.
GPU support fixed, built on CUDA 10.x.
Job pods automatically clean up upon completion of jobs.
Anaconda Enterprise 5.4.0¶
Released: October 31, 2019
Administrator-facing changes
Added support for installing the Anaconda Enterprise cluster on Centos/RHEL 7.7, 8.0
Upgraded Gravity to version 6.1.9 (and Kubernetes 1.15.05) with updated monitoring dashboards
Ability to configure external Postgres database
Authenticated NFS mounts
Customize Sample Gallery and new project template collection. Requires AE5 Tools.
User-facing changes
New UI look-and-feel
New sample gallery projects
Fixed JupyterLab and Jupyter Notebook timeout
Upgraded Conda to version 4.6.14 and anaconda-project to version 0.8.3. Provide faster package installs and improved error messages
NFS mounts now work with scheduled jobs
Backend improvements (non-visible changes)
Upgraded nginx to version 1.17.2, which uses nginx-ingress version 1.5.2, to address CVEs.
Anaconda Enterprise 5.3.1¶
Released: July 17, 2019
Administrator-facing changes
Added support for using on-premises versions of Bitbucket and GitLab, and removed the previous requirement to connect to your repository endpoint over SSL.
Added support for installing the Anaconda Enterprise cluster on RHEL/CentOS 7.6.
Added support for NVIDIA CUDA 10.0 drivers on GPU worker nodes.
Added the ability to set global environment variables via a new configuration file, making them available across all containers. This method can be used to address the issue where values in custom
.condarcfiles could be overwritten if the file was placed in a directory of “lower priority” that the user’s home directory.
User-facing changes
Patched JupyterLab and Jupyter Notebook to address “session timeout” and “failed to fetch” issues. Users may still see an error, but if they reload their notebook, they can continue working without losing any work.
Fixed issue where users were being asked to confirm the environment when creating a project from the Hadoop-Spark template.
Fixed issue where the UI makes it appear that changes made by collaborators on a project have not been committed, when they have been, leading the user to believe that an error has occurred.
Improved the usability of the Schedules UI.
Backend improvements (non-visible changes)
Upgraded Jupyter Notebook to version 5.7.8 to address CVEs.
Anaconda Enterprise 5.3.0¶
Released: March 22, 2019
Administrator-facing changes
Increased the minimum and recommended disk space requirements for the master node.
Added recommendation to setup partitions on the master node using Logical Volume Management (LVM) to accommodate easier future expansion.
Added
noarchto the default platforms in theanaconda.yamlmirror config file.Added a bootstrap executable that you can run to install conda to the Anaconda Enterprise installer.
Changed the process for installing and configuring the Anaconda Enterprise cli and cas-mirror slightly.
User-facing changes
Added ability to deploy projects to user-supplied, static URLs.
Improved UI notifications on behind-the-scenes processes, and added a Notification Center.
Optimized database operations and made other performance improvements.
Added a sample project for connecting to an S3 bucket.
Fixed issue where users couldn’t use Kerberos authentication (kinit) to access a Spark/Hadoop cluster from within a notebook.
Fixed issue where incorrect default kernels were being used for projects created from the Hadoop-Spark template.
Improved error message handling to clarify errors and provide instructions on how to workaround or recover from them.
Added usability improvements related to scheduling deployment runs, audit trail logging, and session initialization.
Anaconda Enterprise 5.2.4¶
Released: January 21, 2019
Administrator-facing changes
Fixed issue where custom resource profiles weren’t being captured during in-place upgrades.
Added security fixes.
Anaconda Enterprise 5.2.3¶
Released: January 2, 2019
Included fix to address a vulnerability in Kubernetes which allowed for permission escalation. You can learn more about the vulnerability here.
User-facing changes
Added ability for users to store secrets that can be used to access file systems, data stores and other enterprise resources from within sessions and deployments. Any secrets added to the platform will be available across all projects associated with the user’s account. For more information, see Storing secrets.
Fixed issue that required users to modify the
anaconda-project.ymlfile to make the Hadoop-Spark environment template work properly.Added ability to view each project’s owner, and sort the list of projects based on this column.
Fixed various issues to improve project and session performance.
Anaconda Enterprise 5.2.2¶
Released: October 10, 2018
Administrator-facing changes
Added ability to configure an external Git repository (instead of the internal Git repository) to store projects containing version-controlled notebooks, code, and other files. Supported external Git version control systems include Atlassian BitBucket, GitHub and GitHub Enterprise, and GitLab.
Administrators can optionally configure GPU worker nodes to be used only for sessions and deployments that require a GPU (by preventing CPU-only sessions and deployments from accessing GPU resources).
In-place upgrades can now be performed from AE 5.2.x to AE 5.2.2.
Improved functionality in backup script related to backup location and disk capacity requirements.
Implemented multiple security enhancements related to cache control headers, HTTP strict transport security, and default ciphers and protocols across all services.
Administrators no longer need to generate separate TLS/SSL certificates for the Operations Center.
Improved validation of custom TLS/SSL certificates in the Administrator Console.
Administrators can now disable access to
sudo yumoperations in sessions across the platform.Fixed an issue related to orphaned clients for sessions and deployments not being removed from Authentication Center.
Tokens for user notebook sessions and deployments are now stored in encrypted format.
Renamed platform-wide conda settings to
default_channels,channel_alias,ssl_verifysettings in thecondasection of configmap to be consistent withcondaconfiguration settings.Administrators can now specify the channel priority order when creating environments/installers.
Fixed an issue related to sorting of package versions when creating environments/installers.
Fixed an issue with download links for custom Anaconda parcels.
Improved behavior of package mirroring tool to only remove existing packages when clean mode is active.
Fixed an issue related to mirroring pip packages from PyPI repository.
Added support for
noarchpackages in package mirroring tool.Improved logging and error handling in package mirroring tool.
Fixed an issue related to projects failing to be created due to special characters in usernames.
Fixed an issue related to authorization center errors when syncing large number of users from external identity providers.
Added logout functionality to
anaconda-enterprise-cli.
User-facing changes
Apache Zeppelin is now available as a notebook editor for projects (in addition to Jupyter Notebooks and JupyterLab). Apache Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with interpreters for Python, R, Spark, Hive, HDFS, SQL, and more.
Conda channels in the repository can be made publicly available (default), or access can be restricted to specific authenticated users or groups.
A single notebook kernel (associated with the active conda environment used within a project) is now displayed by default in Jupyter Notebooks and JupyterLab.
Collaborators can now select a different default editor for projects that have been shared with them.
Implemented various fixes to configuration parameters for scheduled jobs within a project.
Improved input/form validation related to projects, deployments, packages, and settings across the platform.
Improved error messaging/handling across the platform, along with the ability to view errors and logs from underlying services.
Improved notifications for tasks such as uploading projects and copying sample projects.
Users are now prompted to delete all related sessions, deployments, jobs, and runs (including those used by collaborators) when deleting a project.
Fixed an issue that caused numerous erroneous job runs to be spawned based on the default job scheduling parameters.
Anaconda Enterprise 5.2.1¶
Released: August 30, 2018
User-facing changes
Fixed issue with loading spinner appearing on top of notebook sessions
Fixed issue related to missing projects and copying sample projects when upgrading from AE 5.1.x
Improved visual feedback when loading notebook sessions/deployments and performing actions such as creating/copying projects
Anaconda Enterprise 5.2.0¶
Released: July 27, 2018
Administrator-facing changes
New administrative console with workflows for managing channels and packages, creating installers, and other distinct administrator tasks
Added ability to mirror pip packages from PyPI repository
Added ability to define custom hardware resource profiles based on CPU, RAM, and GPU for user sessions and deployments
Added support for GPU worker nodes that can be defined in resource profiles
Added ability to explicitly install different types of master nodes for high availability
Added ability to specify NFS file shares that users can access within sessions and deployments
Significantly reduced the amount of time required for backup/restore operations
Added channel and package management tasks to UI, including downloading/uploading packages, creating/sharing channels, and more
Anaconda Livy is now included in the Anaconda Enterprise installer to enable remote Spark connectivity
All network traffic for services is now routed on standard HTTPS port 443, which reduces the number of external ports that need to be configured and accessed by end users
Notebook/editor sessions are now accessed via subdomains for security and isolation
Reworked documentation for administrator workflows, including managing cluster resources, configuring authentication, generating custom installers, and more
Reduced verbosity of console output from anaconda-enterprise-cli
Suppressed superfluous database errors/warnings
User-facing changes
Added support for selecting GPU hardware in project sessions and deployments, to accelerate model training and other computations with GPU-enabled packages
Added ability to select custom hardware resource profiles based on CPU, RAM, and GPU for individual sessions and deployments
Added support for scheduled and batch jobs, which can be used for recurring tasks such as model training or ETL pipelines
Added support for connecting to external Git repositories in a project session or deployment using account-wide credentials (SSH keys or API tokens)
New, responsive user interface, redesigned for data science workflows
Added ability to share deployments with unauthenticated users outside of Anaconda Enterprise
Changed the default editor in project sessions to Jupyter Notebooks (formerly JupyterLab)
Added ability to specify default editor on a per-project basis, including Jupyter Notebooks and JupyterLab
Added ability to work with data in mounted NFS file shares within sessions and deployments
Added ability to export/download projects from Anaconda Enterprise to local machine
Added package and channel management tasks to UI, including uploading/downloading packages, creating/sharing channels, and more
Reworked documentation for data science workflows, including working with projects/deployments/packages, using project templates, machine learning workflows, and more
Added ability to use plotting/Javascript libraries in JupyterLab
Added ability to force delete a project with running sessions, shared collaborators, etc.
Improved messaging when a session or deployment cannot be scheduled due to limited cluster resources
The last modified date/time for projects now accounts for commits to the project
Unique names are now enforced for projects and deployments
Fixed bug in which project creator role was not being enforced
Backend improvements (non-visible changes)
Updated to Kubernetes 1.9.6
Added RHEL/CentOS 7.5 to supported platforms
Added support for SELinux passive mode
Anaconda Enterprise now uses the Helm package manager to manage and upgrade releases
New version (v2) of backend APIs with more comprehensive information around projects, deployments, packages, channels, credentials and more
Fixed various bugs related to custom Anaconda installer builds
Fixed issue with
kube-routerand aCrashLoopBackOfferror
Anaconda Enterprise 5.1.3¶
Released: June 4, 2018
Backend improvements (non-visible changes)
Fixed issue when generating custom Anaconda installers that contain packages with duplicate files
Fixed multiple issues related to memory errors, file size limits, and network transfer limits that affected the generation of large custom Anaconda installers
Improved logging when generating custom Anaconda installers
Anaconda Enterprise 5.1.2¶
Released: March 16, 2018
Administrator-facing changes
Fixed issue with image/version tags when upgrading AE
Backend improvements (non-visible changes)
Updated to Kubernetes 1.7.14
Anaconda Enterprise 5.1.1¶
Released: March 12, 2018
Administrator-facing changes
Ability to specify custom UID for service account at install-time (default UID: 1000)
Added pre-flight checks for kernel modules, kernel settings, and filesystem options when installing or adding nodes
Improved initial startup time of project creation, sessions, and deployments after installation. Note that all services will be in the
ContainerCreatingstate for 5 to 10 minutes while all AE images are being pre-pulled, after which the AE user interface will become available.Improved upgrade process to automatically handle upgrading AE core services
Improved consistency between GUI- and CLI-based installation paths
Improved security and isolation between internal database from user sessions and deployments
Added capability to configure a custom trust store and LDAPS certificate validation
Simplified installer packaging using a single tarball and consistent naming
Updated documentation for system requirements, including XFS filesystem requirements and kernel modules/settings
Updated documentation for mirroring packages from channels
Added documentation for configuring AE to point to online Anaconda repositories
Added documentation for securing the internal database
Added documentation for configuring RBAC, role mapping, and access control
Added documentation for LDAP federation and identity management
Improved documentation for backup/restore process
Fixed issue when deleting related versions of custom Anaconda parcels
Added command to remove channel permissions
Fixed issue related to Ops Center user creation in post-install configuration
Silenced warnings when using
verify_sslsetting withanaconda-enterprise-cliFixed issue related to default admin role (
ae-admin)Fixed issue when generating TLS/SSL certificates with FQDNs greater than 64 characters
Fixed issue when using special characters with AE Ops Center accounts/passwords
Fixed bug related to Administrator Console link in menu
User-facing changes
Improvements to collaborative workflow: Added notification when collaborators make changes to a project, ability to pull changes into a project, and ability to resolve conflicting changes when saving or pulling changes into a project.
Additional documentation and examples for connecting to remote data and compute sources: Spark, Hive, Impala, and HDFS
Optimized startup time for Spark and SAS project templates
Improved initial startup time of project creation, sessions, and deployments by pre-pulling images after installation.
Increased upload limit of projects from 100 MB to 1 GB
Added capability to
sudo yum installsystem packages from within project sessionsFixed issue when uploading projects that caused them to fail during partial import
Fixed R kernel in R project template
Fixed issue when loading
sparklyrin Spark ProjectFixed issue related to displaying kernel names and Spark project icons
Improved performance when rendering large number of projects, packages, etc.
Improved rendering of long version names in environments and projects
Render full names when sharing projects and deployments with collaborators
Fixed issue when sorting collaborators and package versions
Fixed issue when saving new environments
Fixed issues when viewing installer logs in IE 11 and Safari
Anaconda Enterprise 5.1.0¶
Released: January 19, 2018
Administrator-facing changes
New post-installation administration GUI with automated configuration of TLS/SSL certificates, administrator account, and DNS/FQDN settings; significantly reduces manual steps required during post-installation configuration process
New functionality for administrators to generate custom Anaconda installers, parcels for Cloudera CDH, and management packs for Hortonworks HDP
Improved backup and restore process with included scripts
Switched from groups to roles for role-based access control (RBAC) for Administrator and superuser access to AE services
Clarified system requirements related to system modules and IOPS in documentation
Added ability to specify fractional CPUs/cores in global container resource limits
Fixed consistency of TLS/SSL certificate names in configuration and during creation of self-signed certificates
Changed use of
verify_ssltossl_verifythroughout AE CLI for consistency withcondaFixed configuration issue with licenses, including field names and online/offline licensing documentation
User changes
Updated default project environments to Anaconda Distribution 5.0.1
Improved configuration and documentation on using Sparkmagic and Livy with Kerberos to connect to remote Spark clusters
Fixed R environment used in sample projects and project template
Fixed UI rendering issue on package detail view of channels, downloads, and versions
Fix multiple browser compatiblity issues with Microsoft Edge and Internet Explorer 11
Fixed multiple UI issues with Anaconda Project JupyterLab extension
Backend improvements (non-visible changes)
Updated to Kubernetes 1.7.12
Updated to conda 4.3.32
Added SUSE 12 SP2/SP3, and RHEL/CentOS 7.4 to supported platform matrix
Implemented TLS 1.2 as default TLS protocol; added support for configurable TLS protocol versions and ciphers
Fixed default superuser roles for repository service, which is used for initial/internal package configuration step
Implemented secure flag attribute on all session cookies containing session tokens
Fixed issue during upgrade process that failed to vendor updated images
Fixed
DiskNodeUnderPressureand cluster stability issuesFixed Quality of Service (QoS) issue with core AE services on under-resourced nodes
Fixed issue when using access token instead of ID token when fetching roles from authentication service
Fixed issue with authentication proxy and session cookies
Known issues
IE 11 compatibility issue when using Bokeh in notebooks (including sample projects)
IE 11 compatibility issue when downloading custom installers
Anaconda Enterprise 5.0.6¶
Released: November 9, 2017
Anaconda Enterprise 5.0.5¶
Released: November 7, 2017
Anaconda Enterprise 5.0.4¶
Released: September 12, 2017
Anaconda Enterprise 5.0.3¶
Released: August 31, 2017 (General Availability Release)
Anaconda Enterprise 5.0.2¶
Released: August 15, 2017 (Early Adopter Release)
Anaconda Enterprise 5.0.1¶
Released: March 8, 2017 (Early Adopter Release)
Features:
Simplified, one-click deployment of data science projects and deployments, including live Python and R notebooks, interactive data visualizations and REST APIs.
End-to-end secure workflows with SSL/TLS encryption.
Seamlessly managed scalability of the entire platform
Industry-grade productionization, encapsulation, and containerization of data science projects and applications.
Known issues¶
We are aware of the following issues using Anaconda Enterprise. If you’re experiencing other unexpected behavior, consider checking our Support Knowledge Base.
Unable to obtain Zeppelin credentials¶
After selecting Credential and clicking the question mark icon in the Zeppelin editor, the user should be redirected to Zeppelin documentation explaining the process for obtaining credentials. However, that link is broken.
Workaround
Rather than committing something sensitive in your code/repository through Zeppelin, create a Kubernetes secret in JSON format.
Process for installing the Anaconda Enterprise CLI doesn’t work¶
The process of installing the Anaconda Enterprise CLI downgrades packages that are essential to the AE CLI, resulting in a conda env that won’t work with the tool.
Workaround
Follow this process to create a working conda environment, and activate it:
conda create -n cli-test -c https://anaconda.example.com/repository/conda/anaconda-enterprise anaconda-enterprise-cli git python=3.6 cas-mirror
conda activate cli-test
To access help for using the Anaconda Enterprise CLI, run anaconda-enterprise-cli --help.
Attempting to install new PyViz packages in JupyterLab results in error¶
The new PyViz libraries aren’t compatible with the version of JupyterLab used in Anaconda Enterprise. For more information on PyViz compatibility, see https://github.com/pyviz/pyviz_comms#compatibility.
Workaround
Open the project in Jupyter Notebook.
Unable to download files when running JupyterLab in Chrome browser¶
If you attempt to download a file from within a JupyterLab project running on Chrome, you may see a Failed/Forbidden error, preventing you from being unable to download the file.
Workaround
Open the project in Jupyter Notebook or another supported browser, such as Firefox or Safari, and download the file.
Unexpected metadata in a package breaks AE channel¶
The cspice and spiceypy packages mirrored from conda-forge include incompatible metadata, which causes a channeldata.json build failure, and makes the entire channel inaccessible.
Workaround
Remove these packages from the AE channel, or update your conda-forge mirror to pull in the latest packages.
Custom conda configuration file may be overwritten¶
If you add a custom .condarc file to your project using the anaconda-enterprise-cli spark-config command, it may get overwritten with the default config options when you deploy the project.
Workaround
Place the .condarc file in a directory other than your home directory (/opt/continuum/.condarc).
Note that the conda config settings are loaded from all of the files on the conda config search path. The config settings are merged together, with keys from higher priority files taking precedence over keys from lower priority files. If you need extra settings, start by adding the .condarc file to a lower priority file first and see if this works for you.
For more information on how directory locations are prioritized, see this blog post.
Starting in Anaconda Enterprise 5.3.1, you can also set global config variables via a config map, as an alternative to using the AE CLI.
Incorrect information in command output¶
When running the anaconda-enterprise-cli spark-config command to connect to a remote Hadoop Spark cluster from within a project, the output says you need to specify the namespace by including -n anaconda-enterprise.
Workaround
You must omit -n anaconda-enterprise from the command, as AE is installed in the default namespace.
Error creating an environment immediately after installation¶
At least one project must exist on the platform before you can create an environment. If you attempt to create an environment first, the logs will say that the associated job is running, and the container isn’t ready.
Workaround
Create a project first. The environment creation process will continue and successfully complete after a few minutes.
Cluster performance may degrade after extended use¶
The default limit for max_user_watches may be insufficient, and can be increased to improve cluster longevity.
Workaround
Run the following command on each node in the cluster, to help the cluster remain active:
sysctl -w fs.inotify.max_user_watches=1048576
To ensure this change persists across reboots, you’ll also need to run the following command:
sudo echo -e "fs.inotify.max_user_watches = 1048576" > /etc/sysctl.d/10-fs.inotify.max_user_watches.conf
Invalid issuer URL causes library to get stuck in a sync loop¶
When using the Anaconda Enterprise Operations Center to create an OIDC Auth Connector, if you enter an invalid issuer url in the spec, the go-oidc library can get stuck in a sync loop. This will affect all connectors.
Workaround
On a single node cluster, you’ll need to do the following shut down gravity:
Find the gravity services:
systemctl list-units | grep gravity.You will see output like this:
# systemctl list-units | grep gravity gravity__gravitational.io__planet-master__0.1.87-1714.service loaded active running Auto-generated service for the gravitational.io/planet-master:0.1.87-1714 package gravity__gravitational.io__teleport__2.3.5.service loaded active running Auto-generated service for the gravitational.io/teleport:2.3.5 package
Shut down the
teleportservice:systemctl stop gravity__gravitational.io__teleport__2.3.5.service
Shut down the
planet-masterservice:systemctl stop gravity__gravitational.io__planet-master__0.1.87-1714.service
On a multi-node cluster, you’ll need to shut down gravity AND all gravity-site pods:
kubectl delete pods -n kube-system gravity-site-XXXXX
In both cases, you’ll need to restart gravity services:
systemctl start gravity__gravitational.io__planet-master__0.1.87-1714.service
systemctl start gravity__gravitational.io__teleport__2.3.5.service
GPU affinity setting reverts to default during upgrade¶
When upgrading Anaconda Enterprise from a version that supports the ability to reserve GPU nodes to a newer version (e.g., 5.2.x > 5.2.3), the nodeAffinity setting reverts to the default value, thus allowing CPU sessions and deployments to run on GPU nodes.
Workaround
If you had commented out the nodeAffinity section of the Config map in your previous installation, you’ll need to do so again after completing the upgrade process. See Setting resource limits for more information.
Install and post-install problems¶
Failed installations
If an installation fails, you can view the failed logs as part of the support bundle in the failed installation UI.
After executing sudo gravity enter you can check /var/log/messages to
troubleshoot a failed installation or these types of errors.
After executing sudo gravity enter you can run journalctl to look at
logs to troubleshoot a failed installation or these types of errors:
journalctl -u gravity-23423lkqjfefqpfh2.service
Note
Replace gravity-23423lkqjfefqpfh2.service with the name of your gravity service.
You may see messages in /var/log/messages related to errors such as
“etcd cluster is misconfigured” and “etcd has no leader” from one of the
installation jobs, particularly gravity-site. This usually indicates that
etcd needs more compute power, needs more space or is on a slow disk.
Anaconda Enterprise is very sensitive to disk latency, so we usually recommend
using a better disk for /var/lib/gravity on target machines and/or putting
etcd data on a separate disk. For example, you can mount etcd under
/var/lib/gravity/planet/etcd on the hosts.
After a failed installation, you can uninstall Anaconda Enterprise and start over with a fresh installation.
Failed on pulling gravitational/rbac
If the node refuses to install and fails on pulling gravitational/rbac, create
a new directory TMPDIR before installing and provide write access
to user 1000.
“Cannot continue” error during install
This bug is caused by a previous failure of a kernel module check or other preflight check and subsequent attempt to reinstall.
Stop the install, make sure the preflight check failure is resolved, and restart the install again.
Problems during post-install or post-upgrade steps
Post-install and post-upgrade steps run as Kubernetes jobs. When they finish running, the pods used to run them are not removed. These and other stopped pods can be found using:
kubectl get pods -A
The logs in each of these three pods will be helpful for diagnosing issues in the following steps:
Pod |
Issues in this step |
|---|---|
|
post-install UI |
|
installation step |
|
post-update steps |
Post-install configuration doesn’t complete
After completing the post-install steps, clicking FINISH SETUP may not close the screen, and prevent you from continuing.
You can complete the process by running the following commands within gravity.
To determine the site name:
SITE_NAME=$(gravity status --output=json | jq '.cluster.token.site_domain' -r)
To complete the post-install process:
gravity --insecure site complete
Re-starting the post-install configuration
In order to reinitialize the post-install configuration UI—to regenerate temporary (self-signed) SSL certificates or reconfigure the platform based on your domain name—you must re-create and re-expose the service on a new port.
First, export the deployment’s resource manifest:
helm template --name anaconda-enterprise /var/lib/gravity/local/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/Anaconda-Enterprise/ -x /var/lib/gravity/local/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/Anaconda-Enterprise/templates/wagonwheel.yaml > wagon.yaml
Edit wagon.yaml, replacing image: ae-wagonwheel:5.X.X with image: leader.telekube.local:5000/ae-wagonwheel:5.X.X
Then recreate the ae-wagonwheel deployment using the updated YAML file:
kubectl create -f /var/lib/gravity/site/packages/unpacked/gravitational.io/AnacondaEnterprise/5.X.X/resources/wagon.yaml -n kube-system
NOTE: Replace 5.X.X with your actual version number.
To ensure the deployment is running in the system namespace, execute sudo gravity enter and run:
kubectl get deploy -n kube-system
One of these should be ae-wagonwheel, the post-install configuration UI. To make this visible to the outside world, run:
kubectl expose deploy ae-wagonwheel --port=8000 --type=NodePort --name=post-install -n kube-system
This will run the UI on a new port, allocated by Kubernetes, under the name post-install.
To find out which port it is listening under, run:
kubectl get svc -n kube-system | grep post-install
Then navigate to http://<your domain>:<this port> to access the post-install UI.
Kernel parameters may be overwritten and cause networking errors¶
If networking starts to fail in Anaconda Enterprise, it may be because a kernel parameter related to networking was inadvertently overwritten.
Workaround
On the master node running AE, run gravity status and verify that all kernel parameters are set correctly. If the Status for a particular parameter is degraded, follow the instructions here to reset the kernel parameter.
Removing collaborator from project with open session generates error¶
If you remove a collaborator from a project while they have a session open for that project, they might see a 500 Internal Server Error message.
Workaround
Add the user as a collaborator to the project, have them stop their notebook session, then remove them as a collaborator. For more information, see how to share a project.
To prevent collaborators from seeing this error, ask them to close their running session before you remove them from the project.
Affected versions
5.2.x
AE auth pod throws OutOfMemory Error¶
If you see an exception similar to the following, Anaconda Enterprise has exceeded the maximum heap size for the JVM:
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "default task-248"
2018-08-29 23:13:26.327 UTC ERROR XNIO001007: A channel event listener threw an exception: java.lang.OutOfMemoryError: Java heap space (default I/O-36) [org.xnio.listener]
2018-08-29 23:12:32.823 UTC ERROR UT005023: Exception handling request to /auth/realms/AnacondaPlatform/protocol/openid-connect/token: java.lang.OutOfMemoryError: Java heap space (default task-86) [io.undertow.request]
2018-08-29 23:13:01.353 UTC ERROR XNIO001007: A channel event listener threw an exception: java.lang.OutOfMemoryError: Java heap space
Workaround
Increase the JVM max heap size by doing the following:
Open the
anaconda-enterprise-ap-authdeployment spec by running the following command in a terminal:$ kubectl edit deploy anaconda-enterprise-ap-auth
Increase the value for
JAVA_OPTS(example below):spec: containers: - args: - cp /standalone-config/standalone.xml /opt/jboss/keycloak/standalone/configuration/ && /opt/jboss/keycloak/bin/standalone.sh -Dkeycloak.migration.action=import -Dkeycloak.migration.provider=singleFile -Dkeycloak.migration.file=/etc/secrets/keycloak/keycloak.json -Dkeycloak.migration.strategy=IGNORE_EXISTING -b 0.0.0.0 command: - /bin/sh - -c env: - name: DB_URL value: anaconda-enterprise-postgres:5432 - name: SERVICE_MIGRATE value: auth_quick_migrate - name: SERVICE_LAUNCH value: auth_quick_launch - name: JAVA_OPTS value: -Xms64m -Xmx2048m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m
Affected versions
5.2.1
Fetch changes behavior in Apache Zeppelin may not be obvious to new users¶
A Fetch changes notification appears, but the changes do not get applied to the editor. This is how Zeppelin works, but users unfamiliar with the editor may find it confusing.
If a collaborator makes changes to a notebook that’s also open by another user, the user needs to pull the changes that the collaborator made AND click the small reload arrows to refresh their notebook with the changes (see below).
Affected versions
5.2.2
Apache Zeppelin can’t locate conflicted files or non-Zeppelin notebook files¶
If you need to access files other than Apache Zeppelin notebooks within a project, you can use the %sh interpreter from within a Zeppelin notebook to work with files via bash commands, or use the Settings tab to change the default editor to Jupyter Notebooks or JupyterLab and use the file browser or terminal.
Affected versions
5.2.2
Create and Installer buttons are not visible on Channels page¶
When the Channels page is viewed initially, the Create and Installers buttons are not visible on the top right section of the screen. This prevents the user from creating channels or viewing a list of installers.
Workaround
To make the Create and Installer buttons visible on the Channels page, perform one of the following steps:
Click on the top-level Channels navigation link again when viewing the Channels page
Click on a specific channel to view its detail page, then return to the Channels page
Affected versions
5.2.1
Updating a package from the Anaconda metapackage¶
When updating a package dependency of a project, if that dependency is part of
the Anaconda metapackage the package will be installed once but a subsequent
anaconda-project call will uninstall the upgraded package.
Workaround
When updating a package dependency remove the anaconda metapackage from the
list of dependencies at the same time add the new version of the dependency that
you want to update.
Affected versions
5.1.0, 5.1.1, 5.1.2, 5.1.3
File size limit when uploading files¶
Unable to upload new files inside of a project that are larger than the current restrictions:
The limit of file uploads in JupyterLab is 15 MB
Affected versions
5.1.0, 5.1.1, 5.1.2, 5.1.3, 5.2.0, 5.2.1, 5.2.2, 5.2.3
IE 11 compatibility issue when using Bokeh in projects (including sample projects)¶
Bokeh plots and applications have had a number of issues with Internet Explorer 11, which typically result in the user seeing a blank screen.
Workaround
Upgrade to the latest version of Bokeh available. On Anaconda 4.4 the latest is 0.12.7. On Anaconda 5.0 the latest version of Bokeh is 0.12.13. If you are still having issues, consult the Bokeh team or support.
Affected versions
5.1.0, 5.1.1, 5.1.2, 5.1.3
IE 11 compatibility issue when downloading custom Anaconda installers¶
Unable to download a custom Anaconda installer from the browser when using Internet Explorer 11 on Windows 7. Attempting to download a custom installer with this setup will result in an error that “This page can’t be displayed”.
Workaround
Custom installers can be downloaded by refreshing the page with the error message, clicking the “Fix Connection Error” button, or using a different browser.
Affected versions
5.1.0, 5.1.1, 5.1.2, 5.1.3
Project names over 40 characters may prevent JupyterLab launch¶
If a project name is more than 40 characters long, launching the project in JupyterLab may fail.
Workaround
Rename the project to a name less than 40 characters long and launch the project in JupyterLab again.
Affected versions
5.1.1, 5.1.2, 5.1.3
Long-running jobs may falsely report failure¶
If a job (such as an installer, parcel, or management pack build) runs for more than 10 minutes, the UI may falsely report that the job has failed. The apparent job failure occurs because the session/access token in the UI has expired.
However, the job will continue to run in the background, the job run history will indicate a status of “running job” or “finished job”, and the job logs will be accessible.
Workaround
To prevent false reports of failed jobs from occurring in the UI, you can extend the access token lifespan (default: 10 minutes).
To extend the access token lifespan, log in to the Anaconda Enterprise Authentication Center, navigate to Realm Settings > Tokens, then increase the Access Token Lifespan to be at least as long as the jobs being run (e.g., 30 minutes).
Affected versions
5.1.0, 5.1.1, 5.1.2, 5.1.3
New Notebook not found on IE11¶
On Internet Explorer 11, creating a new Notebook in a Classic Notebook editing session may produce the error “404: Not Found”. This is an artifact of the way that Internet Explorer 11 locates files.
Workaround
If you see this error, click “Back to project”, then click “Return to Session”. This refreshes the file list and allows IE11 to find the file. You should see the new notebook in the file list. Click on it to open the notebook.
Affected versions
5.0.4, 5.0.5
Disk pressure errors on AWS¶
If your Anaconda Enterprise instance is on Amazon Web Services (AWS), overloading the system with reads and writes to the directory /opt/anaconda can cause disk pressure errors, which may result in the following:
Slow project starts.
Project failures.
Slow deployment completions.
Deployment failures.
If you see these problems, check the logs to verify whether disk pressure is the cause:
To list all nodes, run:
kubectl get node
Identify which node is experiencing issues, then run the following command against it, to view the log for that node:
kubectl describe node <master-node-name>
If there is disk pressure, the log will display an error message similar to the following:
Workaround
To relieve disk pressure, you can add disks to the instance by adding another Elastic Block Store (EBS) volume. If the disk pressure is being caused by a back up, you can move the backed up file somewhere else (e.g., to an NFS mount). See Backing up and restoring AE for more information.
To add disks to the instance by adding another Elastic Block Store (EBS) volume.
Open the AWS console and add a new EBS volume provisioned to 3000 IOPS. A typical disk size is 500 GB.
Attach the volume to your AE 5 master.
To find your new disk’s name run
fdisk -l. Our example disk’s name is/dev/nvme1n1. In the rest of the commands on this page, replace/dev/nvme1n1with your disk’s name.Format the new disk:
fdisk /dev/nvme1n1To create a new partition, at the first prompt press
nand then the return key.Accept all default settings.
To write the changes, press
wand then the return key. This will take a few minutes.To find your new partition’s name, examine the output of the last command. If the name is not there, run
fdisk -lagain to find it.Our example partition’s name is
/dev/nvme1n1p1. In the rest of the commands on this page, replace/dev/nvme1n1p1with your partition’s name.Make a file system on the new partition:
mkfs /dev/nvme1n1p1Make a temporary directory to capture the contents of
/opt/anaconda:mkdir /opt/aetmpMount the new partition to
/opt/aetmp:mount /dev/nvme1n1p1 /opt/aetmpShut down the Kubernetes system.
Find the gravity services:
systemctl list-units | grep gravityYou will see output like this:
# systemctl list-units | grep gravity gravity__gravitational.io__planet-master__0.1.87-1714.service loaded active running Auto-generated service for the gravitational.io/planet-master:0.1.87-1714 package gravity__gravitational.io__teleport__2.3.5.service loaded active running Auto-generated service for the gravitational.io/teleport:2.3.5 package
Shut down the
teleportservice:systemctl stop gravity__gravitational.io__teleport__2.3.5.serviceShut down the
planet-masterservice:systemctl stop gravity__gravitational.io__planet-master__0.1.87-1714.serviceCopy everything from
/opt/anacondato/opt/aetmp:rsync -vpoa /opt/anaconda/* /opt/aetmpInclude the new disk at the
/opt/anacondamount point by adding this line to your file systems table at/etc/fstab:/dev/nvme1n1p1 /opt/anaconda ext4 defaults 0 0
Use mixed spaces and tabs in this pattern:
/dev/nvme1n1p1<tab>/opt/anaconda<tab>ext4<tab>defaults<tab>0<space>0Move the old
/opt/anacondaout of the way to/opt/anaconda-old:mv /opt/anaconda /opt/anaconda-oldIf you’re certain the
rsyncwas successful, you may instead delete/opt/anaconda:rm -r /opt/anacondaUnmount the new disk from the
/opt/aetmpmount point:umount /opt/aetmpMake a new
/opt/anacondadirectory:mkdir /opt/anacondaMount all the disks defined in
fstab:mount -aRestart the gravity services:
systemctl start gravity__gravitational.io__planet-master__0.1.87-1714.service systemctl start gravity__gravitational.io__teleport__2.3.5.service
Disk pressure error during backup¶
If a disk pressure error occurs while backing up your configuration, the amount of data being backed up has likely exceeded the amount of space available to store the backup files. This triggers the Kubernetes eviction policy defined in the kubelet startup parameter and causes the backup to fail.
To check your eviction policy, run the following commands on the master node:
sudo gravity enter
systemctl status | grep "/usr/bin/kubelet"
Workaround
Restart the backup process, and specify a location with sufficient space (e.g., an NFS mount) to store the backup files. See Backing up and restoring AE for more information.
General diagnostic and troubleshooting steps¶
Entering Anaconda Enterprise environment
To enter the Anaconda Enterprise environment and gain access to kubectl and
other commands within Anaconda Enterprise, use the command:
sudo gravity enter
Moving files and data
Occasionally you may need to move files and data from the host machine to the Anaconda Enterprise environment. If so, there are two shared mounts to pass data back and forth between the two environments:
host:
/opt/anaconda/-> AE environment:/opt/anaconda/host:
/var/lib/gravity/planet/share-> AE environment:/ext/share
If data is written to either of the locations, that data will be available on both the host machine and within the Anaconda Enterprise environment
Debugging
AWS Traffic needs to handle the public IPs and ports. You should either use a canonical security group with the proper ports opened or manually add the specific ports listed in Network Requirements.
Problems during air gap project migration
The command anaconda-project lock over-specifies the channel list resulting in a conda bug where it adds defaults from the internet to the list of channels.
Solution:
Add to the .condarc: “default_channels”. This way, when conda adds “defaults” to the command it is adding the internal repo server and not the repo.continuum.io URLs.
EXAMPLE:
default_channels:
- anaconda
channels:
- our-internal
- out-partners
- rdkit
- bioconda
- defaults
- r-channel
- conda-forge
channel_alias: https://:8086/conda
auto_update_conda: false
ssl_verify: /etc/ssl/certs/ca.2048.cer
LDAP error in ap-auth
[LDAP: error code 12 - Unavailable Critical Extension]; remaining name 'dc=acme, dc=com'
This error can be caused when pagination is turned on. Pagination is a server side extension and is not supported by some LDAP servers, notably the Sun Directory server.
Session startup errors
If you need to troubleshoot session startup, you can use a terminal to view the
session startup logs. When session startup begins the output of the
anaconda-project prepare command is written to /opt/continuum/preparing,
and when the command completes the log is moved to
/opt/continuum/prepare.log.
Frequently asked questions¶
General¶
When was the general availability release of Anaconda Enterprise v5?
Our GA release was August 31, 2017 (version 5.0.3). Our most recent version was released April 15, 2020 (version 5.4.1).
Which notebooks or editors does Anaconda Enterprise support?
Anaconda Enterprise supports the use of Jupyter Notebooks and JupyterLab, which are the most popular integrated data science environments for working with Python and R notebooks. In version 5.2.2 we added support for Apache Zeppelin, a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with interpreters for Python, R, Spark, Hive, HDFS, SQL, and more.
Can I deploy multiple data science applications to Anaconda Enterprise?
Yes, you can deploy multiple data science applications and languages across an Anaconda Enterprise cluster. Each data science application runs in a secure and isolated environment with all of the dependencies from Anaconda that it requires.
A single node can run multiple applications based on the amount of compute resources (CPU and RAM) available on a given node. Anaconda Enterprise handles all of the resource allocation and application scheduling for you.
Does Anaconda Enterprise support high availability deployments?
Partially. Some of the Anaconda Enterprise services and user-deployed apps will be automatically configured when installed to three or more nodes. Anaconda Enterprise provides several automatic mechanisms for fault tolerance and service continuity, including automatic restarts, health checks, and service migration.
For more information, see Fault tolerance in Anaconda Enterprise.
Which identity management and authentication protocols does Anaconda Enterprise support?
Anaconda Enterprise comes with out-of-the-box support for the following:
LDAP / AD
SAML
Kerberos
For more information, see Connecting to external identity providers.
Does Anaconda Enterprise support two-factor authentication (including one-time passwords)?
Yes, Anaconda Enterprise supports single sign-on (SSO) and two-factor authentication (2FA) using FreeOTP, Google Authenticator or Google Authenticator compatible 2FA.
You can configure one-time password policies in Anaconda Enterprise by navigating to the authentication center and clicking on Authentication and then OTP Policy.
System requirements¶
What operating systems are supported for Anaconda Enterprise?
Please see operating system requirements.
Note
Linux distributions other than those listed in the documentation can be supported on request.
What are the minimum system requirements for Anaconda Enterprise nodes?
Please see system requirements.
Which browsers are supported for Anaconda Enterprise?
Please see browser requirements.
Does Anaconda Enterprise come with a version control system?
Yes, Anaconda Enterprise includes an internal Git server, which allows users to save and commit versions of their projects.
Can Anaconda Enterprise integrate with my own Git server?
Yes, as described in Connecting to an external version control repository.
Installation¶
How do I install Anaconda Enterprise?
The Anaconda Enterprise installer is a single tarball that includes Docker, Kubernetes, system dependencies, and all of the components and images necessary to run Anaconda Enterprise. The system administrator runs one command on each node.
Can Anaconda Enterprise be installed on-premises?
Yes, including airgapped environments.
Can Anaconda Enterprise be installed on cloud environments?
Yes, including Amazon AWS, Microsoft Azure, and Google Cloud Platform.
Does Anaconda Enterprise support air gapped (off-line) environments?
Yes, the Anaconda Enterprise installer includes Docker, Kubernetes, system dependencies, and all of the components and images necessary to run Anaconda Enterprise on-premises or on a private cloud, with or without internet connectivity. We can deliver the installer to you on a USB drive.
Can I build Docker images for the install of Anaconda Enterprise?
No. The installation of Anaconda Enterprise is supported only by using the single-file installer. The Anaconda Enterprise installer includes Docker, Kubernetes, system dependencies, and all of the components and images necessary for Anaconda Enterprise.
Can I install Anaconda Enterprise on my own instance of Kubernetes?
No. The Anaconda Enterprise installer already includes Kubernetes.
Can I get the AE installer packaged as a virtual machine (VM), Amazon Machine Image (AMI) or other installation package?
No. The installation of Anaconda Enterprise is supported only by using the single-file installer.
Which ports are externally accessible from Anaconda Enterprise?
Please see network requirements.
Can I use Anaconda Enterprise to connect to my Hadoop/Spark cluster?
Yes. Anaconda Enterprise supports connectivity from notebooks to local or remote Spark clusters by using the Sparkmagic client and a Livy REST API server. Anaconda Enterprise provides Sparkmagic, which inlcudes Spark, PySpark, and SparkR notebook kernels for deployment.
How can I manage Anaconda packages on my Hadoop/Spark cluster?
An administrator can generate custom Anaconda parcels for Cloudera CDH or custom Anaconda management packs for Hortonworks HDP using Anaconda Enterprise. A data scientist can use these Anaconda libraries from a notebook as part of a Spark job.
On how many nodes can I install Anaconda Enterprise?
You can install Anaconda Enterprise in the following configurations during the initial installation:
One node (one master node)
Two nodes (one master node, one worker node)
Three nodes (one master node, two worker nodes)
Four nodes (one master node, three worker nodes)
After the initial installation, you can add or remove worker nodes from the Anaconda Enterprise cluster at any time.
One node serves as the master node and writes storage to disk, and the other nodes serve as worker nodes. Anaconda Enterprise services and user-deployed applications run seamlessly on the master and worker nodes.
Can I generate certificates manually?
Yes, if automatic TLS/SSL certificate generation fails for any reason, you can generate the certificates manually. Follow these steps:
Generate self-signed temporary certificates. On the master node, run:
cd path/to/Anaconda/Enterprise/unpacked/installer cd DIY-SSL-CA bash create_noprompt.sh DESIRED_FQDN cp out/DESIRED_FQDN/secret.yaml /var/lib/gravity/planet/share/secrets.yaml
Replace
DESIRED_FQDNwith the fully-qualified domain of the cluster to which you are installing Anaconda Enterprise.Saving this file as
/var/lib/gravity/planet/share/secrets.yamlon the Anaconda Enterprise master node makes it accessible as/ext/share/secrets.yamlwithin the Anaconda Enterprise environment which can be accessed with the commandsudo gravity enter.Update the
certssecretReplace the built-in
certssecret with the contents ofsecrets.yaml. Enter the Anaconda Enterprise environment and run these commands:$ kubectl delete secrets certs secret "certs" deleted $ kubectl create -f /ext/share/secrets.yaml secret "certs" created
GPU Support¶
How can I make GPUs available to my team of data scientists?
If your data science team plans to use version 5.2 of the Anaconda Enterprise AI enablement platform, here are a few approaches to consider when planning your GPU cluster:
Build a dedicated GPU-only cluster.
If GPUs will be used by specific teams only, creating a separate cluster allows you to more carefully control GPU access.
Build a heterogeneous cluster.
Not all projects require GPUs, so a cluster containing a mix of worker nodes—with and without GPUs—can serve a variety of use cases in a cost-effective way.
Add GPU nodes to an existing cluster.
If your team’s resource requirements aren’t clearly defined, you can start with a CPU-only cluster, and add GPU nodes to create a heterogeneous cluster when the need arises.
Anaconda Enterprise supports heterogeneous clusters by allowing you to create different “resource profiles” for projects. Each resource profile describes the number of CPU cores, the amount of memory, and the number of GPUs the project needs. Administrators typically will create “Regular”, “Large”, and “Large + GPU” resource profiles for users to select from when running their project. If a project requires a GPU, AE will run it on only those cluster nodes with an available GPU.
What software is GPU accelerated?
Anaconda provides a number of GPU-accelerated packages for data science. For deep learning, these include:
Keras (
keras-gpu)TensorFlow (
tensorflow-gpu)Caffe (
caffe-gpu)PyTorch (
pytorch)MXNet (
mxnet-gpu)
For boosted decision tree models:
XGBoost (
py-xgboost-gpu)
For more general array programming, custom algorithm development, and simulations:
CuPy (
cupy)Numba (
numba)
Note
Unless a package has been specifically optimized for GPUs (by the authors) and built by Anaconda with GPU support, it will not be GPU-accelerated, even if the hardware is present.
What hardware does each of my cluster nodes require?
Anaconda recommends installing Anaconda Enterprise in a cluster configuration. Each installation should have an odd number of master nodes, and we recommend at least one worker node. The master node runs all Anaconda Enterprise core services and does not need a GPU.
Using EC2 instances, a minimal configuration is one master node running on a m4.4xlarge instance and one GPU worker node running on a p3.2xlarge instance. More users will require more worker nodes—and possibly a mix of CPU and GPU worker nodes.
See Installation requirements for the baseline hardware requirements for Anaconda Enterprise.
How many GPUs does my cluster need?
A best practice for machine learning is for each user to have exclusive use of their GPU(s) while their project is running. This ensures they have sufficient GPU memory available for training, and provides more consistent performance.
When an Anaconda Enterprise user launches a notebook session or deployment that requires GPUs, those resources are reserved for as long as the project is running. When the notebook session or deployment is stopped, the GPUs are returned to the available pool for another user to claim.
The number of GPUs required in the cluster can therefore be determined by the number of concurrently running notebook sessions and deployments that are expected. Adding nodes to an Anaconda Enterprise cluster is straightforward, so organizations can start with a conservative number of GPUs and grow as demand increases.
To get more out of your GPU resources, Anaconda Enterprise supports scheduling and running unattended jobs. This enables you to execute periodic retraining tasks—or other resource-intensive tasks—after regular business hours, or at times GPUs would otherwise be idle.
What kind of GPUs should I use?
Although the Anaconda Distribution supports a wide range of NVIDIA GPUs, enterprise deployments for data science teams developing models should use one of the following GPUs:
Tesla V100 (recommended)
Tesla P100 (adequate)
Can I mix GPU models in one cluster?
Kubernetes cannot currently distinguish between different GPU models in the same cluster node, so Anaconda Enterprise requires all GPU-enabled nodes within a given cluster to have the same GPU model (for example, all Tesla V100). Different clusters (e.g., “production” and “development”) can use different GPU models, of course.
Can I use cloud GPUs?
Yes, Anaconda Enterprise 5.2 can be installed on cloud VMs with GPU support. Amazon Web Services (AWS), Google Cloud Platform, and Microsoft Azure all offer Tesla GPU options.
Anaconda Project¶
What operating systems and Python versions are supported for Anaconda Project?
Anaconda Project supports Windows, macOS and Linux, and tracks the latest Anaconda releases with Python 2.7, 3.5, 3.6, and 3.7.
How is encapsulation with Anaconda Project different from creating a workspace or project in Spyder, PyCharm, or other IDEs?
A workspace or project in an IDE is a directory of files on your desktop. Anaconda Project encapsulates those files, but also includes additional parameters to describe how to run a project with its dependencies. Anaconda Project is portable and allows users to run, share, and deploy applications across different operating systems.
What types of projects can I deploy?
Anaconda Project is very flexible and can deploy many types of projects with conda or pip dependencies. Deployable projects include:
Notebooks (Python and R)
Bokeh applications and dashboards
REST APIs in Python and R (including machine learning scoring and predictions)
Python and R scripts
Third-party apps, web frameworks, and visualization tools such as Tensorboard, Flask, Falcon, deck.gl, plot.ly Dash, and more.
Any generic Python and R script or webapp can be configured to serve on port 8086, which will show the app in Anaconda Enterprise when deployed.
Does Anaconda Enterprise include Docker images for my data science projects?
Anaconda Enterprise includes data science application images for the editor and deployments. You can install additional packages in either environment using Anaconda Project. Anaconda Project includes the information required to reproduce the project environment with Anaconda, including Python, R, or any other conda package or pip dependencies.
After upgrading AE5 my projects no longer work
If you’ve upgraded to AE 5.4 and are getting package install errors you may
need to re-write your anaconda-project.yml file.
If you were using modified template anaconda-project.yml files for Python
2.7, 3.5, or 3.6 it is best to leave the package list empty in the env_specs
section. Then you should add your required packages and their versions to the
global package list.
Here’s an example using the Python 3.6 template anaconda-project.yml
file from AE version 5.3.1 where the package list has been removed from the
env_specs and the required packages added to the global list.
name: Python 3.6
description: A comprehensive project template that contains all of the packages available in the Anaconda Distribution v5.0.1 for Python 3.6. Get started with the most popular and powerful packages in data science.
channels: []
packages:
- python=3.6
- notebook
- pandas=0.25
- psycopg2
- holoviews
platforms:
- linux-64
- osx-64
- win-64
env_specs:
anaconda50_py36:
packages: []
channels: []
Notebooks¶
Are the deployed, self-service notebooks read-only?
Yes, the deployed versions of self-service notebooks are read-only, but they can be executed by collaborators or viewers. Owners of the project that contain the notebooks can edit the notebook and deploy (or re-deploy) them.
What happens when other people run the notebook? Does it overwrite any file, if notebook is writing to a file?
A deployed, self-service notebook is read-only but can be executed by other collaborators or viewers. If multiple users are running a notebook that writes to a file, the file will be overwritten unless the notebook is configured to write data based on a username or other environment variable.
Can I define environment variables as part of my data science project?
Yes, Anaconda Project supports environment variables that can be defined when deploying a data science application. Only project collaborators can view or edit environment variables, and they cannot be accessed by viewers.
How are Anaconda Project and Anaconda Enterprise available?
Anaconda Project is free and open-source. Anaconda Enterprise is a commercial product.
Where can I find example projects for Anaconda Enterprise?
Sample projects are included as part of the Anaconda Enterprise installation, which include sample workflows and notebooks for Python and R such as financial modeling, natural language processing, machine learning models with REST APIs, interactive Bokeh applications and dashboards, image classification, and more.
The sample projects include examples with visualization tools (Bokeh, deck.gl), pandas, scipy, Shiny, Tensorflow, Tensorboard, xgboost, and many other libraries. Users can save the sample projects to their Anaconda Enterprise account or download the sample projects to their local machine.
Does Anaconda Enterprise support batch scoring with REST APIs?
Yes, Anaconda Enterprise can be used to deploy machine learning models with REST APIs (including Python and R) that can be queried for batch scoring workflows. The REST APIs can be made available to other users and accessed with an API token.
Does Anaconda Enterprise provide tools to help define and implement REST APIs?
Yes, a data scientist can basically create a model without much work for the API development. Anaconda Enterprise includes an API wrapper for Python frameworks that builds on top of existing web frameworks in Anaconda, making it easy to expose your existing data science models with minimal code. You can also deploy REST APIs using existing API frameworks for Python and R.
Help and training¶
Do you offer support for Anaconda Enterprise?
Yes, we offer full support with Anaconda Enterprise.
Do you offer training for Anaconda Enterprise?
Yes, we offer product training for collaborative, end-to-end data science workflows with Anaconda Enterprise.
Do you have a question not answered here?
Please contact us for more information.



















































































































































