# Nomic > This class allows for programmatic interactions with Atlas. Initialize an AtlasProject in any Python context such as a script --- # Atlas API This class allows for programmatic interactions with Atlas. Initialize an AtlasProject in any Python context such as a script or in a Jupyter Notebook to access your web based map interactions. === "Atlas Client Example" ``` py title="map_embeddings.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) response = atlas.map_embeddings(embeddings=embeddings, is_public=True) print(response) ``` === "Atlas Client Private Map Example" ``` py title="map_embeddings_private.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) response = atlas.map_embeddings(embeddings=embeddings, is_public=False ) print(response) ``` ## Map Embedding API ::: nomic.atlas.map_embeddings options: show_root_heading: True ## Map Text API ::: nomic.atlas.map_text options: show_root_heading: True ## AtlasProject API ::: nomic.project.AtlasProject options: show_root_heading: True ## AtlasProjection API ::: nomic.project.AtlasProjection options: show_root_heading: True --- # Collection of Maps [Twitter](https://atlas.nomic.ai/map/twitter) (5.4 million tweets) [Stable Diffusion](https://atlas.nomic.ai/map/stablediffusion) (6.4 million images) [NeurIPS Proceedings](https://atlas.nomic.ai/map/neurips) (16,623 documents) [ICLR 2018-2023 Submissions](https://atlas.nomic.ai/map/b06c5cd7-6946-43ed-b515-7934970c8ed7/6e643208-03fb-4b94-ae01-69ce5395ee5b) [MNIST Logits](https://atlas.nomic.ai/map/2a222eb6-8f5a-405b-9ab8-f5ab23b71cfd/1dae224b-0284-49f7-b7c9-5f80d9ef8b32) ## GLUE (General Language Understanding Evaluation) Created from the GLUE benchmark as uploaded to the [Hugging Face Hub](https://huggingface.co/datasets/glue). [COLA](https://atlas.nomic.ai/map/2d5544f1-124e-4d28-b9de-f7165c000fe0/62fefbab-8c0d-4039-857e-d6f79c475f49) (10,657 datums) [SST2](https://atlas.nomic.ai/map/0e4facdc-f707-4b8d-aed3-4e47b30e3b23/5458da4d-1956-4ae7-bff3-f8c97d8c3436) (70,042 datums) [MRPC](https://atlas.nomic.ai/map/63374bb4-f7de-4709-8935-bba0a018b0e6/a80fdb79-98fa-4109-8504-50088340d8fd) (5,801 datums) [QQP](https://atlas.nomic.ai/map/a63789f5-9e29-44c7-8153-2977e1155751/9004a23e-072d-417c-affb-dd22f6675b53) (795,241 datums) [STS-B](https://atlas.nomic.ai/map/4f802e26-a007-4234-b02d-247845b75344/e20e7b05-7823-4d1a-80ad-de8065beb470) (8,628 datums) [MNLI](https://atlas.nomic.ai/map/5e7d74d7-739f-4048-8e33-fed722d259c0/7654ca81-d43a-41d9-a9de-9941e1a59756) (431,992 datums) [MNLI Mismatched](https://atlas.nomic.ai/map/2abb7e80-42b4-44b4-8c2a-3e51fc7c604d/2910d366-b6a0-48aa-bdd5-5e4d75c936cf) (19,679 datums) [MNLI Matched](https://atlas.nomic.ai/map/8e74d920-abce-4ba1-8f5d-a7c0a695715d/28106df8-d29f-4951-b096-007933eef9fd) (19,611 datums) [QNLI](https://atlas.nomic.ai/map/e14b375b-4f26-4e92-810c-161b44df896c/6dc04862-7838-44d7-beaa-19b190129115) (115,669 datums) [RTE](https://atlas.nomic.ai/map/32217c03-defd-4204-8f0d-879c86439cb4/4568200b-4506-463f-a498-a84918dc5ecf) (5,767 datums) [WNLI](https://atlas.nomic.ai/map/35a42a5b-2d47-4217-8451-f56d272ffe7c/1becaa35-d7b2-4114-8d0e-6a9496b81608) (852 datums) [AX](https://atlas.nomic.ai/map/d691163c-42ec-460a-9631-1df166c7b6b5/148193ef-a315-43c0-a3ea-f98da28062ee) (1,104 datums) --- # Advanced Walkthrough Maps made in Atlas dynamically update to reflect the underlying data stored in the project. When you add, update or delete data in an AtlasProject the underlying map records your changes. !!! note "Project Lock" Addition, deletion and update operations on a projects data can only occur when the project's transaction lock is released. This lock is present when any map is building on a project. You can check if this is set with the `is_locked` property. Changes you make to your projects data do not immediately reflect on your map. You must explicitly rebuild your map to have your changes incorporated into the maps state. You can commit project data manipulations to your maps state by running the [rebuild_maps](atlas_api.md) method on your AtlasProject. ## Adding data In the below example, we will create a map and then add data to it. To add data to a project, you should use the `add_embeddings` or `add_text` methods depending on your project's modality. The first set of data added to the project will contain 1000 random embeddings in 10 dimensions with mean zero. For tracking purposes, we will associate and color each embedding with a metadata field called `upload` signifying whether the embedding was part of the first or second set of data added to the project. === "First Upload" ``` py from nomic import atlas import numpy as np num_embeddings = 1000 embeddings = np.random.rand(num_embeddings, 10) data = [{'upload': '1', 'id': i} for i in range(len(embeddings))] project = atlas.map_embeddings(embeddings=embeddings, data=data, id_field='id', name='A Map That Gets Updated', colorable_fields=['upload'], reset_project_if_exists=True) map = project.get_map('A Map That Gets Updated') print(map) ``` The second upload will contain 1000 random embeddings of dimension 10 but with a shifted mean. The resulting map contains two clusters: one cluster for the first upload with mean zero vectors and the second cluster corresponding to vectors with the shifted mean. === "Second upload" ``` py total_datums = project.total_datums # embeddings with shifted mean. embeddings += np.ones(shape=(num_embeddings, 10)) data = [{'upload': '2', 'id': total_datums+i} for i in range(len(embeddings))] with project.wait_for_project_lock(): project.add_embeddings(embeddings=embeddings, data=data) project.rebuild_maps() ``` !!! note "Project Lock Context Manager" Place any logic that needs to wait for a project lock to be released behind the `project.wait_for_project_lock()` context manager. This context manager will block the currently running thread until all maps in your project are done building. For large projects, this may take a long time. ## Deleting Data You can delete data with the `delete_data` method on an AtlasProject. Following the previous example: === "Deleting data" ``` py with project.wait_for_project_lock(): project.delete_data(ids=[i for i in range(1100, 2000)]) project.rebuild_maps() ``` One cluster in your map should now be 1/10 the size of the other cluster. --- # How Atlas Works Atlas is a platform for visually and programmatically interacting with massive unstructured datasets of text documents, images and embeddings. ## Data model Atlas lets you store and manipulate data like a standard noSQL document engine. On upload, your data is stored in an abstraction called a `Project`. You can add, update, read and delete (CRUD) data in a project via API calls from the Atlas Python client. #### What kind of data can I store in Atlas? Atlas can natively store: * [Embedding vectors](https://vaclavkosar.com/ml/Embeddings-in-Machine-Learning-Explained) * Text Documents Our roadmap includes first class support for data modalities such as images, audio and video. You can still store images, audio and video in Atlas now but you must generate embeddings for it yourself. Data stored in an Atlas `Project` is semantically indexed by Atlas. This indexing allows you to interact, view and search through your dataset via meaning instead of matching on words. #### How does Atlas semantically index data? Atlas semantically indexes unstructured data by: 1. Converting data points into embedding vectors (if they aren't embeddings already) 2. Organizing the embedding vectors for *fast semantic search* and *human interpretability* If you have embedding vectors of your data from an embedding API such as OpenAI or Cohere, you can attach them during upload. If you don't already have embedding vectors for your data points, Atlas will create them by running your data through neural networks that semantically encode your data points. For example, if you upload text documents Atlas will run them through neural networks that semantically encode text. It is often cheaper and faster to use Atlas' internal embedding models as opposed to an external model APIs. ## How is Atlas different from a noSQL database? Unlike existing data stores, Atlas is built with embedding vectors as first class citizens. [Embedding vectors](https://vaclavkosar.com/ml/Embeddings-in-Machine-Learning-Explained) are representations of data that computers can semantically manipulate. Most operations you do in Atlas, under the hood, are performed on embeddings. ## Atlas makes embeddings human interpretable Despite their utility, embeddings cannot be easily interpreted because they reside in high dimensions. During indexing, Atlas builds a contextual [two-dimensional data map](https://atlas.nomic.ai/map/stablediffusion) of embeddings. This map preserves high-dimensional relationships present between embeddings in a two-dimensional, human interpretable view.  ### Reading an Atlas Map Atlas Maps lay out your dataset contextually. We will use the above [map of news articles](https://atlas.nomic.ai/map/22bb6eb0-04c9-4aa0-a138-d860b83c1057/229deb96-fc59-4d40-acb6-52b32590887f) generated by Atlas to describe how to read Maps. An Atlas Map has the following properties: 1. **Points close to each other on the map are semantically similar/related**. For example, all news articles about sports are at the bottom of the map. Inside the sports region, the map breaks down by type of sport because news articles about a fixed sport (e.g. baseball) have more similarity to each other than with news articles about other types of sports (e.g. tennis). 2. **Relative distances between points correlate with semantic relatedness but the numerical distance between 2D point positions does not have meaning**. For example, the observation that the Tennis and Golf news article clusters are adjacent signify a relationships between Tennis and Golf in the embedding space. You should not, however, make claims or draw conclusions using the Euclidean distance between points in the two clusters. Distance information is only meaningful in the ambient embedding space and can be retrieved with [vector_search](vector_search_in_atlas.md). 3. **Floating labels correspond to distinct topics in your data**. For example, the Golf cluster has the label 'Ryder Cup'. Labels are automatically determined from the textual contents of your data and are crucial for navigating the Map. 4. **Topics have a hierarchy**. As you zoom around the Map, more granular versions of topics will emerge. 4. **Maps update as your data updates**. When new data enters your project, Atlas can reindex the map to reflect how the new data relates to existing data. All information and operations that are visually presented on an Atlas map have a programmatic analog. For example, you can access topic information and vector search through the Python client. #### Technical Details Atlas visualizes your embeddings in two-dimensions using a non-linear dimensionality reduction algorithm. Atlas' dimensionality reduction algorithm is custom-built for scale, speed and dynamic updates. Nomic cannot share the technical details of the algorithm at this time. #### Data Formats and Integrity Atlas stores and transfers data using a subset of the [Apache Arrow](arrow.apache.org) standard. `pyarrow` is used to convert python, pandas, and numpy data types to Arrow types; you can also pass any Arrow table (created by polars, duckdb, pyarrow, etc.) directly to Atlas and the types will be automatically converted. Before being uploaded, all data is converted with the following rules: * Strings are converted to Arrow strings and stored as UTF-8. * Integers are converted to 32-bit integers. (In the case that you have larger integers, they are probably either IDs, in which case you should convert them to strings; or they are a field that you want perform analysis on, in which case you should convert them to floats.) * Floats are converted to 32-bit (single-precision) floats. * Embeddings, regardless of precision, are uploaded as 16-bit (half-precision) floats, and stored in Arrow as FixedSizeList. * All dates and datetimes are converted to Arrow timestamps with millisecond precision and no time zone. (If you have a use case that requires timezone information or micro/nanosecond precision, please let us know.) * Categorical types (called 'dictionary' in Arrow) are supported, but values stored as categorical must be strings. Other data types (including booleans, binary, lists, and structs) are not supported. Values stored as a dictionary must be strings. All fields besides embeddings and the user-specified ID field are nullable. ## Permissions and Privacy To create a Project in Atlas, you must first sign up for an account and obtain an API key. Projects you create in Atlas have configurable permissions and privacy levels. When you create a project, it's ownership is assigned to your Atlas team. You can add people to this team to collaborate on projects together. For example, if you want to invite somone to help you tag points on an Atlas Map, you would add them to your team and give them the appropriate editing permissions on your project. --- # Atlas  Meet Atlas - a platform for interacting with both small and internet scale unstructured datasets.
Atlas enables you to: * Store, update and organize multi-million point datasets of unstructured text, images and embeddings. * [Visually interact](how_does_atlas_work.md#atlas-makes-embeddings-human-interpretable) with embeddings of your data from a web browser. * Operate over unstructured data and embeddings with [topic modeling](map_state/topics.md), [semantic duplicate clustering](map_state/duplicates.md) and [semantic search](vector_search_in_atlas.md). * [Generate high dimensional and two-dimensional](map_state/embeddings.md) embeddings of your data. Use Atlas to: - [Visualize, interact, collaborate and share large datasets of text and embeddings.](map_your_data.md) - [Collaboratively curate your unstructured datasets (clean, tag and label)](data_exploration_cleaning_tagging_in_atlas.ipynb) - [Build high-availability apps powered by semantic search](https://langchain.readthedocs.io/en/latest/ecosystem/atlas.html) - [Understand and debug the latent space of your AI model trains](pytorch_embedding_explorer.ipynb) Read about [how Atlas works](how_does_atlas_work.md) or get started below! ## Quickstart Install the Nomic client with: ```bash pip install nomic ``` Login/create your Nomic account: ```bash nomic login ``` Follow the instructions to obtain your access token. Enter your access token with: ```bash nomic login [token] ``` You are ready to interact with Atlas. Continue on to [make your first data map](map_your_data.md). === "Mapping Embeddings" ``` py title="map_embeddings.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) project = atlas.map_embeddings(embeddings=embeddings) print(project.maps) ``` ## Resources [Make your first neural map.](map_your_data.md) [How does Atlas work?](how_does_atlas_work.md) [Collection of maps.](collection_of_maps.md) ## Example maps [Twitter](https://atlas.nomic.ai/map/twitter) (5.4 million tweets) [Stable Diffusion](https://atlas.nomic.ai/map/stablediffusion) (6.4 million images) [NeurIPS Proceedings](https://atlas.nomic.ai/map/neurips) (16,623 documents) [ICLR 2018-2023 Submissions](https://atlas.nomic.ai/map/iclr) [MNIST Logits](https://atlas.nomic.ai/map/2a222eb6-8f5a-405b-9ab8-f5ab23b71cfd/1dae224b-0284-49f7-b7c9-5f80d9ef8b32) ## About us [Nomic](https://home.nomic.ai) is the world's first *information cartography* company. We believe that the fastest way to understand your data is to look at it. --- Atlas stores your original project data such as text and numeric fields, providing a unified source for all your dataset's information. These fields are displayed with each point in your Atlas Map. You can access these uploaded fields programmatically by using the `data` attribute of an AtlasMap. This is helpful if you would like to perform operations on Atlas artifacts, such as embedding or topic information, along with your original data. ```python from nomic import AtlasProject map = AtlasProject(name='My Project').maps[0] map.data ``` ::: nomic.data_operations.AtlasMapData options: show_root_heading: True --- Atlas groups your data into semantically similar duplicate clusters powered by latent information contained in your embeddings. Under the hood, Atlas utilizes an algorithm similar to [SemDeDup](https://arxiv.org/abs/2303.09540). You can access and operate on semantic duplicate clusters programmatically by using the `duplicates` attribute of an AtlasMap. Make sure to enable duplicate clustering by setting `detect_duplicate = True` when building a map. ```python from nomic import AtlasProject map = AtlasProject(name='My Project').maps[0] map.duplicates ``` ::: nomic.data_operations.AtlasMapDuplicates options: show_root_heading: True --- Atlas stores, manages and generates embeddings for your unstructured data. You can access Atlas latent embedding (e.g. high dimensional) or their two-dimensional projected representations. ```python from nomic import AtlasProject map = AtlasProject(name='My Project').maps[0] projected_embeddings = map.embeddings.projected latent_embeddings = map.embeddings.latent print(f"The datapoint with id {projected_embeddings['id'][0]} is located at ({projected_embeddings['x'][0]}, {projected_embeddings['y'][0]}) with latent embedding {latent_embeddings[0]}") ``` ::: nomic.data_operations.AtlasMapEmbeddings options: show_root_heading: True --- Atlas allows you to visually and programatically associate tags to datapoints. Tags can be added collaboratively by anyone allowed to edit your Atlas Project. You can access and operate on your assigned tags by using the `tags` attribute of an AtlasMap. ```python from nomic import AtlasProject map = AtlasProject(name='My Project').maps[0] map.tags ``` ::: nomic.data_operations.AtlasMapTags options: show_root_heading: True --- Atlas pre-organizes your data into topics informed by the latent contents of your embeddings. Visually, these are represented by regions of homogenous color on an Atlas map. You can access and operate on topics programmatically by using the `topics` attribute of an AtlasMap. ```python from nomic import AtlasProject map = AtlasProject(name='My Project').maps[0] map.topics ``` ::: nomic.data_operations.AtlasMapTopics options: show_root_heading: True --- # Map Your Embeddings Atlas ingests unstructured data such as embeddings or text and organizes them. Once your data is in Atlas, you can view *all of it* at once on an interactive map. Any interaction you do on the map (e.g. tagging, topic labeling, vector search) you can programmatically access in this Python client. ## Your first neural map The following code snippet shows you how to map your embeddings with Atlas. Upload 10,000 random embeddings and see them instantly organized on an interactive map. [Random Embedding Map](https://atlas.nomic.ai/map/82e15baf-5de2-4191-bc60-61ce9d76bd17/91e63b2d-b8af-4de2-a4d2-e6e96d879274) === "Basic Example" ``` py title="map_embeddings.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) project = atlas.map_embeddings(embeddings=embeddings) ``` === "Output" ``` bash https://atlas.nomic.ai/map/82e15baf-5de2-4191-bc60-61ce9d76bd17/91e63b2d-b8af-4de2-a4d2-e6e96d879274 ``` ## Add some colors Now let's add colors. To do this, specify the `data` key in the map call. This field should contain a list of dictionaries - one for each of your embeddings. In the `map_embeddings` call, specify the key you want to be able to color by. In our example, this key is `category`. === "Advanced Example" ``` py title="map_embeddings_with_colors.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) categories = ['rhizome', 'cartography', 'lindenstrauss'] data = [{'category': categories[i % len(categories)], 'id': i} for i in range(len(embeddings))] project = atlas.map_embeddings(embeddings=embeddings, data=data, id_field='id', colorable_fields=['category'] ) ``` --- # Map your images, video and audio Email us at `work@nomic.ai` and help build this! --- # Map Your Text Map your text documents with Atlas using the `map_text` function. Atlas will ingest your documents, organize them with state-of-the-art AI and then serve you back an interactive map. Any interaction you do with your data (e.g. tagging) can be accessed programmatically with the Atlas Python API. ## Map text with Atlas When sending text you should specify an `indexed_field` in the `map_text` function. This lets Atlas know what metadata field to use when building your map. === "Atlas Embed" ``` py title="map_text_with_atlas.py" from nomic import atlas import numpy as np from datasets import load_dataset #Make a dataset with the shape [{'col1': 'val', 'col2': 'val', ...}, etc] #Tip: if you're working with a pandas DataFrame # use pandas.DataFrame.to_dict('records') dataset = load_dataset('ag_news')['train'] max_documents = 10000 subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist() documents = [dataset[i] for i in subset_idxs] project = atlas.map_text(data=documents, indexed_field='text', name='News 10k Example', colorable_fields=['label'], description='News 10k Example.' ) ``` === "Output" ``` bash https://atlas.nomic.ai/map/0642e9a1-12d9-4504-a987-9ca50ecd5327/699afdee-cea0-4805-9c84-12eca6dbebf8 ``` ## Map text with your own models Nomic integrates with embedding providers such as [co:here](https://cohere.ai/) and [huggingface](https://huggingface.co/models) to help you build maps of text. ### Text maps with a 🤗 Hugging Face model This code snippet is a complete example of how to make a map with a Hugging Face model. [Example Hugging Face Map](https://atlas.nomic.ai/map/60e57e91-c573-4d1f-85ac-2f00f2a075ae/f5bf58cf-f40b-439d-bd0d-d3a4a8b98496) !!! note This example requires additional packages. Install them with ```bash pip install datasets transformers torch ``` === "Hugging Face Example" ``` py title="map_with_huggingface.py" from nomic import atlas from transformers import AutoTokenizer, AutoModel import numpy as np import torch from datasets import load_dataset #make dataset max_documents = 10000 dataset = load_dataset("sentiment140")['train'] documents = [dataset[i] for i in np.random.choice(len(dataset), size=max_documents, replace=False).tolist()] model = AutoModel.from_pretrained("prajjwal1/bert-mini") tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-mini") embeddings = [] with torch.no_grad(): batch_size = 10 # lower this if needed for i in range(0, len(documents), batch_size): batch = [document['text'] for document in documents[i:i+batch_size]] encoded_input = tokenizer(batch, return_tensors='pt', padding=True) cls_embeddings = model(**encoded_input)['last_hidden_state'][:, 0] embeddings.append(cls_embeddings) embeddings = torch.cat(embeddings).numpy() response = atlas.map_embeddings(embeddings=embeddings, data=documents, colorable_fields=['sentiment'], name="Huggingface Model Example", description="An example of building a text map with a huggingface model.") print(response) ``` === "Output" ``` bash https://atlas.nomic.ai/map/60e57e91-c573-4d1f-85ac-2f00f2a075ae/f5bf58cf-f40b-439d-bd0d-d3a4a8b98496 ``` ### Text maps with a Cohere model Obtain an API key from [cohere.ai](https://os.cohere.ai) to embed your text data. Add your Cohere API key to the below example to see how their large language model organizes text from a sentiment analysis dataset. [Sentiment Analysis Map](https://atlas.nomic.ai/map/63b3d891-f807-44c5-abdf-2a95dad05b41/db0fa89e-6589-4a82-884b-f58bfb60d641) !!! note This example requires additional packages. Install them with ```bash pip install datasets ``` === "Co:here Example" ``` py title="map_hf_dataset_with_cohere.py" from nomic import atlas from nomic import CohereEmbedder import numpy as np from datasets import load_dataset cohere_api_key = '' dataset = load_dataset("sentiment140")['train'] max_documents = 10000 subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist() documents = [dataset[i] for i in subset_idxs] embedder = CohereEmbedder(cohere_api_key=cohere_api_key) print(f"Embedding {len(documents)} documents with Cohere API") embeddings = embedder.embed(texts=[document['user'] for document in documents], model='small') if len(embeddings) != len(documents): raise Exception("Embedding job failed") print("Embedding job complete.") response = atlas.map_embeddings(embeddings=np.array(embeddings), data=documents, colorable_fields=['sentiment'], name='Sentiment 140', description='A 10,000 point sample of the huggingface sentiment140 dataset embedded with the co:here small model.', ) print(response) ``` === "Output" ``` bash https://atlas.nomic.ai/map/ff2f89df-451e-49c4-b7a3-a608d7375961/f433cbd1-e728-49da-8c83-685cd613788b ``` --- # Mapping FAQ Frequently asked questions about Atlas maps. ## Mapping Latency Map creation latency once Nomic has received your embeddings. | Number of datums/embeddings | Map availability latency (s) | |:-----------------------------:|:------------------------------:| | 10,000 | instant | | 10,001 - 99,999 | 10-40 | | 100,000 - 499,999 | 40-180 | | 500,000 - 999,999 | 180-600 | | 1,000,000 - 9,999,999 | 600+ | ## Who can see my maps? When you create a map, you can toggle it as private or public. Private maps are only accessible by authenticated individuals in your Nomic organization. Public maps are accessible by anyone with a link. === "Atlas Client Private Map Example" ``` py title="map_embeddings_private.py" from nomic import atlas import numpy as np num_embeddings = 10000 embeddings = np.random.rand(num_embeddings, 256) response = atlas.map_embeddings(embeddings=embeddings, is_public=False, organization_name='my_organization' ) print(response) ``` ## How do I login from the client? You can login to your Atlas account from the python client by getting an API key. If you are logged into the Atlas dashboard in your web browser you can find it [here](https://atlas.nomic.ai/cli-login). Either login in a command shell by running `nomic login` or in Python file with: ```py import nomic nomic.login('Nomic API KEY') ``` ## Making maps under an organization If you are added to a Nomic organization by someone (such as your employer), you can create projects under them by specifying an `organization_name` in the `map_embedding` method of the AtlasClient. By default, projects are made under your own account. ## Working with Dates and Timestamps Atlas will consider metadata as timestamps when they are passed as Python `date` or `datetime` objects. Under the hood, these are converted into timestamps compatible with the Apache Arrow standard. Remember, you can directly pass through pandas Dataframe objects and Arrow tables to the `add_*` endpoints. ## How do I make maps of a dataset I have already uploaded? You need to make a new index on the project you have uploaded your data to. See [How does Atlas work?](how_does_atlas_work.md) for details. ## Disabling logging Nomic utilizes the `loguru` module for logging. We recognize that logging can sometimes be annoying. You can disable or change the logging level by including the following snippet at the top of any script. ```py from loguru import logger import sys logger.remove(0) logger.add(sys.stderr, level="ERROR", filter='nomic') ``` --- # Release Notes ## v2.0.16 Bugfix affecting automatic renewal of enterprise login credentials. ## v2.0.14 Improvements for stability and scale. ## v2.0.0 Atlas map state is now accessible as top level attributes of the AtlasProjection ("the Map") class. ```python from atlas import AtlasProject project = AtlasProject(name='My Project') map = project.maps[0] ``` - Topics (`map.topics`) - Embeddings (`map.embeddings`) - Semantic Duplicate Clusters (`map.duplicates`) - Tagging (`map.tags`) The `Accessing Atlas State` section of the documentation shows some of the ways you can use this data. ## v1.1.7 Raise correct errors on bad id uploads ## v1.1.0 ### New Data validation 1. Uploads are now internally handled as Arrow tables, allowing greater type safety and data throughput. 2. In addition to passing lists of dicts, you can directly pass pandas DataFrames or pyarrow tables to any upload methods. 3. Datetime formats are now passed as native python dates or datetimes (or as pandas date or datetime). ISO-formatted strings will no longer be automatically coerced--just pass your own. 4. Null values are now allowed in any fields except for embeddings and ids. These can be passed either by setting the key to None, omiting a key from a dictionary, or using a pandas null type. 5. Typechecking is stricter than before, with the aim of raising errors on the client side sooner. ### Deprecations - `shard_size` and `num_workers` are deprecated. ## v1.0.25 **Tagging**: The `get_tags` method will retrieve tags you have assigned to datapoints on the map. **Progressive projects**: You can now call the `map_*` endpoints multiple times and specify the same project each time. Doing this will add data to the project. See the [documentation](dynamic_maps.md) for examples. **shard_size**: You can now specify a shard*size in the `map*\*` endpoints. If each datum is too large, you want to use a smaller shard size to successfully send data to Atlas. ## v1.0.22 **ID fields**: Every datum by default has an id field attached. You no longer have to specify an id field when mapping data. **Bug fixes**: Numerous bugs are now squashed involving error handling. ## v1.0.14 **Progressive Maps**: Maps can now be built progressively in Atlas. See the progressive map documentation for more information. ## v1.0.13 **Documentation Improvements**: documentation was significantly altered for clarity and organization. **Maps of text**: Atlas can now ingest your raw text data and handle the embedding for you. See the text map documentation for more detail.s --- # Visualizing a Vector Database Atlas is an interactive visual layer and debugger for vector databases. This tutorial will show you how you can visualize your Weaviate and Pinecone vector databases with Atlas. ## Why Visualize A Vector Database Vector databases allow you to query your data semantically by indexing embedding vectors. By interactively visualizing embeddings, you can quickly: - Understand the space of possible query results from your vector database - Identify bad embeddings in your index which may produce poor query results ## Weaviate !!! warning "Required Properties" When adding data to your weaviate database be sure to include the additional properties of id and vectors this can be done by adding this code when importing data to the database: `_additional = {"vector", "id"}` First you need your Atlas API Token and a Weaviate Database URL. If your database requires more authorization add it to the client object. ```python import weaviate from nomic import AtlasProject import numpy as np import nomic nomic.login("NOMIC API KEY") client = weaviate.Client( url="WEAVIATE DATABASE URL", ) ``` Next we'll gather all of the classes and their respective properties from the database. To do this we will iterate through the database schema and append the classes and properties list. ```python schema = client.schema.get() classes = [] props = [] for c in schema["classes"]: classes.append(c["class"]) temp = [] for p in c["properties"]: if p["dataType"] == ["text"]: temp.append(p["name"]) props.append(temp) ``` Now we will make a helper function, this will allow us to map classes that are larger than 10,000 data points. It queries the database while allowing us to use a cursor to store our place. ```python def get_batch_with_cursor( client, class_name, class_properties, batch_size, cursor=None ): query = ( client.query.get(class_name, class_properties) .with_additional(["vector", "id"]) .with_limit(batch_size) ) if cursor is not None: return query.with_after(cursor).do() else: return query.do() ``` The rest of the tutorial will be inside of a for loop. This allows us to create an Atlas Map for all of the classes in the database. ```python for c, p in zip(classes, props): ``` !!! note "Map out only one class" If you would like to map only a single class set `c` equal to the class name and `p` equal to a list with the class properties We will now create an Atlas Project which will eventually contain all of our embeddings and data ```python project = AtlasProject( name=c, unique_id_field="id", modality="embedding", ) ``` Now we use a while loop to access all of the data from each class, which we do in batches using our helper function, in this case we have a batch size of 10,000. We break the while loop when a call to the helper function returns no values. We then set our cursor to the id of the datapoint we left off at, and append the vectors to a list, which we then convert into a numpy array. !!! note "To Not Include Properties" To not include a property add the property name to the list titled `not_data`. If it the property is an additional property add the property name to `un_data` We then parse our data only including the properties we want. Finally we add the embeddings to our atlas project along with our parsed data. ```python cursor = None while True: response = get_batch_with_cursor(client, c, p, 10000, cursor) if len(response["data"]["Get"][c]) == 0: break cursor = response["data"]["Get"][c][-1]["_additional"]["id"] vectors = [] for i in response["data"]["Get"][c]: vectors.append(i["_additional"]["vector"]) embeddings = np.array(vectors) data = [] not_data = ["_additional"] un_data = ["vector"] for i in response["data"]["Get"][c]: j = {key: value for key, value in i.items() if key not in not_data} k = { key: value for key, value in i["_additional"].items() if key not in un_data } j = j | k data.append(j) with project.wait_for_project_lock(): project.add_embeddings( embeddings=embeddings, data=data, ) ``` Finally we will build our map with the given parameters using `create_index()` !!! note "Add Topic Labels" If you want labels on your atlas map add the following line of code using the property name that you want to build the labels for: `topic_label_field= "PROPERTY NAME"` ```python project.create_index( name=c, colorable_fields=p, build_topic_model=True, ) ``` You can find the source code [here](https://github.com/nomic-ai/maps/blob/main/maps/weaviate_script.py) ## Pinecone First, find your Pinecone and Atlas API keys. ```python import pinecone import numpy as np from nomic import atlas import nomic pinecone.init(api_key='YOUR PINECONE API KEY', environment='us-east1-gcp') nomic.login('YOUR NOMIC API KEY') ``` Below we will create an example Pinecone Vector Database Index and fill it with 1000 random embeddings. !!! note "Use your own index" If you have an existing Pinecone Index, you can skip this step and just import the Index as usual. ```python pinecone.create_index("quickstart", dimension=128, metric="euclidean", pod_type="p1") index = pinecone.Index("quickstart") num_embeddings = 1000 embeddings_for_pinecone = np.random.rand(num_embeddings, 128) index.upsert([(str(i), embeddings_for_pinecone[i].tolist()) for i in range(num_embeddings)]) ``` Next, you'll need to get the ID's of all of your embeddings to extract them from your Pinecone Index. In our previous example, we just used the integers 0-999 as our ID's. Then, extract the embeddings out into a numpy array. Once you have embeddings, send them over to Atlas. ```python vectors = index.fetch(ids=[str(i) for i in range(num_embeddings)]) ids = [] embeddings = [] for id, vector in vectors['vectors'].items(): ids.append(id) embeddings.append(vector['values']) embeddings = np.array(embeddings) atlas.map_embeddings(embeddings=embeddings, data=[{'id': id} for id in ids], id_field='id') ``` You can find the full source code [here](https://github.com/nomic-ai/maps/blob/main/maps/pinecone_index.py) --- # Similarity and Vector Search Atlas supports vector search over maps. You can think of vector search as a programatic way to access areas of a map. When you pick a point on a map and use it as input to the `vector_search` function, you get as output points close to your input point. These outputs are called neighbors. ```python project = AtlasProject(name='Example Map') map = project.maps[0] print(project.get_data(ids=['42'])) neighbors, distances = map.embeddings.vector_search(ids=['42']) print(project.get_data(ids=neighbors[0])) ``` ## Applications Vector search can be used to: 1. Programmatically access clusters or neighborhoods of datapoints on your map. 2. Clean your data by finding near duplicates and data points similar to un-wanted ones. 3. Label your data by retrieving points near a point with a known label. 3. Similarity search your data for use-cases like recommendation. !!! note "Vector Search Operates in the Ambient Space" Vector search operates on high dimensional (ambient) vectors corresponding to your data, not on the two dimensional map positions. ## Example The following example showcases creating a map from 25,000 news articles and then performing a vector search. First create the map: ```python from nomic import atlas, AtlasProject import numpy as np from datasets import load_dataset dataset = load_dataset('ag_news')['train'] np.random.seed(0) max_documents = 25000 subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist() documents = [dataset[i] for i in subset_idxs] for idx, document in enumerate(documents): document['id'] = idx project = atlas.map_text(data=documents, indexed_field='text', id_field='id', name='News Dataset 25k', colorable_fields=['label'], description='News Dataset 25k' ) ``` Then run a vector search: ```python project = AtlasProject(name='News Dataset 25k') map = project.maps[0] #batch two vector search queries into one request. query_document_ids = [0, 42] with project.wait_for_project_lock(): neighbors, distances = map.embeddings.vector_search(ids=query_document_ids, k=10) print(neighbors) data = project.get_data(ids=query_document_ids) for datum, datum_neighbors in zip(data, neighbors): neighbor_data = project.get_data(ids=datum_neighbors) print(f"The ten nearest neighbors to the query point {datum} are {neighbor_data}") ``` !!! note "Project Lock and Vector Search" You cannot run a vector_search against a map while it's project is locked (e.g. Atlas is building the map).