# Contents

Embedding your browser history with ChromaDB transforms URLs and metadata into numerical formats, enabling advanced searches.

This guide covers setting up a development environment, embedding browser history, and running advanced searches with ChromaDB. It also explains how to use local models and OpenAI-compatible services to convert web activity into structured, searchable data.

This guide is perfect for those eager to understand embeddings and explore their various applications. You'll learn to set up tools, craft detailed search queries with filters, and gain practical experience for future projects. This step-by-step guide is ideal for developers exploring embeddings or anyone looking to manage their digital footprint, helping you create a functional project with ChromaDB.

TL;DR
  • Export Browser History: Use the Export Chrome History extension to get a CSV file

  • Set Up Environment: Install Daytona and create a workspace

  • Embed History: Run the `search.py` script to embed your history data

  • Search History: Perform searches using various filters and options

Preparations

Installing Prerequisites

Before you begin, ensure you have Daytona installed.

You can install Daytona by running the following command in your terminal:

1curl -L https://download.daytona.io/daytona/install.sh | sudo bash

Daytona creates a controlled development environment where you can work on the project with all dependencies correctly configured, reducing the likelihood of compatibility issues.

Exporting Browser History

To work with your browser history, you’ll need to export it from your browser:

  • Install the Export Chrome History Extension: Download and install the Export Chrome History Chrome extension.

  • Export to CSV: Use the extension to export your browser history as a CSV file. Save it to a convenient location, such as ~/Downloads/history.csv.

Example:

1visit_time,url,title,visit_count,typed_count
22023-08-10 12:34:56,http://example.com,Example Site,5,2
32023-08-09 11:22:33,http://anotherexample.com,Another Example,3,1

Setting Up the Development Environment

Creating a Daytona Workspace

With Daytona installed, you can create a workspace for the nkkko/history repository:

1daytona create https://github.com/nkkko/history

This command clones the repository and sets up the necessary environment. Daytona manages dependencies and configurations, ensuring your environment is correctly set up for running the project scripts.

You can open the workspace in your IDE with the following command:

1daytona code

Set Up OpenAI Service (Optional)

There are many hosted models compatible with the OpenAI API. Here’s a basic rundown of available services:

AzureOpenAICohereAnthropicJurassic-2
Model Size175B+ parameters1B-175B+ parameters6B-12B parameters100B+ parameters20B-178B parameters
Context Length4,000-32,000 tokens4,000-32,000 tokensUp to 2,048 tokensUp to 100,000 tokensUp to 2,048 tokens
Vector Size1,536 dimensions1,536 dimensions512-768 dimensions768-1,024 dimensions1,024-2,048 dimensions
Time for Embedding0.5 - 6 seconds (varies)0.5 - 6 seconds (varies)0.2 - 4 seconds (varies)0.5 - 6 seconds (varies)1 - 5 seconds (varies)
API Cost$0.0004-$0.0120 per token$0.0004-$0.0120 per token$0.0001-$0.0005 per token$0.0005-$0.0030 per token$0.0005-$0.0020 per token

Key Notes

  1. Model Size:

    Anthropic: Claude models are estimated to be in the 100B+ range, but exact figures are not officially confirmed.

    Azure & OpenAI: The sizes refer to GPT-3 and GPT-4 models. GPT-3 has 175B parameters; GPT-4 is larger, but the exact size is proprietary.

  2. Context Length:

    OpenAI and Anthropic: Both platforms offer models with larger context lengths, particularly in newer versions.

    Cohere & Jurassic-2: Smaller context lengths compared to OpenAI’s latest models.

  3. Time for Embedding:

    This is highly variable and depends on multiple factors, such as the specific model, server load, and input length. The provided ranges should be taken as rough estimates rather than definitive benchmarks.

  4. API Cost:

    Cohere, Anthropic, and Jurassic-2: These ranges are more general and reflect common usage scenarios. Always check the latest pricing from providers, as it can change.

    Azure & OpenAI: The cost can vary significantly based on the model, with GPT-4 generally costing more per token than GPT-3.

Environment Variables

Ensure you have the necessary environment variables configured in a local .env file. For example, this is how you would configure Azure:

1AZURE_API_VERSION=<your_azure_api_version>
2AZURE_ENDPOINT=<your_azure_endpoint>
3AZURE_OPENAI_API_KEY=<your_azure_api_key>

Embedding and Searching Your Browser History

In order to use the script, the sentence_transformers module is required to be installed on the user's device.

You can install the dependency with the following command:

1pip install sentence_transformers

Embedding

With your development environment ready, you can now embed your browser history into ChromaDB.

To embed using the local model (multi-qa-distilbert-cos-v1), run:

1python search.py --embed path/to/your/history.csv # Ex: ~/Downloads/history.csv

This command reads each entry from the CSV, processes it into an embedding, and stores it in ChromaDB. The local model multi-qa-distilbert-cos-v1 is lightweight and works well for most purposes.

Using Azure Embeddings

If you’re leveraging Azure for embedding, use the --azure flag:

1python search.py --embed path/to/your/history.csv --azure

Searching

Once your history is embedded, you can perform targeted searches using the search.py script.

To start a basic search, simply use:

1python search.py "search query"

To refine your results, consider using filters. For example, if you recently read an article on “AI ethics” on example.com but can’t recall the exact URL or title, you can narrow your search to that specific domain:

1python search.py "AI ethics" --domain example.com

This command restricts the search to example.com, making it easier to locate the specific page.

If you want to find the most recent page you visited about “quantum computing,” you can sort the results by the latest entries:

1python search.py "quantum computing" --newest

This will display the most recent pages related to “quantum computing,” which is useful for quickly revisiting the latest content. For frequently visited pages, such as those related to a project management tool, you can filter by the number of visits:

1python search.py "project tasks" --visit-count 10

This filter highlights pages you’ve visited at least ten times, likely indicating their importance.

If you’re trying to locate a specific coding tutorial site that you’ve manually typed the URL for multiple times, use:

1python search.py "python tutorial" --typed-count 3

This focuses on pages where you’ve intentionally typed the URL, helping you find the exact tutorial.

To find a document where you remember directly typing the URL, use:

1python search.py "project proposal" --transition typed

This command searches for pages accessed by typing the URL directly, excluding those reached through links.

For a more complex search, such as finding the most recent paper on “deep learning” that you’ve visited multiple times on researchsite.com, you can combine multiple filters:

1python search.py "deep learning" --domain researchsite.com --newest --visit-count 5

This command combines domain restriction, recent sorting, and visit count filtering, allowing you to locate the most relevant content quickly.

Real-world Use Cases

Understanding how to apply the concepts in this guide to real-world scenarios can help you better appreciate the power of embedding your browser history. Below are some practical examples of how this tool can be used in various contexts:

Personal Productivity Tracking

This tool optimizes your online time by analyzing browser history. Compare time spent on work sites (Google Docs, Trello) versus leisure sites (social media, entertainment). Use these insights for better time management.

Example:

1python search.py "to do" --domain trello.com --newest

Content Curation and Management

For content creators, managing online resources is vital. This guide helps embed and index browsing history, enabling quick retrieval of articles, videos, or research papers. When writing a blog post, easily find past sources with keyword searches or visit count filters.

Example:

1python search.py "python" --typed-count 3 --newest

Academic Research

Researchers can streamline literature reviews and data gathering by embedding and searching browser history. For ongoing projects, they can easily retrieve previously visited papers, articles, or datasets. They can also filter results by academic databases or journals to focus on credible sources.

Example:

1python search.py "machine learning" --domain arxiv.org --visit-count 2

Personalized Content Recommendations

Embedding your browser history enhances personalized content recommendations. For example, if you frequently visit sites like Medium for data science articles, this history helps you quickly find similar content or revisit posts, ensuring better-aligned recommendations.

Example:

1python search.py "data science" --domain medium.com --newest

Common Issues and Troubleshooting

CSV Embedding Failure

Problem:
The script fails to embed the CSV file.

Solution:
Ensure the CSV file is correctly formatted and accessible from the path provided. Double-check your Python environment and dependencies.

Hosted Embeddings Not Applied

Problem:
The hosted embedding model is not being used despite setting environment variables.

Solution:
Verify that the .env file is correctly configured and located in the root directory. Ensure that the keys and endpoints are correct and match the required format.

Inaccurate Search Results

Problem:
Search results don’t seem to match your query.

Solution:
Check if the correct embedding model is being used. If the local model is inadequate, use a hosted embedding service for more accurate results. Also, review your filters and search queries for specificity.

Conclusion

Following this guide, you’ve created a system to embed and search your browser history using ChromaDB. You now have a powerful tool for exploring your browsing habits, whether for productivity analysis, research, or personal curiosity.

You can continue experimenting with different embedding models and search parameters to refine your results. You can also consider expanding this setup to include other data types or integrating it into larger data analysis projects.