DEC 19 2024 // 3 min read

Embedding and Searching Browser History with ChromaDB

Jacob Gaffke

# Contents

Preparations Setting Up the Development Environment Embedding and Searching Your Browser History Real-world Use Cases Common Issues and Troubleshooting Conclusion

Embedding your browser history with ChromaDB transforms URLs and metadata into numerical formats, enabling advanced searches.

This guide covers setting up a development environment, embedding browser history, and running advanced searches with ChromaDB. It also explains how to use local models and OpenAI-compatible services to convert web activity into structured, searchable data.

This guide is perfect for those eager to understand embeddings and explore their various applications. You'll learn to set up tools, craft detailed search queries with filters, and gain practical experience for future projects. This step-by-step guide is ideal for developers exploring embeddings or anyone looking to manage their digital footprint, helping you create a functional project with ChromaDB.

TL;DR

Export Browser History: Use the Export Chrome History extension to get a CSV file
Set Up Environment: Install Daytona and create a workspace
Embed History: Run the `search.py` script to embed your history data
Search History: Perform searches using various filters and options

Preparations

Installing Prerequisites

Before you begin, ensure you have Daytona installed.

You can install Daytona by running the following command in your terminal:

Code copied successfully!

1curl -L https://download.daytona.io/daytona/install.sh | sudo bash

Daytona creates a controlled development environment where you can work on the project with all dependencies correctly configured, reducing the likelihood of compatibility issues.

Exporting Browser History

To work with your browser history, you’ll need to export it from your browser:

Install the Export Chrome History Extension: Download and install the Export Chrome History Chrome extension.
Export to CSV: Use the extension to export your browser history as a CSV file. Save it to a convenient location, such as ~/Downloads/history.csv.

Example:

Code copied successfully!

1visit_time,url,title,visit_count,typed_count
22023-08-10 12:34:56,http://example.com,Example Site,5,2
32023-08-09 11:22:33,http://anotherexample.com,Another Example,3,1

Setting Up the Development Environment

Creating a Daytona Workspace

With Daytona installed, you can create a workspace for the nkkko/history repository:

Code copied successfully!

1daytona create https://github.com/nkkko/history

This command clones the repository and sets up the necessary environment. Daytona manages dependencies and configurations, ensuring your environment is correctly set up for running the project scripts.

You can open the workspace in your IDE with the following command:

Code copied successfully!

1daytona code

Set Up OpenAI Service (Optional)

There are many hosted models compatible with the OpenAI API. Here’s a basic rundown of available services:

	Azure	OpenAI	Cohere	Anthropic	Jurassic-2
Model Size	175B+ parameters	1B-175B+ parameters	6B-12B parameters	100B+ parameters	20B-178B parameters
Context Length	4,000-32,000 tokens	4,000-32,000 tokens	Up to 2,048 tokens	Up to 100,000 tokens	Up to 2,048 tokens
Vector Size	1,536 dimensions	1,536 dimensions	512-768 dimensions	768-1,024 dimensions	1,024-2,048 dimensions
Time for Embedding	0.5 - 6 seconds (varies)	0.5 - 6 seconds (varies)	0.2 - 4 seconds (varies)	0.5 - 6 seconds (varies)	1 - 5 seconds (varies)
API Cost	$0.0004-$0.0120 per token	$0.0004-$0.0120 per token	$0.0001-$0.0005 per token	$0.0005-$0.0030 per token	$0.0005-$0.0020 per token

Key Notes

Model Size:
Anthropic: Claude models are estimated to be in the 100B+ range, but exact figures are not officially confirmed.
Azure & OpenAI: The sizes refer to GPT-3 and GPT-4 models. GPT-3 has 175B parameters; GPT-4 is larger, but the exact size is proprietary.
Context Length:
OpenAI and Anthropic: Both platforms offer models with larger context lengths, particularly in newer versions.
Cohere & Jurassic-2: Smaller context lengths compared to OpenAI’s latest models.
Time for Embedding:
This is highly variable and depends on multiple factors, such as the specific model, server load, and input length. The provided ranges should be taken as rough estimates rather than definitive benchmarks.
API Cost:
Cohere, Anthropic, and Jurassic-2: These ranges are more general and reflect common usage scenarios. Always check the latest pricing from providers, as it can change.
Azure & OpenAI: The cost can vary significantly based on the model, with GPT-4 generally costing more per token than GPT-3.

Environment Variables

Ensure you have the necessary environment variables configured in a local .env file. For example, this is how you would configure Azure:

Code copied successfully!

1AZURE_API_VERSION=<your_azure_api_version>
2AZURE_ENDPOINT=<your_azure_endpoint>
3AZURE_OPENAI_API_KEY=<your_azure_api_key>

Embedding and Searching Your Browser History

In order to use the script, the sentence_transformers module is required to be installed on the user's device.

You can install the dependency with the following command:

Code copied successfully!

1pip install sentence_transformers

Embedding

With your development environment ready, you can now embed your browser history into ChromaDB.

To embed using the local model (multi-qa-distilbert-cos-v1), run:

Code copied successfully!

1python search.py --embed path/to/your/history.csv  # Ex: ~/Downloads/history.csv

This command reads each entry from the CSV, processes it into an embedding, and stores it in ChromaDB. The local model multi-qa-distilbert-cos-v1 is lightweight and works well for most purposes.

Using Azure Embeddings

If you’re leveraging Azure for embedding, use the --azure flag:

Code copied successfully!

1python search.py --embed path/to/your/history.csv --azure

Searching

Once your history is embedded, you can perform targeted searches using the search.py script.

To start a basic search, simply use:

Code copied successfully!

1python search.py "search query"

To refine your results, consider using filters. For example, if you recently read an article on “AI ethics” on example.com but can’t recall the exact URL or title, you can narrow your search to that specific domain:

Code copied successfully!

1python search.py "AI ethics" --domain example.com

This command restricts the search to example.com, making it easier to locate the specific page.

If you want to find the most recent page you visited about “quantum computing,” you can sort the results by the latest entries:

Code copied successfully!

1python search.py "quantum computing" --newest

This will display the most recent pages related to “quantum computing,” which is useful for quickly revisiting the latest content. For frequently visited pages, such as those related to a project management tool, you can filter by the number of visits:

Code copied successfully!

1python search.py "project tasks" --visit-count 10

This filter highlights pages you’ve visited at least ten times, likely indicating their importance.

If you’re trying to locate a specific coding tutorial site that you’ve manually typed the URL for multiple times, use:

Code copied successfully!

1python search.py "python tutorial" --typed-count 3

This focuses on pages where you’ve intentionally typed the URL, helping you find the exact tutorial.

To find a document where you remember directly typing the URL, use:

Code copied successfully!

1python search.py "project proposal" --transition typed

This command searches for pages accessed by typing the URL directly, excluding those reached through links.

For a more complex search, such as finding the most recent paper on “deep learning” that you’ve visited multiple times on researchsite.com, you can combine multiple filters:

Code copied successfully!

1python search.py "deep learning" --domain researchsite.com --newest --visit-count 5

This command combines domain restriction, recent sorting, and visit count filtering, allowing you to locate the most relevant content quickly.

Real-world Use Cases

Understanding how to apply the concepts in this guide to real-world scenarios can help you better appreciate the power of embedding your browser history. Below are some practical examples of how this tool can be used in various contexts:

Personal Productivity Tracking

This tool optimizes your online time by analyzing browser history. Compare time spent on work sites (Google Docs, Trello) versus leisure sites (social media, entertainment). Use these insights for better time management.

Example:

Code copied successfully!

1python search.py "to do" --domain trello.com --newest

Content Curation and Management

For content creators, managing online resources is vital. This guide helps embed and index browsing history, enabling quick retrieval of articles, videos, or research papers. When writing a blog post, easily find past sources with keyword searches or visit count filters.

Example:

Code copied successfully!

1python search.py "python" --typed-count 3 --newest

Academic Research

Researchers can streamline literature reviews and data gathering by embedding and searching browser history. For ongoing projects, they can easily retrieve previously visited papers, articles, or datasets. They can also filter results by academic databases or journals to focus on credible sources.

Example:

Code copied successfully!

1python search.py "machine learning" --domain arxiv.org --visit-count 2

Personalized Content Recommendations

Embedding your browser history enhances personalized content recommendations. For example, if you frequently visit sites like Medium for data science articles, this history helps you quickly find similar content or revisit posts, ensuring better-aligned recommendations.

Example:

Code copied successfully!

1python search.py "data science" --domain medium.com --newest

Common Issues and Troubleshooting

CSV Embedding Failure

Problem:
The script fails to embed the CSV file.

Solution:
Ensure the CSV file is correctly formatted and accessible from the path provided. Double-check your Python environment and dependencies.

Hosted Embeddings Not Applied

Problem:
The hosted embedding model is not being used despite setting environment variables.

Solution:
Verify that the .env file is correctly configured and located in the root directory. Ensure that the keys and endpoints are correct and match the required format.

Inaccurate Search Results

Problem:
Search results don’t seem to match your query.

Solution:
Check if the correct embedding model is being used. If the local model is inadequate, use a hosted embedding service for more accurate results. Also, review your filters and search queries for specificity.

Conclusion

Following this guide, you’ve created a system to embed and search your browser history using ChromaDB. You now have a powerful tool for exploring your browsing habits, whether for productivity analysis, research, or personal curiosity.

You can continue experimenting with different embedding models and search parameters to refine your results. You can also consider expanding this setup to include other data types or integrating it into larger data analysis projects.

About

Daytona is a dev environment orchestration & management platform empowering developers to focus on what matters.

[ → Learn more ]

The author

Jacob Gaffke

Freelance web and game developer

Jacob Gaffke is a web developer and game designer based in the United States who started programming as a hobby at an early age. He has created open source projects such as Tatami, a dungeon generation algorithm for roguelikes, and publishes programming tutorials on his website.

Embedding and Searching Browser History with ChromaDB

Jacob Gaffke

# Contents

TL;DR

Preparations

Installing Prerequisites

Exporting Browser History

Setting Up the Development Environment

Creating a Daytona Workspace

Set Up OpenAI Service (Optional)

Key Notes

Environment Variables

Embedding and Searching Your Browser History

Embedding

Using Azure Embeddings

Searching

Real-world Use Cases

Personal Productivity Tracking

Content Curation and Management

Academic Research

Personalized Content Recommendations

Common Issues and Troubleshooting

CSV Embedding Failure

Hosted Embeddings Not Applied

Inaccurate Search Results

Conclusion

About

The author

Jacob Gaffke

Related Content

Harnessing AI through Standardization and Isolation

The Dream Dies: Why Cloud IDEs Failed Developers

Guide to Installing the Daytona JetBrains Gateway Plugin

Newsletter