Embedding your browser history with ChromaDB transforms URLs and metadata into numerical formats, enabling advanced searches.
This guide covers setting up a development environment, embedding browser history, and running advanced searches with ChromaDB. It also explains how to use local models and OpenAI-compatible services to convert web activity into structured, searchable data.
This guide is perfect for those eager to understand embeddings and explore their various applications. You'll learn to set up tools, craft detailed search queries with filters, and gain practical experience for future projects. This step-by-step guide is ideal for developers exploring embeddings or anyone looking to manage their digital footprint, helping you create a functional project with ChromaDB.
TL;DR
Export Browser History: Use the Export Chrome History extension to get a CSV file
Set Up Environment: Install Daytona and create a workspace
Embed History: Run the `search.py` script to embed your history data
Search History: Perform searches using various filters and options
Preparations
Installing Prerequisites
Before you begin, ensure you have Daytona installed.
You can install Daytona by running the following command in your terminal:
1curl -L https://download.daytona.io/daytona/install.sh | sudo bash
Daytona creates a controlled development environment where you can work on the project with all dependencies correctly configured, reducing the likelihood of compatibility issues.
Exporting Browser History
To work with your browser history, you’ll need to export it from your browser:
Install the Export Chrome History Extension: Download and install the Export Chrome History Chrome extension.
Export to CSV: Use the extension to export your browser history as a CSV file. Save it to a convenient location, such as
~/Downloads/history.csv
.
Example:
1visit_time,url,title,visit_count,typed_count22023-08-10 12:34:56,http://example.com,Example Site,5,232023-08-09 11:22:33,http://anotherexample.com,Another Example,3,1
Setting Up the Development Environment
Creating a Daytona Workspace
With Daytona installed, you can create a workspace for the nkkko/history
repository:
1daytona create https://github.com/nkkko/history
This command clones the repository and sets up the necessary environment. Daytona manages dependencies and configurations, ensuring your environment is correctly set up for running the project scripts.
You can open the workspace in your IDE with the following command:
1daytona code
Set Up OpenAI Service (Optional)
There are many hosted models compatible with the OpenAI API. Here’s a basic rundown of available services:
Azure | OpenAI | Cohere | Anthropic | Jurassic-2 | |
---|---|---|---|---|---|
Model Size | 175B+ parameters | 1B-175B+ parameters | 6B-12B parameters | 100B+ parameters | 20B-178B parameters |
Context Length | 4,000-32,000 tokens | 4,000-32,000 tokens | Up to 2,048 tokens | Up to 100,000 tokens | Up to 2,048 tokens |
Vector Size | 1,536 dimensions | 1,536 dimensions | 512-768 dimensions | 768-1,024 dimensions | 1,024-2,048 dimensions |
Time for Embedding | 0.5 - 6 seconds (varies) | 0.5 - 6 seconds (varies) | 0.2 - 4 seconds (varies) | 0.5 - 6 seconds (varies) | 1 - 5 seconds (varies) |
API Cost | $0.0004-$0.0120 per token | $0.0004-$0.0120 per token | $0.0001-$0.0005 per token | $0.0005-$0.0030 per token | $0.0005-$0.0020 per token |
Key Notes
Model Size:
Anthropic: Claude models are estimated to be in the 100B+ range, but exact figures are not officially confirmed.
Azure & OpenAI: The sizes refer to GPT-3 and GPT-4 models. GPT-3 has 175B parameters; GPT-4 is larger, but the exact size is proprietary.
Context Length:
OpenAI and Anthropic: Both platforms offer models with larger context lengths, particularly in newer versions.
Cohere & Jurassic-2: Smaller context lengths compared to OpenAI’s latest models.
Time for Embedding:
This is highly variable and depends on multiple factors, such as the specific model, server load, and input length. The provided ranges should be taken as rough estimates rather than definitive benchmarks.
API Cost:
Cohere, Anthropic, and Jurassic-2: These ranges are more general and reflect common usage scenarios. Always check the latest pricing from providers, as it can change.
Azure & OpenAI: The cost can vary significantly based on the model, with GPT-4 generally costing more per token than GPT-3.
Environment Variables
Ensure you have the necessary environment variables configured in a local .env
file. For example, this is how you would configure Azure:
1AZURE_API_VERSION=<your_azure_api_version>2AZURE_ENDPOINT=<your_azure_endpoint>3AZURE_OPENAI_API_KEY=<your_azure_api_key>
Embedding and Searching Your Browser History
In order to use the script, the sentence_transformers
module is required to be installed on the user's device.
You can install the dependency with the following command:
1pip install sentence_transformers
Embedding
With your development environment ready, you can now embed your browser history into ChromaDB.
To embed using the local model (multi-qa-distilbert-cos-v1), run:
1python search.py --embed path/to/your/history.csv # Ex: ~/Downloads/history.csv
This command reads each entry from the CSV, processes it into an embedding, and stores it in ChromaDB. The local model multi-qa-distilbert-cos-v1
is lightweight and works well for most purposes.
Using Azure Embeddings
If you’re leveraging Azure for embedding, use the --azure
flag:
1python search.py --embed path/to/your/history.csv --azure
Searching
Once your history is embedded, you can perform targeted searches using the search.py
script.
To start a basic search, simply use:
1python search.py "search query"
To refine your results, consider using filters. For example, if you recently read an article on “AI ethics” on example.com
but can’t recall the exact URL or title, you can narrow your search to that specific domain:
1python search.py "AI ethics" --domain example.com
This command restricts the search to example.com
, making it easier to locate the specific page.
If you want to find the most recent page you visited about “quantum computing,” you can sort the results by the latest entries:
1python search.py "quantum computing" --newest
This will display the most recent pages related to “quantum computing,” which is useful for quickly revisiting the latest content. For frequently visited pages, such as those related to a project management tool, you can filter by the number of visits:
1python search.py "project tasks" --visit-count 10
This filter highlights pages you’ve visited at least ten times, likely indicating their importance.
If you’re trying to locate a specific coding tutorial site that you’ve manually typed the URL for multiple times, use:
1python search.py "python tutorial" --typed-count 3
This focuses on pages where you’ve intentionally typed the URL, helping you find the exact tutorial.
To find a document where you remember directly typing the URL, use:
1python search.py "project proposal" --transition typed
This command searches for pages accessed by typing the URL directly, excluding those reached through links.
For a more complex search, such as finding the most recent paper on “deep learning” that you’ve visited multiple times on researchsite.com
, you can combine multiple filters:
1python search.py "deep learning" --domain researchsite.com --newest --visit-count 5
This command combines domain restriction, recent sorting, and visit count filtering, allowing you to locate the most relevant content quickly.
Real-world Use Cases
Understanding how to apply the concepts in this guide to real-world scenarios can help you better appreciate the power of embedding your browser history. Below are some practical examples of how this tool can be used in various contexts:
Personal Productivity Tracking
This tool optimizes your online time by analyzing browser history. Compare time spent on work sites (Google Docs, Trello) versus leisure sites (social media, entertainment). Use these insights for better time management.
Example:
1python search.py "to do" --domain trello.com --newest
Content Curation and Management
For content creators, managing online resources is vital. This guide helps embed and index browsing history, enabling quick retrieval of articles, videos, or research papers. When writing a blog post, easily find past sources with keyword searches or visit count filters.
Example:
1python search.py "python" --typed-count 3 --newest
Academic Research
Researchers can streamline literature reviews and data gathering by embedding and searching browser history. For ongoing projects, they can easily retrieve previously visited papers, articles, or datasets. They can also filter results by academic databases or journals to focus on credible sources.
Example:
1python search.py "machine learning" --domain arxiv.org --visit-count 2
Personalized Content Recommendations
Embedding your browser history enhances personalized content recommendations. For example, if you frequently visit sites like Medium for data science articles, this history helps you quickly find similar content or revisit posts, ensuring better-aligned recommendations.
Example:
1python search.py "data science" --domain medium.com --newest
Common Issues and Troubleshooting
CSV Embedding Failure
Problem:
The script fails to embed the CSV file.
Solution:
Ensure the CSV file is correctly formatted and accessible from the path provided. Double-check your Python environment and dependencies.
Hosted Embeddings Not Applied
Problem:
The hosted embedding model is not being used despite setting environment variables.
Solution:
Verify that the .env
file is correctly configured and located in the root directory. Ensure that the keys and endpoints are correct and match the required format.
Inaccurate Search Results
Problem:
Search results don’t seem to match your query.
Solution:
Check if the correct embedding model is being used. If the local model is inadequate, use a hosted embedding service for more accurate results. Also, review your filters and search queries for specificity.
Conclusion
Following this guide, you’ve created a system to embed and search your browser history using ChromaDB. You now have a powerful tool for exploring your browsing habits, whether for productivity analysis, research, or personal curiosity.
You can continue experimenting with different embedding models and search parameters to refine your results. You can also consider expanding this setup to include other data types or integrating it into larger data analysis projects.