-
Notifications
You must be signed in to change notification settings - Fork 421
Open
Description
Problem Statement
Currently, ii-agent requires significant manual setup and lacks persistent knowledge capabilities:
- Complex deployment: Users must manually install dependencies, configure databases, set up workspace directories
- Limited data sources: Only web search available, no crawling or social media ingestion
- No knowledge persistence: Agent forgets everything between sessions, no RAG system
- Poor developer experience: Multiple config files, manual API key management, scattered setup steps
Current user journey: Clone repo → Install Python deps → Configure workspace → Set up database → Configure API keys → Debug connection issues → Finally test agent
Desired user journey: Clone repo → docker compose up
→ Add API keys to .env
→ Agent ready with crawling + RAG
Proposed Solution
Add comprehensive data ingestion and knowledge management system with containerized deployment.
Core Components
1. Docker Infrastructure
- Full docker-compose stack (agent + UI + postgres + redis + neo4j + graphiti + crawler workers)
- Environment-based configuration
- Volume persistence for data
- Health checks and dependency management
2. Web Crawling System with Crawl4AI
- WebCrawlTool using Crawl4AI v0.5.0 for content extraction
- Deep crawling capabilities with configurable strategies (BFS, DFS, Best-First)
- Memory-Adaptive Dispatcher for large-scale crawls with dynamic concurrency
- LLM-friendly markdown output perfect for RAG pipeline ingestion
- Redis-based job queue for async crawling
- Automatic content deduplication and chunking
3. Social Media Connectors
- Reddit: PRAW-based fetcher for user comments/posts (up to 1000 items API limit)
- Twitter/X: Tweepy v2 integration with rate limit handling
- LinkedIn: Graph API v22 for public posts/analytics
- TikTok: Research API support (graceful 403 handling for non-academic tokens)
- Instagram: Graph API integration
- Other Socials...
4. Graphiti Knowledge Graph Integration
- Neo4j-backed knowledge graph with semantic search and relationship extraction
- Episode-based memory: Stores conversation turns as episodes with automatic fact extraction
- Advanced search capabilities:
- Reciprocal Rank Fusion (RRF) combining BM25 and semantic search
- Maximal Marginal Relevance (MMR) for diverse results
- Cross-encoder reranking with OpenAI or BGE models
- RAGQueryTool for agent to retrieve relevant context with graph relationships
- Session-scoped knowledge with cross-session learning
- Entity relationship mapping between people, topics, and content sources
Technical Implementation
File Structure
├── docker-compose.yml
├── Dockerfile.crawler
├── .env.example
├── src/
│ ├── ii_agent/tools/
│ │ ├── web_crawl_tool.py
│ │ ├── social_ingestion_tool.py
│ │ └── rag_query_tool.py
│ ├── crawler/
│ │ ├── worker.py
│ │ ├── crawl4ai_config.py
│ │ └── processors/
│ ├── social/
│ │ ├── reddit_fetcher.py
│ │ ├── twitter_fetcher.py
│ │ └── linkedin_fetcher.py
│ └── graphiti/
│ ├── client.py
│ └── episode_manager.py
Key APIs
WebCrawlTool (using Crawl4AI deep crawling)
{
"name": "web_crawl",
"input": {
"urls": ["https://example.com"],
"strategy": "BFS",
"max_depth": 3,
"max_pages": 50,
"url_filters": ["*/blog/*", "*/docs/*"],
"crawler_mode": "browser"
}
}
SocialIngestionTool
{
"name": "social_ingest",
"input": {
"platform": "reddit",
"count": 500,
"filters": {"subreddits": ["python", "MachineLearning"]}
}
}
RAGQueryTool (with Graphiti graph search)
{
"name": "rag_query",
"input": {
"query": "What topics do I comment about most?",
"search_type": "hybrid",
"rerank_method": "rrf",
"include_relationships": true,
"top_k": 10
}
}
Graphiti Integration Details
Episode Management
# src/graphiti/episode_manager.py
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from graphiti_core.utils.bulk_utils import RawEpisode
class GraphitiEpisodeManager:
def __init__(self, neo4j_uri, neo4j_user, neo4j_password):
self.client = Graphiti(neo4j_uri, neo4j_user, neo4j_password)
async def add_crawled_content(self, content, metadata):
"""Add crawled content as episodes to knowledge graph"""
episode = RawEpisode(
name=f"crawl_{metadata['url']}",
episode_type=EpisodeType.CONVERSATION,
content=content,
source=metadata['url'],
reference_time=metadata['timestamp']
)
await self.client.add_episode(episode)
async def search_knowledge(self, query, search_type="hybrid"):
"""Search knowledge graph with advanced reranking"""
results = await self.client.search(
query=query,
search_type=search_type, # "semantic", "bm25", or "hybrid"
rerank_method="rrf", # "rrf", "mmr", or "cross_encoder"
limit=10
)
return results
Docker Compose Services
services:
ii-agent:
build: .
environment:
- DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent
- NEO4J_URI=bolt://neo4j:7687
- NEO4J_USER=neo4j
- NEO4J_PASSWORD=graphiti_password
- REDIS_URL=redis://redis:6379
- CRAWL4AI_MEMORY_THRESHOLD=0.8
depends_on: [postgres, redis, neo4j, crawl4ai]
postgres:
image: postgres:15
environment:
POSTGRES_DB: iiagent
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
neo4j:
image: neo4j:5.23
environment:
- NEO4J_AUTH=neo4j/graphiti_password
- NEO4J_PLUGINS=["apoc"]
ports:
- "7474:7474"
- "7687:7687"
volumes:
- neo4j_data:/data
- neo4j_logs:/logs
# Crawl4AI as scalable service
crawl4ai:
image: unclecode/crawl4ai:latest
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN}
ports:
- "8001:8000"
deploy:
replicas: 2
crawler-worker:
build:
context: .
dockerfile: Dockerfile.crawler
environment:
- CRAWL_QUEUE_URL=redis://redis:6379
- CRAWL4AI_URL=http://crawl4ai:8000
- NEO4J_URI=bolt://neo4j:7687
- NEO4J_USER=neo4j
- NEO4J_PASSWORD=graphiti_password
depends_on: [redis, neo4j, crawl4ai]
deploy:
replicas: 3
volumes:
postgres_data:
neo4j_data:
neo4j_logs:
Environment Configuration
.env.example
# Core
ANTHROPIC_API_KEY=your_anthropic_key
DATABASE_URL=postgresql://postgres:password@postgres:5432/iiagent
# Neo4j & Graphiti Configuration
NEO4J_URI=bolt://neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=graphiti_password
# Crawl4AI Configuration
CRAWL4AI_API_TOKEN=your_crawl4ai_token
CRAWL4AI_MEMORY_THRESHOLD=0.8
CRAWL4AI_MAX_CONCURRENT=10
# Social Media APIs
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret
TWITTER_BEARER_TOKEN=your_twitter_token
LINKEDIN_CLIENT_ID=your_linkedin_id
LINKEDIN_CLIENT_SECRET=your_linkedin_secret
# Optional
TAVILY_API_KEY=your_tavily_key
JINA_API_KEY=your_jina_key
Benefits
- Zero-setup deployment: Single command gets full stack running including Neo4j
- Persistent knowledge graph: Agent builds and maintains relationships between entities
- Advanced search capabilities: RRF, MMR, and cross-encoder reranking for better retrieval
- Rich data sources: Web crawling + social media beyond just web search
- Scalable architecture: Crawl4AI's memory-adaptive dispatcher + Redis queues
- LLM-optimized content: Crawl4AI generates clean markdown perfect for RAG
- Deep crawling capabilities: Explore entire websites with intelligent strategies
- Graph-based memory: Graphiti automatically extracts entities and relationships
- Developer friendly: Standard docker-compose, clear API structure
- Production ready: Health checks, volume persistence, environment configuration
Use Cases Enabled
- Personal knowledge assistant: "Deep crawl my blog and find connections between my ideas"
- Research automation: "Monitor these 50 websites and track how topics evolve"
- Social media analysis: "What topics am I engaging with most and how do they relate?"
- Documentation assistant: "Crawl our docs and understand product relationships"
- Content discovery: "Find unexpected connections between my saved articles and projects"
- Competitive intelligence: "Track competitor content and identify relationship patterns"
- Academic research: "Build knowledge graphs from paper citations and topics"
Metadata
Metadata
Assignees
Labels
No labels