Skip to content

Commit 58528d5

Browse files
Merge pull request #6 from maint/formats
Format documentation
2 parents 4bc788f + c06f2ed commit 58528d5

File tree

4 files changed

+858
-2
lines changed

4 files changed

+858
-2
lines changed
Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
---
2+
layout: default
3+
title: Flow Classes
4+
parent: Configuration
5+
grand_parent: Reference
6+
nav_order: 1
7+
permalink: /reference/configuration/flow-classes
8+
---
9+
10+
# Flow Class Configuration
11+
12+
Flow classes define complete dataflow pattern templates in TrustGraph. When instantiated, they create interconnected networks of processors that handle data ingestion, processing, storage, and querying as a unified system.
13+
14+
## Overview
15+
16+
A flow class serves as a blueprint for creating flow instances. Each flow class defines:
17+
- **Shared services** that are used by all flow instances of the same class
18+
- **Flow-specific processors** that are unique to each flow instance
19+
- **Interfaces** that define how external systems interact with the flow
20+
- **Queue patterns** that route messages between processors
21+
22+
Flow classes are stored in TrustGraph's configuration system with the configuration type `flow-classes` and are managed through dedicated CLI commands.
23+
24+
## Structure
25+
26+
Every flow class definition has four main sections:
27+
28+
### 1. Class Section
29+
30+
Defines shared service processors that are instantiated once per flow class. These processors handle requests from all flow instances of this class.
31+
32+
```json
33+
{
34+
"class": {
35+
"embeddings:{class}": {
36+
"request": "non-persistent://tg/request/embeddings:{class}",
37+
"response": "non-persistent://tg/response/embeddings:{class}"
38+
},
39+
"text-completion:{class}": {
40+
"request": "non-persistent://tg/request/text-completion:{class}",
41+
"response": "non-persistent://tg/response/text-completion:{class}"
42+
}
43+
}
44+
}
45+
```
46+
47+
**Characteristics:**
48+
- Shared across all flow instances of the same class
49+
- Typically expensive or stateless services (LLMs, embedding models)
50+
- Use `{class}` template variable for queue naming
51+
- Examples: `embeddings:{class}`, `text-completion:{class}`, `graph-rag:{class}`
52+
53+
### 2. Flow Section
54+
55+
Defines flow-specific processors that are instantiated for each individual flow instance. Each flow gets its own isolated set of these processors.
56+
57+
```json
58+
{
59+
"flow": {
60+
"chunker:{id}": {
61+
"input": "persistent://tg/flow/chunk:{id}",
62+
"output": "persistent://tg/flow/chunk-load:{id}"
63+
},
64+
"pdf-decoder:{id}": {
65+
"input": "persistent://tg/flow/document-load:{id}",
66+
"output": "persistent://tg/flow/chunk:{id}"
67+
}
68+
}
69+
}
70+
```
71+
72+
**Characteristics:**
73+
- Unique instance per flow
74+
- Handle flow-specific data and state
75+
- Use `{id}` template variable for queue naming
76+
- Examples: `chunker:{id}`, `pdf-decoder:{id}`, `kg-extract-relationships:{id}`
77+
78+
### 3. Interfaces Section
79+
80+
Defines the entry points and interaction contracts for the flow. These form the API surface for external systems and internal component communication.
81+
82+
Interfaces can take two forms:
83+
84+
**Fire-and-Forget Pattern** (single queue):
85+
```json
86+
{
87+
"interfaces": {
88+
"document-load": "persistent://tg/flow/document-load:{id}",
89+
"triples-store": "persistent://tg/flow/triples-store:{id}"
90+
}
91+
}
92+
```
93+
94+
**Request/Response Pattern** (object with request/response fields):
95+
```json
96+
{
97+
"interfaces": {
98+
"embeddings": {
99+
"request": "non-persistent://tg/request/embeddings:{class}",
100+
"response": "non-persistent://tg/response/embeddings:{class}"
101+
},
102+
"text-completion": {
103+
"request": "non-persistent://tg/request/text-completion:{class}",
104+
"response": "non-persistent://tg/response/text-completion:{class}"
105+
}
106+
}
107+
}
108+
```
109+
110+
**Types of Interfaces:**
111+
- **Entry Points**: Where external systems inject data (`document-load`, `agent`)
112+
- **Service Interfaces**: Request/response patterns for services (`embeddings`, `text-completion`)
113+
- **Data Interfaces**: Fire-and-forget data flow connection points (`triples-store`, `entity-contexts-load`)
114+
115+
### 4. Metadata
116+
117+
Additional information about the flow class:
118+
119+
```json
120+
{
121+
"description": "Standard RAG pipeline with document processing and query capabilities",
122+
"tags": ["rag", "document-processing", "embeddings", "graph-query"]
123+
}
124+
```
125+
126+
## Template Variables
127+
128+
Flow class definitions use template variables that are replaced when flow instances are created:
129+
130+
### {id}
131+
- **Purpose**: Creates isolated resources for each flow instance
132+
- **Usage**: Flow-specific processors and data pathways
133+
- **Example**: `persistent://tg/flow/chunk-load:{id}` becomes `persistent://tg/flow/chunk-load:customer-A-flow`
134+
135+
### {class}
136+
- **Purpose**: Creates shared resources across flows of the same class
137+
- **Usage**: Shared services and expensive processors
138+
- **Example**: `non-persistent://tg/request/embeddings:{class}` becomes `non-persistent://tg/request/embeddings:standard-rag`
139+
140+
## Queue Patterns
141+
142+
Flow classes use Apache Pulsar for messaging. Queue names follow the Pulsar format:
143+
144+
```
145+
<persistence>://<tenant>/<namespace>/<topic>
146+
```
147+
148+
### Queue Components
149+
150+
| Component | Description | Examples |
151+
|-----------|-------------|----------|
152+
| **persistence** | Pulsar persistence mode | `persistent`, `non-persistent` |
153+
| **tenant** | Organization identifier | `tg` (TrustGraph) |
154+
| **namespace** | Messaging pattern | `flow`, `request`, `response` |
155+
| **topic** | Queue/topic name | `chunk-load:{id}`, `embeddings:{class}` |
156+
157+
### Persistent Queues
158+
159+
Used for fire-and-forget services and durable data flow:
160+
161+
```
162+
persistent://tg/flow/<topic>:{id}
163+
```
164+
165+
**Characteristics:**
166+
- Data persists in Pulsar storage across restarts
167+
- Used for document processing pipelines
168+
- Ensures data durability and reliability
169+
- Examples: `persistent://tg/flow/chunk-load:{id}`, `persistent://tg/flow/triples-store:{id}`
170+
171+
### Non-Persistent Queues
172+
173+
Used for request/response messaging patterns:
174+
175+
```
176+
non-persistent://tg/request/<topic>:{class}
177+
non-persistent://tg/response/<topic>:{class}
178+
```
179+
180+
**Characteristics:**
181+
- Ephemeral, not persisted to disk
182+
- Lower latency, suitable for RPC-style communication
183+
- Used for shared services like embeddings and LLM calls
184+
- Examples: `non-persistent://tg/request/embeddings:{class}`, `non-persistent://tg/response/text-completion:{class}`
185+
186+
## Complete Example
187+
188+
Here's a simplified flow class definition for a standard RAG pipeline:
189+
190+
```json
191+
{
192+
"description": "Standard RAG pipeline with document processing and query capabilities",
193+
"tags": ["rag", "document-processing", "embeddings"],
194+
195+
"class": {
196+
"embeddings:{class}": {
197+
"request": "non-persistent://tg/request/embeddings:{class}",
198+
"response": "non-persistent://tg/response/embeddings:{class}"
199+
},
200+
"text-completion:{class}": {
201+
"request": "non-persistent://tg/request/text-completion:{class}",
202+
"response": "non-persistent://tg/response/text-completion:{class}"
203+
}
204+
},
205+
206+
"flow": {
207+
"pdf-decoder:{id}": {
208+
"input": "persistent://tg/flow/document-load:{id}",
209+
"output": "persistent://tg/flow/chunk:{id}"
210+
},
211+
"chunker:{id}": {
212+
"input": "persistent://tg/flow/chunk:{id}",
213+
"output": "persistent://tg/flow/chunk-load:{id}"
214+
},
215+
"vectorizer:{id}": {
216+
"input": "persistent://tg/flow/chunk-load:{id}",
217+
"output": "persistent://tg/flow/doc-embeds-store:{id}"
218+
}
219+
},
220+
221+
"interfaces": {
222+
"document-load": "persistent://tg/flow/document-load:{id}",
223+
"embeddings": {
224+
"request": "non-persistent://tg/request/embeddings:{class}",
225+
"response": "non-persistent://tg/response/embeddings:{class}"
226+
},
227+
"text-completion": {
228+
"request": "non-persistent://tg/request/text-completion:{class}",
229+
"response": "non-persistent://tg/response/text-completion:{class}"
230+
}
231+
}
232+
}
233+
```
234+
235+
## Flow Instantiation
236+
237+
When a flow instance is created from this class:
238+
239+
**Given:**
240+
- Flow Instance ID: `customer-A-flow`
241+
- Flow Class: `standard-rag`
242+
243+
**Template Expansions:**
244+
- `persistent://tg/flow/chunk-load:{id}``persistent://tg/flow/chunk-load:customer-A-flow`
245+
- `non-persistent://tg/request/embeddings:{class}``non-persistent://tg/request/embeddings:standard-rag`
246+
247+
**Result:**
248+
- Isolated document processing pipeline for `customer-A-flow`
249+
- Shared embedding service for all `standard-rag` flows
250+
- Complete dataflow from document ingestion through querying
251+
252+
## Dataflow Architecture
253+
254+
Flow classes create unified dataflows where:
255+
256+
1. **Document Processing Pipeline**: Flows from ingestion through transformation to storage
257+
2. **Query Services**: Integrated processors that query the same data stores and services
258+
3. **Shared Services**: Centralized processors that all flows can utilize
259+
4. **Storage Writers**: Persist processed data to appropriate stores
260+
261+
All processors (both `{id}` and `{class}`) work together as a cohesive dataflow graph, not as separate systems.
262+
263+
## Benefits
264+
265+
### Resource Efficiency
266+
- Expensive services (LLMs, embedding models) are shared across flows
267+
- Reduces computational costs and resource usage
268+
269+
### Flow Isolation
270+
- Each flow has its own data processing pipeline
271+
- Prevents data mixing between different flows
272+
273+
### Scalability
274+
- Can instantiate multiple flows from the same template
275+
- Horizontal scaling by adding more flow instances
276+
277+
### Modularity
278+
- Clear separation between shared and flow-specific components
279+
- Easy to modify and extend flow capabilities
280+
281+
### Unified Architecture
282+
- Query and processing are part of the same dataflow
283+
- Consistent data handling across ingestion and retrieval
284+
285+
## Common Patterns
286+
287+
### Standard RAG Flow
288+
- Document ingestion → chunking → embedding → storage
289+
- Query interface for retrieval and generation
290+
291+
### Knowledge Graph Flow
292+
- Document ingestion → entity extraction → relationship extraction → graph storage
293+
- Query interface for graph traversal and reasoning
294+
295+
### Object Extraction Flow
296+
- Document ingestion → structured data extraction → object storage
297+
- Query interface for structured data retrieval
298+
299+
## Best Practices
300+
301+
### Queue Design
302+
- Use persistent queues for data that must survive restarts
303+
- Use non-persistent queues for fast request/response patterns
304+
- Include template variables in queue names for proper isolation
305+
306+
### Service Sharing
307+
- Share expensive services (LLMs, embeddings) at the class level
308+
- Keep data processing isolated at the flow level
309+
310+
### Interface Design
311+
- Provide clear entry points for external systems
312+
- Use request/response patterns for synchronous operations
313+
- Use fire-and-forget patterns for asynchronous data flow
314+
315+
### Template Variables
316+
- Use `{id}` for flow-specific resources
317+
- Use `{class}` for shared resources
318+
- Be consistent with naming conventions
319+
320+
## See Also
321+
322+
- [tg-put-flow-class](../cli/tg-put-flow-class) - Create or update flow classes
323+
- [tg-get-flow-class](../cli/tg-get-flow-class) - Retrieve flow class definitions
324+
- [tg-show-flow-classes](../cli/tg-show-flow-classes) - List available flow classes
325+
- [Flow Processor Reference](../extending/flow-processor) - Building custom processors
326+
- [Pulsar Configuration](pulsar) - Message queue configuration

reference/configuration/index.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
layout: default
3+
title: Configuration
4+
parent: Reference
5+
has_children: true
6+
nav_order: 7
7+
---
8+
9+
# Configuration Schemas
10+
11+
Reference documentation for TrustGraph configuration file formats and data structures.
12+
13+
## Configuration Types
14+
15+
### Flow Configuration
16+
- **[Flow Classes](flow-classes)** - Define dataflow pattern templates and processor networks
17+
18+
### Data Configuration
19+
- **Structure Descriptor Language (SDL)** - For structured data import (see [SDL Reference](../sdl))
20+
21+
### System Configuration
22+
- **Pulsar Configuration** - Message queue and topic configuration (see [Pulsar API](../apis/pulsar))
23+
24+
## Overview
25+
26+
TrustGraph uses various configuration schemas to define:
27+
- How data flows through the system
28+
- How processors are connected and configured
29+
- How external systems integrate with TrustGraph
30+
- How data is transformed and stored
31+
32+
These configuration files are typically stored in JSON format and can be managed through the CLI tools or the web interface.
33+
34+
## Common Patterns
35+
36+
### Template Variables
37+
Many configuration schemas support template variables for dynamic naming:
38+
- `{id}` - Replaced with instance identifiers
39+
- `{class}` - Replaced with class names
40+
- `{collection}` - Replaced with collection names
41+
42+
### JSON Schema Validation
43+
Configuration files are validated against JSON schemas to ensure correctness before deployment.
44+
45+
### Versioning
46+
Configuration schemas support versioning to maintain backward compatibility as the system evolves.
47+
48+
## See Also
49+
50+
- [CLI Reference](../cli/) - Commands for managing configurations
51+
- [API Reference](../apis/) - REST APIs for configuration management
52+
- [Extension Reference](../extending/) - Creating custom configurations

0 commit comments

Comments
 (0)