Skip to content

Conversation

@jjoyce0510
Copy link
Collaborator

@jjoyce0510 jjoyce0510 commented Nov 4, 2025

Introducing Documents in DataHub (Context)

This PR introduces a new Document entity to DataHub, enabling users to create, manage, and organize first-party knowledge base content directly within the platform. Documents can be hierarchically organized, linked to data assets, and managed through a complete lifecycle including draft/publish workflows.

Core Data Models

Introduces comprehensive metadata models for the Document entity in DataHub:

Entity Definition

  • New document entity with key aspect documentKey and search capabilities
  • Full support for standard DataHub aspects: ownership, domains, tags, glossary terms, structured properties, institutional memory

Core Aspects (PDL Models)

  • DocumentKey - Unique identifier for documents
  • DocumentInfo - Primary aspect containing:
    • Title and text contents
    • Document status (PUBLISHED/UNPUBLISHED)
    • Source information (distinguishes first-party vs third-party ingested documents)
    • Audit stamps (created/lastModified with actor and timestamp)
    • Hierarchical parent-child relationships
    • Related assets (datasets, dashboards, etc.) and related documents
    • Draft workflow support via draftOf field
  • DocumentContents - Text content storage
  • DocumentStatus & DocumentState - Publication state management
  • DocumentSource - Tracking external sources for third-party integrations
  • ParentDocument, RelatedAsset, RelatedDocument - Relationship models
  • DraftOf - Draft-to-published document linking

GraphQL APIs

Comprehensive GraphQL API surface in knowledge.graphql:

Mutations

  1. createDocument - Create new documents with content, relationships, and hierarchy

    • Supports custom IDs or auto-generated UUIDs
    • Can create as draft or published
    • Automatic ownership assignment to creator
  2. updateDocumentContents - Update document text and title

  3. updateDocumentRelatedEntities - Manage relationships to assets and other documents

  4. moveDocument - Relocate documents within the hierarchy

  5. deleteDocument - Remove documents and their references

  6. updateDocumentStatus - Toggle between PUBLISHED/UNPUBLISHED states

  7. mergeDraft - Merge draft content into published document with optional draft deletion

Queries

  1. document(urn) - Fetch document by URN with full metadata
  2. searchDocuments - Hybrid semantic search with rich filtering:
    • Semantic query support
    • Filter by parent document (hierarchical browsing)
    • Filter by types, domains, states
    • Option to include/exclude drafts
    • Faceted search support

Special Features

  • drafts field - Lists all draft versions of a published document
  • changeHistory field - Chronological audit log of document modifications with support for: Content changes, Parent changes (moves), Relationship changes, State changes, etc.

Authorization & Privileges

New Platform Privilege

  • MANAGE_DOCUMENTS - Platform-level privilege for managing all documents

Entity-Level Privileges

Documents support standard DataHub entity privileges:

  • VIEW_ENTITY_PAGE / GET_ENTITY - View document
  • EDIT_ENTITY_DOCS / EDIT_ENTITY - Edit document content
  • CREATE_ENTITY - Create documents
  • EDIT_ENTITY_OWNERS - Manage ownership
  • EDIT_ENTITY_DOMAINS - Assign domains
  • SHARE_ENTITY - Share documents
  • EDIT_ENTITY_PROPERTIES - Edit structured properties

Authorization Logic

  • canCreateDocument() - Requires CREATE_ENTITY for documents or MANAGE_DOCUMENTS
  • canEditDocument() - Requires EDIT_ENTITY_DOCS, EDIT_ENTITY, or MANAGE_DOCUMENTS
  • canGetDocument() - Requires VIEW_ENTITY_PAGE or MANAGE_DOCUMENTS
  • canDeleteDocument() - Requires delete authorization or MANAGE_DOCUMENTS

Backend Services

DocumentService

Complete service layer implementation in metadata-service/services:

  • CRUD operations with validation
  • Draft workflow management (create, merge, track)
  • Hierarchical structure management (move operations)
  • Relationship management (assets and documents)
  • Ownership management
  • State transition handling
  • Full audit trail via lastModified timestamps

Timeline Support

  • DocumentInfoChangeEventGenerator - Generates change events for audit history
  • Tracks all modifications to document aspects
  • Integrates with DataHub's timeline service

Factory Beans

  • DocumentServiceFactory - Spring factory for service instantiation
  • Integration with GraphQL engine

Test Coverage

Smoke Tests

  • document_test.py (410 lines) - End-to-end document lifecycle tests
  • document_draft_test.py (326 lines) - Draft creation, merging, and workflows
  • document_change_history_test.py (281 lines) - Timeline and change tracking

Unit Tests

  • DocumentServiceTest.java (486 lines) - Service layer business logic
  • GraphQL resolver tests for all mutations and queries
  • DocumentMapperTest.java - Type mapping validation
  • DocumentInfoChangeEventGeneratorTest.java - Timeline event generation

Key Features & Use Cases

  1. Knowledge Base Management - Create and organize internal documentation, FAQs, tutorials, and runbooks
  2. Asset Documentation - Link documents to data assets for enriched context
  3. Draft Workflows - Work on document updates without publishing immediately
  4. Hierarchical Organization - Structure documents in parent-child relationships
  5. Semantic Search - Find relevant documents through hybrid search
  6. Change Tracking - Full audit history of all document modifications
  7. Third-Party Integration Ready - Source field supports ingesting external docs (Confluence, Notion, etc.)

This PR lays the foundation for DataHub to become a central knowledge hub, combining first-party documentation with data asset management in a unified platform.

Coming in a followup PR:

  • Add a browse paths for docs, enabling us to replicate hierarchical structure from other places.
  • Add the "container" story for docs. One option is to define a parent container type as a Dataset entity (e.g. Dataset = Collection of Documents) which is then itself within a container.
  • Models for document-level lineage, and UI support for creating document level lineage links.
  • Support Document Tags, Glossary Terms, and inclusion in Data Products

Status

Ready for review.

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Nov 4, 2025
@jjoyce0510 jjoyce0510 marked this pull request as ready for review November 5, 2025 22:06
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 5, 2025
@abedatahub abedatahub self-requested a review November 6, 2025 16:48
Copy link
Collaborator

@abedatahub abedatahub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to land; a few minor comments.

/**
* Information about the external source of this document.
* Only populated for third-party documents ingested from external systems.
* If null, the document is first-party (created directly in DataHub).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the convention then you can remove the sourceType field in DocumentSource


/**
* Returns true if the current user is able to create Knowledge Articles. This is true if the user
* has the 'Create Entity' privilege for Knowledge Articles or 'Manage Knowledge Articles'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans for fine grained (per-document view, create-child, edit) privileges?


return true;
} catch (Exception e) {
log.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to log here since the caller can log the exception which has all the needed debugging info.

implements DataFetcher<CompletableFuture<List<DocumentChange>>> {

private final TimelineService _timelineService;
private static final long DEFAULT_LOOKBACK_MILLIS = 30L * 24 * 60 * 60 * 1000; // 30 days
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeUnit.DAYS.toMillis(30)

Comment on lines +59 to +62
long endTime = endTimeMillis != null ? endTimeMillis : System.currentTimeMillis();
long startTime =
startTimeMillis != null ? startTimeMillis : (endTime - DEFAULT_LOOKBACK_MILLIS);
int maxResults = limit != null ? limit : DEFAULT_LIMIT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the java.time APIs to do time math.

// Batch ingest all proposals
entityClient.batchIngestProposals(opContext, mcps, false);

log.info("Updated contents for document {}", documentUrn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like a debug log.

log.error(
"Failed to clear entity references for Document with URN {}: {}",
documentUrn,
e.getMessage());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the stacktrace?

}
} catch (Exception e) {
// If we can't get parent info, assume no cycle for safety
log.warn("Failed to check parent info for {}: {}", currentParent, e.getMessage());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the exception object

infoProposal.setAspect(GenericRecordUtils.serializeAspect(publishedInfo));
entityClient.ingestProposal(opContext, infoProposal, false);

log.info("Merged draft {} into published document {}", draftUrn, publishedUrn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and others look like they should be debug level. And we can turn up log levels for specific modules at runtime

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment pending-submitter-merge product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants