AI Agent Using Confluence as a Knowledge Base

·

·

In this article, I’ll show you how to easily create an AI agent that uses Confluence as a knowledge base to answer your questions.

First, you’ll need a vector database. In this guide, we’ll use Milvus. The following configuration is required:

docker-compose.yaml:

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ./volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: [ "CMD", "etcdctl", "endpoint", "health" ]
      interval: 30s
      timeout: 20s
      retries: 3

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - ./volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.4.9
    command: [ "milvus", "run", "standalone" ]
    security_opt:
      - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ./milvus.yaml:/milvus/configs/milvus.yaml
      - ./volumes/milvus:/var/lib/milvus
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:9091/healthz" ]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

And download the milvus.yaml file from the URL below: https://github.com/milvus-io/milvus/blob/master/configs/milvus.yaml. Run the following command: docker compose up. And that’s it, the vector database has started.

Vector databases are specialized systems optimized for storing and searching vector embeddings — mathematical representations of data in multi-dimensional space. These vector embeddings represent the semantic meaning or features of various types of data (text, images, audio, etc.) as points in high-dimensional space. Unlike traditional databases that look for exact matches, vector databases excel at finding similar items through approximate nearest neighbor (ANN) search algorithms.

Once the Confluence spaces are divided into smaller chunks, each chunk will be converted into vector embeddings using a pre-trained machine learning model, such as OpenAI’s text-embedding-ada-002, Sentence Transformers, or other embedding models. These embeddings capture the semantic meaning of each text fragment, enabling efficient and intelligent search capabilities. After generating the embeddings, we will store them in a vector database.

When users query the database, their query will be converted into an embedding using the same model. The vector database will then use Approximate Nearest Neighbor (ANN) search to retrieve the most relevant results based on similarity in high-dimensional space rather than exact keyword matching.

The following package will be needed for the implementation:

pip install atlassian-python-api langchain langchain-openai langgraph pymilvus pymilvus[model] pytesseract langchain-community

The core implementation consists of a system that connects to Confluence, extracts content, and transforms it into vector embeddings for storage in Milvus. The ConfluenceConnection class stores essential parameters like URL, authentication token, space key, and content processing options. When the load_confluence function is called, it uses the ConfluenceLoader to extract documents from your Confluence workspace, including attachments if specified. These documents are then split into smaller, semantically meaningful chunks using a RecursiveCharacterTextSplitter. We implement a dual-embedding approach, generating both sparse and dense vector representations for each document chunk, which enables more robust semantic search capabilities. The vectors, along with metadata such as the original document title and source, are stored in a Milvus collection with appropriate indexing for efficient retrieval. For testing purposes, we recreate the collection in every run.

connections.connect("default", uri="http://localhost:19530", token="root:Milvus")

EMBEDDINGS = BGEM3EmbeddingFunction(model_name="BAAI/bge-m3")
COLLECTION_NAME = "KNOWLEDGE_BASE"

class ConfluenceConnection:
    url: str
    token: str
    space_key: str
    include_attachments: bool
    limit: int
    ocr_languages: str
    max_pages: int


def create_collection():

    fields = [
        FieldSchema(name="pk", dtype=DataType.VARCHAR,
                    is_primary=True, auto_id=True, max_length=100),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=10_000),
        FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
        FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDINGS.dim["dense"]),
        FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=2048),
        FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=5012)
    ]

    schema = CollectionSchema(fields, "")
    col = Collection(COLLECTION_NAME, schema)

    sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
    col.create_index("sparse_vector", sparse_index)
    dense_index = {"index_type": "FLAT", "metric_type": "IP"}
    col.create_index("dense_vector", dense_index)
    col.load()


def load_confluence(confluence) -> None:
    """
    Get data from confluence and load to a vector db.
    Args:
        confluence: Confluence settings

    Returns:

    """
    loader = ConfluenceLoader(
        url=confluence.url,
        token=confluence.token,
        space_key=confluence.space_key,
        include_attachments=confluence.include_attachments,
        limit=confluence.limit,
        ocr_languages=confluence.ocr_languages,
        max_pages=confluence.max_pages
    )

    docs = loader.load()

    chunks = RecursiveCharacterTextSplitter().split_documents(docs)

    if not utility.has_collection(COLLECTION_NAME):
        create_collection()
        insert_doc_to_collection(chunks, collection_name=COLLECTION_NAME)
    else:
        utility.drop_collection(COLLECTION_NAME)
        create_collection()
        insert_doc_to_collection(chunks, collection_name=COLLECTION_NAME)


def insert_doc_to_collection(
        documents: List[Document],
        collection_name: str
):
    col = Collection(collection_name)
    contents = []
    titles = []
    sources = []
    for doc in documents:
        contents.append(doc.page_content)
        titles.append(doc.metadata["title"])
        sources.append(doc.metadata["source"])

    docs_embeddings = EMBEDDINGS.encode_documents(contents)
    entities = [contents, docs_embeddings["sparse"], docs_embeddings["dense"], titles, sources]

    col.insert(entities)
    col.flush()

With our Confluence content now stored in Milvus, we’ll implement the search functionality and create the AI agent. The system uses a hybrid search approach that combines the strengths of both dense and sparse vector embeddings. The hybrid_search function serves as a tool for our agent, performing a dual search with both embedding types and merging results using the Reciprocal Rank Fusion (RRF) algorithm. This approach achieves better retrieval accuracy than using either search method alone.

Now let’s set up our agent to answer questions using our Confluence knowledge base. The following code demonstrates how to initialize the required components and start interacting with the agent:

# Load environment variables (API keys, etc.)
load_dotenv()

# 1. First, set up your OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OpenAI API key not found. Please set it in your environment variables.")

# 2. Define your Confluence connection parameters
confluence = ConfluenceConnection()
confluence.url = "YOUR_CONFLUENCE_URL"
confluence.token = os.getenv("CONFLUENCE_TOKEN")
confluence.space_key = "SPACE_KEY"
confluence.include_attachments = False
confluence.limit = 10
confluence.ocr_languages = "hun+eng"
confluence.max_pages = 20

# 3. Load data from Confluence into Milvus
load_confluence(confluence)

# 4. Create the agent executor
agent_executor = create_agent_executor(
    openai_api_key=openai_api_key,
    model_name="gpt-4o"
)

# 5. Example: Use the agent to answer questions based on your Confluence knowledge base
result = agent_executor.invoke({"input": "Your question"})

print(result)

This code first loads environment variables, including the OpenAI API key and Confluence token. Then it configures the Confluence connection with parameters such as the URL, space key, and content processing options. The load_confluence function extracts the content and stores it in the Milvus vector database. Finally, we create the agent executor with the specified OpenAI model and use it to answer questions by querying the knowledge base.

In this article, we’ve built a powerful AI agent that can answer questions based on your organization’s Confluence knowledge base. By leveraging vector embeddings and a hybrid search approach, the system can understand and retrieve information semantically, going beyond simple keyword matching.


Leave a Reply

Your email address will not be published. Required fields are marked *