In this article, I’ll show you how to easily create an AI agent that uses Confluence as a knowledge base to answer your questions.
First, you’ll need a vector database. In this guide, we’ll use Milvus. The following configuration is required:
docker-compose.yaml:
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ./volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
healthcheck:
test: [ "CMD", "etcdctl", "endpoint", "health" ]
interval: 30s
timeout: 20s
retries: 3
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
ports:
- "9001:9001"
- "9000:9000"
volumes:
- ./volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
interval: 30s
timeout: 20s
retries: 3
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.4.9
command: [ "milvus", "run", "standalone" ]
security_opt:
- seccomp:unconfined
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ./milvus.yaml:/milvus/configs/milvus.yaml
- ./volumes/milvus:/var/lib/milvus
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9091/healthz" ]
interval: 30s
start_period: 90s
timeout: 20s
retries: 3
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"
And download the milvus.yaml file from the URL below: https://github.com/milvus-io/milvus/blob/master/configs/milvus.yaml. Run the following command: docker compose up. And that’s it, the vector database has started.
Vector databases are specialized systems optimized for storing and searching vector embeddings — mathematical representations of data in multi-dimensional space. These vector embeddings represent the semantic meaning or features of various types of data (text, images, audio, etc.) as points in high-dimensional space. Unlike traditional databases that look for exact matches, vector databases excel at finding similar items through approximate nearest neighbor (ANN) search algorithms.
Once the Confluence spaces are divided into smaller chunks, each chunk will be converted into vector embeddings using a pre-trained machine learning model, such as OpenAI’s text-embedding-ada-002
, Sentence Transformers, or other embedding models. These embeddings capture the semantic meaning of each text fragment, enabling efficient and intelligent search capabilities. After generating the embeddings, we will store them in a vector database.
When users query the database, their query will be converted into an embedding using the same model. The vector database will then use Approximate Nearest Neighbor (ANN) search to retrieve the most relevant results based on similarity in high-dimensional space rather than exact keyword matching.
The following package will be needed for the implementation:
pip install atlassian-python-api langchain langchain-openai langgraph pymilvus pymilvus[model] pytesseract langchain-community
The core implementation consists of a system that connects to Confluence, extracts content, and transforms it into vector embeddings for storage in Milvus. The ConfluenceConnection
class stores essential parameters like URL, authentication token, space key, and content processing options. When the load_confluence
function is called, it uses the ConfluenceLoader to extract documents from your Confluence workspace, including attachments if specified. These documents are then split into smaller, semantically meaningful chunks using a RecursiveCharacterTextSplitter
. We implement a dual-embedding approach, generating both sparse and dense vector representations for each document chunk, which enables more robust semantic search capabilities. The vectors, along with metadata such as the original document title and source, are stored in a Milvus collection with appropriate indexing for efficient retrieval. For testing purposes, we recreate the collection in every run.
connections.connect("default", uri="http://localhost:19530", token="root:Milvus")
EMBEDDINGS = BGEM3EmbeddingFunction(model_name="BAAI/bge-m3")
COLLECTION_NAME = "KNOWLEDGE_BASE"
class ConfluenceConnection:
url: str
token: str
space_key: str
include_attachments: bool
limit: int
ocr_languages: str
max_pages: int
def create_collection():
fields = [
FieldSchema(name="pk", dtype=DataType.VARCHAR,
is_primary=True, auto_id=True, max_length=100),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=10_000),
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDINGS.dim["dense"]),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=2048),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=5012)
]
schema = CollectionSchema(fields, "")
col = Collection(COLLECTION_NAME, schema)
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
col.create_index("sparse_vector", sparse_index)
dense_index = {"index_type": "FLAT", "metric_type": "IP"}
col.create_index("dense_vector", dense_index)
col.load()
def load_confluence(confluence) -> None:
"""
Get data from confluence and load to a vector db.
Args:
confluence: Confluence settings
Returns:
"""
loader = ConfluenceLoader(
url=confluence.url,
token=confluence.token,
space_key=confluence.space_key,
include_attachments=confluence.include_attachments,
limit=confluence.limit,
ocr_languages=confluence.ocr_languages,
max_pages=confluence.max_pages
)
docs = loader.load()
chunks = RecursiveCharacterTextSplitter().split_documents(docs)
if not utility.has_collection(COLLECTION_NAME):
create_collection()
insert_doc_to_collection(chunks, collection_name=COLLECTION_NAME)
else:
utility.drop_collection(COLLECTION_NAME)
create_collection()
insert_doc_to_collection(chunks, collection_name=COLLECTION_NAME)
def insert_doc_to_collection(
documents: List[Document],
collection_name: str
):
col = Collection(collection_name)
contents = []
titles = []
sources = []
for doc in documents:
contents.append(doc.page_content)
titles.append(doc.metadata["title"])
sources.append(doc.metadata["source"])
docs_embeddings = EMBEDDINGS.encode_documents(contents)
entities = [contents, docs_embeddings["sparse"], docs_embeddings["dense"], titles, sources]
col.insert(entities)
col.flush()
With our Confluence content now stored in Milvus, we’ll implement the search functionality and create the AI agent. The system uses a hybrid search approach that combines the strengths of both dense and sparse vector embeddings. The hybrid_search
function serves as a tool for our agent, performing a dual search with both embedding types and merging results using the Reciprocal Rank Fusion (RRF) algorithm. This approach achieves better retrieval accuracy than using either search method alone.
Now let’s set up our agent to answer questions using our Confluence knowledge base. The following code demonstrates how to initialize the required components and start interacting with the agent:
# Load environment variables (API keys, etc.)
load_dotenv()
# 1. First, set up your OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
raise ValueError("OpenAI API key not found. Please set it in your environment variables.")
# 2. Define your Confluence connection parameters
confluence = ConfluenceConnection()
confluence.url = "YOUR_CONFLUENCE_URL"
confluence.token = os.getenv("CONFLUENCE_TOKEN")
confluence.space_key = "SPACE_KEY"
confluence.include_attachments = False
confluence.limit = 10
confluence.ocr_languages = "hun+eng"
confluence.max_pages = 20
# 3. Load data from Confluence into Milvus
load_confluence(confluence)
# 4. Create the agent executor
agent_executor = create_agent_executor(
openai_api_key=openai_api_key,
model_name="gpt-4o"
)
# 5. Example: Use the agent to answer questions based on your Confluence knowledge base
result = agent_executor.invoke({"input": "Your question"})
print(result)
This code first loads environment variables, including the OpenAI API key and Confluence token. Then it configures the Confluence connection with parameters such as the URL, space key, and content processing options. The load_confluence
function extracts the content and stores it in the Milvus vector database. Finally, we create the agent executor with the specified OpenAI model and use it to answer questions by querying the knowledge base.
In this article, we’ve built a powerful AI agent that can answer questions based on your organization’s Confluence knowledge base. By leveraging vector embeddings and a hybrid search approach, the system can understand and retrieve information semantically, going beyond simple keyword matching.
Leave a Reply