Create Knowledge Base & Upload Documents
Steps to upload documents into Knowledge:
- Select the document you need to upload from your local files;
- Segment and clean the document, and preview the effect;
- Choose and configure Index Mode and Retrieval Settings;
- Wait for the chunks to be embedded;
- Upload completed, now you can use it in your applications 🎉
1 Creating a Knowledge Base
Click on Knowledge in the main navigation bar of Dify. On this page, you can see your existing knowledge bases. Click Create Knowledge to enter the setup wizard:
-
If you have not prepared any documents yet, you can first create an empty knowledge base;
-
When creating a knowledge base with an external data source (such as Notion or Sync from website), the knowledge base type becomes immutable. This restriction prevents management complexities that could arise from multiple data sources within a single knowledge base.
For scenarios requiring multiple data sources, we recommend creating separate knowledge bases for each source. You can then utilize the Multiple-Retrieval feature to reference multiple knowledge bases within the same application.
Limitations for uploading documents:
- The upload size limit for a single document is 15MB;

Create Knowledge base
2 Text Preprocessing and Cleaning
After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.
Two strategies are supported:
- Automatic mode
- Custom mode
Automatic
The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Dify automatically segments and sanitizes content files, streamlining the document preparation process.

Automatic mode
3 Indexing Mode
You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.
- High-Quality Mode
- Economical Mode
In High-Quality mode, the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.
The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check “Retrieval Settings”.

High Quality
4 Retrieval Settings
In high-quality indexing mode, Dify offers three retrieval settings:
- Vector Search
- Full-Text Search
- Hybrid Search
Definition: The system vectorizes the user’s input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.

Vector Search Settings
Vector Search Settings:
Rerank Model: After configuring the API key for the Rerank model on the “Model Provider” page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chucks that are most similar to the user’s question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
In the Economical indexing mode, Dify offers a single retrieval setting:
Inverted Index
An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the “Inverted Index”.
TopK: This parameter filters the text chucks that are most similar to the user’s question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.

Inverted Index
After specifying the retrieval settings, you can refer to Retrieval Test/Citation Attribution to check the matching between keywords and content chunks.
5. Complete Upload
After configuring all the settings mentioned above, simply click “Save and Process” to complete the creation of your knowledge base. You can refer to Integrate Knowledge Base Within Application to build an LLM application that can answer questions based on the knowledge base.
Reference
Optional ETL Configuration
In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Dify supports optional ETL solutions: Dify ETL and Unstructured ETL.
Unstructured can efficiently extract and transform your data into clean data for subsequent steps.
DIFY ETL | Unstructured ETL |
---|---|
txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv | txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub |
Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.
Embedding Model
Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.
Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.