Custom Model Integration
Introduction
After completing vendor integration, the next step is to integrate models under the vendor. To help understand the entire integration process, we will use Xinference
as an example to gradually complete a full vendor integration.
It is important to note that for custom models, each model integration requires a complete vendor credential.
Unlike predefined models, custom vendor integration will always have the following two parameters, which do not need to be defined in the vendor YAML file.
In the previous section, we have learned that vendors do not need to implement validate_provider_credential
. The Runtime will automatically call the corresponding model layer’s validate_credentials
based on the model type and model name selected by the user for validation.
Writing Vendor YAML
First, we need to determine what types of models the vendor supports.
Currently supported model types are as follows:
llm
Text Generation Modeltext_embedding
Text Embedding Modelrerank
Rerank Modelspeech2text
Speech to Texttts
Text to Speechmoderation
Moderation
Xinference
supports LLM
, Text Embedding
, and Rerank
, so we will start writing xinference.yaml
.
Next, we need to consider what credentials are required to define a model in Xinference.
- It supports three different types of models, so we need
model_type
to specify the type of the model. It has three types, so we write it as follows:
- Each model has its own name
model_name
, so we need to define it here.
- Provide the address for the local deployment of Xinference.
- Each model has a unique
model_uid
, so we need to define it here.
Now, we have completed the basic definition of the vendor.
Writing Model Code
Next, we will take the llm
type as an example and write xinference.llm.llm.py
.
In llm.py
, create a Xinference LLM class, which we will name XinferenceAILargeLanguageModel
(arbitrary name), inheriting from the __base.large_language_model.LargeLanguageModel
base class. Implement the following methods:
-
LLM Invocation
Implement the core method for LLM invocation, which can support both streaming and synchronous returns.
When implementing, note that you need to use two functions to return data, one for handling synchronous returns and one for streaming returns. This is because Python identifies functions containing the
yield
keyword as generator functions, and the return data type is fixed asGenerator
. Therefore, synchronous and streaming returns need to be implemented separately, as shown below (note that the example uses simplified parameters; the actual implementation should follow the parameter list above): -
Precompute Input Tokens
If the model does not provide a precompute tokens interface, it can directly return 0.
Sometimes, you may not want to directly return 0, so you can use
self._get_num_tokens_by_gpt2(text: str)
to get precomputed tokens. This method is located in theAIModel
base class and uses GPT2’s Tokenizer for calculation. However, it can only be used as an alternative method and is not completely accurate. -
Model Credential Validation
Similar to vendor credential validation, this is for validating individual model credentials.
-
Model Parameter Schema
Unlike custom types, since a model’s supported parameters are not defined in the YAML file, we need to dynamically generate the model parameter schema.
For example, Xinference supports the
max_tokens
,temperature
, andtop_p
parameters.However, some vendors support different parameters depending on the model. For instance, the vendor
OpenLLM
supportstop_k
, but not all models provided by this vendor supporttop_k
. Here, we illustrate that Model A supportstop_k
, while Model B does not. Therefore, we need to dynamically generate the model parameter schema, as shown below: -
Invocation Error Mapping Table
When a model invocation error occurs, it needs to be mapped to the Runtime-specified
InvokeError
type to facilitate Dify’s different subsequent processing for different errors.Runtime Errors:
InvokeConnectionError
Invocation connection errorInvokeServerUnavailableError
Invocation server unavailableInvokeRateLimitError
Invocation rate limit reachedInvokeAuthorizationError
Invocation authorization failedInvokeBadRequestError
Invocation parameter error
For an explanation of interface methods, see: Interfaces. For specific implementations, refer to: llm.py.