Inference Reference

GenAI Studio supports the following model and hardware pair configurations.

Mistral-7b	Configuration
T4	- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”
	- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”
V100	- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”
A100	- slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”

Llama-2-7b	Configuration	Llama-2-13b	Configuration
T4	- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”	T4	- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”
	- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 10 - swap_space: 8 - torch_dtype: “float16”
V100	- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”	V100	- slots_per_trial: 2 - max_new_tokens: 1500 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”
			- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”
A100	- slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”	A100	- slots_per_trial: 1 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”
			- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”

Llama-2-70b	Configuration
A100	- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16”

falcon-7b	Configuration	falcon-40b	Configuration
A100	- slots_per_trial: 1 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”	A100	- slots_per_trial: 4 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16”
		V100	- slots_per_trial: 8 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16”

mpt-7b	Configuration	mpt-30b	Configuration
A100	- slots_per_trial: 1 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”	A100	- slots_per_trial: 2 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16”
			- slots_per_trial: 4 - max_new_tokens: 4000 - batch_size: 100 - swap_space: 16 - torch_dtype: “float16”
V100	- slots_per_trial: 2 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”	V100	- slots_per_trial: 8 - max_new_tokens: 4000 - batch_size: 1 - swap_space: 16 - torch_dtype: “float16”

T4	- slots_per_trial: 2 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”
	- slots_per_trial: 4 - max_new_tokens: 2000 - batch_size: 100 - swap_space: 8 - torch_dtype: “float16”

Configuration Description #

slots_per_trial: The number of slots (GPUs) each trial (such as a run of inference) will use. For example, if slots_per_trial is set to 8 and the hardware type is V100, then one inference task will need 8 V100 GPUs.
max_new_tokens: The maximum number of new tokens that can be processed or generated during a training or inference task. Tokens are units of text, such as words or subwords, that are used in natural language processing tasks. For example, if a model has a max_new_tokens value of 4000 the model can generate or process up to 4000 new tokens.
batch_size: The number of batches the model will process at a time.
swap_space: The region of the computer’s hard drive that is used as virtual memory when the physical RAM is fully utilized. Swap space is an extension of RAM that allows the system to store data temporarily when the physical memory is insufficient. A swap_space of 16 indicates 16 GB of space on the hard drive is allocated to be used as virtual memory.
torch_dtype: Specifies the data type used during training. For example, float16 reduces memory usage and can help with faster computation.