Jina Code Embeddings 1.5b
Jina AI
Jina Code Embeddings 1.5b
Jina AI
Jina Code Embeddings 1.5b
Jina AI
Efficient code embeddings from code generation models
jina-code-embeddings-1.5b
jina-code-embeddings-1.5b is a 1.54 billion parameter code embedding model designed for:
- Retrieving code from natural language queries
- Technical Q&A
- Identifying similar code across programming languages
Built on the Qwen2.5-Coder-1.5B backbone, it generates embeddings via last-token pooling and addresses the limitations of traditional code embedding models, which rely on limited aligned data such as comments and docstrings.
Instead, this model leverages vast unaligned code and documentation corpora from LLM training datasets, enabling robust generalization and superior code understanding.
It supports five task categories with specific instruction prefixes:
NL2Code
, TechQA
, Code2Code
, Code2NL
, and
Code2Completion
.
The model implements Matryoshka representation learning for truncatable embeddings, allowing flexible precision-resource trade-offs. Despite its larger size, it maintains practical deployment characteristics while achieving performance competitive with much larger alternatives.
Highlights
- Multilingual support: Covers 15+ programming languages and works across various domains including web development, software engineering, machine learning, data science, and educational coding problems.
- Task-specific instruction prefixes are available for all five supported tasks and can be selected at inference time.
- Flexible embedding size: Default is 1536 dimensions, but embeddings can be truncated down to as low as 128 dimensions with minimal performance loss.