Jina Code Embeddings 0.5b
Jina AI
Jina Code Embeddings 0.5b
Jina AI
Jina Code Embeddings 0.5b
Jina AI
Efficient code embeddings from code generation models
jina-code-embeddings-0.5b
jina-code-embeddings-0.5b is a 494 million parameter code embedding model designed for:
- Retrieving code from natural language queries
- Technical Q&A
- Identifying similar code across programming languages
Built on the Qwen2.5-Coder-0.5B backbone, it generates embeddings via last-token pooling and addresses the core limitations of traditional code embedding models, which typically rely on scarce aligned data like comments and docstrings.
Instead, this model leverages the abundant unaligned code and documentation used in LLM training, achieving state-of-the-art performance despite its compact size.
It supports five task categories with specific instruction prefixes:
NL2Code
, TechQA
, Code2Code
, Code2NL
, and
Code2Completion
.
The model implements Matryoshka representation learning for truncatable embeddings, enabling flexible precision-resource trade-offs.
Highlights
- Multilingual support: Covers 15+ programming languages and works across various domains including web development, software engineering, machine learning, data science, and educational coding problems.
- Task-specific instruction prefixes are available for all five supported tasks and can be selected at inference time.
- Flexible embedding size: Default is 896 dimensions, but embeddings can be truncated down to as low as 64 dimensions with minimal performance loss.