https://store-images.s-microsoft.com/image/apps.10812.b983e0fb-9f49-49f7-9da4-74927b3b1e00.9a03a059-17fc-4f57-acee-b87160f7a54c.96c68bce-90e0-4395-aacb-695fb112557d

Jina Code Embeddings 1.5b

Jina AI

Jina Code Embeddings 1.5b

Jina AI

Efficient code embeddings from code generation models

jina-code-embeddings-1.5b

jina-code-embeddings-1.5b is a 1.54 billion parameter code embedding model designed for:

  • Retrieving code from natural language queries
  • Technical Q&A
  • Identifying similar code across programming languages

Built on the Qwen2.5-Coder-1.5B backbone, it generates embeddings via last-token pooling and addresses the limitations of traditional code embedding models, which rely on limited aligned data such as comments and docstrings.

Instead, this model leverages vast unaligned code and documentation corpora from LLM training datasets, enabling robust generalization and superior code understanding.

It supports five task categories with specific instruction prefixes: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion.

The model implements Matryoshka representation learning for truncatable embeddings, allowing flexible precision-resource trade-offs. Despite its larger size, it maintains practical deployment characteristics while achieving performance competitive with much larger alternatives.

Highlights

  • Multilingual support: Covers 15+ programming languages and works across various domains including web development, software engineering, machine learning, data science, and educational coding problems.
  • Task-specific instruction prefixes are available for all five supported tasks and can be selected at inference time.
  • Flexible embedding size: Default is 1536 dimensions, but embeddings can be truncated down to as low as 128 dimensions with minimal performance loss.
Please visit here for example usage.