Spaces:

dixisouls
/

VelocityLM

Running on Zero

App Files Files Community

dixisouls commited on Aug 25

Commit

27b9282

1 Parent(s): a12c1bc

Initial Commit

Browse files

Files changed (7) hide show

.gitignore +184 -0
README.md +100 -6
app.py +635 -0
requirements.txt +14 -0
src/inference/inference.py +231 -0
src/model/layers.py +86 -0
src/model/transformer.py +281 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,184 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# ML/AI specific files
+*.pkl
+*.pickle
+*.h5
+*.hdf5
+*.ckpt
+*.pt
+*.pth
+*.safetensors
+# Model training artifacts
+wandb/
+runs/
+logs/
+tensorboard/
+# Data directories
+data/
+datasets/
+# Model checkpoints (excluding checkpoint directory as user mentioned git LFS)
+# User tracks model files in checkpoints/ with git LFS, so we won't ignore it
+# Temporary files
+*.tmp
+*.temp
+.DS_Store
+Thumbs.db
+# IDE files
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Gradio temporary files
+gradio_cached_examples/
+flagged/
+# HuggingFace cache
+.cache/
+cache/
+# Local configuration files
+config.local.*
+.secrets

README.md CHANGED Viewed

@@ -1,14 +1,108 @@
 ---
-title: VelocityLM
-emoji: 🌍
-colorFrom: indigo
-colorTo: blue
 sdk: gradio
 sdk_version: 5.43.1
 app_file: app.py
 pinned: false
 license: mit
-short_description: FoundationalLM for fast text-generation
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Custom LLM - Foundational Language Model
+emoji: 🤖
+colorFrom: blue
+colorTo: purple
 sdk: gradio
 sdk_version: 5.43.1
 app_file: app.py
 pinned: false
 license: mit
+models:
+  - gpt2
+datasets:
+  - tiiuae/falcon-refinedweb
+tags:
+  - text-generation
+  - transformer
+  - pytorch
+  - custom-model
+  - llm
+  - foundational-model
+short_description: A custom 2B parameter foundational language model with streaming generation
 ---
+# 🤖 Custom LLM - Foundational Language Model
+A custom-trained foundational language model with **2 billion parameters**, built with modern transformer architecture and deployed with streaming text generation capabilities.
+## 🚀 Features
+- **Custom Architecture**: Modern transformer with RoPE (Rotary Position Embedding), RMSNorm, and SwiGLU activation
+- **Streaming Generation**: Real-time text generation with token-by-token streaming
+- **Flexible Sampling**: Configurable temperature, top-p, top-k, and repetition penalty
+- **ZeroGPU Integration**: Optimized for Hugging Face Spaces with GPU acceleration
+- **Responsive UI**: Clean, intuitive Gradio interface
+## 📊 Model Details
+| Specification | Value |
+|---------------|-------|
+| **Parameters** | ~2 billion |
+| **Architecture** | Custom Transformer |
+| **Context Length** | 2,048 tokens |
+| **Vocab Size** | 50,257 (GPT-2 tokenizer) |
+| **Layers** | 24 |
+| **Attention Heads** | 32 |
+| **Hidden Size** | 2,048 |
+| **Intermediate Size** | 8,192 |
+## 🏗️ Architecture Components
+- **RMSNorm**: Root Mean Square Layer Normalization for better training stability
+- **RoPE**: Rotary Position Embeddings for better length extrapolation
+- **SwiGLU**: Switch GLU activation function for improved performance
+- **Causal Attention**: Standard autoregressive attention mechanism
+## 🎯 Training Details
+- **Dataset**: Falcon RefinedWeb (curated web text)
+- **Training Steps**: 100,000 steps
+- **Learning Rate**: 6e-4 with warmup and decay
+- **Batch Size**: 32 (4 per device × 8 accumulation steps)
+- **Optimization**: AdamW with β1=0.9, β2=0.95
+- **Precision**: Mixed precision (FP16)
+## 🛠️ Generation Parameters
+- **Max Tokens**: Control the length of generated text (1-1024)
+- **Temperature**: Sampling randomness (0.1-2.0, higher = more creative)
+- **Top-p**: Nucleus sampling threshold (0.1-1.0)
+- **Top-k**: Top-k sampling limit (0-200, 0 = disabled)
+- **Repetition Penalty**: Reduce repetitive text (1.0-2.0)
+## 💡 Usage Tips
+1. **For Creative Writing**: Use higher temperature (1.0-1.5) and top-p (0.9-0.95)
+2. **For Factual Content**: Use lower temperature (0.3-0.7) and top-p (0.8-0.9)
+3. **For Code Generation**: Use temperature ~0.2 with top-k filtering
+4. **Longer Context**: The model handles up to 2,048 tokens of context
+## 🚨 Limitations
+- **Knowledge Cutoff**: Training data knowledge cutoff varies by source
+- **Biases**: May reflect biases present in training data
+- **Factuality**: Generated content should be verified for factual accuracy
+- **Context Window**: Limited to 2,048 tokens (approximately 1,500 words)
+## 🔧 Technical Implementation
+The model uses a custom PyTorch implementation with:
+- Efficient attention mechanisms
+- Memory-optimized layer implementations
+- Streaming generation with proper token handling
+- GPU acceleration via ZeroGPU
+## 📝 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🙏 Acknowledgments
+- Hugging Face for the Spaces platform and ZeroGPU infrastructure
+- The open-source community for transformer implementations and best practices
+- TII UAE for the Falcon RefinedWeb dataset
+---
+**Note**: This is a foundational language model trained for research and educational purposes. Please use responsibly and be aware of potential biases and limitations.

app.py ADDED Viewed

	@@ -0,0 +1,635 @@

+"""Gradio app for the custom LLM with streaming support and ZeroGPU integration."""
+import gradio as gr
+import torch
+import torch.nn.functional as F
+from typing import Iterator, Optional, Union, List
+from transformers import AutoTokenizer
+import json
+import warnings
+import sys
+from pathlib import Path
+# Add src to path
+sys.path.append(str(Path(__file__).parent))
+warnings.filterwarnings("ignore")
+try:
+    import spaces
+    HAS_SPACES = True
+except ImportError:
+    HAS_SPACES = False
+    # Mock decorator for local testing
+    def spaces_decorator(gpu_memory=None):
+        def decorator(func):
+            return func
+        return decorator
+    spaces = type('MockSpaces', (), {'GPU': spaces_decorator})
+from src.model.transformer import TransformerForCausalLM
+class StreamingTextGenerator:
+    """Streaming text generation for the custom LLM."""
+    def __init__(self, model, tokenizer, device='cuda'):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.device = device
+        self.model.to(device)
+        self.model.eval()
+    def generate_stream(
+        self,
+        prompt: str,
+        max_new_tokens: int = 512,
+        temperature: float = 0.8,
+        top_p: float = 0.9,
+        top_k: Optional[int] = 50,
+        repetition_penalty: float = 1.1,
+        do_sample: bool = True,
+    ) -> Iterator[str]:
+        """Generate text with streaming output."""
+        # Tokenize prompt
+        inputs = self.tokenizer(
+            prompt,
+            return_tensors='pt',
+            padding=False,
+            truncation=True,
+            max_length=1024,  # Leave room for generation
+        ).to(self.device)
+        input_ids = inputs['input_ids']
+        attention_mask = inputs['attention_mask']
+        # Initialize generated sequence
+        generated_ids = input_ids.clone()
+        generated_text = prompt
+        with torch.no_grad():
+            for step in range(max_new_tokens):
+                # Get model predictions
+                outputs = self.model(
+                    input_ids=generated_ids,
+                    attention_mask=attention_mask,
+                )
+                # Get logits for the last token
+                next_token_logits = outputs.logits[0, -1, :].clone()
+                # Apply repetition penalty
+                if repetition_penalty != 1.0:
+                    for token_id in set(generated_ids[0].tolist()):
+                        next_token_logits[token_id] /= repetition_penalty
+                # Apply temperature
+                if temperature > 0:
+                    next_token_logits = next_token_logits / temperature
+                # Apply top-k filtering
+                if top_k is not None and top_k > 0:
+                    top_k_logits, _ = torch.topk(next_token_logits, min(top_k, next_token_logits.size(-1)))
+                    min_top_k = top_k_logits[-1]
+                    next_token_logits = torch.where(
+                        next_token_logits < min_top_k,
+                        torch.full_like(next_token_logits, float('-inf')),
+                        next_token_logits
+                    )
+                # Apply top-p (nucleus) filtering
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    # Remove tokens with cumulative probability above threshold
+                    sorted_indices_to_remove = cumulative_probs > top_p
+                    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
+                    sorted_indices_to_remove[0] = False
+                    indices_to_remove = sorted_indices_to_remove.scatter(0, sorted_indices, sorted_indices_to_remove)
+                    next_token_logits[indices_to_remove] = float('-inf')
+                # Sample next token
+                if do_sample and temperature > 0:
+                    probs = F.softmax(next_token_logits, dim=-1)
+                    next_token = torch.multinomial(probs, num_samples=1)
+                else:
+                    next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
+                # Check for EOS token
+                if next_token.item() == self.tokenizer.eos_token_id:
+                    break
+                # Append to generated sequence
+                generated_ids = torch.cat([generated_ids, next_token.unsqueeze(0)], dim=-1)
+                # Update attention mask
+                attention_mask = torch.cat([
+                    attention_mask,
+                    torch.ones((1, 1), device=self.device, dtype=attention_mask.dtype)
+                ], dim=-1)
+                # Decode and yield new token
+                new_text = self.tokenizer.decode(
+                    generated_ids[0],
+                    skip_special_tokens=True,
+                    clean_up_tokenization_spaces=False
+                )
+                # Only yield the new part
+                if len(new_text) > len(generated_text):
+                    generated_text = new_text
+                    yield generated_text
+def download_model_from_hf():
+    """Download model from HuggingFace repository."""
+    from huggingface_hub import hf_hub_download
+    import os
+    model_repo = "dixisouls/VelocityLM"
+    cache_dir = Path("model_cache")
+    cache_dir.mkdir(exist_ok=True)
+    print("📥 Downloading model from HuggingFace...")
+    # Download config.json
+    config_path = hf_hub_download(
+        repo_id=model_repo,
+        filename="config.json",
+        cache_dir=cache_dir,
+        local_files_only=False
+    )
+    # Download pytorch_model.bin
+    model_path = hf_hub_download(
+        repo_id=model_repo,
+        filename="pytorch_model.bin",
+        cache_dir=cache_dir,
+        local_files_only=False
+    )
+    print("✅ Model downloaded successfully!")
+    return config_path, model_path
+def load_model_and_tokenizer():
+    """Load the trained model and tokenizer."""
+    import os
+    # Check if model exists locally, if not download from HF
+    cache_dir = Path("model_cache")
+    local_config = None
+    local_model = None
+    # Try to find cached files
+    if cache_dir.exists():
+        for root, dirs, files in os.walk(cache_dir):
+            if "config.json" in files:
+                local_config = Path(root) / "config.json"
+            if "pytorch_model.bin" in files:
+                local_model = Path(root) / "pytorch_model.bin"
+    # Download if not found locally
+    if not local_config or not local_model:
+        config_path, model_path = download_model_from_hf()
+    else:
+        config_path = str(local_config)
+        model_path = str(local_model)
+        print("📂 Using cached model files")
+    # Load config
+    with open(config_path, 'r') as f:
+        config = json.load(f)
+    # Create model config object
+    class ModelConfig:
+        def __init__(self, config_dict):
+            for key, value in config_dict.items():
+                setattr(self, key, value)
+    model_config = ModelConfig(config['model'])
+    # Load model
+    print("🔧 Initializing model...")
+    model = TransformerForCausalLM(model_config)
+    # Load state dict from pytorch_model.bin
+    print("📦 Loading model weights...")
+    model_state_dict = torch.load(
+        model_path,
+        map_location='cpu'
+    )
+    model.load_state_dict(model_state_dict, strict=False)
+    print("✅ Model weights loaded!")
+    # Load tokenizer
+    print("🔤 Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(config['tokenizer']['tokenizer_name'])
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    print("🎉 Model and tokenizer ready!")
+    return model, tokenizer
+# Global variables for model and generator
+model = None
+tokenizer = None
+generator = None
+def initialize_model():
+    """Initialize model and tokenizer."""
+    global model, tokenizer, generator
+    if model is None:
+        print("Loading model and tokenizer...")
+        model, tokenizer = load_model_and_tokenizer()
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        generator = StreamingTextGenerator(model, tokenizer, device=device)
+        print(f"Model loaded on {device}")
+@spaces.GPU(duration=120) if HAS_SPACES else lambda x: x
+def generate_response(
+    prompt: str,
+    max_new_tokens: int = 512,
+    temperature: float = 0.8,
+    top_p: float = 0.9,
+    top_k: int = 50,
+    repetition_penalty: float = 1.1,
+) -> Iterator[str]:
+    """Generate streaming response."""
+    # Initialize model if needed
+    initialize_model()
+    if not prompt.strip():
+        yield "Please enter a prompt."
+        return
+    try:
+        # Generate with streaming
+        for partial_text in generator.generate_stream(
+            prompt=prompt,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            top_k=top_k if top_k > 0 else None,
+            repetition_penalty=repetition_penalty,
+            do_sample=temperature > 0,
+        ):
+            yield partial_text
+    except Exception as e:
+        yield f"Error generating text: {str(e)}"
+# Create Gradio interface
+def create_interface():
+    """Create the Gradio interface."""
+    # Custom CSS for enhanced UI
+    custom_css = """
+    .gradio-container {
+        max-width: 1200px !important;
+        margin: 0 auto !important;
+    }
+    .header-text {
+        text-align: center;
+        background: linear-gradient(45deg, #667eea 0%, #764ba2 100%);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+        background-clip: text;
+        font-size: 2.5em !important;
+        font-weight: bold !important;
+        margin-bottom: 0.5em !important;
+    }
+    .subtitle-text {
+        text-align: center;
+        color: #666;
+        font-size: 1.2em !important;
+        margin-bottom: 2em !important;
+    }
+    .parameter-box {
+        background: linear-gradient(135deg, #2d3748 0%, #1a202c 100%) !important;
+        border-radius: 15px !important;
+        padding: 20px !important;
+        border: 1px solid #4a5568 !important;
+    }
+    .parameter-box summary {
+        color: #ffffff !important;
+        font-weight: bold !important;
+        background: rgba(255, 255, 255, 0.1) !important;
+        padding: 10px !important;
+        border-radius: 10px !important;
+    }
+    .parameter-box details summary {
+        color: #ffffff !important;
+        font-weight: bold !important;
+    }
+    /* Make ALL text white in the parameter box */
+    .parameter-box,
+    .parameter-box *,
+    .parameter-box label,
+    .parameter-box span,
+    .parameter-box p,
+    .parameter-box div,
+    .parameter-box small {
+        color: #ffffff !important;
+    }
+    /* Ensure input values are also white */
+    .parameter-box input[type="number"],
+    .parameter-box .gr-textbox input {
+        color: #ffffff !important;
+        background: rgba(255, 255, 255, 0.1) !important;
+        border: 1px solid #4a5568 !important;
+    }
+    /* Make the centered description text white too */
+    .parameter-box > p {
+        color: #ffffff !important;
+        text-align: center !important;
+    }
+    .output-box {
+        border-radius: 15px !important;
+        border: 1px solid #e1e5e9 !important;
+    }
+    .generate-btn {
+        background: linear-gradient(45deg, #667eea 0%, #764ba2 100%) !important;
+        border: none !important;
+        color: white !important;
+        font-weight: bold !important;
+        font-size: 1.1em !important;
+        padding: 15px 30px !important;
+        border-radius: 25px !important;
+        box-shadow: 0 4px 15px rgba(102, 126, 234, 0.4) !important;
+        transition: all 0.3s ease !important;
+    }
+    .generate-btn:hover {
+        transform: translateY(-2px) !important;
+        box-shadow: 0 6px 20px rgba(102, 126, 234, 0.6) !important;
+    }
+    .clear-btn {
+        background: linear-gradient(45deg, #ff6b6b 0%, #ee5a24 100%) !important;
+        border: none !important;
+        color: white !important;
+        font-weight: bold !important;
+        border-radius: 20px !important;
+        padding: 10px 20px !important;
+        box-shadow: 0 2px 10px rgba(255, 107, 107, 0.3) !important;
+    }
+    .info-box {
+        background: linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%) !important;
+        border-radius: 15px !important;
+        padding: 20px !important;
+        border: 1px solid #f0c27b !important;
+        margin-top: 20px !important;
+    }
+    .example-box {
+        background: linear-gradient(135def, #e8f5e8 0%, #d4edda 100%) !important;
+        border-radius: 15px !important;
+        padding: 15px !important;
+        border: 1px solid #c3e6cb !important;
+    }
+    .metric-card {
+        background: white !important;
+        border-radius: 10px !important;
+        padding: 15px !important;
+        text-align: center !important;
+        box-shadow: 0 2px 10px rgba(0,0,0,0.1) !important;
+        border-left: 4px solid #667eea !important;
+    }
+    .progress-bar {
+        background: linear-gradient(45deg, #667eea 0%, #764ba2 100%) !important;
+    }
+    """
+    with gr.Blocks(
+        title="VelocityLM - Fast Text Generation",
+        theme=gr.themes.Soft(
+            primary_hue="blue",
+            secondary_hue="purple",
+            neutral_hue="gray"
+        ),
+        css=custom_css
+    ) as demo:
+        # Header with gradient text
+        gr.HTML("""
+        <div style="text-align: center; margin-bottom: 2rem;">
+            <h1 class="header-text">VelocityLM</h1>
+            <p class="subtitle-text">Advanced 2B Parameter Foundational Language Model</p>
+            <div style="display: flex; justify-content: center; gap: 2rem; margin: 1.5rem 0;">
+                <div class="metric-card">
+                    <h3 style="margin: 0; color: #667eea;">2B+</h3>
+                    <p style="margin: 5px 0 0 0; color: #666; font-size: 0.9em;">Parameters</p>
+                </div>
+                <div class="metric-card">
+                    <h3 style="margin: 0; color: #667eea;">2048</h3>
+                    <p style="margin: 5px 0 0 0; color: #666; font-size: 0.9em;">Context Length</p>
+                </div>
+            </div>
+        </div>
+        """)
+        gr.Markdown(
+            """
+            <div style="text-align: center; background: linear-gradient(135deg, #f8f9ff 0%, #e8f0ff 100%);
+                        padding: 20px; border-radius: 15px; margin-bottom: 2rem; border: 1px solid #e1e8f7;">
+                <p style="margin: 0; font-size: 1.1em; color: #4a5568;">
+                    🎯 <strong>Modern Architecture:</strong> RoPE • RMSNorm • SwiGLU • Multi-Head Attention<br>
+                    ✨ <strong>Features:</strong> Text Generation • Configurable Sampling • GPU Accelerated
+                </p>
+            </div>
+            """,
+            elem_classes=["info-box"]
+        )
+        with gr.Row(equal_height=True):
+            # Input Column
+            with gr.Column(scale=2, min_width=400):
+                gr.HTML("<div style='margin-bottom: 1rem;'><h3 style='color: #667eea; margin: 0;'>💬 Input Prompt</h3></div>")
+                prompt_input = gr.Textbox(
+                    lines=6,
+                    placeholder="✨ Enter your creative prompt here...\n\nExample: Write a story about a future where AI and humans collaborate to solve climate change...",
+                    label="Your Prompt",
+                    show_copy_button=True,
+                    container=True,
+                    elem_classes=["input-box"]
+                )
+                # Advanced Parameters Section
+                with gr.Accordion("🎛️ Advanced Generation Parameters", open=False, elem_classes=["parameter-box"]):
+                    gr.HTML("<p style='text-align: center; color: #333; margin-bottom: 1rem;'>Fine-tune your generation settings</p>")
+                    with gr.Row():
+                        max_new_tokens = gr.Slider(
+                            minimum=1,
+                            maximum=1024,
+                            value=512,
+                            step=1,
+                            label="🔢 Max New Tokens",
+                            info="Maximum number of tokens to generate"
+                        )
+                        temperature = gr.Slider(
+                            minimum=0.1,
+                            maximum=2.0,
+                            value=0.8,
+                            step=0.1,
+                            label="🌡️ Temperature",
+                            info="Higher = more creative, lower = more focused"
+                        )
+                    with gr.Row():
+                        top_p = gr.Slider(
+                            minimum=0.1,
+                            maximum=1.0,
+                            value=0.9,
+                            step=0.05,
+                            label="🎯 Top-p",
+                            info="Nucleus sampling threshold"
+                        )
+                        top_k = gr.Slider(
+                            minimum=0,
+                            maximum=200,
+                            value=50,
+                            step=5,
+                            label="📊 Top-k",
+                            info="Top-k sampling limit (0 = disabled)"
+                        )
+                    repetition_penalty = gr.Slider(
+                        minimum=1.0,
+                        maximum=2.0,
+                        value=1.1,
+                        step=0.05,
+                        label="🔄 Repetition Penalty",
+                        info="Reduce repetitive text (higher = less repetition)"
+                    )
+                # Generate Button with enhanced styling
+                gr.HTML("<div style='margin: 1.5rem 0;'>")
+                generate_btn = gr.Button(
+                    "🚀 Generate Text",
+                    variant="primary",
+                    size="lg",
+                    elem_classes=["generate-btn"],
+                    scale=1
+                )
+                gr.HTML("</div>")
+                # Quick Settings Presets
+                gr.HTML("<div style='margin-top: 1rem;'><h4 style='color: #667eea; margin-bottom: 0.5rem;'>⚡ Quick Presets</h4></div>")
+                with gr.Row():
+                    creative_btn = gr.Button("🎨 Creative", size="sm", variant="secondary")
+                    balanced_btn = gr.Button("⚖️ Balanced", size="sm", variant="secondary")
+                    precise_btn = gr.Button("🎯 Precise", size="sm", variant="secondary")
+            # Output Column
+            with gr.Column(scale=3, min_width=500):
+                gr.HTML("<div style='margin-bottom: 1rem; display: flex; justify-content: space-between; align-items: center;'><h3 style='color: #667eea; margin: 0;'>📝 Generated Output</h3></div>")
+                output_text = gr.Textbox(
+                    lines=22,
+                    label="Generated Text",
+                    show_copy_button=True,
+                    interactive=False,
+                    placeholder="Your generated text will appear here...\n\n✨ Streaming in real-time\n🚀 Powered by custom 2B parameter model",
+                    elem_classes=["output-box"],
+                    container=True
+                )
+                # Action buttons
+                with gr.Row():
+                    clear_btn = gr.Button("🗑️ Clear All", variant="secondary", elem_classes=["clear-btn"])
+        # Enhanced Examples Section
+        gr.HTML("<div style='margin: 2rem 0;'><h3 style='color: #667eea; text-align: center; margin-bottom: 1rem;'>🎯 Example Prompts</h3></div>")
+        with gr.Accordion("📚 Prompt Examples", open=True, elem_classes=["example-box"]):
+            gr.Examples(
+                examples=[
+                    ["Once upon a time in a distant galaxy, there lived a civilization that had never seen the stars."],
+                    ["The old lighthouse keeper noticed something strange about the fog that night."],
+                    ["In the depths of the Amazon rainforest, Dr. Martinez made a discovery that would change everything."],
+                    ["The last bookstore on Earth was about to close its doors forever when"],
+                    ["As the spaceship approached the mysterious planet, the crew realized"],
+                    ["The clockmaker's shop had been abandoned for fifty years, but every morning at precisely 9 AM"],
+                    ["Deep beneath the city, in tunnels forgotten by time, archaeologist Elena found"],
+                    ["The message in a bottle had traveled across three oceans before washing ashore"],
+                ],
+                inputs=[prompt_input],
+                label="Click any example to get started!",
+                examples_per_page=4
+            )
+        # Event handlers for main functionality
+        generate_btn.click(
+            fn=generate_response,
+            inputs=[
+                prompt_input,
+                max_new_tokens,
+                temperature,
+                top_p,
+                top_k,
+                repetition_penalty,
+            ],
+            outputs=[output_text],
+            show_progress=True,
+        )
+        # Preset button handlers
+        creative_btn.click(
+            fn=lambda: (1.2, 0.95, 40, 1.05),
+            outputs=[temperature, top_p, top_k, repetition_penalty]
+        )
+        balanced_btn.click(
+            fn=lambda: (0.8, 0.9, 50, 1.1),
+            outputs=[temperature, top_p, top_k, repetition_penalty]
+        )
+        precise_btn.click(
+            fn=lambda: (0.3, 0.8, 20, 1.2),
+            outputs=[temperature, top_p, top_k, repetition_penalty]
+        )
+        # Utility button handlers
+        clear_btn.click(
+            fn=lambda: ("", ""),
+            outputs=[prompt_input, output_text]
+        )
+    return demo
+if __name__ == "__main__":
+    # Initialize for local testing
+    demo = create_interface()
+    demo.launch(
+        server_name="127.0.0.1",
+        server_port=7860,
+        share=False,
+        debug=False,
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+# Gradio app requirements for HuggingFace Spaces
+gradio==4.44.0
+spaces==0.29.4
+# Core ML dependencies
+torch==2.2.0
+transformers==4.36.0
+tokenizers==0.15.0
+# Numerical computing
+numpy==1.26.4
+# Utilities
+tqdm==4.66.1

src/inference/inference.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""Text generation utilities for the trained model."""
+import torch
+import torch.nn.functional as F
+from typing import List, Optional, Union
+from transformers import AutoTokenizer
+import logging
+logger = logging.getLogger(__name__)
+class TextGenerator:
+    """Text generation with various decoding strategies."""
+    def __init__(self, model, tokenizer, device='cuda'):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.device = device
+        self.model.to(device)
+        self.model.eval()
+    @torch.no_grad()
+    def generate(
+        self,
+        prompt: Union[str, List[str]],
+        max_length: int = 100,
+        temperature: float = 1.0,
+        top_k: Optional[int] = 50,
+        top_p: Optional[float] = 0.9,
+        num_return_sequences: int = 1,
+        do_sample: bool = True,
+        repetition_penalty: float = 1.0,
+    ) -> List[str]:
+        """Generate text from prompt(s)."""
+        # Handle single string input
+        if isinstance(prompt, str):
+            prompts = [prompt]
+        else:
+            prompts = prompt
+        # Tokenize prompts
+        inputs = self.tokenizer(
+            prompts,
+            return_tensors='pt',
+            padding=True,
+            truncation=True,
+            max_length=max_length,
+        ).to(self.device)
+        input_ids = inputs['input_ids']
+        attention_mask = inputs['attention_mask']
+        # Generate
+        batch_size = input_ids.shape[0]
+        generated_ids = input_ids.clone()
+        for _ in range(max_length - input_ids.shape[1]):
+            # Get model predictions
+            outputs = self.model(
+                input_ids=generated_ids,
+                attention_mask=attention_mask,
+            )
+            # Get logits for the last token
+            next_token_logits = outputs.logits[:, -1, :]
+            # Apply repetition penalty
+            if repetition_penalty != 1.0:
+                for i in range(batch_size):
+                    for token_id in set(generated_ids[i].tolist()):
+                        next_token_logits[i, token_id] /= repetition_penalty
+            # Apply temperature
+            if temperature != 1.0:
+                next_token_logits = next_token_logits / temperature
+            # Apply top-k filtering
+            if top_k is not None:
+                indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
+                next_token_logits[indices_to_remove] = float('-inf')
+            # Apply top-p (nucleus) filtering
+            if top_p is not None:
+                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                # Remove tokens with cumulative probability above the threshold
+                sorted_indices_to_remove = cumulative_probs > top_p
+                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                sorted_indices_to_remove[..., 0] = 0
+                indices_to_remove = sorted_indices_to_remove.scatter(
+                    1, sorted_indices, sorted_indices_to_remove
+                )
+                next_token_logits[indices_to_remove] = float('-inf')
+            # Sample from the distribution
+            if do_sample:
+                probs = F.softmax(next_token_logits, dim=-1)
+                next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+            else:
+                next_tokens = torch.argmax(next_token_logits, dim=-1)
+            # Append to generated sequence
+            generated_ids = torch.cat([generated_ids, next_tokens.unsqueeze(1)], dim=1)
+            # Update attention mask
+            attention_mask = torch.cat([
+                attention_mask,
+                torch.ones((batch_size, 1), device=self.device)
+            ], dim=1)
+            # Check for EOS token
+            if (next_tokens == self.tokenizer.eos_token_id).all():
+                break
+        # Decode generated sequences
+        generated_texts = []
+        for i in range(batch_size):
+            generated_text = self.tokenizer.decode(
+                generated_ids[i],
+                skip_special_tokens=True,
+                clean_up_tokenization_spaces=True
+            )
+            generated_texts.append(generated_text)
+        return generated_texts
+    def beam_search(
+        self,
+        prompt: str,
+        max_length: int = 100,
+        num_beams: int = 4,
+        length_penalty: float = 1.0,
+        early_stopping: bool = True,
+    ) -> str:
+        """Generate text using beam search."""
+        # Implementation of beam search
+        # This is a simplified version - full implementation would be more complex
+        inputs = self.tokenizer(
+            prompt,
+            return_tensors='pt',
+            truncation=True,
+            max_length=max_length,
+        ).to(self.device)
+        # For now, fallback to greedy decoding
+        return self.generate(
+            prompt,
+            max_length=max_length,
+            do_sample=False,
+            num_return_sequences=1
+        )[0]
+def load_generator(checkpoint_path: str, device: str = 'cuda'):
+    """Load model and create generator."""
+    import yaml
+    from pathlib import Path
+    import sys
+    sys.path.append(str(Path(__file__).parent.parent.parent))
+    from src.model.transformer import TransformerForCausalLM
+    # Load config
+    config_path = Path(checkpoint_path) / 'config.json'
+    with open(config_path, 'r') as f:
+        import json
+        config = json.load(f)
+    # Create model config
+    class ModelConfig:
+        def __init__(self, config_dict):
+            for key, value in config_dict.items():
+                setattr(self, key, value)
+    model_config = ModelConfig(config['model'])
+    # Load model
+    model = TransformerForCausalLM(model_config)
+    state_dict = torch.load(
+        Path(checkpoint_path) / 'pytorch_model.bin',
+        map_location=device
+    )
+    model.load_state_dict(state_dict)
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(config['tokenizer']['tokenizer_name'])
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Create generator
+    generator = TextGenerator(model, tokenizer, device)
+    return generator
+if __name__ == '__main__':
+    """Example usage."""
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--checkpoint', type=str, required=True, help='Path to model checkpoint')
+    parser.add_argument('--prompt', type=str, required=True, help='Input prompt')
+    parser.add_argument('--max-length', type=int, default=100, help='Maximum generation length')
+    parser.add_argument('--temperature', type=float, default=0.8, help='Sampling temperature')
+    parser.add_argument('--top-k', type=int, default=50, help='Top-k filtering')
+    parser.add_argument('--top-p', type=float, default=0.9, help='Top-p (nucleus) filtering')
+    parser.add_argument('--device', type=str, default='cuda', help='Device to use')
+    args = parser.parse_args()
+    # Load generator
+    print("Loading model...")
+    generator = load_generator(args.checkpoint, args.device)
+    # Generate text
+    print(f"Prompt: {args.prompt}")
+    print("Generating...")
+    generated = generator.generate(
+        args.prompt,
+        max_length=args.max_length,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+    )
+    print(f"Generated: {generated[0]}")

src/model/layers.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""Custom layers for the transformer model."""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Tuple
+import warnings
+warnings.filterwarnings("ignore")
+class RMSNorm(nn.Module):
+    """Root Mean Square Layer Normalization."""
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.eps = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
+        return self.weight * hidden_states.to(input_dtype)
+class RotaryEmbedding(nn.Module):
+    """Rotary Position Embedding."""
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float() / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build cached cos/sin
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings,
+            device=self.inv_freq.device,
+            dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len=None):
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len].to(dtype=x.dtype),
+            self.sin_cached[:seq_len].to(dtype=x.dtype),
+        )
+    @staticmethod
+    def rotate_half(x):
+        x1 = x[..., : x.shape[-1] // 2]
+        x2 = x[..., x.shape[-1] // 2 :]
+        return torch.cat((-x2, x1), dim=-1)
+    def apply_rotary_pos_emb(self, q, k, cos, sin, position_ids):
+        cos = cos[position_ids].unsqueeze(1)
+        sin = sin[position_ids].unsqueeze(1)
+        q_embed = (q * cos) + (self.rotate_half(q) * sin)
+        k_embed = (k * cos) + (self.rotate_half(k) * sin)
+        return q_embed, k_embed
+class SwiGLU(nn.Module):
+    """SwiGLU activation function."""
+    def __init__(self, hidden_size, intermediate_size, hidden_act="silu"):
+        super().__init__()
+        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
+        self.act_fn = F.silu if hidden_act == "silu" else F.gelu
+    def forward(self, x):
+        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

src/model/transformer.py ADDED Viewed

	@@ -0,0 +1,281 @@

+"""State-of-the-art Transformer model implementation."""
+import math
+from typing import Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import CrossEntropyLoss
+from dataclasses import dataclass
+import warnings
+warnings.filterwarnings("ignore")
+from .layers import RMSNorm, RotaryEmbedding, SwiGLU
+@dataclass
+class ModelOutput:
+    """Model output container."""
+    loss: Optional[torch.Tensor] = None
+    logits: Optional[torch.Tensor] = None
+    hidden_states: Optional[Tuple[torch.Tensor]] = None
+    attentions: Optional[Tuple[torch.Tensor]] = None
+class CausalSelfAttention(nn.Module):
+    """Multi-head self-attention with causal mask and RoPE."""
+    def __init__(self, config):
+        super().__init__()
+        assert config.hidden_size % config.num_attention_heads == 0
+        self.num_attention_heads = config.num_attention_heads
+        self.head_dim = config.hidden_size // config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+        self.rotary_emb = RotaryEmbedding(
+            self.head_dim,
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta,
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        q = self.q_proj(hidden_states)
+        k = self.k_proj(hidden_states)
+        v = self.v_proj(hidden_states)
+        q = q.view(bsz, q_len, self.num_attention_heads, self.head_dim).transpose(1, 2)
+        k = k.view(bsz, q_len, self.num_attention_heads, self.head_dim).transpose(1, 2)
+        v = v.view(bsz, q_len, self.num_attention_heads, self.head_dim).transpose(1, 2)
+        # Apply rotary embeddings
+        cos, sin = self.rotary_emb(v, seq_len=q_len)
+        q, k = self.rotary_emb.apply_rotary_pos_emb(q, k, cos, sin, position_ids)
+        # Flash attention or standard attention
+        attn_weights = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if attention_mask is not None:
+            attn_weights = attn_weights + attention_mask
+        attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
+        attn_weights = self.attention_dropout(attn_weights)
+        attn_output = torch.matmul(attn_weights, v)
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None
+class TransformerBlock(nn.Module):
+    """Transformer block with RMSNorm and SwiGLU."""
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = CausalSelfAttention(config)
+        self.mlp = SwiGLU(
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+        )
+        self.input_layernorm = RMSNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.hidden_dropout = nn.Dropout(config.hidden_dropout)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+        )
+        hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        # MLP
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states, present_key_value
+class TransformerModel(nn.Module):
+    """Main transformer model."""
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layers = nn.ModuleList(
+            [TransformerBlock(config) for _ in range(config.num_hidden_layers)]
+        )
+        self.norm = RMSNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.gradient_checkpointing = False
+        # Initialize weights
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ) -> torch.Tensor:
+        batch_size, seq_length = input_ids.shape
+        # Embed tokens
+        hidden_states = self.embed_tokens(input_ids)
+        # Create position IDs
+        if position_ids is None:
+            position_ids = torch.arange(
+                seq_length, dtype=torch.long, device=input_ids.device
+            ).unsqueeze(0)
+        # Create causal mask
+        causal_mask = torch.triu(
+            torch.full((seq_length, seq_length), float('-inf'), device=input_ids.device),
+            diagonal=1
+        ).unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]
+        if attention_mask is not None:
+            # Convert padding mask [batch, seq_len] to 4D [batch, 1, 1, seq_len]
+            # and combine with causal mask
+            expanded_mask = attention_mask[:, None, None, :]  # [batch, 1, 1, seq_len]
+            expanded_mask = (1.0 - expanded_mask) * -10000.0  # Convert 0s to -inf
+            attention_mask = expanded_mask + causal_mask.expand(input_ids.shape[0], -1, -1, -1)
+        else:
+            attention_mask = causal_mask.expand(input_ids.shape[0], -1, -1, -1)
+        # Forward through layers
+        for layer in self.layers:
+            if self.gradient_checkpointing and self.training:
+                hidden_states, _ = torch.utils.checkpoint.checkpoint(
+                    layer,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    None,
+                    False,
+                    use_reentrant=False,
+                )
+            else:
+                hidden_states, _ = layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=None,
+                    use_cache=False,
+                )
+        hidden_states = self.norm(hidden_states)
+        return hidden_states
+class TransformerForCausalLM(nn.Module):
+    """Transformer model with language modeling head."""
+    def __init__(self, config):
+        super().__init__()
+        self.model = TransformerModel(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Tie weights
+        self.lm_head.weight = self.model.embed_tokens.weight
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ) -> ModelOutput:
+        hidden_states = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = self.lm_head(hidden_states)
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.view(-1, shift_logits.size(-1)),
+                shift_labels.view(-1)
+            )
+        return ModelOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=hidden_states,
+            attentions=None,
+        )
+    def gradient_checkpointing_enable(self):
+        self.model.gradient_checkpointing = True