Vishwas1 commited on
Commit
411c845
ยท
verified ยท
1 Parent(s): 41b7476

Upload 7 files

Browse files
Files changed (7) hide show
  1. BLOG.md +1 -0
  2. DEMO_README.md +1 -0
  3. FEATURE_SUMMARY.md +178 -0
  4. README.md +4 -3
  5. SPACE_BLOG.md +1 -0
  6. app.py +196 -21
  7. requirements.txt +2 -1
BLOG.md CHANGED
@@ -365,3 +365,4 @@ As enterprises generate ever more complex documents, the need for intelligent, a
365
  ### Q: What about languages other than English?
366
 
367
  **A:** Currently optimized for English, with beta support for Spanish, French, and German. Multi-language support is on our roadmap based on customer demand.
 
 
365
  ### Q: What about languages other than English?
366
 
367
  **A:** Currently optimized for English, with beta support for Spanish, French, and German. Multi-language support is on our roadmap based on customer demand.
368
+
DEMO_README.md CHANGED
@@ -246,3 +246,4 @@ python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained
246
  ```
247
 
248
  ๐Ÿš€ **Your Active Reading demo will be live in minutes!**
 
 
246
  ```
247
 
248
  ๐Ÿš€ **Your Active Reading demo will be live in minutes!**
249
+
FEATURE_SUMMARY.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐ŸŽฏ New Features Added to Active Reading Demo
2
+
3
+ ## ๐Ÿ“‚ **Category Selection Feature**
4
+
5
+ ### What It Does
6
+ Users can now manually select or override the document category detection:
7
+
8
+ **Available Categories:**
9
+ - **Auto-Detect** (default) - AI detects domain automatically
10
+ - **Finance** - Financial reports, earnings, budgets
11
+ - **Legal** - Contracts, agreements, policies
12
+ - **Technical** - API docs, manuals, specifications
13
+ - **Medical** - Clinical trials, research, treatments
14
+ - **General** - Any other document type
15
+
16
+ ### Category-Specific Extraction Patterns
17
+
18
+ #### ๐Ÿ“Š Finance Category
19
+ - **Revenue**: `$150 million revenue`, `sales of $2.5B`
20
+ - **Profit**: `profit margin 25%`, `net profit $50M`
21
+ - **Growth**: `15% growth`, `increased by 20%`
22
+ - **Dates**: `Q3 2024`, `fiscal year 2023`
23
+ - **Employees**: `hire 200 engineers`, `workforce of 5000`
24
+ - **Market Cap**: `market cap $10B`
25
+
26
+ #### โš–๏ธ Legal Category
27
+ - **Parties**: `between Company A and Company B`
28
+ - **Terms**: `term of 36 months`, `duration 3 years`
29
+ - **Liability**: `liability not to exceed $1M`
30
+ - **Termination**: `90 days written notice`
31
+ - **Governing Law**: `governed by laws of Delaware`
32
+ - **Effective Date**: `effective January 1, 2024`
33
+
34
+ #### ๐Ÿ”ง Technical Category
35
+ - **API Endpoints**: `GET /api/users`, `POST /auth/login`
36
+ - **Versions**: `version 2.1.0`, `v3.5`
37
+ - **Response Time**: `response time 150ms`
38
+ - **Rate Limits**: `1000 requests per minute`
39
+ - **Authentication**: `OAuth 2.0`, `JWT tokens`
40
+ - **Status Codes**: `HTTP 200`, `status code 404`
41
+
42
+ #### ๐Ÿฅ Medical Category
43
+ - **Dosage**: `50mg daily`, `100ml twice daily`
44
+ - **Duration**: `treatment for 12 weeks`
45
+ - **Efficacy**: `85% efficacy rate`
46
+ - **Side Effects**: `side effects in 12% of patients`
47
+ - **Patient Count**: `500 patients enrolled`
48
+ - **P-Values**: `p<0.001`, `p=0.025`
49
+
50
+ ## ๐Ÿ”‘ **Custom Keys Feature**
51
+
52
+ ### What It Does
53
+ Users can specify their own extraction terms as comma-separated values:
54
+
55
+ **Example Inputs:**
56
+ ```
57
+ CEO, budget, deadline, timeline
58
+ risk assessment, compliance, audit
59
+ performance, scalability, security
60
+ treatment, dosage, clinical trial
61
+ ```
62
+
63
+ ### How It Works
64
+ - **Smart Extraction**: Finds sentences containing the custom terms
65
+ - **Context Preservation**: Returns full sentences, not just keywords
66
+ - **Confidence Scoring**: Shows extraction confidence levels
67
+ - **JSON Output**: Structured data for easy integration
68
+
69
+ ## ๐ŸŽฏ **New Strategy: Category-Specific Extraction**
70
+
71
+ ### What's New
72
+ Added a specialized strategy that combines:
73
+ 1. **Category-specific patterns** for targeted extraction
74
+ 2. **Custom key extraction** for user-defined terms
75
+ 3. **Structured output** with confidence scores
76
+ 4. **Domain expertise** for each business category
77
+
78
+ ### Example Output
79
+ ```json
80
+ {
81
+ "category": "Finance",
82
+ "extracted_data": {
83
+ "revenue": ["$150 million", "$2.5 billion sales"],
84
+ "growth": ["15% increase", "20% growth rate"],
85
+ "date": ["Q3 2024", "fiscal year 2023"]
86
+ },
87
+ "custom_extractions": {
88
+ "CEO": ["CEO announced plans to expand", "CEO John Smith reported"],
89
+ "investment": ["$50M investment in AI", "investment in new markets"]
90
+ },
91
+ "confidence_scores": {
92
+ "revenue": 8.5,
93
+ "custom_CEO": 6.2
94
+ }
95
+ }
96
+ ```
97
+
98
+ ## ๐ŸŽจ **Enhanced UI Elements**
99
+
100
+ ### New Input Controls
101
+ - **๐Ÿ“‚ Category Dropdown**: Manual category selection
102
+ - **๐Ÿ”‘ Custom Keys Input**: Text field for custom extraction terms
103
+ - **๐Ÿ“Š Enhanced Strategy Selection**: Added "Category-Specific Extraction"
104
+
105
+ ### New Output Tabs
106
+ - **๐ŸŽฏ Category Analysis**: Dedicated tab for category-specific results
107
+ - **Enhanced JSON**: Structured category extraction data
108
+ - **Confidence Scores**: Shows extraction reliability
109
+
110
+ ### Improved User Experience
111
+ - **Dynamic Help Text**: Context-aware guidance
112
+ - **Example Suggestions**: Sample custom keys for each category
113
+ - **Better Visual Organization**: Clearer result presentation
114
+
115
+ ## ๐Ÿš€ **Usage Examples**
116
+
117
+ ### Finance Document Analysis
118
+ ```
119
+ Document Category: Finance
120
+ Custom Keys: CEO, quarterly results, investment
121
+ Strategy: Category-Specific Extraction
122
+ ```
123
+
124
+ **Result**: Extracts revenue figures, profit margins, growth rates PLUS CEO mentions, quarterly data, and investment information.
125
+
126
+ ### Legal Contract Review
127
+ ```
128
+ Document Category: Legal
129
+ Custom Keys: liability, termination, governing law
130
+ Strategy: Category-Specific Extraction
131
+ ```
132
+
133
+ **Result**: Finds contract parties, terms, dates PLUS specific liability clauses, termination conditions, and jurisdiction details.
134
+
135
+ ### Technical Documentation
136
+ ```
137
+ Document Category: Technical
138
+ Custom Keys: security, performance, scalability
139
+ Strategy: Category-Specific Extraction
140
+ ```
141
+
142
+ **Result**: Extracts API endpoints, versions, rate limits PLUS security features, performance metrics, and scalability considerations.
143
+
144
+ ## ๐ŸŽฏ **Why This Makes Active Reading Better**
145
+
146
+ ### 1. **Adaptive Intelligence**
147
+ - AI now adapts not just to document type, but to user-specific needs
148
+ - Combines automated domain detection with custom requirements
149
+
150
+ ### 2. **Enterprise Flexibility**
151
+ - Users can extract exactly what they need for their business case
152
+ - Supports diverse enterprise document analysis workflows
153
+
154
+ ### 3. **Structured Output**
155
+ - Category-specific patterns ensure consistent extraction
156
+ - Custom keys add user-defined flexibility
157
+ - JSON format enables easy integration
158
+
159
+ ### 4. **Demonstrable Value**
160
+ - Shows how Active Reading adapts to different business domains
161
+ - Proves the framework can handle real enterprise requirements
162
+ - Highlights the superiority over one-size-fits-all approaches
163
+
164
+ ## ๐ŸŽจ **Implementation Impact**
165
+
166
+ ### What Changed in Code
167
+ - **Added**: `extract_category_specific_info()` method
168
+ - **Enhanced**: `process_document()` function with category/custom key parameters
169
+ - **New**: Category-specific regex patterns for each domain
170
+ - **Improved**: UI with additional input controls and output tabs
171
+
172
+ ### Backward Compatibility
173
+ - โœ… All existing strategies still work
174
+ - โœ… Auto-detection remains the default
175
+ - โœ… Original demo functionality preserved
176
+ - โœ… Enhanced with new capabilities
177
+
178
+ This makes your Active Reading demo much more interactive and showcases the adaptive intelligence that makes it superior to traditional document processing approaches! ๐Ÿš€
README.md CHANGED
@@ -4,7 +4,7 @@ emoji: ๐Ÿง 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 5.44.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
@@ -63,7 +63,7 @@ This is a simplified demo. The complete Enterprise Active Reading Framework incl
63
  - **Scalable Deployment**: Docker, Kubernetes, monitoring
64
  - **Advanced Evaluation**: Custom benchmarks and performance metrics
65
 
66
-
67
 
68
  ## Citation
69
 
@@ -71,4 +71,5 @@ Based on the research paper:
71
  ```
72
  Lin, J., Berges, V.P., Chen, X., Yih, W.T., Ghosh, G., & OฤŸuz, B. (2024).
73
  Learning Facts at Scale with Active Reading. arXiv:2508.09494.
74
- ```
 
 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
63
  - **Scalable Deployment**: Docker, Kubernetes, monitoring
64
  - **Advanced Evaluation**: Custom benchmarks and performance metrics
65
 
66
+ For the full implementation, visit: [GitHub Repository](https://github.com/your-repo/active-reader)
67
 
68
  ## Citation
69
 
 
71
  ```
72
  Lin, J., Berges, V.P., Chen, X., Yih, W.T., Ghosh, G., & OฤŸuz, B. (2024).
73
  Learning Facts at Scale with Active Reading. arXiv:2508.09494.
74
+ ```
75
+
SPACE_BLOG.md CHANGED
@@ -157,3 +157,4 @@ Contribute new reading strategies and domain adaptations!
157
  *Built on cutting-edge research, optimized for real-world enterprise use.*
158
 
159
  **Tags:** `#ActiveReading` `#AI` `#NLP` `#DocumentAnalysis` `#MachineLearning` `#Enterprise`
 
 
157
  *Built on cutting-edge research, optimized for real-world enterprise use.*
158
 
159
  **Tags:** `#ActiveReading` `#AI` `#NLP` `#DocumentAnalysis` `#MachineLearning` `#Enterprise`
160
+
app.py CHANGED
@@ -122,6 +122,92 @@ class SimpleActiveReader:
122
  return "Medical"
123
  else:
124
  return "General"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  # Initialize the model
127
  try:
@@ -130,22 +216,31 @@ except Exception as e:
130
  logger.error(f"Failed to initialize model: {e}")
131
  active_reader = None
132
 
133
- def process_document(text: str, strategy: str) -> tuple:
134
  """
135
- Process document with selected strategy
136
 
137
- Returns: (result_text, facts_json, questions_json, summary_text, domain)
138
  """
139
  if not active_reader:
140
- return "Error: Model not loaded", "", "", "", ""
141
 
142
  if not text.strip():
143
- return "Please enter some text to analyze.", "", "", "", ""
144
 
145
  try:
146
  # Detect domain
147
  domain = active_reader.detect_domain(text)
148
 
 
 
 
 
 
 
 
 
 
149
  # Apply selected strategy
150
  if strategy == "Fact Extraction":
151
  facts = active_reader.extract_facts(text)
@@ -173,7 +268,7 @@ def process_document(text: str, strategy: str) -> tuple:
173
  questions = active_reader.generate_questions(text)
174
  summary = active_reader.generate_summary(text)
175
 
176
- result = f"""**Domain:** {domain}
177
 
178
  **Summary:**
179
  {summary}
@@ -187,12 +282,45 @@ def process_document(text: str, strategy: str) -> tuple:
187
  facts_json = json.dumps(facts, indent=2)
188
  questions_json = json.dumps(questions, indent=2)
189
  summary_text = summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
- return result, facts_json, questions_json, summary_text, domain
192
 
193
  except Exception as e:
194
  logger.error(f"Processing error: {e}")
195
- return f"Error processing document: {str(e)}", "", "", "", ""
196
 
197
  def create_demo():
198
  """Create the Gradio demo interface"""
@@ -255,11 +383,25 @@ def create_demo():
255
 
256
  # Strategy selection
257
  strategy_selector = gr.Radio(
258
- choices=["Fact Extraction", "Question Generation", "Summarization", "Complete Analysis"],
259
  value="Complete Analysis",
260
  label="Active Reading Strategy"
261
  )
262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
  # Process button
264
  process_btn = gr.Button("๐Ÿš€ Apply Active Reading", variant="primary", size="lg")
265
 
@@ -282,6 +424,9 @@ def create_demo():
282
 
283
  with gr.Tab("๐Ÿ“ Summary"):
284
  summary_output = gr.Textbox(lines=5, label="Document Summary")
 
 
 
285
 
286
  # Event handlers
287
  def load_sample_text(sample_choice):
@@ -297,8 +442,8 @@ def create_demo():
297
 
298
  process_btn.click(
299
  fn=process_document,
300
- inputs=[text_input, strategy_selector],
301
- outputs=[results_output, facts_output, questions_output, summary_output, domain_output]
302
  )
303
 
304
  # How it works and blog section
@@ -331,6 +476,35 @@ def create_demo():
331
  - ๐Ÿ”ง **Technology**: API docs, technical specifications, system manuals
332
  - ๐Ÿฅ **Healthcare**: Clinical trials, research papers, treatment protocols
333
  - ๐Ÿข **General Business**: Proposals, memos, strategic documents
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334
  """)
335
 
336
  with gr.Tab("๐Ÿ“– About the Research"):
@@ -373,10 +547,11 @@ def create_demo():
373
 
374
  **๐ŸŽฎ 5-Minute Demo:**
375
  1. Select **"Financial Report"** from sample documents
376
- 2. Choose **"Complete Analysis"** strategy
377
- 3. Click **"๐Ÿš€ Apply Active Reading"**
378
- 4. Explore the extracted facts, questions, and domain detection
379
- 5. Try different strategies to see how AI adapts!
 
380
 
381
  **๐Ÿ” Advanced Exploration:**
382
  1. **Upload your own document** (paste text up to 2000 words)
@@ -386,12 +561,12 @@ def create_demo():
386
 
387
  ### Sample Documents Available
388
 
389
- | Document Type | What You'll Learn |
390
- |---------------|-------------------|
391
- | ๐Ÿ“Š **Financial Report** | How AI extracts metrics, growth data, and financial insights |
392
- | โš–๏ธ **Legal Contract** | How AI identifies key terms, obligations, and risk factors |
393
- | ๐Ÿ”ง **Technical Manual** | How AI processes specifications, procedures, and system details |
394
- | ๐Ÿฅ **Medical Research** | How AI handles clinical data, statistics, and medical terminology |
395
 
396
  ### Next Steps
397
 
 
122
  return "Medical"
123
  else:
124
  return "General"
125
+
126
+ def extract_category_specific_info(self, text: str, category: str, custom_keys: List[str]) -> Dict[str, Any]:
127
+ """Extract information based on selected category and custom keys"""
128
+ results = {
129
+ "category": category,
130
+ "extracted_data": {},
131
+ "custom_extractions": {},
132
+ "confidence_scores": {}
133
+ }
134
+
135
+ # Category-specific extraction patterns
136
+ category_patterns = {
137
+ "Finance": {
138
+ "revenue": r'\$?[\d,]+\.?\d*\s*(?:million|billion|thousand|M|B|K)?\s*(?:revenue|sales|income)',
139
+ "profit": r'profit.*?\$?[\d,]+\.?\d*|margin.*?[\d,]+\.?\d*%',
140
+ "growth": r'(?:growth|increase|decrease).*?[\d,]+\.?\d*%',
141
+ "date": r'\b(?:Q[1-4]|quarter|fiscal|FY)\s*\d{4}|\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}',
142
+ "employees": r'(?:employees|staff|workforce).*?[\d,]+',
143
+ "market_cap": r'market\s*cap.*?\$?[\d,]+\.?\d*\s*(?:million|billion|M|B)'
144
+ },
145
+ "Legal": {
146
+ "parties": r'between\s+([^,]+)\s+and\s+([^,]+)|party.*?([A-Z][a-z]+\s+[A-Z][a-z]+)',
147
+ "term": r'term.*?(\d+)\s*(?:years?|months?|days?)',
148
+ "liability": r'liability.*?\$?[\d,]+\.?\d*',
149
+ "termination": r'terminat.*?(\d+)\s*days?\s*notice',
150
+ "governing_law": r'governed?\s*by.*?laws?\s*of\s*([^,.]+)',
151
+ "effective_date": r'effective.*?(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'
152
+ },
153
+ "Technical": {
154
+ "api_endpoint": r'(?:GET|POST|PUT|DELETE)\s+[/\w-]+|endpoint.*?[/\w-]+',
155
+ "version": r'version\s*[\d.]+|v[\d.]+',
156
+ "response_time": r'response.*?(\d+).*?(?:ms|milliseconds|seconds)',
157
+ "rate_limit": r'rate.*?limit.*?(\d+).*?(?:per|/)\s*(?:minute|hour|second)',
158
+ "authentication": r'auth.*?(OAuth|JWT|API\s*key|token)',
159
+ "status_code": r'status.*?(\d{3})|HTTP.*?(\d{3})'
160
+ },
161
+ "Medical": {
162
+ "dosage": r'(\d+)\s*(?:mg|ml|units?)\s*(?:daily|twice|once)',
163
+ "duration": r'(?:for|duration).*?(\d+)\s*(?:days?|weeks?|months?)',
164
+ "efficacy": r'efficacy.*?(\d+)%|success.*?(\d+)%',
165
+ "side_effects": r'side\s*effects?.*?(\d+)%',
166
+ "patient_count": r'(?:patients?|subjects?).*?(\d+)',
167
+ "p_value": r'p[<>=]\s*([\d.]+)'
168
+ },
169
+ "General": {
170
+ "numbers": r'\b\d+(?:,\d{3})*(?:\.\d+)?\b',
171
+ "dates": r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b\d{4}\b',
172
+ "percentages": r'\d+(?:\.\d+)?%',
173
+ "names": r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b',
174
+ "organizations": r'\b[A-Z][a-zA-Z\s&]+(?:Inc|LLC|Corp|Company|Ltd)\b'
175
+ }
176
+ }
177
+
178
+ # Extract category-specific information
179
+ patterns = category_patterns.get(category, category_patterns["General"])
180
+
181
+ for key, pattern in patterns.items():
182
+ matches = re.findall(pattern, text, re.IGNORECASE)
183
+ if matches:
184
+ # Clean up matches
185
+ cleaned_matches = []
186
+ for match in matches:
187
+ if isinstance(match, tuple):
188
+ # Handle tuple results from groups
189
+ match = ' '.join([m for m in match if m])
190
+ cleaned_matches.append(str(match).strip())
191
+
192
+ results["extracted_data"][key] = cleaned_matches
193
+ results["confidence_scores"][key] = len(cleaned_matches) / len(text.split()) * 100
194
+
195
+ # Extract custom keys if provided
196
+ if custom_keys:
197
+ for custom_key in custom_keys:
198
+ custom_key = custom_key.strip()
199
+ if not custom_key:
200
+ continue
201
+
202
+ # Create a pattern to find sentences containing the custom key
203
+ pattern = f'[^.]*{re.escape(custom_key)}[^.]*'
204
+ matches = re.findall(pattern, text, re.IGNORECASE)
205
+
206
+ if matches:
207
+ results["custom_extractions"][custom_key] = [match.strip() for match in matches]
208
+ results["confidence_scores"][f"custom_{custom_key}"] = len(matches) / len(text.split()) * 100
209
+
210
+ return results
211
 
212
  # Initialize the model
213
  try:
 
216
  logger.error(f"Failed to initialize model: {e}")
217
  active_reader = None
218
 
219
+ def process_document(text: str, strategy: str, category: str = None, custom_keys: str = "") -> tuple:
220
  """
221
+ Process document with selected strategy, category, and custom keys
222
 
223
+ Returns: (result_text, facts_json, questions_json, summary_text, domain, category_data)
224
  """
225
  if not active_reader:
226
+ return "Error: Model not loaded", "", "", "", "", ""
227
 
228
  if not text.strip():
229
+ return "Please enter some text to analyze.", "", "", "", "", ""
230
 
231
  try:
232
  # Detect domain
233
  domain = active_reader.detect_domain(text)
234
 
235
+ # Use manual category if provided, otherwise use detected domain
236
+ selected_category = category if category and category != "Auto-Detect" else domain
237
+
238
+ # Parse custom keys
239
+ custom_keys_list = [key.strip() for key in custom_keys.split(",") if key.strip()] if custom_keys else []
240
+
241
+ # Extract category-specific information
242
+ category_data = active_reader.extract_category_specific_info(text, selected_category, custom_keys_list)
243
+
244
  # Apply selected strategy
245
  if strategy == "Fact Extraction":
246
  facts = active_reader.extract_facts(text)
 
268
  questions = active_reader.generate_questions(text)
269
  summary = active_reader.generate_summary(text)
270
 
271
+ result = f"""**Domain:** {domain} | **Category:** {selected_category}
272
 
273
  **Summary:**
274
  {summary}
 
282
  facts_json = json.dumps(facts, indent=2)
283
  questions_json = json.dumps(questions, indent=2)
284
  summary_text = summary
285
+
286
+ elif strategy == "Category-Specific Extraction":
287
+ # New strategy for category-specific extraction
288
+ extracted_data = category_data["extracted_data"]
289
+ custom_extractions = category_data["custom_extractions"]
290
+
291
+ result = f"""**Category:** {selected_category}
292
+
293
+ **Category-Specific Extractions:**
294
+ """
295
+
296
+ for key, values in extracted_data.items():
297
+ if values:
298
+ result += f"\n**{key.replace('_', ' ').title()}:**\n"
299
+ for value in values[:3]: # Show first 3 matches
300
+ result += f"โ€ข {value}\n"
301
+ if len(values) > 3:
302
+ result += f"โ€ข ... and {len(values) - 3} more\n"
303
+
304
+ if custom_extractions:
305
+ result += f"\n**Custom Key Extractions:**\n"
306
+ for key, values in custom_extractions.items():
307
+ result += f"\n**{key}:**\n"
308
+ for value in values[:2]: # Show first 2 matches
309
+ result += f"โ€ข {value}\n"
310
+ if len(values) > 2:
311
+ result += f"โ€ข ... and {len(values) - 2} more\n"
312
+
313
+ facts_json = json.dumps(extracted_data, indent=2)
314
+ questions_json = json.dumps(custom_extractions, indent=2)
315
+ summary_text = f"Extracted {len(extracted_data)} category-specific fields and {len(custom_extractions)} custom fields"
316
+
317
+ category_json = json.dumps(category_data, indent=2)
318
 
319
+ return result, facts_json, questions_json, summary_text, domain, category_json
320
 
321
  except Exception as e:
322
  logger.error(f"Processing error: {e}")
323
+ return f"Error processing document: {str(e)}", "", "", "", "", ""
324
 
325
  def create_demo():
326
  """Create the Gradio demo interface"""
 
383
 
384
  # Strategy selection
385
  strategy_selector = gr.Radio(
386
+ choices=["Fact Extraction", "Question Generation", "Summarization", "Complete Analysis", "Category-Specific Extraction"],
387
  value="Complete Analysis",
388
  label="Active Reading Strategy"
389
  )
390
 
391
+ # Category selection
392
+ category_selector = gr.Dropdown(
393
+ choices=["Auto-Detect", "Finance", "Legal", "Technical", "Medical", "General"],
394
+ value="Auto-Detect",
395
+ label="๐Ÿ“‚ Document Category (overrides auto-detection)"
396
+ )
397
+
398
+ # Custom keys input
399
+ custom_keys_input = gr.Textbox(
400
+ placeholder="e.g., budget, deadline, CEO, risk assessment (comma-separated)",
401
+ label="๐Ÿ”‘ Custom Extraction Keys",
402
+ info="Enter specific terms you want to extract information about"
403
+ )
404
+
405
  # Process button
406
  process_btn = gr.Button("๐Ÿš€ Apply Active Reading", variant="primary", size="lg")
407
 
 
424
 
425
  with gr.Tab("๐Ÿ“ Summary"):
426
  summary_output = gr.Textbox(lines=5, label="Document Summary")
427
+
428
+ with gr.Tab("๐ŸŽฏ Category Analysis"):
429
+ category_output = gr.Code(language="json", label="Category-Specific Extractions")
430
 
431
  # Event handlers
432
  def load_sample_text(sample_choice):
 
442
 
443
  process_btn.click(
444
  fn=process_document,
445
+ inputs=[text_input, strategy_selector, category_selector, custom_keys_input],
446
+ outputs=[results_output, facts_output, questions_output, summary_output, domain_output, category_output]
447
  )
448
 
449
  # How it works and blog section
 
476
  - ๐Ÿ”ง **Technology**: API docs, technical specifications, system manuals
477
  - ๐Ÿฅ **Healthcare**: Clinical trials, research papers, treatment protocols
478
  - ๐Ÿข **General Business**: Proposals, memos, strategic documents
479
+
480
+ ### ๐ŸŽฏ Category-Specific Extraction
481
+
482
+ **Finance Category extracts:**
483
+ - Revenue, profit margins, growth rates
484
+ - Financial dates (Q1 2024, fiscal year)
485
+ - Employee counts, market cap
486
+
487
+ **Legal Category extracts:**
488
+ - Contract parties, terms, liability amounts
489
+ - Termination clauses, governing law
490
+ - Effective dates and obligations
491
+
492
+ **Technical Category extracts:**
493
+ - API endpoints, version numbers
494
+ - Response times, rate limits
495
+ - Authentication methods, status codes
496
+
497
+ **Medical Category extracts:**
498
+ - Dosages, treatment duration
499
+ - Efficacy rates, side effects
500
+ - Patient counts, statistical significance
501
+
502
+ ### ๐Ÿ”‘ Custom Keys Feature
503
+
504
+ Add your own extraction terms like:
505
+ - `budget, timeline, deliverables` for project docs
506
+ - `CEO, board, shareholders` for corporate docs
507
+ - `security, compliance, audit` for IT policies
508
  """)
509
 
510
  with gr.Tab("๐Ÿ“– About the Research"):
 
547
 
548
  **๐ŸŽฎ 5-Minute Demo:**
549
  1. Select **"Financial Report"** from sample documents
550
+ 2. Choose **"Category-Specific Extraction"** strategy
551
+ 3. Set category to **"Finance"** (or leave as Auto-Detect)
552
+ 4. Add custom keys: **"CEO, growth, investment"**
553
+ 5. Click **"๐Ÿš€ Apply Active Reading"**
554
+ 6. Check the **"๐ŸŽฏ Category Analysis"** tab to see targeted extraction!
555
 
556
  **๐Ÿ” Advanced Exploration:**
557
  1. **Upload your own document** (paste text up to 2000 words)
 
561
 
562
  ### Sample Documents Available
563
 
564
+ | Document Type | Category | Example Custom Keys | What You'll Learn |
565
+ |---------------|----------|-------------------|-------------------|
566
+ | ๐Ÿ“Š **Financial Report** | Finance | `CEO, growth, investment, Q3` | Revenue extraction, profit analysis, growth metrics |
567
+ | โš–๏ธ **Legal Contract** | Legal | `termination, liability, governing law` | Contract terms, obligations, risk factors |
568
+ | ๐Ÿ”ง **Technical Manual** | Technical | `endpoint, authentication, rate limit` | API specs, system requirements, procedures |
569
+ | ๐Ÿฅ **Medical Research** | Medical | `efficacy, patients, side effects` | Clinical data, statistical analysis, treatment outcomes |
570
 
571
  ### Next Steps
572
 
requirements.txt CHANGED
@@ -1,5 +1,6 @@
1
  # Minimal requirements for Hugging Face Spaces demo
2
  torch>=2.0.0
3
  transformers>=4.30.0
4
- gradio
5
  numpy>=1.24.0
 
 
1
  # Minimal requirements for Hugging Face Spaces demo
2
  torch>=2.0.0
3
  transformers>=4.30.0
4
+ gradio>=4.0.0
5
  numpy>=1.24.0
6
+