Skip to content

CoderSilicon/sget

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sget

sget

The CLI data-tracer. data extractor built on top of curl.

Built by CoderSilicon License: MIT


⚡ The Core Philosophy

sget operates directly on raw HTTP streams. It treats the web as a queryable data pipeline rather than a visual canvas, making it incredibly lightweight and lightning-fast.

  • curl-Powered Core: Built straight on top of the industry standard for network requests. If curl can reach it, sget can extract from it.
  • Zero Engine Overhead: No Chromium instances, no heavy JS evaluation loops, and zero memory leaks. Just pure data parsing.
  • Unix-Pipeline Native: Seamlessly fits into your existing workflow. Pipe HTML/JSON into sget, or pipe sget's structured output straight into jq, grep, or local files.

🏗️ Architecture Flow


[ Target Webpage ]
│
▼  (Optimized HTTP Fetch)
┌───────┐
│ curl  │
└───────┘
│
▼  (Raw Data Stream)
┌───────┐
│ sget │ ──► [ Extraction Engine: CSS Selectors / XPath / Regex ]
└───────┘
│
▼  (Structured Output Flush)
[ JSON / CSV / Text ] ──► Pipe to next tool (e.g., jq, redirect to file)

  1. The Fetch: sget utilizes native network optimization layers via curl for highly stable, low-level HTTP requests.
  2. The Stream: The target payload is fed instantly into sget's memory-efficient stream parser without downloading unnecessary visual assets.
  3. The Extraction: Your declarative filters (CSS tags, XPath nodes, or Regex boundaries) parse the DOM structure instantly.
  4. The Output: Structured data is flushed to stdout in your format of choice, completely ready for automated consumption.

🛠️ Stack & Mechanics

Layer Engine/Protocol Purpose
Libraries curl Low-overhead HTTP transfer, custom headers, proxy handling, and cookie jars.
Language C++ with vcpkg It is fast. That's it.
Formatting Native Colors(OS) Use of COLORS defined by the operating system itself for latency
Environment Docker Containers to run standalone inside any Linux, macOS, or Windows terminal.

🔒 Optimization & Speed

sget is engineered from the ground up for high-performance automation:

  • Minimal Memory Footprint: Uses a fraction of the RAM required by headless browsers (Puppeteer/Playwright).
  • Parallel Scrapes: Launch multiple data tracing threads concurrently without melting your CPU.
  • Bypass Anti-Bot: Native integration for rapid User-Agent rotation, custom request delays, and upstream proxy chains.

Built with 💻 by CoderSilicon
"It is always better to differ from others."

About

A robust CLI tool for deep web data extraction. Leveraging native libcurl for high-performance network analysis and metadata harvesting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors