5 Commits

Author SHA1 Message Date
Arik Jones (aider)
1869dae89a docs: update configuration section in README.md 2024-09-22 18:36:17 -05:00
Arik Jones (aider)
d3ff7cb862 docs: Update README.md CLI flag documentation 2024-09-22 18:33:24 -05:00
Arik Jones (aider)
ea410e4abb feat: Update README.md to reflect recent changes in functionality 2024-09-22 18:31:06 -05:00
Arik Jones (aider)
7d8e25b1ad docs: Add CHANGELOG.md with v0.0.3 release notes 2024-09-22 18:20:25 -05:00
Arik Jones
691832e282 fix: Update expectation 2024-09-22 18:18:03 -05:00
3 changed files with 64 additions and 18 deletions

View File

@@ -4,16 +4,18 @@ Rollup aggregates the contents of text-based files and webpages into a markdown
## Features
- File type filtering
- Ignore patterns for excluding files
- Support for code-generated file detection
- Advanced web scraping functionality
- Verbose logging option for detailed output
- Exclusionary CSS selectors for web scraping
- Support for multiple URLs in web scraping
- File type filtering for targeted content aggregation
- Ignore patterns for excluding specific files or directories
- Support for code-generated file detection and exclusion
- Advanced web scraping functionality with depth control
- Verbose logging option for detailed operation insights
- Exclusionary CSS selectors for precise web content extraction
- Support for multiple URLs in web scraping operations
- Configurable output format for web scraping (single file or separate files)
- Configuration file support (YAML)
- Generation of default configuration file
- Flexible configuration file support (YAML)
- Automatic generation of default configuration file
- Custom output file naming
- Concurrent processing for improved performance
## Installation
@@ -74,14 +76,27 @@ ignore:
code_generated:
- **/generated/**
scrape:
urls:
- url: https://example.com
sites:
- base_url: https://example.com
css_locator: .content
exclude_selectors:
- .ads
- .navigation
max_depth: 2
allowed_paths:
- /blog
- /docs
exclude_paths:
- /admin
output_alias: example
path_overrides:
- path: /special-page
css_locator: .special-content
exclude_selectors:
- .special-ads
output_type: single
requests_per_second: 1.0
burst_limit: 3
```
## Examples
@@ -92,10 +107,10 @@ scrape:
rollup files
```
2. Web scraping with multiple URLs:
2. Web scraping with multiple URLs and increased concurrency:
```bash
rollup web --urls=https://example.com,https://another-example.com
rollup web --urls=https://example.com,https://another-example.com --concurrent=8
```
3. Generate a default configuration file:
@@ -104,15 +119,25 @@ scrape:
rollup generate
```
4. Use a custom configuration file:
4. Use a custom configuration file and specify output:
```bash
rollup files --config=my-config.yml
rollup files --config=my-config.yml --output=project_summary.md
```
5. Web scraping with separate output files:
5. Web scraping with separate output files and custom timeout:
```bash
rollup web --urls=https://example.com,https://another-example.com --output=separate
rollup web --urls=https://example.com,https://another-example.com --output=separate --timeout=60
```
6. Rollup files with specific types and ignore patterns:
```bash
rollup files --types=.go,.md --ignore=vendor/**,*_test.go
```
7. Web scraping with depth and CSS selector:
```bash
rollup web --urls=https://example.com --depth=2 --css=.main-content
```
## Contributing

View File

@@ -67,7 +67,7 @@ func TestIsIgnored(t *testing.T) {
{"subdir/file.log", true},
{"subdir/file.txt", false},
{".git/config", true},
{"src/.git/config", true},
{"src/.git/config", false},
{"vendor/package/file.go", true},
{"internal/vendor/file.go", false},
}

21
docs/CHANGELOG.md Normal file
View File

@@ -0,0 +1,21 @@
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.0.3] - 2024-09-22
### Added
- Implemented web scraping functionality using Playwright
- Added support for CSS selectors to extract specific content
- Introduced rate limiting for web requests
- Created configuration options for scraping settings
### Changed
- Improved error handling and logging throughout the application
- Enhanced URL parsing and validation
### Fixed
- Resolved issues with concurrent scraping operations