YaCy 'search-newsgroup-site': Запуск сканування

Advanced Crawler

Сканер/Павук

Мережний видобуток

Click on this API button to see a documentation of the POST request parameter for crawl starts.

Expert Запуск сканування

Завдання запуску сканування: You can define URLs as start points for Web page crawling and start crawling here. "Crawling" means that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth". A crawl can also be started using wget and the post arguments for this web page.

Crawl Job

A Crawl Job consist of one or more start point, crawl limitations and document freshness rules.

Start Point

One Start URL or a list of URLs: (must start with http:// https:// ftp:// smb:// file://): Define the start-url(s) here. You can submit more than one URL, each line one URL please. Each of these URLs are the root for a crawl start, existing start URLs are always re-loaded. Інші, вже відвідані сторінки відкидаються як "повторні", якщо вони не дозволені через настройку перескановування.

Зі списку посилань
З карти сайту
З файлу (enter a path within your local file system)

Index Attributes

Add Crawl result to collection (important for Index Pack generation): A crawl result can be tagged with names which are candidates for a collection request. These tags can be selected with the GSA interface using the 'site' operator. To use this option, the 'collection_sxt'-field must be switched on in the Solr Schema
Do not use underline '_' in collection name, use '-' instead. When useful, add a language code to the collection name, e.g. 'top-100-en'.
Time Zone Offset: The time zone is required when the parser detects a date in the crawled web page. Content can be searched with the on: - modifier which requires also a time zone when a query is made. To normalize all given dates, the date is stored in UTC time zone. To get the right offset from dates without time zones to UTC, this offset must be given here. The offset is given in minutes; Time zone offsets for locations east of UTC must be negative; offsets for zones west of UTC must be positve.

Crawler Filter

These are limitations on the crawl stacker. The filters will be applied before a web page is loaded.

Indexing

This enables indexing of the webpages the crawler will download. This should be switched on by default, unless you want to crawl only to fill the кешу документів без індексації. Індексувати текст: Індексувати медіа:

Виконувати віддалене індексування

При активації сканер буде підтримувати зв’язок з іншими вузлами і використовувати їх як віддалених індексаторів для вашого сканування. Якщо вам потрібне місцеве сканування, відключіть цю функцію. Тільки Старший і Головний вузли можуть починати або приймати віддалене сканування. Повідомлення відображається в розділі Новин YaCy і надсилається всім вузлам мережі, щоб вони могли уникнути запуску сканування з тієї ж відправної точки.

Remote crawl results won't be added to the local index as the remote crawler is disabled on this peer. You can activate it in the Remote Crawl Configuration page.
	Опишіть, чому ви починаєте це загальне сканування (не обов’язково): Це повідомлення буде відображатися на інших вузлах в таблиці "Запуски сканування інших вузлів".

Crawling Depth

Визначає, як довго сканувач буде переходити за посиланнями (з посилань ...), вбудованими в сайти. 0 означає, що тільки сторінка вказана в "Початку сканування" буде додана в індекс. 2-4 добре для звичайного індексування. Значення вище 8 не є корисними, оскільки пошук з глибиною 8 проіндексує близько 25.600.000.000 сторінок, а це цілий WWW. також всі пов’язані необроблювані документи

Unlimited crawl depth for URLs matching with

Maximum Pages per Domain

Цією настройкою можна обмежити максимальне число сторінок, які знаходяться і індексується з одиночного домену. Ви також можете поєднати це з "Авто-Дом-Фільтром", внаслідок чого межа застосовується до всіх доменів з зазначеною глибиною. Домени за межами заданої глибини просто відкидаються. Використовувати: Кількість сторінок:

misc. Constraints

A questionmark is usually a hint for a dynamic page. URLs pointing to dynamic content should usually not be crawled. However, there are sometimes web pages with static content that доступні тільки по URL що містить знак запитання. Якщо ви не впевнені, не вибирате цю функцію щоб уникнути закільцювання. Following frames is NOT done by Gxxg1e, but we do by default to have a richer content. 'nofollow' in robots metadata can be overridden; this does not affect obeying of the robots.txt which is never ignored. Accept URLs with query-part ('?'):
Obey html-robots-noindex:
Obey html-robots-nofollow:

Media Type detection

Not loading URLs with unsupported file extension is faster but less accurate. Indeed, for some web resources the actual Media Type is not consistent with the URL file extension. Here are some examples:

https://en.wikipedia.org/wiki/.de : the .de extension is unknown, but the actual Media Type of this page is text/html
https://en.wikipedia.org/wiki/Ask.com : the .com extension is not supported (executable file format), but the actual Media Type of this page is text/html
https://commons.wikimedia.org/wiki/File:YaCy_logo.png : the .png extension is a supported image format, but the actual Media Type of this page is text/html

Do not load URLs with an unsupported file extension Always cross check file extension against Content-Type header

Load Filter on URLs

Цей фільтр регулярний вираз,. Example: to allow only urls that contain the word 'science', set the must-match filter to '.*science.*'. Ви можете також використовувати автоматичне обмеження домену для повного сканування певного домену. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match
Обмежити до початкового домену(s)
Обмежити до початкового шляху(s)
Використовувати фільтр	(must not be empty)
must-not-match

Load Filter on URL origin of links

Цей фільтр регулярний вираз,. Example: to allow loading only links from pages on example.org domain, set the must-match filter to '.*example.org.*'. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match	(must not be empty)
must-not-match

Load Filter on IPs

must-match	(must not be empty)
must-not-match

Список повинно-співпадати для кодів країн

Сканування можуть бути обмежені певними країнами. Ця можливість використовує код країни, вирахуваний з IP сервера, який розміщує сторінку. Фільтр не є регулярним виразом, а простим списком з кодами країн, що перелічені через кому. Без обмеження по коду країни
Використовувати фільтр

Document Filter

These are limitations on index feeder. The filters will be applied after a web page was loaded.

Filter on URLs

Цей фільтр регулярний вираз, that must not match with the URLs to allow that the content of the url is indexed. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match	(must not be empty)
must-not-match
No Indexing when Canonical present and Canonical != URL

Filter on Content of Document (all visible text, including camel-case-tokenized url and title)

must-match	(must not be empty)
must-not-match

Filter on Document Media Type (aka MIME type)

Цей фільтр регулярний вираз, that must match with the document Media Type (also known as MIME Type) to allow the URL to be indexed. Standard Media Types are described at the IANA registry. Attention: you can test the functionality of your regular expressions using the Regular Expression Tester within YaCy.

must-match
must-not-match

Solr query filter on any active indexed field(s)

Each parsed document is checked against the given Solr query before being added to the index. The query must be written in respect to the standard Solr query syntax.

must-match

must-not-match

Content Filter

These are limitations on parts of a document. The filter will be applied after a web page was loaded. You can choose to:

Evaluate by default

Use all words in document by default until a CSS class as listed below appears; then ignore all

Ignore by default

Ignore all words in document by default until a CSS class as listed below appears, then evaluate all

Filter div or nav class names

comma-separated list of <div> or <nav> element class names which should be filtered out/in according to switch above.

Clean-Up before Запуск сканування

Clean up search events cache: Check this option to be sure to get fresh search results including newly crawled documents. Beware that it will also interrupt any refreshing/resorting of search results currently requested from browser-side.
No Deletion: After a crawl was done in the past, document may become stale and eventually they are also deleted on the target host. To remove old files from the search index it is not sufficient to just consider them for re-load but it may be necessary to delete them because they simply do not exist any more. Use this in combination with re-crawl while this time should be longer. Do not delete any document before the crawl is started.
Delete sub-path: For each host in the start url list, delete all documents (in the given subpath) from that host.
Delete only old: Treat documents that are loaded ago as stale and delete them before the crawl is started.

Double-Check Rules

No Doubles: Сканувач виконує перевірку на наявність адрес у своїй внутрішній базі даних. Якщо ж адреса знайдена, URL розглядається як дублікат, якщо властивість "Без повторів" була вибрана. URL може бути завантажена знову, коли вона досягне певного віку. to use that check the 're-load' option. Never load any page that is already known. Only the start-url may be loaded again.
Re-load: Treat documents that are loaded ago as stale and load them again. If they are younger, they are ignored.

Document Cache

Зберегти у кеш: Ця настройка ввімкнена за замовчуванням для проксі, але не потрібна для сканування.
Правила використання веб-кешу: Правила кешування визначають, коли кеш буде використовуватися під час сканування: без кешу: Ніколи не використовувати кеш, все брати напряму свіже; якщо свіжий: Використовувати кеш, якщо посилання наявне в кеші і свіже; якщо існує: Викорисовувати кеш по можливості, без перевірки дати. В іншому випадку, використання прямого джерела; тільки кеш: Ніколи не виходити в мережу, брати вміст тільки з кешу. Якщо немає кешу, вважати що вміст не доступний. без кешу якщо свіжий якщо існує тільки кеш

Robot Behaviour

Use Special User Agent and robot identification: Because YaCy can be used as replacement for commercial search appliances (like the Google Search Appliance aka GSA) the user must be able to crawl all web pages that are granted to such commercial platforms. Not having this option would be a strong handicap for professional usage of this software. Therefore you are able to select alternative user agents here which have different crawl timings and also identify itself with another user agent and obey the corresponding robots rule.

Snapshot Creation

Max Depth for Snapshots: Snapshots are xml metadata and pictures of web pages that can be created during crawling time. The xml data is stored in the same way as a Solr search result with one hit and the pictures will be stored as pdf into subdirectories of HTCACHE/snapshots/. From the pdfs the jpg thumbnails are computed. Snapshot generation can be controlled using a depth parameter; that means a snapshot is only be generated if the crawl depth of a document is smaller or equal to the given number here. If the number is set to -1, no snapshots are generated.
Multiple Snapshot Versions: replace old snapshots with new one add new versions for each crawl
must-not-match filter for snapshot generation
Image Creation

First Steps

Нагляд

Production

Administration

Пошук Portal Integration

Advanced Crawler

Сканер/Павук

Мережний видобуток

Expert Запуск сканування