Graph Page

(by nicholas, last updated: jun 2013)

headers

headers returned by server

userinfo

?

issue_category

?

scheme:

url scheme (http/https/ftp/...)

duplication_types

?

issue_ids

issues this page has by id

description

meta-description from this page

total count of links on the page

all urls for all the links on the page

redirect_to_string

redirect location if 30x

last_crawled

last crawl date

count_of_words

word count

duplicate_page_ids

reference to duplicate pages

inbound links count internal links count. total amount of links to this page in this domain.

host

host this page is found on

keywords

no idea, seems always empty

status:

Active/inactive No idea what it stands for

s3_histories:

reference to the histories stored on S3

title_simhash

simularity hash for the title

description_simhash

description simularity hash

code

returned status code

url

url of the page

content

No idea, is always empty

updated_at

last updated timestamp

response_time

response time of the page

no idea

images

list of images on the page

duplicate_simhash

simhash to detect duplicate pages (body)

redirect_to:

redirect location if 30x (duplicate of redirect_to_string)

port:

port used to fetch page

opengraph_site_name

open graph site name if present

query

query of the url if present

data

??

potential_keywords

list of potential keywords

digest:

hash of cleaned body

kind

This is always "Page" TODO: Ask william

authority

classification of authority

  • siphon

  • sinkhole

  • conversion

  • bridge

title

page title

title_duplicate_pages

title of duplicate pages

opengraph_type

opengraph type if present on page

path

url path

created_at

creation timestamp

absolute_url

absolute url

duplicate_pages

duplicate pages ids (?? duplicate_page_ids dupe?),

fragment

fragment of url

full_text

full text of the page

canonical

no idea TODO: Ask william

tags

list of tags (duplicate of body, conversion, ...)

backlink count

list of inlinks

last_rechecked

timestamp of last recheck

domain_id

domain it belongs to

(??? always empty)

body_digest

hash of body

list of links on the page with detail information

robot_tag

robot tags attached to this page (all, index, nofollow)...(??? how is this calculated)

description_duplicate_pages

list of duplicate description pages

simhash

total simhash

anchor_texts

list of anchor texts on the page

Last updated