First question that I expect springs to your mind right after reading the title is “Why the hell should I connect my GitHub and Google Search Appliance (GSA) instances? As far as I know GitHub has it’s own search engine, plus if I wanted to have GitHub contents in my GSA I would just use the built-in GSA crawler!”
Yes, you’re right but there are several constrains that you must bear in mind before dismissing this option because these drawbacks drove us to integrate these two instances.
GitHub information in GSA?
At King, we have a GSA instance helping us to centralise all our searches because it feeds from diverse sources – content management, issue tracking, forums, etc. So, why not include GitHub as well? It’s very convenient to have a single search across as many sources as you can.
Of course we didn’t want to include all the GitHub content (for example, we excluded the source code lines) mainly because we have hundreds of repositories and adding too much content would inflate our GSA with millions of records and reduce ‘searchability.’
We decided to include the following GitHub information as ‘searchable:’
- Account Name (User or Organisation)
- Repository Name
- Repository Description
- Readme.md file contents
And these fields as ‘filters’ for the search results:
- Repository language (Java, C, Python, CSS, etc.)
- Record Type (Organisation, User)
- Number of Forks
- Number of Stargazers
Why not to use the built-in GSA crawler?
The crawler built in to GSA accepts an starting URL and it crawls recursively any link referenced within processed pages. This means that crawling the main GitHub page it will gather all the information, because the main page has links to repositories, file, files contents, etc.
By the way, a standard GitHub instance prevents crawling by default. You must modify the robots.txt file to allow it.
Another decision – Feed Client or GSA Connector?
GSA provides two types of integration to ‘inject’ contents – Feed Client or Connector.
You can check the official page for detailed information but here’s a shallow overview:
- It pushes to GSA Static and/or Dynamic documents and ends
- Statics documents means that you must provide a Doc ID, metadata (optional), and document contents that GSA will index
- For Dynamic documents you must provide a Doc ID, metadata (optional) and one URL to allow GSA to crawl the contents by itself. Crawling will occur depending on GSA settings and then gathered contents will be indexed
- It’s always running in a WebServer
- Once configured in GSA, this connector will be the ‘intermediary’ between GSA and the original source
- GSA requests the list of Doc IDs to be indexed (periodicity of this action could be configured)
- GSA asks the connector for the contents and metadata of each Doc Id. Periodicity is configured through GSA Admin panel
- The connector is solely responsible for gathering the information from the source
First phase – feed client
We decided to implement the feed client as a phase one approach because it’s easier to implement and it provides the desired functionality.
Take a look at the source code here: https://github.com/king/github-gsa-feedclient
This approach has some constraints as the need to schedule the execution, impact in GSA performance because we feed all the information in one shot, etc.
How does it work?
It uses GitHub API to go through all accounts and repositories to gather ‘static’ searchable contents – the account description and Repo description. It also collects additional information to be used as metadata as Repo language, record type, number of forks, and number of stargazers. An XML is populated with those ‘static documents’ and pushed to GSA.
Important – All Doc IDs are ‘fake’ URLs just to prevent GSA from crawling them again. We send the real URL into a displayURL parameter to be displayed in the GSA search results.
Using same GitHub API it gets the URL of each repository’s readme.md file in raw format. This information, together with the metadata fields, populates an XML as ‘dynamic’ docs and it’s pushed to GSA.
Important – As it’s dynamic contents we expect GSA to crawl the Doc ID (URL). That’s why we use the raw file contents page because those pages don’t have links referencing other pages, i.e. no recursive crawling.
- Modify the GitHub robots.txt to allow GSA to crawl raw readme.md file contents
- In GSA admin panel add the URLs representing the doc IDs to allow to index them; the ‘fake’ url for the static contents and the raw readme.md urls for dynamic contents
Second phase – connector
We implemented a GSA Connector to add more functionality to existing feed client.
Benefits compared with the feed client:
- No need to schedule client execution
- Crawling configuration managed through GSA Admin pane, no impact on GSA performance
- Control deleted or non-existing contents. GSA information is always up-to-date
- Add GitHub Repo’s issues as ‘searchable’ contents
How does it work?
Once connected, the GSA requests the list of Doc IDs to be indexed. Our connector uses the GitHub API to provide the following list:
- Accounts (Organizations and Users)
- Repositories readme.md file
- Repositories issues
Important – The submitted Doc IDs are related to our connector url, in that way GSA will contact us again to gather the docs contents.
Once the GSA has the docs list, and according to configured crawling schedule, it will execute ‘GetDocContents’ in our connector for each Doc Id. Parsing the Doc Id the Connector knows what contents it must retrieve from GitHub using the API:
- Accounts. Account (Organization or User) description
- Repositories. Repository descriptionRepositories README.md file. Readme.md file contents
- Repositories issues. Issue title and body
- Response also contains metadata (Record type, repo language, number of forks, etc.)
Important – In cases where the document does not exist in GitHub we can respond ‘Not Found’ and GSA will remove this Doc Id from the indexation list.
Both implementations have some limitations or features that were not implemented:
- Only public repositories are indexed
- There’s no document-level authorisation. No GSA Security mechanisms; ACLs (Access Control Lists) or SAML (Security Assertion Markup Language)
- GitHub Wiki contents not indexed. There’s no GitHub API to retrieve that information, at least at this point in time
- Last modification date in submitted documents is not taken into account during the crawling process. It would help to re-index only modified files