Title: | Split Chromosome 'Fasta' File |
---|---|
Description: | Chromosome files in the 'Fasta' format usually contain large sequences like human genome. Sometimes users have to split these chromosomes into different files according to their chromosome number. The 'chromseq' can help to handle this. So the selected chromosome sequence can be used for downstream analysis like motif finding. Howard Y. Chang(2019) <doi:10.1038/s41587-019-0206-z>. |
Authors: | Shaoqian Ma [aut, cre] |
Maintainer: | Shaoqian Ma <[email protected]> |
License: | Artistic-2.0 |
Version: | 0.1.3 |
Built: | 2025-02-28 05:47:21 UTC |
Source: | https://github.com/msq-123/chromseq |
This dataset is sampled from The hg19 blacklist. For splitting a chromosome Fasta file, sometimes the Fasta identifier is too complicated to manipulate. This data can be used to show how to simplify the Fasta identifier.
data(id)
data(id)
A character sequence with 20 elements
Satpathy A T, Granja J M, Yost K E, et al. (2019) Nature biotechnology 37,925–936. (PubMed)
data(id)
data(id)
Make a list file from large chromosome Fasta file
readToList(id = id, text = text, con = con)
readToList(id = id, text = text, con = con)
id |
The id list made from subFasID function |
text |
Large character read in by readLines function from Fasta file |
con |
A connection object or a character string, the connection must refer to the same Fasta file as text |
Chromosome Fasta file in list format.
data("text") id <- subFasID(text = text) fil <- tempfile(fileext = ".data") write(text,file = fil) con0 <- file(fil, "r") tex <- readToList(id,text = text,con = con0)
data("text") id <- subFasID(text = text) fil <- tempfile(fileext = ".data") write(text,file = fil) con0 <- file(fil, "r") tex <- readToList(id,text = text,con = con0)
Make the chromosome id starting with ">" into simple format like ">chr:1091194-1093520...",this is helpful for sorting the chromosome according to their number
replaceText(type = "text", input = input)
replaceText(type = "text", input = input)
type |
This can be either "text" or "list", The previous is a large character containing each line of the Fasta file, the latter is a list in which each element contains a unit of Fasta file |
input |
The large character or list containing ids that need to be simplified |
The large character or list of Chromosome Fasta file with simplified id.
Shaoqian Ma
data("id") simpleID<- replaceText(type = "text",input = id)
data("id") simpleID<- replaceText(type = "text",input = id)
Sort the chromosome list according to the chromosome number
sortList(id = id, tex = tex, chrsig = "single")
sortList(id = id, tex = tex, chrsig = "single")
id |
The identifier list of the Fasta file made by subFasID |
tex |
A chromosome Fasta file in list format made by readToList function |
chrsig |
The number of characters of the chromosome, either "single" or "double", the previous means a single character following "chr" in the Fasta identifier, the latter means two characters following "chr" in the Fasta identifier. eg."chr1,chrX,chrY,chrM" is "single";"chr10,chr11" is "double". If you want to obtain both "single" and "double" sorted list of chromosome, try "single" and "double" respectively |
The sorted chromosome Fasta file in list format.
data("tex") data("text") text<- replaceText(type = "text",input = text) id <- subFasID(text = text) tex2<- sortList(id=id,tex = tex,chrsig = "single") tex3 <- sortList(id=id,tex = tex,chrsig = "double")
data("tex") data("text") text<- replaceText(type = "text",input = text) id <- subFasID(text = text) tex2<- sortList(id=id,tex = tex,chrsig = "single") tex3 <- sortList(id=id,tex = tex,chrsig = "double")
Split all chromosomes from the sorted chromosome list
splitChr(tex = tex, chr = chr, sex = FALSE, outdir = ".")
splitChr(tex = tex, chr = chr, sex = FALSE, outdir = ".")
tex |
The sorted chromosome list made by sortList function. |
chr |
The chromosome number sequence, if the chromosome list is "single" which means a single character following "chr" in the Fasta identifier, be sure starting with 1 and ending with 9; if the chromosome list is "double" which means two characters following "chr" in the Fasta identifier, be sure that starting with 10 but the ending can be changed. |
sex |
Whether to output the sex chromosomes like X chromosome and Y chromosome. |
outdir |
The output directory. |
Write the splitted chromosome Fasta file to separated txt files according to the chromosome number.
Shaoqian Ma
data(tex) data(text) #Simplify the Fasta id text<- replaceText(type = "text",input = text) #Subtract id id <- subFasID(text = text) #Sort the fasta according to the chromosome number in id tex2<- sortList(id=id,tex = tex,chrsig = "single") tex3 <- sortList(id=id,tex = tex,chrsig = "double") outdir <- tempdir() #Output the results splitChr(tex = tex2,chr=seq(1,9),sex = TRUE,outdir = outdir) splitChr(tex = tex3,chr=seq(10,22),sex = FALSE,outdir = outdir)
data(tex) data(text) #Simplify the Fasta id text<- replaceText(type = "text",input = text) #Subtract id id <- subFasID(text = text) #Sort the fasta according to the chromosome number in id tex2<- sortList(id=id,tex = tex,chrsig = "single") tex3 <- sortList(id=id,tex = tex,chrsig = "double") outdir <- tempdir() #Output the results splitChr(tex = tex2,chr=seq(1,9),sex = TRUE,outdir = outdir) splitChr(tex = tex3,chr=seq(10,22),sex = FALSE,outdir = outdir)
Subtract chromosome ids from Fasta file
subFasID(text = text)
subFasID(text = text)
text |
Large character read by readLines from chromosome Fasta file. |
The id list of the Fasta file.
data("text") text<- replaceText(type = "text",input = text) id <- subFasID(text = text)
data("text") text<- replaceText(type = "text",input = text) id <- subFasID(text = text)
Data from "Three representative inter and intra-subspecific crosses reveal the genetic architecture of reproductive isolation in rice."
data(tex)
data(tex)
A large list containing 62 elements.
Li, G. et al. (2017) The Plant Journal 92, 349–362. (PubMed)
data(tex)
data(tex)
A downsampled dataset containing the hg19 chromosome sequence from the hg19 blacklist. The hg19 blacklist is obtained from the supplementary dataset from "Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion." The dataset is sent to the UCSC Table Browser for obtaining the corresponding sequence file. The sequence file is processed with replaceText function to simplify the fasta id. To best illustate the usage, the sequence file is downsampled.
data(text)
data(text)
A character sequence with 2099 elements.
Satpathy A T, Granja J M, Yost K E, et al. (2019) Nature biotechnology 37, 925–936. (PubMed)
data(text)
data(text)